Opened 4 years ago

Closed 3 years ago

Last modified 3 years ago

#1327 closed change (fixed)

Test migration of adblockplus.org/forum from phpBB to Discourse

Reported by: kzar Assignee: trev
Priority: P3 Milestone:
Module: Infrastructure Keywords: phpbb forum discourse
Cc: kzar, trev, arthur, mapx, greiner, matze Blocked By: #3242
Blocking: #1624 Platform: Unknown
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by kzar)

Background

We want to replace phpBB with Discourse for the Adblock Plus user forum.

phpBB forum is our most used system, the one where we get most user interaction. It is also the crappiest one, not to mention outdated and too hard to update. A project to have it replaced has been picked up a few times over the years, the progress has been negligible because nobody can focus on it. We should consider adding this to the roadmap.

  • palant

To achieve that the first step is to test the migration.

What to change

The aim is to create a Vagrant VM to help test the migration of phpBB to Discourse. The VM needs to be provisioned using puppet scripts as with the real servers and needs to be loaded with test data that is as realistic as possible.

The VM also needs to be set up to perform migration tests as easily as possible at the time of provisioning. This might require extra work that won't be used on the real servers. This can be performed however as long as it works effectively and is reproducible.

The migration itself will be tested using Discourse's in-built phpBB data importer for now. Depending on how well it works we might need to customise it or make something ourselves.

The aim of this testing is also to enumerate any issues / questions migration raises so that we can start to address them.

What we need to import:

  • Forum posts
  • User accounts (without passwords, the email address should be sufficient to log in with Google & Co.) - only the ones who actually posted something, we have a massive number of dead spammer accounts there.
  • Subforums should be mapped to categories, e.g. "Adblock Plus for Firefox" will become the "firefox" category. We will create the necessary category and configure permissions for them manually.
  • Group memberships and ranks don't need to be converted, we can do that manually.

Notes

  • We might need to make a script that helps phpBB users migrate, it could allow you to login with your phpBB credentials and then give you a choice of buttons to link with your Google / other account for use with Discourse.
  • We might need to create map of redirects from phpBB URLs to Discourse ones so that we can avoid broken links. We could serve the redirections from a web app or write them for the webserver to handle directly.
  • During migration we might need to set up a read only demonstration of Discourse with the data on a different URL, maybe a subdomain like forum.adblockplus.org . When it's ready we could then switch writing on and put phpBB to read only before we start redirecting people.

Resources

http://www.discourse.org/
https://meta.discourse.org/t/phpbb-3-importer/17397
https://meta.discourse.org/t/importing-phpbb-into-discourse/7956

Attachments (1)

dump.sql.gz (991.3 KB) - added by trev 4 years ago.
Cleaned up forum database dump

Download all attachments as: .zip

Change History (36)

comment:1 Changed 4 years ago by kzar

Without access to the adblock.com/forum database and files this ticket is unworkable IMHO, we need to test the migration script works in practice with our data. I think therefore this ticket is blocked until someone with access to the data has time to work on the migration.

comment:2 Changed 4 years ago by kzar

  • Cc dave@… added
  • Owner kzar deleted

comment:3 follow-up: Changed 4 years ago by philll

  • Cc kzar trev added; dave@… removed

@trev: I guess it would be sufficient to provide some test dump to get this done by somebody without direct access to the database.

However, this Issue lacks a lot of concept, which is required to do such a migration with some specs about how logically to migrate subthreads, what to show on the start page, how to logically take over permissions etc.

comment:4 in reply to: ↑ 3 Changed 4 years ago by kzar

Yea I was thinking that, @snoack was saying how there's sensitive data like password hashes and private messages but perhaps it would be possible to strip them all out of the dump with a regexp? (Like set all passwords to a hash of "password" and the private messages with lorem ipsum or something appropriate.)

@phill you're right there's a lot of questions about what happens in various situations. I've not thought them through at all really to be honest! I thought the best place to start was to test see what actually happens in practice and then from that we could realise where the problem areas are and decide the best course of action for them all. That said if you guys have ideas about some of those things already post away I guess!

Replying to philll:

@trev: I guess it would be sufficient to provide some test dump to get this done by somebody without direct access to the database.

However, this Issue lacks a lot of concept, which is required to do such a migration with some specs about how logically to migrate subthreads, what to show on the start page, how to logically take over permissions etc.

Last edited 4 years ago by kzar (previous) (diff)

comment:5 Changed 4 years ago by trev

  • Description modified (diff)
  • Priority changed from Unknown to P3
  • Ready set
  • Summary changed from Replace adblockplus.org/forum phpBB with Discourse to Migrate adblockplus.org/forum data from phpBB to Discourse

I updated the description and title. Note that this is only about migrating the data now, actually setting up a public Discourse instance should be a separate issue.

comment:6 Changed 4 years ago by trev

  • Review URL(s) modified (diff)

comment:7 Changed 4 years ago by trev

  • Description modified (diff)

comment:8 Changed 4 years ago by trev

  • Description modified (diff)

Changed 4 years ago by trev

Cleaned up forum database dump

comment:9 Changed 4 years ago by trev

The dump I attached contains only very old posts, up to July 2006 or something like that. There are no IP addresses, these have been anonymized away long ago. I removed the lines from the dump manually, so there might be topics without any posts but hopefully not vice versa. Also, the users table is completely empty - please use some dummy data there.

I left out the groups table - the import script uses it to recognize bot accounts, we should look at the post count instead. I also left out the forums table - the import script gets forum names from there, we should simply map forum IDs to categories "manually" instead.

Last edited 4 years ago by trev (previous) (diff)

comment:10 Changed 4 years ago by arthur

  • Cc arthur added

comment:11 Changed 4 years ago by christian

  • Cc cvervoorts@… added

comment:12 Changed 4 years ago by kzar

  • Description modified (diff)

comment:13 Changed 4 years ago by trev

We might need to make a script that helps phpBB users migrate, it could allow you to login with your phpBB credentials and then give you a choice of buttons to link with your Google / other account for use with Discourse.

That would only help those who have an invalid email address in their phpBB account - meaning that it shouldn't be necessary. An announcement should be sufficient to make the forum regulars update their email addresses (if necessary at all, chances are that their email addresses are already correct - for the sake of notifications). Nobody else should care.

We might need to create map of redirects from phpBB URLs to Discourse ones so that we can avoid broken links. We could serve the redirections from a web app or write them for the webserver to handle directly.

Our webserver of choice is Nginx. It might be efficient enough to handle 25000 redirects. Ideal solution would be keeping thread IDs however, then a few redirects would be sufficient.

comment:14 Changed 4 years ago by mapx

  • Cc mapx added

comment:15 Changed 4 years ago by greiner

  • Cc greiner added

comment:16 Changed 4 years ago by kzar

  • Owner set to kzar

comment:17 Changed 4 years ago by kzar

  • Description modified (diff)

comment:18 Changed 4 years ago by matze

  • Cc mathias@… added

comment:19 Changed 4 years ago by kzar

So after a few false starts I realised the best approach, I've forked the infrastructure repo and have been making changes to get a forumserver running. Once it's working I will use the VM to test the migrations and when we eventually come to go ahead we can use much of the same code in production.

I've written a phpBB puppet module and added a forumserver VM, that's working quite well. @matze @aalvz and @poz2k4444 are helping me fix up the Discourse module ready now so that we can get some migration tests running.

If you're curious code so far is all online on GitHub https://github.com/kzar/adblockplus-infrastructure/tree/1327-forum-migration/forum-migration .

comment:20 Changed 4 years ago by matze

  • Review URL(s) modified (diff)

comment:21 Changed 4 years ago by kzar

  • Description modified (diff)
  • Summary changed from Migrate adblockplus.org/forum data from phpBB to Discourse to Test migration of adblockplus.org/forum from phpBB to Discourse

comment:22 Changed 4 years ago by kzar

  • Blocked By 1614 added

comment:23 Changed 4 years ago by innerself

  • Blocking 1624 added

comment:24 Changed 4 years ago by kzar

  • Owner kzar deleted
  • Review URL(s) modified (diff)
  • Tester set to Unknown

I have not been working on this for some time and in the meeting it sounded like our approach is changing and that Matze is doing some work here. I'll unassign myself for now, but shout if I can help.

comment:25 Changed 3 years ago by matze

  • Cc matze added; cvervoorts@… mathias@… removed

comment:26 Changed 3 years ago by matze

  • Owner set to matze

comment:27 Changed 3 years ago by trev

  • Owner changed from matze to trev

Reassigning to myself, I've made some progress here. The phpBB import script works mostly fine, one has to change the table names however (it expects "phpbb_" table prefix which we don't have). There is some weirdness: some topics were put under "Your first category" or "Your first category/Your first forum" instead of the correct forums. Also, anonymous posts now have "system" as author. There is an exception processing invalid URLs, e.g. Object aasted.org/adblock has no method 'splice'. Finally, topic IDs weren't kept, that will make redirects from the current forum complicated.

The import script I used is outdated, the current Discourse version has a newer one. I'll need to look into whether it solves any of the issues above, and whether it will run with our Discourse version (updating Discourse is also an option of course).

Last edited 3 years ago by trev (previous) (diff)

comment:28 Changed 3 years ago by trev

  • Blocked By 3242 added; 1614 removed

comment:29 Changed 3 years ago by trev

Now that Discourse has been updated I started the new version of the migration script. It no longer handles missing avatars gracefully, these have to be present. Also, there is a crash which can be worked around by changing cache_rows: false to cache_rows: true in script/import_scripts/phpbb3/database/database_base.rb. And this fix needs to be applied, otherwise the script will fail on invalid birth dates.

comment:30 Changed 3 years ago by trev

  • Blocked By 4234 added

With the setup in #4234 we should be able to update Discourse easily in future. I'm testing with Discourse 1.5.3 now which is the latest stable version. Issues:

comment:31 Changed 3 years ago by trev

  • Blocked By 4234 removed
  • Resolution set to fixed
  • Status changed from new to closed

My test import completed. Two additional warnings popped up when importing private messages. One is referring to private messages without recipients - I guess that dropping these isn't an issue. The other is referring to conversations where the parent message no longer exists - Discourse treats private messages like regular discussion topics, and these would be incomplete here. While losing these private messages isn't too much of a deal (only 71 PMs out of 3.8k dropped and many of those appear to be mine), it shouldn't be too hard to import these as partial conversations (adjusting root in the MySQL database would do). But maybe Discourse devs decide to fix this issue, then we won't need to do anything.

Either way, the migrated data looks good, so that I can resolve this issue now. For reference, here is the settings.yml file used for the conversion (not really modified much compared to the example):

database:
  type: MySQL # currently only MySQL is supported - more to come soon
  host: localhost
  port: 3306
  username: root
  password:
  schema: phpbb3
  table_prefix: phpbb # Usually all table names start with phpbb. Change this, if your forum is using a different prefix.
  batch_size: 1000 # Don't change this unless you know what you're doing. The default (1000) should work just fine.

import:
  # Enable this option if you want to have a better conversion of BBCodes to Markdown.
  # WARNING: This can slow down your import.
  use_bbcode_to_md: false

  # This is the path to the root directory of your current phpBB installation (or a copy of it).
  # The importer expects to find the /files and /images directories within the base directory.
  # You need to change this to something like /var/www/phpbb if you are not using the Docker based importer.
  # This is only needed if you want to import avatars, attachments or custom smilies.
  phpbb_base_dir: /shared/import/data

  site_prefix:
    # this is needed for rewriting internal links in posts
    original: adblockplus.org/forum    # without http(s)://
    new: https://forum.adblockplus.org       # with http:// or https://

  # Enable this, if you want to redirect old forum links to the the new locations.
  permalinks:
    categories: true  # redirects   /viewforum.php?f=1            to  /c/category-name
    topics: true      # redirects   /viewtopic.php?f=6&t=43       to  /t/topic-name/81
    posts: false      # redirects   /viewtopic.php?p=2455#p2455   to  /t/topic-name/81/4

  avatars:
    uploaded: true  # import uploaded avatars
    gallery: true   # import the predefined avatars phpBB offers
    remote: false   # WARNING: This can considerably slow down your import. It will try to download remote avatars.

  # When true: Anonymous users are imported as suspended users. They can't login and have no email address.
  # When false: The system user will be used for all anonymous users.
  anonymous_users: true

  # Enable this, if you want import password hashes in order to use the "migratepassword" plugin.
  # This will allow users to login with their current password.
  # The plugin is available at: https://github.com/discoursehosting/discourse-migratepassword
  passwords: false

  # By default all the following things get imported. You can disable them by setting them to false.
  bookmarks: true
  attachments: true
  private_messages: true
  polls: true

  # When true: each imported user will have the original username from phpBB as its name
  # When false: the name of each user will be blank
  username_as_name: false

  # Map Emojis to smilies used in phpBB. Most of the default smilies already have a mapping, but you can override
  # the mappings here, if you don't like some of them.
  # The mapping syntax is: emoji_name: 'smiley_in_phpbb'
  # Or map multiple smilies to one Emoji: emoji_name: ['smiley1', 'smiley2']
  emojis:
    # here are two example mappings...
    smiley: [':D', ':-D', ':grin:']
    heart: ':love:'

Unfortunately, I messed up site_prefix section originally so links didn't get converted - this should work for the real import however.

comment:32 Changed 3 years ago by trev

Noticed one more non-critical issue: permalinks didn't get created. Configurable permalink import is a new feature in Discourse 1.6.0 beta2, the importer in Discourse 1.5.3 will only create permalinks if a particular code section in script/import_scripts/phpbb3/importer.rb is uncommented.

comment:33 Changed 3 years ago by trev

  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening - it's probably a good idea to do one more go, this time with Discourse 1.6.0 Beta 11 and a current database dump. A few issues should go away then.

comment:34 Changed 3 years ago by trev

  • Resolution set to fixed
  • Status changed from reopened to closed

Ok, import of the current database dump with Discourse 1.6.0 Beta 11 succeeded, took around 5 hours on my machine. Detailed steps:

  • Unpack phpBB images under /var/discourse/shared/standalone/import/data (create directory as it doesn't exist initially).
  • sudo mkdir /var/discourse/shared/standalone/import/data/files (our phpBB instance doesn't allow file uploads).
  • Copy phpbb.dump.gz to /var/discourse/shared/standalone/import/
  • Create settings.yml under /var/discourse/shared/standalone/import/settings.yml with the following contents:
    database:
      type: MySQL
      host: localhost
      port: 3306
      username: root
      password:
      schema: phpbb3
      table_prefix: phpbb
      batch_size: 1000
    
    import:
      use_bbcode_to_md: false
      phpbb_base_dir: /shared/import/data
    
      site_prefix:
        original: adblockplus.org/forum
        new: https://forum.adblockplus.org
    
      permalinks:
        categories: true
        topics: true
        posts: false
    
      avatars:
        uploaded: true  # import uploaded avatars
        gallery: true   # import the predefined avatars phpBB offers
        remote: false   # WARNING: This can considerably slow down your import. It will try to download remote avatars.
    
      anonymous_users: true
      passwords: false
      bookmarks: true
      attachments: true
      private_messages: true
      polls: true
    
      username_as_name: true
    
  • Run sudo /var/discourse/launcher enter app --skip-prereqs to enter the container. All further commands should be run from within the Docker container.
  • apt-get install mysql-server libmysqlclient-dev
  • /etc/init.d/mysql start
  • echo 'CREATE DATABASE phpbb3;' | mysql
  • gzip -cd /shared/import/phpbb.dump.gz | sed -r 's/(TABLES?|INSERT INTO|TABLE IF EXISTS) /\1 phpbb_/' | mysql phpbb3 - note that modifying the dump should no longer be necessary with the next Discourse beta as empty table prefixes are supported properly now.
  • It might be a good idea to remove users without any posts, the overwhelming majority of those are spammers. This would currently remove 70758 out of 81577 users. Just for reference, I didn't do it for this import: DELETE FROM phpbb_users WHERE user_id NOT IN (SELECT poster_id FROM phpbb_posts); | mysql phpbb3
  • echo "gem 'mysql2'" >> Gemfile
  • echo "gem 'ruby-bbcode-to-md', :github => 'nlalonde/ruby-bbcode-to-md'" >> Gemfile
  • su discourse -c 'bundle install --no-deployment --without test --without development --path vendor/bundle'
  • Apply patch from this bug report.
  • su discourse -c 'bundle exec rails r SiteSetting.email_domains_whitelist=\"\"' - clearing domain whitelist is necessary because I'm running import with the intraforum configuration, the real forum setup would allow all domains of course.
  • su discourse -c 'bundle exec ruby script/import_scripts/phpbb3.rb /shared/import/settings.yml'

The known issues so far:

  • WARN -- : Badly formed IFD: undefined method 'unpack' for nil:NilClass - this warning comes from the exifr gem, it seems to refer to the avatar of some user (image optimization). From what I can tell, it doesn't affect anything.
  • TypeError: contents[0].splice is not a function - this prevents a few posts from being imported, especially those with abp:// links. This is a Discourse bug, not fixed yet. The messages like Parent post 16189 doesn't exist. Skipping 16211 are follow-up errors.
  • Private message without recipients. - from what I can tell, this means private messages sent to a user that no longer exists. Dropping these should be fine.
  • Conversion of internal links broke some of them, this is a Discourse bug, not fixed yet.
  • Anonymous users have only a username, the name is empty. This is an issue if their original username cannot be mapped to Discourse (e.g. for Cyrillic names), their original username will be lost then. Discourse bug filed.
  • Forums are imported as categories, permissions aren't taken over however. It will be necessary to adjust the permissions manually, also the category structure and names.
Last edited 3 years ago by trev (previous) (diff)

comment:35 Changed 3 years ago by trev

For reference, all the bugs I filed have been resolved - too late for Discourse 1.6 Beta 12 but still in time for the final release which is scheduled for August 5.

Note: See TracTickets for help on using tickets.