Opened on 09/05/2014 at 09:03:34 AM
Closed on 07/13/2016 at 04:06:49 PM
Last modified on 07/28/2016 at 09:35:08 AM
#1327 closed change (fixed)
Test migration of adblockplus.org/forum from phpBB to Discourse
Reported by: | kzar | Assignee: | trev |
---|---|---|---|
Priority: | P3 | Milestone: | |
Module: | Infrastructure | Keywords: | phpbb forum discourse |
Cc: | kzar, trev, arthur, mapx, greiner, matze | Blocked By: | #3242 |
Blocking: | #1624 | Platform: | Unknown |
Ready: | yes | Confidential: | no |
Tester: | Unknown | Verified working: | no |
Review URL(s): |
Description (last modified by kzar)
Background
We want to replace phpBB with Discourse for the Adblock Plus user forum.
phpBB forum is our most used system, the one where we get most user interaction. It is also the crappiest one, not to mention outdated and too hard to update. A project to have it replaced has been picked up a few times over the years, the progress has been negligible because nobody can focus on it. We should consider adding this to the roadmap.
- palant
To achieve that the first step is to test the migration.
What to change
The aim is to create a Vagrant VM to help test the migration of phpBB to Discourse. The VM needs to be provisioned using puppet scripts as with the real servers and needs to be loaded with test data that is as realistic as possible.
The VM also needs to be set up to perform migration tests as easily as possible at the time of provisioning. This might require extra work that won't be used on the real servers. This can be performed however as long as it works effectively and is reproducible.
The migration itself will be tested using Discourse's in-built phpBB data importer for now. Depending on how well it works we might need to customise it or make something ourselves.
The aim of this testing is also to enumerate any issues / questions migration raises so that we can start to address them.
What we need to import:
- Forum posts
- User accounts (without passwords, the email address should be sufficient to log in with Google & Co.) - only the ones who actually posted something, we have a massive number of dead spammer accounts there.
- Subforums should be mapped to categories, e.g. "Adblock Plus for Firefox" will become the "firefox" category. We will create the necessary category and configure permissions for them manually.
- Group memberships and ranks don't need to be converted, we can do that manually.
Notes
- We might need to make a script that helps phpBB users migrate, it could allow you to login with your phpBB credentials and then give you a choice of buttons to link with your Google / other account for use with Discourse.
- We might need to create map of redirects from phpBB URLs to Discourse ones so that we can avoid broken links. We could serve the redirections from a web app or write them for the webserver to handle directly.
- During migration we might need to set up a read only demonstration of Discourse with the data on a different URL, maybe a subdomain like forum.adblockplus.org . When it's ready we could then switch writing on and put phpBB to read only before we start redirecting people.
Resources
http://www.discourse.org/
https://meta.discourse.org/t/phpbb-3-importer/17397
https://meta.discourse.org/t/importing-phpbb-into-discourse/7956
Attachments (1)
Change History (36)
comment:1 Changed on 09/05/2014 at 09:43:56 AM by kzar
comment:2 Changed on 09/05/2014 at 09:45:15 AM by kzar
- Cc dave@adblockplus.org added
- Owner kzar deleted
comment:3 follow-up: ↓ 4 Changed on 09/05/2014 at 12:11:16 PM by philll
- Cc kzar trev added; dave@adblockplus.org removed
@trev: I guess it would be sufficient to provide some test dump to get this done by somebody without direct access to the database.
However, this Issue lacks a lot of concept, which is required to do such a migration with some specs about how logically to migrate subthreads, what to show on the start page, how to logically take over permissions etc.
comment:4 in reply to: ↑ 3 Changed on 09/05/2014 at 12:47:31 PM by kzar
Yea I was thinking that, @snoack was saying how there's sensitive data like password hashes and private messages but perhaps it would be possible to strip them all out of the dump with a regexp? (Like set all passwords to a hash of "password" and the private messages with lorem ipsum or something appropriate.)
@phill you're right there's a lot of questions about what happens in various situations. I've not thought them through at all really to be honest! I thought the best place to start was to test see what actually happens in practice and then from that we could realise where the problem areas are and decide the best course of action for them all. That said if you guys have ideas about some of those things already post away I guess!
Replying to philll:
@trev: I guess it would be sufficient to provide some test dump to get this done by somebody without direct access to the database.
However, this Issue lacks a lot of concept, which is required to do such a migration with some specs about how logically to migrate subthreads, what to show on the start page, how to logically take over permissions etc.
comment:5 Changed on 09/05/2014 at 07:38:59 PM by trev
- Description modified (diff)
- Priority changed from Unknown to P3
- Ready set
- Summary changed from Replace adblockplus.org/forum phpBB with Discourse to Migrate adblockplus.org/forum data from phpBB to Discourse
I updated the description and title. Note that this is only about migrating the data now, actually setting up a public Discourse instance should be a separate issue.
Changed on 09/05/2014 at 08:22:34 PM by trev
Cleaned up forum database dump
comment:9 Changed on 09/05/2014 at 08:27:02 PM by trev
The dump I attached contains only very old posts, up to July 2006 or something like that. There are no IP addresses, these have been anonymized away long ago. I removed the lines from the dump manually, so there might be topics without any posts but hopefully not vice versa. Also, the users table is completely empty - please use some dummy data there.
I left out the groups table - the import script uses it to recognize bot accounts, we should look at the post count instead. I also left out the forums table - the import script gets forum names from there, we should simply map forum IDs to categories "manually" instead.
comment:10 Changed on 09/08/2014 at 12:14:30 PM by arthur
- Cc arthur added
comment:11 Changed on 09/08/2014 at 12:42:19 PM by christian
- Cc cvervoorts@adblockplus.org added
comment:12 Changed on 09/08/2014 at 02:57:12 PM by kzar
- Description modified (diff)
comment:13 Changed on 09/09/2014 at 08:48:59 PM by trev
We might need to make a script that helps phpBB users migrate, it could allow you to login with your phpBB credentials and then give you a choice of buttons to link with your Google / other account for use with Discourse.
That would only help those who have an invalid email address in their phpBB account - meaning that it shouldn't be necessary. An announcement should be sufficient to make the forum regulars update their email addresses (if necessary at all, chances are that their email addresses are already correct - for the sake of notifications). Nobody else should care.
We might need to create map of redirects from phpBB URLs to Discourse ones so that we can avoid broken links. We could serve the redirections from a web app or write them for the webserver to handle directly.
Our webserver of choice is Nginx. It might be efficient enough to handle 25000 redirects. Ideal solution would be keeping thread IDs however, then a few redirects would be sufficient.
comment:14 Changed on 09/12/2014 at 08:25:27 AM by mapx
- Cc mapx added
comment:15 Changed on 09/16/2014 at 01:13:42 PM by greiner
- Cc greiner added
comment:16 Changed on 09/17/2014 at 01:36:41 PM by kzar
- Owner set to kzar
comment:17 Changed on 09/17/2014 at 02:19:46 PM by kzar
- Description modified (diff)
comment:18 Changed on 10/02/2014 at 08:18:30 AM by matze
- Cc mathias@adblockplus.org added
comment:19 Changed on 10/06/2014 at 09:21:28 AM by kzar
So after a few false starts I realised the best approach, I've forked the infrastructure repo and have been making changes to get a forumserver running. Once it's working I will use the VM to test the migrations and when we eventually come to go ahead we can use much of the same code in production.
I've written a phpBB puppet module and added a forumserver VM, that's working quite well. @matze @aalvz and @poz2k4444 are helping me fix up the Discourse module ready now so that we can get some migration tests running.
If you're curious code so far is all online on GitHub https://github.com/kzar/adblockplus-infrastructure/tree/1327-forum-migration/forum-migration .
comment:20 Changed on 10/06/2014 at 10:30:31 AM by matze
- Review URL(s) modified (diff)
comment:21 Changed on 11/26/2014 at 03:57:38 PM by kzar
- Description modified (diff)
- Summary changed from Migrate adblockplus.org/forum data from phpBB to Discourse to Test migration of adblockplus.org/forum from phpBB to Discourse
comment:22 Changed on 11/26/2014 at 04:22:21 PM by kzar
- Blocked By 1614 added
comment:23 Changed on 11/27/2014 at 11:45:32 AM by innerself
- Blocking 1624 added
comment:24 Changed on 07/28/2015 at 11:26:43 AM by kzar
- Owner kzar deleted
- Review URL(s) modified (diff)
- Tester set to Unknown
I have not been working on this for some time and in the meeting it sounded like our approach is changing and that Matze is doing some work here. I'll unassign myself for now, but shout if I can help.
comment:25 Changed on 09/18/2015 at 09:53:49 AM by matze
- Cc matze added; cvervoorts@adblockplus.org mathias@adblockplus.org removed
comment:26 Changed on 09/18/2015 at 09:54:24 AM by matze
- Owner set to matze
comment:27 Changed on 10/28/2015 at 09:00:57 AM by trev
- Owner changed from matze to trev
Reassigning to myself, I've made some progress here. The phpBB import script works mostly fine, one has to change the table names however (it expects "phpbb_" table prefix which we don't have). There is some weirdness: some topics were put under "Your first category" or "Your first category/Your first forum" instead of the correct forums. Also, anonymous posts now have "system" as author. There is an exception processing invalid URLs, e.g. Object aasted.org/adblock has no method 'splice'. Finally, topic IDs weren't kept, that will make redirects from the current forum complicated.
The import script I used is outdated, the current Discourse version has a newer one. I'll need to look into whether it solves any of the issues above, and whether it will run with our Discourse version (updating Discourse is also an option of course).
comment:28 Changed on 10/28/2015 at 10:35:45 AM by trev
- Blocked By 3242 added; 1614 removed
comment:29 Changed on 10/28/2015 at 01:15:17 PM by trev
Now that Discourse has been updated I started the new version of the migration script. It no longer handles missing avatars gracefully, these have to be present. Also, there is a crash which can be worked around by changing cache_rows: false to cache_rows: true in script/import_scripts/phpbb3/database/database_base.rb. And this fix needs to be applied, otherwise the script will fail on invalid birth dates.
comment:30 Changed on 07/11/2016 at 03:07:33 PM by trev
- Blocked By 4234 added
With the setup in #4234 we should be able to update Discourse easily in future. I'm testing with Discourse 1.5.3 now which is the latest stable version. Issues:
- We actually had one user in our database without an email address - the import script complains about that so I changed his address in the forum.
- Import script still cannot deal with the scenario where no table prefix is defined, so I have to add that prefix when importing the database:
gzip -cd /shared/phpbb.dump.gz | sed -r 's/(TABLES?|INSERT INTO|TABLE IF EXISTS) `/\1 `phpbb_/' | mysql
- Importing anonymous users fails - due to a bug in active_record gem question marks in user names cannot be processed correctly. However, replacing function all_records_exist? in /var/www/discourse/script/import_scripts/base.rb by the current version from https://github.com/discourse/discourse/blob/master/script/import_scripts/base.rb makes this work (side-effect from this change I think).
- Importing anonymous users has letter case issues, I applied the patch listed in the bug report.
- This post and some other posts (usually with abp:// links) could not be imported because of a weird Discourse bug.
comment:31 Changed on 07/11/2016 at 08:20:40 PM by trev
- Blocked By 4234 removed
- Resolution set to fixed
- Status changed from new to closed
My test import completed. Two additional warnings popped up when importing private messages. One is referring to private messages without recipients - I guess that dropping these isn't an issue. The other is referring to conversations where the parent message no longer exists - Discourse treats private messages like regular discussion topics, and these would be incomplete here. While losing these private messages isn't too much of a deal (only 71 PMs out of 3.8k dropped and many of those appear to be mine), it shouldn't be too hard to import these as partial conversations (adjusting root in the MySQL database would do). But maybe Discourse devs decide to fix this issue, then we won't need to do anything.
Either way, the migrated data looks good, so that I can resolve this issue now. For reference, here is the settings.yml file used for the conversion (not really modified much compared to the example):
database: type: MySQL # currently only MySQL is supported - more to come soon host: localhost port: 3306 username: root password: schema: phpbb3 table_prefix: phpbb # Usually all table names start with phpbb. Change this, if your forum is using a different prefix. batch_size: 1000 # Don't change this unless you know what you're doing. The default (1000) should work just fine. import: # Enable this option if you want to have a better conversion of BBCodes to Markdown. # WARNING: This can slow down your import. use_bbcode_to_md: false # This is the path to the root directory of your current phpBB installation (or a copy of it). # The importer expects to find the /files and /images directories within the base directory. # You need to change this to something like /var/www/phpbb if you are not using the Docker based importer. # This is only needed if you want to import avatars, attachments or custom smilies. phpbb_base_dir: /shared/import/data site_prefix: # this is needed for rewriting internal links in posts original: adblockplus.org/forum # without http(s):// new: https://forum.adblockplus.org # with http:// or https:// # Enable this, if you want to redirect old forum links to the the new locations. permalinks: categories: true # redirects /viewforum.php?f=1 to /c/category-name topics: true # redirects /viewtopic.php?f=6&t=43 to /t/topic-name/81 posts: false # redirects /viewtopic.php?p=2455#p2455 to /t/topic-name/81/4 avatars: uploaded: true # import uploaded avatars gallery: true # import the predefined avatars phpBB offers remote: false # WARNING: This can considerably slow down your import. It will try to download remote avatars. # When true: Anonymous users are imported as suspended users. They can't login and have no email address. # When false: The system user will be used for all anonymous users. anonymous_users: true # Enable this, if you want import password hashes in order to use the "migratepassword" plugin. # This will allow users to login with their current password. # The plugin is available at: https://github.com/discoursehosting/discourse-migratepassword passwords: false # By default all the following things get imported. You can disable them by setting them to false. bookmarks: true attachments: true private_messages: true polls: true # When true: each imported user will have the original username from phpBB as its name # When false: the name of each user will be blank username_as_name: false # Map Emojis to smilies used in phpBB. Most of the default smilies already have a mapping, but you can override # the mappings here, if you don't like some of them. # The mapping syntax is: emoji_name: 'smiley_in_phpbb' # Or map multiple smilies to one Emoji: emoji_name: ['smiley1', 'smiley2'] emojis: # here are two example mappings... smiley: [':D', ':-D', ':grin:'] heart: ':love:'
Unfortunately, I messed up site_prefix section originally so links didn't get converted - this should work for the real import however.
comment:32 Changed on 07/11/2016 at 08:30:19 PM by trev
Noticed one more non-critical issue: permalinks didn't get created. Configurable permalink import is a new feature in Discourse 1.6.0 beta2, the importer in Discourse 1.5.3 will only create permalinks if a particular code section in script/import_scripts/phpbb3/importer.rb is uncommented.
comment:33 Changed on 07/12/2016 at 05:16:48 PM by trev
- Resolution fixed deleted
- Status changed from closed to reopened
Reopening - it's probably a good idea to do one more go, this time with Discourse 1.6.0 Beta 11 and a current database dump. A few issues should go away then.
comment:34 Changed on 07/13/2016 at 04:06:49 PM by trev
- Resolution set to fixed
- Status changed from reopened to closed
Ok, import of the current database dump with Discourse 1.6.0 Beta 11 succeeded, took around 5 hours on my machine. Detailed steps:
- Unpack phpBB images under /var/discourse/shared/standalone/import/data (create directory as it doesn't exist initially).
- sudo mkdir /var/discourse/shared/standalone/import/data/files (our phpBB instance doesn't allow file uploads).
- Copy phpbb.dump.gz to /var/discourse/shared/standalone/import/
- Create settings.yml under /var/discourse/shared/standalone/import/settings.yml with the following contents:
database: type: MySQL host: localhost port: 3306 username: root password: schema: phpbb3 table_prefix: phpbb batch_size: 1000 import: use_bbcode_to_md: false phpbb_base_dir: /shared/import/data site_prefix: original: adblockplus.org/forum new: https://forum.adblockplus.org permalinks: categories: true topics: true posts: false avatars: uploaded: true # import uploaded avatars gallery: true # import the predefined avatars phpBB offers remote: false # WARNING: This can considerably slow down your import. It will try to download remote avatars. anonymous_users: true passwords: false bookmarks: true attachments: true private_messages: true polls: true username_as_name: true
- Run sudo /var/discourse/launcher enter app --skip-prereqs to enter the container. All further commands should be run from within the Docker container.
- apt-get install mysql-server libmysqlclient-dev
- /etc/init.d/mysql start
- echo 'CREATE DATABASE phpbb3;' | mysql
- gzip -cd /shared/import/phpbb.dump.gz | sed -r 's/(TABLES?|INSERT INTO|TABLE IF EXISTS) /\1 phpbb_/' | mysql phpbb3 - note that modifying the dump should no longer be necessary with the next Discourse beta as empty table prefixes are supported properly now.
- It might be a good idea to remove users without any posts, the overwhelming majority of those are spammers. This would currently remove 70758 out of 81577 users. Just for reference, I didn't do it for this import: DELETE FROM phpbb_users WHERE user_id NOT IN (SELECT poster_id FROM phpbb_posts); | mysql phpbb3
- echo "gem 'mysql2'" >> Gemfile
- echo "gem 'ruby-bbcode-to-md', :github => 'nlalonde/ruby-bbcode-to-md'" >> Gemfile
- su discourse -c 'bundle install --no-deployment --without test --without development --path vendor/bundle'
- Apply patch from this bug report.
- su discourse -c 'bundle exec rails r SiteSetting.email_domains_whitelist=\"\"' - clearing domain whitelist is necessary because I'm running import with the intraforum configuration, the real forum setup would allow all domains of course.
- su discourse -c 'bundle exec ruby script/import_scripts/phpbb3.rb /shared/import/settings.yml'
The known issues so far:
- WARN -- : Badly formed IFD: undefined method 'unpack' for nil:NilClass - this warning comes from the exifr gem, it seems to refer to the avatar of some user (image optimization). From what I can tell, it doesn't affect anything.
- TypeError: contents[0].splice is not a function - this prevents a few posts from being imported, especially those with abp:// links. This is a Discourse bug, not fixed yet. The messages like Parent post 16189 doesn't exist. Skipping 16211 are follow-up errors.
- Private message without recipients. - from what I can tell, this means private messages sent to a user that no longer exists. Dropping these should be fine.
- Conversion of internal links broke some of them, this is a Discourse bug, not fixed yet.
- Anonymous users have only a username, the name is empty. This is an issue if their original username cannot be mapped to Discourse (e.g. for Cyrillic names), their original username will be lost then. Discourse bug filed.
- Forums are imported as categories, permissions aren't taken over however. It will be necessary to adjust the permissions manually, also the category structure and names.
comment:35 Changed on 07/28/2016 at 09:35:08 AM by trev
For reference, all the bugs I filed have been resolved - too late for Discourse 1.6 Beta 12 but still in time for the final release which is scheduled for August 5.
Without access to the adblock.com/forum database and files this ticket is unworkable IMHO, we need to test the migration script works in practice with our data. I think therefore this ticket is blocked until someone with access to the data has time to work on the migration.