Page MenuHomePhabricator

Switch English Wikipedia to uca-default collation
Closed, ResolvedPublic8 Estimated Story Points

Description

In order to fix T8948, we'll need to switch English Wikipedia from using simple "uppercase" collation to "uca-default" collation (based on the Unicode Collation Algorithm). This will cause several changes to category sorting, most noticeably, characters with diacritics will be grouped with non-diacritic characters (instead of being considered distinct). We should run this change by the English Wikipedia community and make sure they are OK with it before any changes are actually made.

Related Objects

Event Timeline

Just as an aside, i strongly reccomend that this be done in one step right to uca with numeric. Going first to uca-default then to uca + numeric would be unnessarily disruptive given how long the conversion script takes (even including the recent improvements)

@Bawolff: Good idea. BTW, the discussion on the en.wiki village pump has unanimous approval for switching to uca-default.

MatmaRex actually informed me on irc that the numeric option doesn't affect the non-numeric sort keys, which I didn't realize, so going from uca-default -> uca-defautlt-numeric would only mess up sorting of things that start with a number. Thus its probably less of an issue then I thought

So many tasks!

I just left a note at T32996#2336528 regarding the English Wikipedia proposal specifically, though a similar question applies to other wikis.

I also created https://meta.wikimedia.org/wiki/Collation.

kaldari moved this task from New & TBD Tickets to Up Next (June 3-21) on the Community-Tech board.

We should probably enable numeric sorting for enwiki at the same time so that we don't have to regenerate the sort-keys twice.

DannyH triaged this task as Medium priority.Jul 5 2016, 8:08 PM
DannyH moved this task from Older: Team Work to Up Next (June 3-21) on the Community-Tech board.
DannyH set the point value for this task to 5.
DannyH edited projects, added Community-Tech-Sprint; removed Community-Tech.
DannyH changed the point value for this task from 5 to 8.

Might be good to test this on a real wiki (i.e. with real articles and categories and defaultsort keys) before deploying to English Wikipedia. Maybe Swedish Wikipedia would work. Will ping @Johan when he gets back next week.

@Niharika: Steps:

  1. Make sure there is a record for 'en' in IcuCollation::$tailoringFirstLetters. (There is.)
  2. Check out the operations/mediawiki-config repo that has all the production configs in it.
  3. In wmf-config/InitialiseSettings.php, you'll add a new record to the $wgConf->settings['wgCategoryCollation'] array: 'enwiki' => 'uca-default-u-kn', // T136150. (You can refer to https://www.mediawiki.org/wiki/Manual:$wgCategoryCollation for all the possible options.)
  4. Commit your change and push it to Gerrit (referencing bug T136150)
  5. Go to https://wikitech.wikimedia.org/wiki/Deployments and add the commit as a deployment request during the Monday morning SWAT window.
  6. Make sure you can ssh to the terbium server: ssh terbium.eqiad.wmnet. If you can't check your ssh config settings.
  7. Be in the wikimedia-operations IRC channel at the beginning of the SWAT window (18:00 UTC).
  8. Open up ssh connections to both terbium and fluorine.
  9. After the SWAT deployer has merged the change, but before they sync it across the production servers, you'll start running the fatalmonitor script on fluorine to keep an eye on fatal errors. You can also watch the chart at https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json. If you notice serious problems caused by the change, request the SWAT deployer to roll-back the change.
  10. Test that category pages still load successfully on English Wikipedia.
  11. Test adding a page to a category (like your user sandbox).
  12. Log into terbium.
  13. Create a new screen session with the screen command. (You may have to hit space to dismiss the intro.)
  14. Run mwscript maintenance/updateCollation.php --wiki=enwiki --force within the screen session.
  15. Wait a day or two.
  16. See if everything worked correctly.

@Johan: If there are no issues from Swedish Wikipedia before Monday, we'll plan on deploying the change to enwiki on Monday. I expect the maintenance script will take about 24 hours to run, starting at approximately 18:00 UTC Monday. Can you alert the enwiki community? Tell them that category sorting may be unreliable for up to a day.

Sure. I'll do that Monday morning UTC?

Change 307248 had a related patch set uploaded (by Niharika29):
Switch enwiki to uca-default collation

https://gerrit.wikimedia.org/r/307248

Change 307248 merged by jenkins-bot:
Switch enwiki to uca-default collation

https://gerrit.wikimedia.org/r/307248

Mentioned in SAL [2016-08-29T18:11:48Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:307248|Switch enwiki to uca-default collation (T136150)]] (duration: 00m 47s)

Close to 9 million rows covered so far by the maintenance script.

Most special characters (including é as in that example) should now sort properly, so yes, such defaultsort are probably no longer needed. (They do have a minor affect though. In the new system, accents are used as a tie-breaker for pages with the exact same name not counting accents, so the defaultsort would prevent the tiebreaking, which is probably bad)

Close to 9 million rows covered so far by the maintenance script.

For reference, that out of 104,348,595 rows.

As of this writing, seems to be in the C's. (The A's took a long time, as about 10% of the category entries on enwiki belong to categories starting with either "All" or "Articles")

btw, once this script finishes, I would recommend doing a second pass via mwscript maintenance/updateCollation.php --wiki=enwiki --previous-collation uppercase (Don't specify --force). It appears that very occasionally the script misses a row, and this will get any stragglers. It should be very fast the second time around.

~39.5 million done as of writing this.

hashar subscribed.

The script has been launched just before we updated enwiki code from 1.28.0-wmf.16 to 1.28-wmf.17. As such it is currently running with an old code base and should be restarted (if that is at all possible). (from T144580)

The script has been launched just before we updated enwiki code from 1.28.0-wmf.16 to 1.28-wmf.17. As such it is currently running with an old code base and should be restarted (if that is at all possible). (from T144580)

It's covered 57 million rows out of 104,348,595 already. It's been running ~4 days now. I think it's possible to restart and it should cover the remaining rows. Or we can let it go on as it is and then run

mwscript maintenance/updateCollation.php --wiki=enwiki --previous-collation uppercase

afterwards when it ends, as Brian suggested.

The script should be totally fine to stop and restart (just dont specify --force or it will start from the beggining instead of resuming where it left off)

Selecting next 100 rows... processing...89000000 done.

And done!

104160942 rows processed

Also ran the follow-up script:

niharika29@terbium:~$ mwscript maintenance/updateCollation.php --wiki=enwiki --previous-collation uppercase
Collations up-to-date.

And done!

104160942 rows processed

Also ran the follow-up script:

niharika29@terbium:~$ mwscript maintenance/updateCollation.php --wiki=enwiki --previous-collation uppercase
Collations up-to-date.

Hmm, something's odd then:

MariaDB [enwiki_p]> select count(*) from categorylinks where cl_collation = 'uppercase' limit 4;
+----------+
| count(*) |
+----------+
|      207 |
+----------+
1 row in set (0.11 sec)

Ether something is out of sync, or there's just some stragglers

mwscript maintenance/updateCollation.php --wiki=enwiki --previous-collation uppercase

as suggested before should fix them up very quickly. Leave for a day, and check on it again to make sure

And done!

104160942 rows processed

Also ran the follow-up script:

niharika29@terbium:~$ mwscript maintenance/updateCollation.php --wiki=enwiki --previous-collation uppercase
Collations up-to-date.

Hmm, something's odd then:

MariaDB [enwiki_p]> select count(*) from categorylinks where cl_collation = 'uppercase' limit 4;
+----------+
| count(*) |
+----------+
|      207 |
+----------+
1 row in set (0.11 sec)

Nevermind, appears that its correct on real dbs, and this is just a tool labs replication issue.

So this can be marked as Resolved?

kaldari moved this task from In Development to Q1 2018-19 on the Community-Tech-Sprint board.