In order to fix T8948, we'll need to switch English Wikipedia from using simple "uppercase" collation to "uca-default" collation (based on the Unicode Collation Algorithm). This will cause several changes to category sorting, most noticeably, characters with diacritics will be grouped with non-diacritic characters (instead of being considered distinct). We should run this change by the English Wikipedia community and make sure they are OK with it before any changes are actually made.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Switch enwiki to uca-default collation | operations/mediawiki-config | master | +1 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T32996 Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki | |||
Open | Feature | None | T47443 Deploy language-specific "uca-xx" collations on Wikimedia wikis | ||
Resolved | kaldari | T128483 Fix category headers for pages that begin with numbers | |||
Resolved | kaldari | T8948 Natural number sorting in category listings | |||
Resolved | Niharika | T136150 Switch English Wikipedia to uca-default collation | |||
Resolved | Johan | T144081 Notify English Wikipedia of switch to uca-default collation | |||
Duplicate | None | T144580 updateCollation.php on terbium still run code from 1.28.0-wmf.16 against enwiki ( LoadBalancer::reallyOpenConnection: 402+ connections made (master=db1057) LoadBalancer.php line 850 ) | |||
Open | None | T144634 Investigation: New sort order: Hyphenated words should be sorted lower than the prefix |
Event Timeline
Proposal posted on English Wikipedia village pump:
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#OK_to_switch_English_Wikipedia.27s_category_collation_to_uca-default.3F
Just as an aside, i strongly reccomend that this be done in one step right to uca with numeric. Going first to uca-default then to uca + numeric would be unnessarily disruptive given how long the conversion script takes (even including the recent improvements)
@Bawolff: Good idea. BTW, the discussion on the en.wiki village pump has unanimous approval for switching to uca-default.
MatmaRex actually informed me on irc that the numeric option doesn't affect the non-numeric sort keys, which I didn't realize, so going from uca-default -> uca-defautlt-numeric would only mess up sorting of things that start with a number. Thus its probably less of an issue then I thought
So many tasks!
I just left a note at T32996#2336528 regarding the English Wikipedia proposal specifically, though a similar question applies to other wikis.
I also created https://meta.wikimedia.org/wiki/Collation.
We should probably enable numeric sorting for enwiki at the same time so that we don't have to regenerate the sort-keys twice.
Might be good to test this on a real wiki (i.e. with real articles and categories and defaultsort keys) before deploying to English Wikipedia. Maybe Swedish Wikipedia would work. Will ping @Johan when he gets back next week.
@Niharika: Steps:
- Make sure there is a record for 'en' in IcuCollation::$tailoringFirstLetters. (There is.)
- Check out the operations/mediawiki-config repo that has all the production configs in it.
- In wmf-config/InitialiseSettings.php, you'll add a new record to the $wgConf->settings['wgCategoryCollation'] array: 'enwiki' => 'uca-default-u-kn', // T136150. (You can refer to https://www.mediawiki.org/wiki/Manual:$wgCategoryCollation for all the possible options.)
- Commit your change and push it to Gerrit (referencing bug T136150)
- Go to https://wikitech.wikimedia.org/wiki/Deployments and add the commit as a deployment request during the Monday morning SWAT window.
- Make sure you can ssh to the terbium server: ssh terbium.eqiad.wmnet. If you can't check your ssh config settings.
- Be in the wikimedia-operations IRC channel at the beginning of the SWAT window (18:00 UTC).
- Open up ssh connections to both terbium and fluorine.
- After the SWAT deployer has merged the change, but before they sync it across the production servers, you'll start running the fatalmonitor script on fluorine to keep an eye on fatal errors. You can also watch the chart at https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json. If you notice serious problems caused by the change, request the SWAT deployer to roll-back the change.
- Test that category pages still load successfully on English Wikipedia.
- Test adding a page to a category (like your user sandbox).
- Log into terbium.
- Create a new screen session with the screen command. (You may have to hit space to dismiss the intro.)
- Run mwscript maintenance/updateCollation.php --wiki=enwiki --force within the screen session.
- Wait a day or two.
- See if everything worked correctly.
@Johan: If there are no issues from Swedish Wikipedia before Monday, we'll plan on deploying the change to enwiki on Monday. I expect the maintenance script will take about 24 hours to run, starting at approximately 18:00 UTC Monday. Can you alert the enwiki community? Tell them that category sorting may be unreliable for up to a day.
Change 307248 had a related patch set uploaded (by Niharika29):
Switch enwiki to uca-default collation
- Check out the operations/mediawiki-config repo that has all the production configs in it.
- In wmf-config/InitialiseSettings.php, you'll add a new record to the $wgConf->settings['wgCategoryCollation'] array: 'enwiki' => 'uca-default-u-kn', // T136150. (You can refer to https://www.mediawiki.org/wiki/Manual:$wgCategoryCollation for all the possible options.)
- Commit your change and push it to Gerrit (referencing bug T136150)
- Go to https://wikitech.wikimedia.org/wiki/Deployments and add the commit as a deployment request during the Monday morning SWAT window.
- Make sure you can ssh to the terbium server: ssh terbium.eqiad.wmnet. If you can't check your ssh config settings.
Mentioned in SAL [2016-08-29T18:11:48Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:307248|Switch enwiki to uca-default collation (T136150)]] (duration: 00m 47s)
Should AWB stop adding DEFAULTSORT for these cases?
https://en.wikipedia.org/w/index.php?title=Pero_Qu%C3%A9_Necesidad&diff=next&oldid=737008876
Most special characters (including é as in that example) should now sort properly, so yes, such defaultsort are probably no longer needed. (They do have a minor affect though. In the new system, accents are used as a tie-breaker for pages with the exact same name not counting accents, so the defaultsort would prevent the tiebreaking, which is probably bad)
For reference, that out of 104,348,595 rows.
As of this writing, seems to be in the C's. (The A's took a long time, as about 10% of the category entries on enwiki belong to categories starting with either "All" or "Articles")
btw, once this script finishes, I would recommend doing a second pass via mwscript maintenance/updateCollation.php --wiki=enwiki --previous-collation uppercase (Don't specify --force). It appears that very occasionally the script misses a row, and this will get any stragglers. It should be very fast the second time around.
The script has been launched just before we updated enwiki code from 1.28.0-wmf.16 to 1.28-wmf.17. As such it is currently running with an old code base and should be restarted (if that is at all possible). (from T144580)
It's covered 57 million rows out of 104,348,595 already. It's been running ~4 days now. I think it's possible to restart and it should cover the remaining rows. Or we can let it go on as it is and then run
mwscript maintenance/updateCollation.php --wiki=enwiki --previous-collation uppercase
afterwards when it ends, as Brian suggested.
The script should be totally fine to stop and restart (just dont specify --force or it will start from the beggining instead of resuming where it left off)
And done!
104160942 rows processed
Also ran the follow-up script:
niharika29@terbium:~$ mwscript maintenance/updateCollation.php --wiki=enwiki --previous-collation uppercase Collations up-to-date.
Hmm, something's odd then:
MariaDB [enwiki_p]> select count(*) from categorylinks where cl_collation = 'uppercase' limit 4; +----------+ | count(*) | +----------+ | 207 | +----------+ 1 row in set (0.11 sec)
Ether something is out of sync, or there's just some stragglers
mwscript maintenance/updateCollation.php --wiki=enwiki --previous-collation uppercase
as suggested before should fix them up very quickly. Leave for a day, and check on it again to make sure
Nevermind, appears that its correct on real dbs, and this is just a tool labs replication issue.