We need to convert all databases/tables (that can support this) to binary. This is because binary is MediaWiki's default.
Description
Event Timeline
I could copy a db which is in a foreign language and try and convert it and see if it works.
@Southparkfan I'm wondering if I could have assistance on this please? This is a really big change and could lead to data loss.
I would like to assist here, but I am not very experienced with character sets. I have a few questions though.
- Shouldn't we migrate from latin1 (our MariaDB default) to binary (MediaWiki default), instead from latin1 to utf8mb4? (see also next question)
- T4425#137259: didn't they use binary?
- Do you have a conversion procedure in mind? What are the known caveats?
- Are other systems (Matomo, Phabricator, etc.) running on non-supported character sets too?
- Is read-only mode required for converting (read: can pt-osc handle this conversion for us without read-only mode enabled)? If so, what is the ETA?
- Are things going to explode if one wiki has binary tables, whereas another wiki has latin1 tables? Data corruption is likely and will have a major impact.
- Suggestion: we must back-up wikis prior to converting the tables.
- Suggestion: a Wikimedia DBA could help you answering these questions. They are very experienced with the MediaWiki database architecture.
https://phabricator.wikimedia.org/T250058#6051265 suggests search will break when converting to binary?
Spoked with a DBA at wmf briefly (Marostegui). He said it uses utf8mb4 and binary. Also he said he guessed it required read only and also alter table commands.
He also said he doesn't think it would cause corruption if one table had latin1 and the other binary but he also said it's worth checking out.
Phabricator uses utf8mb4 (as it's forced).
Also:
MariaDB [metawiki]> show table status like 'searchindex'; +-------------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+-----------------+----------+----------------+---------+------------------+-----------+ | Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment | Max_index_length | Temporary | +-------------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+-----------------+----------+----------------+---------+------------------+-----------+ | searchindex | InnoDB | 10 | Dynamic | 20670 | 2004 | 41435136 | 0 | 425984 | 4194304 | NULL | 2020-07-15 23:25:09 | 2021-06-02 11:40:55 | NULL | utf8_general_ci | NULL | | | 0 | N | +-------------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+-----------------+----------+----------------+---------+------------------+-----------+
I wonder if that should be moved to utf8mb4_unicode_ci?
Is this something we wish to pursue? There's a lot of chat about data corruption and this being a big/dangerous task - I'm not really aware of any major issues with the current setup that would require running such a risk personally.
@Paladox can you elaborate on the necessity for this please, and if not, can we decline this task?
We don't necessarily have to do I don't think. The point of doing this was so we can search in the backend using unicode/or if in another language the string. Because at the moment we cannot as the string isn't displayed in a utf8 way for us.