Page MenuHomeMiraheze

Fix all mysql tables that are using latin rather than binary
Closed, DeclinedPublic

Description

We need to convert all databases/tables (that can support this) to binary. This is because binary is MediaWiki's default.

Event Timeline

Paladox triaged this task as High priority.May 29 2019, 14:01
Paladox renamed this task from Fix some mysql tables that are using latin rather then utf8 to Fix some mysql tables that are using latin rather then utf8mb4.May 29 2019, 14:11
Paladox renamed this task from Fix some mysql tables that are using latin rather then utf8mb4 to Fix all mysql tables that are using latin rather then utf8mb4.
Paladox updated the task description. (Show Details)
Paladox lowered the priority of this task from High to Normal.Jun 17 2019, 23:03
Paladox lowered the priority of this task from Normal to Low.Jul 1 2020, 19:26
In T4425#130025, @John wrote:

@Paladox update?

I'm not sure if this is safe to do hence why I've set this on the low priority.

I could copy a db which is in a foreign language and try and convert it and see if it works.

Paladox changed the task status from Open to Stalled.Jan 30 2021, 16:32
Paladox added a subscriber: Southparkfan.

@Southparkfan wants us to find out what the WMF does before we do this.

John changed the task status from Stalled to Open.Mar 6 2021, 19:53

@RhinosF1 did the db person from the wmf get back to you?

@RhinosF1 said they use mediawikis default.

Unknown Object (User) unsubscribed.Apr 3 2021, 19:59

@Southparkfan I'm wondering if I could have assistance on this please? This is a really big change and could lead to data loss.

In T4425#142254, @John wrote:

@Southparkfan See the above please

I would like to assist here, but I am not very experienced with character sets. I have a few questions though.

  • Shouldn't we migrate from latin1 (our MariaDB default) to binary (MediaWiki default), instead from latin1 to utf8mb4? (see also next question)
  • T4425#137259: didn't they use binary?
  • Do you have a conversion procedure in mind? What are the known caveats?
  • Are other systems (Matomo, Phabricator, etc.) running on non-supported character sets too?
  • Is read-only mode required for converting (read: can pt-osc handle this conversion for us without read-only mode enabled)? If so, what is the ETA?
  • Are things going to explode if one wiki has binary tables, whereas another wiki has latin1 tables? Data corruption is likely and will have a major impact.
  • Suggestion: we must back-up wikis prior to converting the tables.
  • Suggestion: a Wikimedia DBA could help you answering these questions. They are very experienced with the MediaWiki database architecture.

Discussed; paladox will contact Wikimedia DBAs.

Southparkfan renamed this task from Fix all mysql tables that are using latin rather then utf8mb4 to Fix all mysql tables that are using latin rather then binary.May 14 2021, 12:12
Southparkfan updated the task description. (Show Details)

https://phabricator.wikimedia.org/T250058#6051265 suggests search will break when converting to binary?

Redmin renamed this task from Fix all mysql tables that are using latin rather then binary to Fix all mysql tables that are using latin rather than binary.May 14 2021, 13:29

Spoked with a DBA at wmf briefly (Marostegui). He said it uses utf8mb4 and binary. Also he said he guessed it required read only and also alter table commands.

He also said he doesn't think it would cause corruption if one table had latin1 and the other binary but he also said it's worth checking out.

Phabricator uses utf8mb4 (as it's forced).

Also:

MariaDB [metawiki]> show table status like 'searchindex';
+-------------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+-----------------+----------+----------------+---------+------------------+-----------+
| Name        | Engine | Version | Row_format | Rows  | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time         | Check_time | Collation       | Checksum | Create_options | Comment | Max_index_length | Temporary |
+-------------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+-----------------+----------+----------------+---------+------------------+-----------+
| searchindex | InnoDB |      10 | Dynamic    | 20670 |           2004 |    41435136 |               0 |       425984 |   4194304 |           NULL | 2020-07-15 23:25:09 | 2021-06-02 11:40:55 | NULL       | utf8_general_ci |     NULL |                |         |                0 | N         |
+-------------+--------+---------+------------+-------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+-----------------+----------+----------------+---------+------------------+-----------+

I wonder if that should be moved to utf8mb4_unicode_ci?

Is this something we wish to pursue? There's a lot of chat about data corruption and this being a big/dangerous task - I'm not really aware of any major issues with the current setup that would require running such a risk personally.

@Paladox can you elaborate on the necessity for this please, and if not, can we decline this task?

In T4425#156452, @John wrote:

Is this something we wish to pursue? There's a lot of chat about data corruption and this being a big/dangerous task - I'm not really aware of any major issues with the current setup that would require running such a risk personally.

@Paladox can you elaborate on the necessity for this please, and if not, can we decline this task?

We don't necessarily have to do I don't think. The point of doing this was so we can search in the backend using unicode/or if in another language the string. Because at the moment we cannot as the string isn't displayed in a utf8 way for us.

John claimed this task.