Breaking out from the MySQL character-set hell

The MySQL charset hell. Read to know how you fall in it, what to do if you find yourself in it and how to migrate from latin1 to utf8mb4.

Emergence and Rise of UTF-8

Those who know how computer encoding works would know that for a long period of time, ASCII (American Standard Code for Information Interchange) was the default encoding standard. The 8-bit/1 byte character encoding covered all the characters in the English Language, Numbers and the most commonly used special characters (!,.* and so on…). But with time, many Non-English speakers also started using computers and eventually computers started supporting these foreign languages.

Shows the usage of the main encodings on the web from 2001 to 2012 as recorded by Google.

MySQL — Collation and Character Set

MySQL uses two basic terminologies:
1. Collation: A collation is a set of rules for comparing characters in a character set.
2. Character-Set: A character set is a set of symbols and encoding, mostly this information is derived from the type of collation.

Latin1 and MySQL

When working on MySQL you would have seen that when a database is created or a new table is created, the default collation of the database or the table is always latin1_swedish_ci with the default character-set being Latin1. Latin1 being an extension of ASCII defines some characters used by non-english languages & also shares commonly used symbols & glyphs.

MySQL Character-Set Hell — Reproducing The Problem

To find if you’re part of “character-set” hell, checkout the following things:

  1. From your application interface, Insert the following text, from a point where it is possible “English to Arabic….الإنجليزية إلى العربية
  2. Using the mysql-client CLI tool, connect to your application database. (Don’t use a GUI tool like phpMyAdmin to query, always stick to the CLI)
  3. Set the Database Connection variables as follows:
/*The commands below will allow your mysql client to act as a misconfigured application*/  SET SESSION CHARACTER_SET_CLIENT = latin1;
SET SESSION CHARACTER_SET_RESULTS = latin1;
SET SESSION CHARACTER_SET_CLIENT = utf8mb4;
SET SESSION CHARACTER_SET_RESULTS = utf8mb4;

Fixing The Problem

When following the above-mentioned steps, the result turned out to be the one mentioned above, then you can try out the following steps, to fix your character-set encoding problem.

  1. Make everything UTF-8 everywhere. Update mysqld, mysql and client settings in the /etc/mysql/*.cnf file as follows:[1]
[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character_set_server=utf8mb4
collation_server=utf8mb4_unicode_ci
#The following should be set if you are using mysql version 5.6 or lower
innodb_file_format=barracuda
innodb_file_per_table=1
innodb_large_prefix=1
[mysqld]
character_set_client: utf8mb4
character_set_database: utf8mb4
character_set_results: utf8mb4
character_set_connection: utf8mb4
character_set_server: utf8mb4
collation_connection: utf8mb4_unicode_ci
collation_server: utf8mb4_unicode_ci
innodb_file_format=barracuda
innodb_file_per_table=1
innodb_large_prefix=1
mysql -h host -u username -p
  • The -u and -p options specify the username and password of the database user respectively.
  • For more information: Checkout the Mysql official doc for mysql.
SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_general_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.05 sec)
  • Method 2: I find this method more reliable, although it is 100% more time consuming than Method 1. This method involves taking a dump of the current database (database, character-set, table structure and table data) using the mysqldump CLI tool, updating the character-set and the collation in the dump file and finally re-inserting the fixed SQL dump back, replacing the original database and its content(associated tables and data). When following Method 2, make sure that any active applications connecting to the applications are DISCONNECTED.[5]
mysqldump --verbose -h host -u username -p --add-drop-database --opt --skip-set-charset --default-character-set=latin1 --skip-extended-insert --databases databasename > dump_file.sql
  • The --opt command specifies a list of other default arguments for mysqldump. Read here for more info.
  • The --skip-set-charset and -- default-char-set=latin1 options prevent MySQL from taking the already Latin1 collated table and “helpfully” converting it to any other character set for you. This should ensure that your mysqldump is really in the Latin1 character encoding scheme.
  • The --skip-extended-insert option forces mysqldump to put each INSERT command in the dump on its own line. This will make the dump take much longer to re-import, however, in my experimentation, adding this option was enough to prevent the dump from having syntax errors in anywhere. And in any case, should the re-import fail for any reason, having each row’s data on its own line really helps to be able to zero-in on which rows are causing you problems (and gives you easier options to workaround the problem rows).
  • The --databases allow dumping multiple databases. Furthermore, it also adds a CREATE and USE database command in the dump, which would allow us to change the collation and character-set of the Database as well.
  • The dump_file.sql is the file to which the mysqldump command would write the contents of the database dump. Not specifying an output file would simply lead the database dump to be printed in the terminal.
  • For more information: Checkout the Mysql official doc for mysqldump.
perl -i -pe 's/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci/' dump_file.sql
perl -i -pe 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4 COLLATE utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC/' dump_file.sql
mysql --verbose -h host -u username -p < dump_file.sql
mysqlcheck --verbose -h host -u root -p --auto-repair --optimize --all-databases
  • The --auto-repair option informs mysqlcheck to automatically fix a table, if the table, when checked is found to be corrupted or has errors.
  • The --optimize option informs mysqlcheck to optimize the checked tables
  • The above options enable mysqlcheck to run the CHECK TABLE, REPAIR TABLE, ANALYZE TABLE and OPTIMIZE TABLE queries in a convenient way for the user.
  • For more information: Checkout the Mysql official doc for mysqlcheck.
SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_general_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.05 sec)
SET NAMES 'utf8mb4';

Nth Encoded Characters

After following the above steps, in an ideal world with a relatively small dataset, everything would be fine and your application would run smoothly, properly displaying the data. Let’s be realistic, in a real world user facing application, with a huge dataset, the possibility of saving double, triple, nth encoded characters exists.

Application Precautions

After breaking out of the MySQL character-set hell, there are a few precautions you should take in your applications. Many programmers and developers, while learning how to code get to know about the default string comparison functions provided by the Language (i.e. In C++/PHP: strlen, strcmp etc.). But many are not aware of the fact, that these functions work on the string provided, with an assumption that 1 byte = 1 character, which was true for ASCII being 1 byte but proven false for UTF-8 which is a variable length character encoding system. A fact to be kept in mind would be to use UTF-8 safe string comparison functions (in PHP — mb* utilities).

Conclusion

I hope that this article provides you with a way to come out of the MySQL character-set hell problem. As people say, hindsight is better than foresight, it definitely applies to the problems associated with MySQL character-set encoding. Also, the above article is not just limited to converting your MySQL database from latin1 to utf8mb4. You can follow the article to move between any two charsets and collation.

Sources

[1]: https://mathiasbynens.be/notes/mysql-utf8mb4#character-sets
[2]: http://mechanics.flite.com/blog/2014/07/29/using-innodb-large-prefix-to-avoid-error-1071/
[3]: http://aprogrammers.blogspot.in/2014/12/utf8mb4-character-set-in-amazon-rds.html
[4]: http://codex.wordpress.org/Converting_Database_Character_Sets#Converting_columns_to_blob.2C_then_back_to_original_format_with_new_charset
[5]: https://www.bluebox.net/insight/blog-article/getting-out-of-mysql-character-set-hell
[6]: https://docs.moodle.org/24/en/Converting_your_MySQL_database_to_UTF8#Linux_.26_Mac
[7]: https://mathiasbynens.be/notes/mysql-utf8mb4#repair-optimize
[8]: https://gist.github.com/mdemblani/34d30015f24e1927736f189b88be143a

Archduke of Computer Engineering. Code to enjoy, like to build stuff, software architecture, DevOps, and learn new technologies. LFC Fan #YNWA.