Cozying up to Unicode: Refitting Legacy MariaDB instances for modern Python interpreters

Starting a new job in a startup environment can be a thrilling experience that lets you take the reigns and start building cool stuff from the ground up. There are fewer surprises because you know the code base like the back of your hand and it's fresh! On the other hand, joining a larger enterprise that has been around for a while will require web developers to make sense of legacy code, fix bugs, and potentially modernize. Sure, sure, if it aint broke, don't fix it, I get it. Legacy systems don't exist in a vacuum though, and moving parts around it will often times cause friction, and I experienced one of those cases that I felt was worth blurbin' about.

If you didn't already know, Django is a framework, written in Python, for building webapps, RESTful APIs, or both! It's easy to setup a data-access layer in Django as it includes an accessible database ORM as part of it's core libraries. Typically you'd setup a Postgres instance, and interface with that, as the django.models package has native support for Postgres. That's not important though, cause who needs array fields, and an easy way to store JSON, when you can hack something contrived together for your web UI forms. I'm not bitter though ...

Anyways! If you're using an older version of Django that uses Python2.7, regardless of the database you choose, and you decide to migrate your project to a version of Django that uses Python3.x, well .. that's a moving part that has the potential to grind against existing system components, like databases that aren't configured to encode strings as unicode. The reason being a very non-arbitrary change that came with the release of Python3.x, how it handles strings. Python3.x treats strings as unicode by default, utf-8, where-as Python2.7 did not. If you're solely a US based company, big whoop, for multi-natiionals with internationalized applications however, there is definitely a wider surface area that is almost certainly going to cause encoding problems.

For instance, let's say you have a social media app and you announce the release of an internationalized version of the app, so that users all across the world can now logon and share their dirty laundry with their closest friends and family. A user from Germany signs up for an account, entering in details of who he is, what type of things he likes, among other things. Database saves the information fine. The next day that same german user signs in and gets a 500 internal server error, and reports the problem. Management hands that down to an individual contributor, and he starts crankin' away the debug machine. They open up a python shell, configure and start their Django project, and query for the user's details. They see this exception in their terminal

In [27]: user = SiteUser.objects.get(user_id=1337)

--------------------------------------------------------------------------
UnicodeDecodeError                           Traceback (most recent call last)
<ipython-input-27-3162>

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-27-3372z99d2a> in <module>
----> 1 user = TAUser.objects.get(user_id=21271)

/usr/lib/lib/python3.8/site-packages/django/db/models/manager.py in manager_method(self, *args, **kwargs)
     83     def create_method(name, method):
     84         def manager_method(self, *args, **kwargs):
     85             return getattr(self.get_queryset(), name)(*args, **kwargs)
     86         manager_method.__name__ = method.__name__
     87         manager_method.__doc__ = method.__doc__

... (more lines from Django's internal files)

/usr/lib/lib/python3.8/site-packages/MySQLdb/cursors.py in _fetch_row(self, size)
    326         if not self._result:
    327             return ()
--> 328         return self._result.fetch_row(size, self._fetch_type)
    329 
    330     def __iter__(self):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 18: invalid continuation byte

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 18: invalid continuation byte

The Problem with latin1

The latin1 character set, also known as ISO 8859-1, was a default in many databases for years. However, it supports only Western European languages, which is a severe limitation in today's globalized digital world. As a Django project evolves, the need to support a wider array of characters becomes critical, especially when dealing with user-generated content from a global customer base.

Transitioning to Unicode

To fully support internationalization, it's necessary to transition the database character set to utf8mb4. This character set supports the complete range of Unicode characters, including emojis, which are becoming increasingly common in user data.

Step 1: Backup Your Database Before making any changes to your database, ensure you have a complete backup. Use mysqldump or MariaDB's backup tools to secure your data.

Step 2: Modify the Database and Table Settings The following SQL commands will change the default character set and collation for your database and tables:

ALTER DATABASE your_db_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
ALTER TABLE your_table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Replace your_db_name and your_table_name with your actual database and table names.

Step 3: Update Django Settings In your Django settings file, update the database options to reflect the new character set:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'your_db_name',
        'USER': 'your_db_user',
        'PASSWORD': 'your_db_password',
        'HOST': 'your_db_host',  # Or an IP Address that your DB is hosted on
        'PORT': 'your_db_port',
        'OPTIONS': {
            'charset': 'utf8mb4',
            'init_command': "SET sql_mode='STRICT_TRANS_TABLES'",
        },
    }
}

Step 4: Verify Your Application’s Unicode Handling Once the database character set has been updated, you'll need to verify that your application correctly handles Unicode strings throughout. Pay special attention to form processing, URL handling, and any place where string manipulation occurs.

Handling Legacy Data If you have legacy data stored using the latin1 character set, you’ll need to convert it to utf8mb4. This can be a complex process, especially if there are already mis-encoded characters in the database. You might need to write custom scripts to clean up your data before converting the character set.

Updating your MariaDB configuration to support Unicode in a Django project is a critical step towards modernizing your application and providing a truly international user experience. While the transition can be challenging, especially when dealing with legacy systems and data, the benefits of a fully Unicode-compliant application are immense. By following the steps outlined above, you can ensure that your Django project is well-positioned for the diverse needs of a global audience.

Remember, thorough testing and validation are key after making these changes. It's not just about updating settings; it's about ensuring that every aspect of your application can handle the full spectrum of Unicode characters without issues.