Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: upgrade mysql charset and collation to utf8mb4 #1065

Open
wants to merge 4 commits into
base: redwood
Choose a base branch
from

Conversation

Danyal-Faheem
Copy link
Contributor

@Danyal-Faheem Danyal-Faheem commented May 16, 2024

fixes #938.

Changes

  • Changes the default character set of MySQL to utf8mb4 and the default collation to utf8mb4_unicode_ci.
  • Also updated the dosqlshell command to use utf8mb4 as the default character set.
  • Added a new do command tutor local do change-charset-collation to allow users to upgrade the charset and collation of their tutor installations
    • Users can upgrade all the tables at once
    • Users can choose to upgrade only certain tables or apps
    • Users can also choose to upgrade all bar certain tables or apps

Testing

  • I have tested these changes on tutor dev and tutor K8s. I followed these steps:
    • Ran tutor [dev|k8s] launch
    • Imported the demo course and made sure it was working fine
    • Created admin and student users successfully
    • Created a new course with emojis in the title and in subsection titles and made sure changes were being published and reflected in the LMS

@Danyal-Faheem Danyal-Faheem force-pushed the danyalfaheem/upgrade-mysql-utf8mb4 branch from 47f01f9 to 438b312 Compare May 28, 2024 17:17
docs/local.rst Outdated
Comment on lines 146 to 157

Your database's charset and collation might not support specific characters or emojis. To upgrade your charset and collation of all the tables in your database to utf8mb4 and utf8mb4_unicode_ci respectively, run::

tutor local do change-charset-collation --all-tables --charset=utf8mb4 --collation=utf8mb4_unicode_ci

Alternatively, if you only want to upgrade certain tables or exclude certain tables, you can use the `status` option. To upgrade the `courseware_studentmodule` and `courseware_studentmodulehistory` tables, run::

tutor local do change-charset-collation --status=include courseware_studentmodule courseware_studentmodulehistory

Tutor performs pattern matching from the start of the table name so you can just enter the name of the app to include/exclude all the tables under that app. To upgrade all the tables in the database except the ones under the student and wiki apps, run::

tutor local do change-charset-collation --status=exclude student wiki
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should specify that the change is not reversible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a warning note at the top.

@Danyal-Faheem Danyal-Faheem marked this pull request as ready for review June 3, 2024 10:35
@@ -66,3 +66,55 @@ def get_mongo_upgrade_parameters(
mv '/openedx/data/ora2/SET-ME-PLEASE (ex. bucket-name)' /openedx/data/ora2/openedxuploads
fi
"""

def get_mysql_change_charset_query(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should move this util to another file as we are not using this in upgrade anymore.

docs/local.rst Outdated

Your database's charset and collation might not support specific characters or emojis. To upgrade your charset and collation of all the tables in your database to utf8mb4 and utf8mb4_unicode_ci respectively, run::

tutor local do change-charset-collation --all-tables --charset=utf8mb4 --collation=utf8mb4_unicode_ci
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which situation(s) will users have to run this command? Is it for users upgrading from Quince? Is it optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they can run it after upgrading from Quince. It is optional as tutor runs fine without it unless you add specific characters such as emojis that will throw DB errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explanation should be present in the docs :)

docs/local.rst Outdated

tutor local do change-charset-collation --all-tables --charset=utf8mb4 --collation=utf8mb4_unicode_ci

Alternatively, if you only want to upgrade certain tables or exclude certain tables, you can use the `status` option. To upgrade the `courseware_studentmodule` and `courseware_studentmodulehistory` tables, run::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All verbatim comments need double backticks. E.g: ``status``.

context_settings={"ignore_unknown_options": True},
)
@click.option("--all-tables", is_flag=True, help="change all tables in the openedx database")
@click.option("-s", "--status", type=click.Choice(['include', 'exclude'], case_sensitive=False), help="Whether to include or exclude certain tables/apps")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This combination of options is not very intuitive. It means that the end users needs to specify at least one option -- and that is not a great pattern, as options are considered... optional :)

Instead, I suggest that:

  1. all tables are process by default.
  2. If --include is specified, then only those tables are processed.
  3. If --exclude is specified, then those tables are not processed.

Note that with this approach, --include and --exclude are mutually compatible (--exclude would override --include).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all tables are process by default

I did not want to make this the default scenario in case someone accidentally executes the command while trying out the different options provided and they did not want to upgrade all the tables since this is a potentially irreversible change. Upgrading all tables at once can also take a long time.

Maybe we can prompt the user if they wish to proceed?

Note that with this approach, --include and --exclude are mutually compatible (--exclude would override --include).

I think this can be achieved and sounds more plausible. I'll update the options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can prompt the user if they wish to proceed?

Yes, this sounds perfect. Please make sure to include a -I/--non-interactive option for headless scripts.

@click.command(
short_help="Upgrade mysql to a specific charset and collation",
help=(
"Upgrade mysql to a specific charset and collation. You can either upgrade all tables, specify only certain tables to upgrade or specify certain tables to exclude from the upgrade process"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is misleading. It implies that it allows the user to upgrade to just any charset/collation. Instead, I suggest that we rename this command to something like convert-mysql-utf8mb4-charset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original command was meant to allow users to upgrade to any charset/collation, however that caused issues with the connection string. However, since we have removed that option, I think the convert-mysql-utf8mb4-charset makes more sense. I'll update it.


SET FOREIGN_KEY_CHECKS = 0;
Select _table_name;
SET @statement = CONCAT('ALTER TABLE ', _table_name, ' CONVERT TO CHARACTER SET {charset} COLLATE {collation};');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change columns that currently use utf8mb4_bin collation to be utf8mb4_unicode_ci? If so, that's a problem. The relatively small number of columns that use utf8mb4_bin are usually doing so because they explicitly want the values to be case sensitive, because case insensitivity on things like course keys has given us a lot of grief in the past.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify that last statement, an example was that MongoDB and the Python layer of CourseKeys are case sensitive, but the key stored in our CourseOverview model is not case sensitive. So at one point we had two course runs in production that had keys that differed only by case (because someone had misspelled when doing the initial creation), and updates to one could clobber the values for the other because they were fighting over the same row in CourseOverview for updates. (The inactive course run was still getting automated push updates from an external system.)

I realize that fixing CourseOverview is well outside the scope of this work, and I'm not trying to suggest we tackle that here. I only put it out there as an example of why apps might want to explicitly specify a sensitive field for storing things like identifier keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change columns that currently use utf8mb4_bin collation to be utf8mb4_unicode_ci?

It seems this is the case. I think we can add a condition that columns with the utf8mb4_bin collation should not be upgraded. Any other specific charsets or collations that we should look out for here?

Or we can just upgrade all the columns that are using utf8mb3_general_ci so that if some other column has their charset/collation fixed by the model in the future, we don't have to update the check for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your findings, it seems like the only forced collations that are not utf8mb4_unicode_ci are:

  • utf8mb3_bin: This should be mapped into utf8mb4_bin, since that's the logically equivalent case-sensitive collation.
  • utf8mb4_bin: This should be preserved as-is, as discussed.
  • utf8mb4_general_ci: This would be mapped to utf8mb4_unicode_ci, but it looks like it only exists for the bundles app, and that app is unused in Redwood and removed entirely just after the Redwood cut, so it's fine to ignore or update it, depending on what's more convenient in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the command to perform the following mappings instead:

  • utf8mb3_general_ci to utf8mb4_unicode_ci
  • utf8mb3_* to utf8mb4_* meaning that the utf8mb3_bin collations are converted to utf8mb4_bin now.

utf8mb4_bin: This should be preserved as-is, as discussed.

We ignore any columns that already have the utf8mb4 charset.

utf8mb4_general_ci: This would be mapped to utf8mb4_unicode_ci, but it looks like it only exists for the bundles app, and that app is unused in Redwood and removed entirely just after the Redwood cut, so it's fine to ignore or update it, depending on what's more convenient in the code.

Yup, trying it out on the nightly release, the bundles app was not there. Therefore, I've ignored it.

LEAVE tables_loop;
END IF;

SET FOREIGN_KEY_CHECKS = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cause any issue when trying to do a partial migration that includes the course_overviews_courseoverview table? That's the only table I'm aware of that has a varchar primary key and foreign keys from other tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

None yet

4 participants