Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters in the \x80 - \x9f range not returning as expected #390

Closed
sanjuroj opened this issue Oct 24, 2019 · 2 comments · Fixed by #398
Closed

Characters in the \x80 - \x9f range not returning as expected #390

sanjuroj opened this issue Oct 24, 2019 · 2 comments · Fixed by #398

Comments

@sanjuroj
Copy link

I’m trying to retrieve data from a MySQL v5.1 database in which most of the character settings are latin1. The trouble is with characters in the \x80 - \x9f range, the ones that are different between canonical latin1 (ISO-8859-1) and MySQL’s latin1 (cp1252). I’m using mysqlclient as part of a Django application but I’ve tried to replicate the problem with a more direct example below. The replication isn’t perfect, the Django application throws an error: "'utf-8' codec can't decode byte 0x92 in position 5: invalid start byte" when there is no charset option set. It prints "Women . s" (i.e. whitespace in place of the apostrophe) when the charset = 'latin1' option is provided. However, I hope the example below is close enough that the solution will work for Django as well.

In the example below, the word "Women’s", with the apostrophe as \x92, is being stored in a MySQL v5.7 database.

First, set up a local MySQL v5.7 server so it's similar to the MySQL v5.1 server that we're trying to hit.

mysql> CREATE DATABASE testchar;
mysql> USE testchar;
mysql> CREATE TABLE testchar (id INTEGER PRIMARY KEY NOT NULL AUTO_INCREMENT, data_value VARCHAR(80) CHARACTER SET latin1 ) ENGINE=MYISAM DEFAULT CHARSET=latin1;
mysql> INSERT INTO testchar(data_value) VALUES (UNHEX("576F6D656E9273"));

It looks ok with a CONVERT statement but not with a SELECT statement

mysql> SELECT CONVERT (UNHEX("576F6D656E9273") USING latin1);
+------------------------------------------------+
| CONVERT (UNHEX("576F6D656E9273") USING latin1) |
+------------------------------------------------+
| Women’s                                        |
+------------------------------------------------+

mysql> SELECT UNHEX("576F6D656E9273");
+-------------------------+
| UNHEX("576F6D656E9273") |
+-------------------------+
| Women?s                  |
+-------------------------+

Now set up the python environment

$ virtualenv -p python3.7 --no-site-packages venv
$ source venv/bin/activate
$ pip install mysqlclient

$ python
>>> import MySQLdb
>>> db = MySQLdb.connect(host="127.0.0.1", db="testchar", user="user", password="password")
>>> c = db.cursor()
>>> c.execute("show variables like 'char%';")
>>> c.fetchall()

Should see lots of utf-8 settings. In order to simulate what is shown when this is done on the actual v5.1 server, need to set names.

>>> c.execute("set names latin1")
>>> c.execute("show variables like 'char%';")
>>> c.fetchall()

Should see the connection, result, and client values change to latin1. This is close to what the v5.1 server is set to.

Now try the query

>>> c.execute("select * from testchar")
>>> c.fetchall()
((1, 'Women\x92s'),)

The same thing happens if I pass charset="latin1" to the db.connect(... statement.

Is there something I’m doing wrong in making this connection, or some setting I need to add to make this work as expected?

I did try posting to Stack Overflow, and got a response, but the suggestions didn’t fix the problem. Any suggestions you can provide would be great, thanks.

@methane
Copy link
Member

methane commented Oct 24, 2019

>>> c.execute("set names latin1")

You must not do this. It changes only server side variables.
In worst case, this cause charset mismatch between server and client, and cause SQL-injection attack.

I strongly recommend you to convert latin1 to utf8. I don't test charset other than utf8 well.

For workaround, I think you can do this:

conn = MySQLdb.connect(..., charset="latin1")
conn.encoding = "cp1252"

But note that this is just a workaround and may be not work in the future versions.

@sanjuroj
Copy link
Author

Ok, thanks for the info.

Using the SET NAMES was just a way to make the session variables in the connection to the test database match what we're seeing on the live database. We're not using SET NAMES in our live connection.

And unfortunately, I don't have control over the database we're accessing, and it's not clear that it will be upgraded. My fallback option is to schedule a nightly dump-and-convert to another DB, or a convoluted hack using Python's bytearray. I'll take a look at your suggestion for setting conn.encoding but I suspect Django doesn't give me an option to do that.

Thanks again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants