Ill quotes

I spent a little bit of unpleasant time yesterday debugging what I thought was a quoting issue with the Rails runner script. Here is what happened.

The mystery

I wrote a script to populate my development database with some static data — the textual content of a web application. This textual content happens to have quote characters (’) embedded. The script instantiates and saves a number of model objects containing the text.

To make sure that the quotes won’t give me any problems, I issued these on the Rails console, which turned out fine:

Thing.create(:title => "I'm good", ...)
Thing.create(:title => "I'm bad", ...)

That is, the title column in the database appeared to have gone through the proper quoting or escaping — done automatically by ActiveRecord, as expected.

However, when I ran the script through the runner, all my quotes appeared as the question mark (?) characters in the database. Could there be a bug in runner? Do I need to escape the quotes manually? What’s going on?

Emacs to the rescue

As with all good stories with happy endings, the real problem is really, ... let’s say, stupid.

You see, in the script, I’ve got a lot more text than what I typed on the Rails console. Remember, these are texts to populate sections of a website, so they are more lengthy than just a handful of words. As a matter of fact, I’ve got these texts — cut and pasted these texts, to be exact — from an existing Microsoft Word document. By now, you probably can tell where I’m going with this already…

Well, take a look at the content this file, and see if you can spot the problem right away:

ill_quotes.txt

If you have Emacs, open the file and point your cursor to the first instance of the quote character in the file, and then press Ctrl-x =. Try to remember the result of that, and then do the same thing with the second instance.

Surprised? I was, too.

Will the real character please stand up?

The first quote character in the file appears to be the hexadecimal ‘0×27’, the ASCII representation of the quote character, while the second one the hexadecimal ‘0×92’, the Windows 1252 (CP1252) representation of the curved quote character.

Unstable and Bitchy

This is what Wikipedia says about this odd mismatch:

It is very common to mislabel text data with the charset label ISO-8859-1, even though the data is really Windows-1252 encoded. In Windows-1252, codes between 0×80 and 0×9F are used for letters and punctuation, whereas they are control codes in ISO-8859-1. Many web browsers and e-mail clients will interpret ISO-8859-1 control codes as Windows-1252 characters in order to accommodate such mislabeling.

Dang it, Dale. I should have known better than to trust Windows with supplying my characters…

Comment

Commenting is closed for this article.