Thursday, January 10, 2008

Debugging Unicode Problems

This page describes what to do in a very specific situation. Namely, you've got some character data in one place (typically a database) which has to go through various steps and then ends up being shown to the user (often on a web page). Unfortunately, some characters aren't being displayed correctly. Due to the many steps involved, the problem can occur in various places. This page aims to help you find out what's wrong simply and reliably.

Step 1: Understand the basics of Unicode
If you feel comfortable with Unicode, character encodings etc, feel free to skip this step. Basically, you need to know a little bit about what characters are and what conversions are likely to be applied to them before going much further. See my article on the subject (and the articles it references) for more information.

Step 2: Try to identify the possible conversions involved
If you can work out where things might be going wrong, it's much easier to then isolate which one it is. Also bear in mind not just how you're retrieving the data, but how the data got there in the first place. (Some problems I've seen have been due to an old application writing to and reading from the database in an incorrect way, but the bugs cancelling each other out. No problems occur when it's just this broken application which accesses the database, but things go wrong when anything else does.) Steps involved may well include fetching the data from the database, reading it from a file, sending it across a web connection, or displaying it on the screen.

Step 3: Verify the data at each step
The first lesson here is not to trust anything which tries to log the character data as a sequence of glyphs. Instead, you should log the character data as a sequence of Unicode values (integers). For instance, if I had a string containing the word "hello", I would display it as "0068 0065 006c 006c 006f". (Using hex makes it easier to check values against the Unicode code charts later.) To achieve this, step through each character in the string and display the character however you would display an integer. For instance, here is a method to dump all the characters in a string to the console:

static void DumpString (string value)
{
foreach (char c in value)
{
Console.Write ("{0:x4} ", (int)c);
}
Console.WriteLine();
}



Depending on your exact environment, your method of logging will vary, but using something like the above should give you what you need.

The reason for doing this is that it gets rid of problems with fonts, other encoding issues, etc. If you can't log even plain ASCII hex digits properly, you're in a world of trouble anyway - but you may well not be able to log Unicode in a reliable way, and as you already know you've got some problems on the Unicode front, it's worth being safe.

Now you need to make sure there's a test case to use. Find some (preferrably small) example of where your application is failing, make sure you know exactly what the result should be, and then log the actual result at each of your possible problem points. (Some may be out of your control, but usually if you log as soon as you receive some data and just before you send some data, you'll find the problem.)

Having logged a problematic string, you should verify whether or not it's what it should be. This is where the Unicode code charts page comes in. You can either pick which block you believe the correct character is in, or you can search for your character alphabetically. Check that each character in the string has its proper Unicode value. As soon as you find a point in your application flow where the character data is corrupted, you should investigate that area of the code, find out why it's being corrupted and fix it. When you've got it right throughout the application flow, the application should be working properly.

No comments: