Tin Isles : Fixing Mojibake / borked Unicode text

Fixing Mojibake / borked Unicode text

Here’s a problem I’ve recently hit at work. I hit the google geocode API with a couple of thousand addresses and stored the JSON results. Using the WebClient DownloadString method I didn’t think to set anything for the encoding. Browsing through the output files I see “DorfstraÃƒÆ’Ã…Â¸e” where I’m expecting to see “DorfstraÃŸe”. Urgh, mojibake!

Here’s my understanding of what went wrong:

Google’s API serves down the UTF8 bytes for “DorfstraÃŸe”: 0x44, 0x6f, 0x72, 0x66, 0x73, 0x74, 0x72, 0x61, 0xc3, 0x9f, 0x65
WebClient Encoding uses the system default. In my case, Windows-1252. If I had set this property to UTF8 I wouldn’t have these Mojibake files.
In UTF8 the ÃŸ character is represented with the bytes 0xc3, 0x9f
WebClient interprets these bytes with Windows-1252 gets: “DorfstraÃƒÅ¸e”
Saving this with File.WriteAllText “uses UTF-8 encoding without a Byte-Order Mark”. This turns “DorfstraÃƒÅ¸e” into the bytes: 0x44, 0x6f, 0x72, 0x66, 0x73, 0x74, 0x72, 0x61, 0xc3, 0x83, 0xc5, 0xb8, 0x65
Open the file without a BOM and you see: “DorfstraÃƒÆ’Ã…Â¸e”

I’ve got the files on my disk. How would I fix it? Here’s the plan to put things in reverse..

read in the file bytes and interpret as UTF8 so we are back to “DorfstraÃƒÅ¸e”
get the Western European (Windows) bytes for that string
interpret those bytes as UTF8

…the code:

Posted by russ on Thursday, September 6, 2012, at 12:09 pm. Filed under Uncategorized. Follow any responses to this post with its comments RSS feed. You can post a comment or trackback from your blog.

Tin Isles

Fixing Mojibake / borked Unicode text

Post a Comment

Search

Pages

Recent Posts

Blogroll

Categories

RSS Links

Archives