Skip to content

Fixing Mojibake / borked Unicode text

Here’s a problem I’ve recently hit at work. I hit the google geocode API with a couple of thousand addresses and stored the JSON results. Using the WebClient DownloadString method I didn’t think to set anything for the encoding. Browsing through the output files I see “Dorfstraße” where I’m expecting to see “Dorfstraße”. Urgh, mojibake!

Here’s my understanding of what went wrong:

  • Google’s API serves down the UTF8 bytes for “Dorfstraße”: 0x44, 0x6f, 0x72, 0x66, 0x73, 0x74, 0x72, 0x61, 0xc3, 0x9f, 0x65
  • WebClient Encoding uses the system default. In my case, Windows-1252. If I had set this property to UTF8 I wouldn’t have these Mojibake files.
  • In UTF8 the ß character is represented with the bytes 0xc3, 0x9f
  • WebClient interprets these bytes with Windows-1252 gets: “Dorfstraße”
  • Saving this with File.WriteAllText “uses UTF-8 encoding without a Byte-Order Mark”. This turns “Dorfstraße” into the bytes: 0x44, 0x6f, 0x72, 0x66, 0x73, 0x74, 0x72, 0x61, 0xc3, 0x83, 0xc5, 0xb8, 0x65
  • Open the file without a BOM and you see: “Dorfstraße”

I’ve got the files on my disk. How would I fix it? Here’s the plan to put things in reverse..

  1. read in the file bytes and interpret as UTF8 so we are back to “Dorfstraße”
  2. get the Western European (Windows) bytes for that string
  3. interpret those bytes as UTF8

…the code:

Post a Comment

Your email is never published nor shared. Required fields are marked *