Here’s a problem I’ve recently hit at work. I hit the google geocode API with a couple of thousand addresses and stored the JSON results. Using the WebClient DownloadString method I didn’t think to set anything for the encoding. Browsing through the output files I see “DorfstraÃƒÅ¸e” where I’m expecting to see “Dorfstraße”. Urgh, mojibake!
Here’s my understanding of what went wrong:
- Google’s API serves down the UTF8 bytes for “Dorfstraße”: 0x44, 0x6f, 0x72, 0x66, 0x73, 0x74, 0x72, 0x61, 0xc3, 0x9f, 0x65
- WebClient Encoding uses the system default. In my case, Windows-1252. If I had set this property to UTF8 I wouldn’t have these Mojibake files.
- In UTF8 the ß character is represented with the bytes 0xc3, 0x9f
- WebClient interprets these bytes with Windows-1252 gets: “DorfstraÃŸe”
- Saving this with File.WriteAllText “uses UTF-8 encoding without a Byte-Order Mark”. This turns “DorfstraÃŸe” into the bytes: 0x44, 0x6f, 0x72, 0x66, 0x73, 0x74, 0x72, 0x61, 0xc3, 0x83, 0xc5, 0xb8, 0x65
- Open the file without a BOM and you see: “DorfstraÃƒÅ¸e”
I’ve got the files on my disk. How would I fix it? Here’s the plan to put things in reverse..
- read in the file bytes and interpret as UTF8 so we are back to “DorfstraÃŸe”
- get the Western European (Windows) bytes for that string
- interpret those bytes as UTF8