

It's nice to be able to point site owners to it though. Yes, since that that is where the mismatch is created. "Isn't the demoroniser fix mentioned above intended for people generating web pages, and not for processing incoming web page?" (fyi, Dan's weblog says it's charset=utf-8 which seems to be standard. Garbage chars might also happen whenever the page's stated encoding (in Content-Type meta tag at top of page) doesn't match the actual encoding. This is why you see three garbage characters on that page. It's also three legitimate Unicode characters expressed in Unicode: Now we end up with 6 bytes of UTF-8, which is exactly what appears in the source of the HTML in that page. Some weird mapping is going on there in Safari to come up with that result). Interestingly, U+0080 is *not* the Euro symbol (that's U+20AC), but in Windows-1252 encoding, 0x80 is the Euro symbol. The single Unicode character U+2019 has become *three* Unicode characters U+00E2 U+0080 U+0099, which are rendering for me as â, € and ™.
Set unicode in thunderbird for mac iso#
(Now we've incorrectly interpreted each byte of the UTF-8 string as a separate ISO 8859-1 character when converting to Unicode - this happens a lot when lazy programmers assume that any string they come across must be ISO 8859-1, since it's the default on English Windows.

(note how the Unicode character U+2019 is correctly represented as the three bytes 0xe2 0x80 0x99 in UTF-8)

Set unicode in thunderbird for mac mac#
I'm doing the following by opening Terminal on my Mac and running python Here's exactly what happened for the ’ (U+2019, the right single quotation mark in "We’d like."), which is \u2019 in Python syntax. Finally, the incorrect characters were re-encoded in UTF-8. Then *each byte* of the multibyte UTF-8 sequence was accidentally misinterpreted as *separate* single-byte characters encoded in ISO 8859-1. What's happened here is that each right single quote character (U+2019 in Unicode) has been first encoded in UTF-8. This is a really common problem I see nowadays, which results from programmers assuming that any string they input into their program is encoded in ISO 8859-1 when it's actually sometimes UTF-8.
