169

what is the difference between utf8 and latin1?

3

2 Answers 2

185

UTF-8 is prepared for world domination, Latin1 isn't.

If you're trying to store non-Latin characters like Chinese, Japanese, Hebrew, Russian, etc using Latin1 encoding, then they will end up as mojibake. You may find the introductory text of this article useful (and even more if you know a bit Java).

Note that full 4-byte UTF-8 support was only introduced in MySQL 5.5. Before that version, it only goes up to 3 bytes per character, not 4 bytes per character. So, it supported only the BMP plane and not e.g. the Emoji plane. If you want full 4-byte UTF-8 support, upgrade MySQL to at least 5.5 or go for another RDBMS like PostgreSQL. In MySQL 5.5+ it's called utf8mb4.

12
  • 33
    Mysql 5.1 supports 3 byte UTF-8, however Mysql 5.5 does support 4 byte UTF-8 as utf8mb4.
    – velcrow
    Aug 22, 2011 at 18:02
  • 2
    @BalusC Can you elaborate more on how UTF-8 isn't fully supported? Does it mean that Mysql 5.1 can't store all unicode characters?
    – Pacerier
    Jun 12, 2012 at 5:54
  • 2
    @Pacerier: it only supports 3 bytes per character, thus only the BMP (the first 65535 characters) is supported, the remnant not. For all characters, see en.wikipedia.org/wiki/Plane_(Unicode)
    – BalusC
    Jun 12, 2012 at 11:01
  • 2
    @BalusC As for people using 5.1.63 and don't have the privilege to update the web server's mysql version, what may be the alternatives?
    – Pacerier
    Jun 12, 2012 at 18:54
  • 6
    @Pacerier: You could save as VARBINARY instead of VARCHAR and decode/encode in the business tier yourself, but this is hacky. Consider asking a new question, maybe there are better ways.
    – BalusC
    Jun 12, 2012 at 18:57
63

In latin1 each character is exactly one byte long. In utf8 a character can consist of more than one byte. Consequently utf8 has more characters than latin1 (and the characters they do have in common aren't necessarily represented by the same byte/bytesequence).

3
  • 1
    What about ascii and bin? May 17, 2017 at 10:54
  • 10
    @YoushaAleayoub ASCII is a single-byte encoding which uses the characters 0 through 127, so it can encode half as many characters as latin1. It's a strict subset of both latin1 and utf8, meaning the bytes 0 through 127 in both latin1 and utf8 encode the same things as they do in ASCII. Bin isn't an encoding. It's usually an option that you can give when reading a file, telling the IO functions to not apply any encoding, but instead just read the file byte by byte.
    – sepp2k
    May 17, 2017 at 11:38
  • 1
    thanks, I meant binary collate...? and which one is better for english/numeric fields: ascii_general_ci or ascii_bin? May 17, 2017 at 12:29

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.