what is the difference between utf8 and latin1?
-
5They are different encodings (with some characters mapped to common byte sequences, e.g. the ASCII characters and many accented letters). UTF-8 is one encoding of Unicode with all its codepoints; Latin1 encodes less than 256 characters.– ShreevatsaRApr 25, 2010 at 16:45
-
There is also latin9 which is available in Linux locales and could have been mentioned in the question: en.wikipedia.org/wiki/ISO/IEC_8859-15– baptxApr 6, 2020 at 17:19
-
Does this answer your question? What is the difference between UTF-8 and ISO-8859-1?– Karl KnechtelAug 5, 2022 at 2:57
2 Answers
UTF-8 is prepared for world domination, Latin1 isn't.
If you're trying to store non-Latin characters like Chinese, Japanese, Hebrew, Russian, etc using Latin1 encoding, then they will end up as mojibake. You may find the introductory text of this article useful (and even more if you know a bit Java).
Note that full 4-byte UTF-8 support was only introduced in MySQL 5.5. Before that version, it only goes up to 3 bytes per character, not 4 bytes per character. So, it supported only the BMP plane and not e.g. the Emoji plane. If you want full 4-byte UTF-8 support, upgrade MySQL to at least 5.5 or go for another RDBMS like PostgreSQL. In MySQL 5.5+ it's called utf8mb4
.
-
33Mysql 5.1 supports 3 byte UTF-8, however Mysql 5.5 does support 4 byte UTF-8 as utf8mb4.– velcrowAug 22, 2011 at 18:02
-
2@BalusC Can you elaborate more on how UTF-8 isn't fully supported? Does it mean that Mysql 5.1 can't store all unicode characters?– PacerierJun 12, 2012 at 5:54
-
2@Pacerier: it only supports 3 bytes per character, thus only the BMP (the first 65535 characters) is supported, the remnant not. For all characters, see en.wikipedia.org/wiki/Plane_(Unicode)– BalusCJun 12, 2012 at 11:01
-
2@BalusC As for people using 5.1.63 and don't have the privilege to update the web server's mysql version, what may be the alternatives?– PacerierJun 12, 2012 at 18:54
-
6@Pacerier: You could save as
VARBINARY
instead ofVARCHAR
and decode/encode in the business tier yourself, but this is hacky. Consider asking a new question, maybe there are better ways.– BalusCJun 12, 2012 at 18:57
In latin1 each character is exactly one byte long. In utf8 a character can consist of more than one byte. Consequently utf8 has more characters than latin1 (and the characters they do have in common aren't necessarily represented by the same byte/bytesequence).
-
1
-
10@YoushaAleayoub ASCII is a single-byte encoding which uses the characters 0 through 127, so it can encode half as many characters as latin1. It's a strict subset of both latin1 and utf8, meaning the bytes 0 through 127 in both latin1 and utf8 encode the same things as they do in ASCII. Bin isn't an encoding. It's usually an option that you can give when reading a file, telling the IO functions to not apply any encoding, but instead just read the file byte by byte.– sepp2kMay 17, 2017 at 11:38
-
1thanks, I meant
binary
collate...? and which one is better for english/numeric fields:ascii_general_ci
orascii_bin
? May 17, 2017 at 12:29