Uncategorized

unicode – encode method for UTF16 and UTF32 in Python



I want to investigate how do UTF16 and UTF32 work by looking specific characters in binary.

I know we can user
"string".encoding("(encoding name)") to check its hex value in specific encoding and it works fine with UTF8.

but when it comes to UTF16 or 32, I found the result is different from the encodnig value it supposed to be.

for example, the first letter “あ” in Japanese, accordting to https://www.compart.com/en/unicode/U+3042
the hex value of UTF8,16,32 are
E38182, 3042, 00003042

so if I execute the following code

print("あ".encode('utf-8'))
print("あ".encode('utf-16BE'))
print("あ".encode('utf-32BE'))

I will get

b'\xe3\x81\x82'
b'0B'
b'\x00\x000B'

as you can see, utf8 is identical with the code table, but 16 and 32 are wired…
No idea how can 000B convert to 3042, do I misunderstand something of the encode method?



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *