Dealing with binary data and text encodings can be tricky in any programming language. In Python, you may encounter binary data when reading files opened in binary mode, interfacing with network sockets, or using libraries that return binary buffers.
On the other hand, Python’s string type uses Unicode by default. So you need to properly encode and decode between binary data and text.
Also read: 4 ways to Convert a Binary String to a Normal String
What is Binary Data?
Computers ultimately store all data as binary – sequences of 1’s and 0’s. This low-level binary data representation is great for storage and transmission efficiency. But binary data in raw form is not human-readable. So we have various encoding schemes that convert binary data to text characters and vice-versa. Here are some examples of binary data you may encounter in Python:
- Contents read from a file opened in binary mode
- Image data from files or camera devices
- Audio buffers from microphone input
- Network packets received from a socket
- Output of a C library function
Binary data can contain any arbitrary sequence of bytes. The text has structure and encoding rules to map binary patterns to human-readable characters.
What is Text Encoding?
Text encoding schemes provide rules to map binary byte sequences to text characters that humans can read and write.
Some examples of text encodings are:
- ASCII – Maps binary patterns to English characters and symbols
- UTF-8 – Unicode encoding that supports all major world languages
- Latin-1 – Encoding for Western European languages
The same binary data, when decoded with different text encodings, will result in entirely different text output.
So dealing with binary data and text requires you to be aware of encodings.
Python String and Bytes Types
Python has two main types for representing binary data and text:
- bytes – Immutable sequence of integers in the range 0 <= x < 256. Used for representing binary data.
- str – Immutable sequence of Unicode codepoints. Used for representing text.
You need to convert between these types using .encode() and .decode() methods when interfacing with binary data in Python.
Here is an example:
# Text string
text = "Hello World"
# Encode text to binary data
data = text.encode("utf-8")
# Decode binary data back to text string
text = data.decode("utf-8")
print(data)
print(text)
Now let’s go through different techniques to convert binary data to UTF-8 encoded text in Python.
Convert Binary Data to UTF-8 String
This section provides various methods to decode binary data to UTF-8 properly.
Method 1: Decode with UTF-8
If your binary data is already UTF-8 encoded, you can simply decode it to a text string:
data = b"Some binary data"
text = data.decode("utf-8")
Note that UTF-8 is the default encoding for decode(), so you can omit it:
text = data.decode() # UTF-8 default
This handles the majority of the UTF-8 decoding use cases.
Also read: Encoding an Image File With BASE64 in Python
Method 2: Decode Latin-1 Before UTF-8
For binary data with unknown encoding, you can first decode it as Latin-1. This will map each byte to a corresponding Unicode code point.
Then you can encode the resulting text as UTF-8:
data = b"\x12\xab" # Binary data with unknown encoding
text = data.decode("latin-1").encode("utf-8")
Caution: This can corrupt your data if it’s not actually Latin-1 compatible. Only use this method if you are sure your binary data meets the Latin-1 spec.
Method 3: Base64 Encode Before Decoding
Another technique that works for unknown binary data is to base64 encode it first. Base64 produces UTF-8 compatible text output from any binary input.
import base64
data = b"\x12\xab"
# Base64 encode
b64_data = base64.b64encode(data)
# Now decode base64 text from UTF-8 to string
text = b64_data.decode("utf-8")
print(text)
# Output: Eqs=
The output base64 text will be safe for UTF-8 decoding.
This method has the overhead of base64 encoding/decoding but reliably converts even non-text binary data to UTF-8.
Method 4: Hex Encode Before Decoding UTF-8
Similar to base64 encoding, you can hex encode the binary data first. The output hex text will be UTF-8 friendly:
import binascii
data = b"\x12\xab"
hex_data = binascii.hexlify(data)
text = hex_data.decode("utf-8")
print(text)
# Output: 12ab
So hex encoding is another method to reliably convert arbitrary binary data to UTF-8 text.
Handling Encoding Errors
The above decoding examples can fail with Unicode encoding errors:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x12 in position 0: invalid start byte
This means your binary data is not valid UTF-8. There are a few ways to handle such errors:
Ignore Errors
Pass errors="ignore"
to skip undecodable bytes:
data = b"Some \x12invalid utf-8\xab"
text = data.decode("utf-8", errors="ignore")
print(text) # Some invalid utf-8
The unknown bytes are omitted from the text output.
Replace Errors
Replace bad bytes with a placeholder like � using errors="replace"
:
data = b"Some \x12invalid utf-8\xab"
text = data.decode("utf-8", errors="replace")
print(text) # Some �invalid utf-8�
This indicates invalid byte sequences to the reader.
Use Fallback Encoding
If your text is a legacy encoding like Latin-1, you can decode first with ISO-8859-1 and then convert to UTF-8:
data = b"Some \x12invalid utf-8\xab"
text = (
data.decode("iso-8859-1")
.encode("utf-8", errors="replace")
)
This will salvage as much text as supported by Latin-1 encoding.
Summary
We went over various techniques to handle binary data to UTF-8 text conversion in Python:
- For UTF-8 encoded input, decode it directly
- For unknown input, try Latin-1, Base64, or Hex decode
- Handle encoding errors by ignoring, replacing, or using fallbacks
In summary, some best practices are:
- Know your input encoding – Explicitly decode with the required encoding
- Handle errors gracefully – Don’t let unmappable bytes crash your program
- Validate decoded text – Spot check output if corruption is unacceptable
Converting between binary data and text is unavoidable. I hope these practical examples give you a toolkit to handle these scenarios in your Python applications.
The key ideas are being aware of text encodings and using the appropriate decoding method based on your binary data source and use case requirements.