How to Handle Bytes and Unicode with Files in Python

Python provides different modes for handling files, with text mode being the default option for readable or writable files. Text mode operates with Unicode strings, whereas binary mode, denoted by appending ‘b‘ to the file mode, deals with bytes. Let’s explore how to work with bytes and Unicode when reading and writing files in Python.

1. Example Data.

Consider a file containing non-ASCII characters encoded in UTF-8:

这是一个包含非ASCII字符的UTF-8编码文件示例。

Save the above UTF-8 encoded content to a file with the name example_none_ascii_file.txt. In my environment, the file is saved in the path ./resource-files.

2. Reading and Decoding Bytes.

When reading from a file in text mode, Python decodes bytes according to the specified encoding. However, in binary mode, it reads the exact number of bytes requested. Here’s an illustration:

def read_and_decode_bytes_automatically(path):
    # Reading from a file in text mode
    with open(path, mode='r',encoding='UTF-8') as f:
        chars = f.read(10)  

        while chars != None and len(chars) > 0:
            print(chars)
            chars = f.read(10)  

    # Reading from a file in binary mode
    with open(path, mode="rb") as f:
        data = f.read(10)
        while data != None and len(data) > 0:
            print(data)
            data = f.read(10)

if __name__ == "__main__":
    path = "./resource-files/example_none_ascii_file.txt"
    read_and_decode_bytes_automatically(path)

Output.

这是一个包含非ASC
II字符的UTF-8
编码文件示例。
b'\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4'
b'\xb8\xaa\xe5\x8c\x85\xe5\x90\xab\xe9\x9d'
b'\x9eASCII\xe5\xad\x97\xe7'
b'\xac\xa6\xe7\x9a\x84UTF-8'
b'\xe7\xbc\x96\xe7\xa0\x81\xe6\x96\x87\xe4'
b'\xbb\xb6\xe7\xa4\xba\xe4\xbe\x8b\xe3\x80'
b'\x82'

3. Decoding Bytes Manually.

Depending on the encoding, you might need to manually decode bytes to Unicode characters. However, decoding requires that each encoded Unicode character is fully formed, otherwise, it will throw the error UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe4 in position 9: unexpected end of data.

The below example code will throw the above UnicodeDecodeError.

def read_and_decode_bytes_manually_throw_error(path):
   
    # Reading from a file in binary mode
    with open(path, mode="rb") as f:
        data = f.read(10)
        while data != None and len(data) > 0:
            print(data)

            # Decoding bytes manually
            decoded_text = data.decode("utf-8")
            print(decoded_text)

            data = f.read(10)  

if __name__ == "__main__":
    path = "./resource-files/example_none_ascii_file.txt"
    read_and_decode_bytes_manually_throw_error(path)

You can change the code below to avoid the above error.

def read_and_decode_bytes_manually_without_error(path):
   
    # Reading from a file in binary mode
    with open(path, mode="rb") as f:

        full_data_array = bytearray(b'')

        data = f.read(10)
        while data != None and len(data) > 0:
            print(data)
            full_data_array.extend(data)

            data = f.read(10)
        
        print(full_data_array)
        # Decoding bytes manually
        decoded_text = bytes(full_data_array).decode("utf-8")
        print(decoded_text)


if __name__ == "__main__":
    path = "./resource-files/example_none_ascii_file.txt"
    read_and_decode_bytes_manually_without_error(path)

Output.

b'\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4'
b'\xb8\xaa\xe5\x8c\x85\xe5\x90\xab\xe9\x9d'
b'\x9eASCII\xe5\xad\x97\xe7'
b'\xac\xa6\xe7\x9a\x84UTF-8'
b'\xe7\xbc\x96\xe7\xa0\x81\xe6\x96\x87\xe4'
b'\xbb\xb6\xe7\xa4\xba\xe4\xbe\x8b\xe3\x80'
b'\x82'
bytearray(b'\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe5\x8c\x85\xe5\x90\xab\xe9\x9d\x9eASCII\xe5\xad\x97\xe7\xac\xa6\xe7\x9a\x84UTF-8\xe7\xbc\x96\xe7\xa0\x81\xe6\x96\x87\xe4\xbb\xb6\xe7\xa4\xba\xe4\xbe\x8b\xe3\x80\x82')
这是一个包含非ASCII字符的UTF-8编码文件示例。

4. Converting Unicode Encodings.

Text mode combined with the encoding option in the `open()` function facilitates converting between different Unicode encodings:

def convert_unicode_encoding(src_path, target_path):
   
    full_data = ''

    # Reading from a file in text mode
    with open(src_path, mode="r") as f:
        data = f.read(10)
        while data != None and len(data) > 0:
            print(data)
            full_data += data
            data = f.read(10)

    # 'x': Creates a new file and opens it for writing. If the file already exists, the operation fails.
    with open(target_path, "x", encoding="gb2312") as target:
        target.write(full_data)   

    print("==========================================")

    # Reading from converted file
    with open(target_path, encoding="gb2312") as f:
        data = f.read(10)
        while data != None and len(data) > 0:
            print(data)
            data = f.read(10)

if __name__ == "__main__":
    path = "./resource-files/example_none_ascii_file.txt"
    target_path = "./resource-files/example_none_ascii_file_target.txt"
    convert_unicode_encoding(path, target_path)

Output.

这是一个包含非ASC
II字符的UTF-8
编码文件示例。
==========================================
这是一个包含非ASC
II字符的UTF-8
编码文件示例。

5. Handling File Positioning.

Beware of using `seek()` when opening files in modes other than binary, as it might lead to errors if the file position is within a Unicode character:

def handling_file_positioning(path):
    # Handling file positioning
    f = open(path, encoding='utf-8')
    print(f.read(5))
    f.seek(4)
    print(f.read(2))
    f.close()


if __name__ == "__main__":
    path = "./resource-files/example_none_ascii_file.txt"
    handling_file_positioning(path)

Output.

这是一个包
Traceback (most recent call last):
  File "/Users/songzhao/Documents/WorkSpace/ProgrammingCourses/PythonCourses/python-courses/python-files-io/how-to-handle-bytes-and-unicode-with-files-in-python.py", line 94, in <module>
    handling_file_positioning(path)
  File "/Users/songzhao/Documents/WorkSpace/ProgrammingCourses/PythonCourses/python-courses/python-files-io/how-to-handle-bytes-and-unicode-with-files-in-python.py", line 83, in handling_file_positioning
    print(f.read(2))
          ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 0: invalid start byte

In the above code, it encountered an error because of improper file positioning. Let’s break down the code and identify the potential issue:

Opening the File: The function `handling_file_positioning()` opens a file located at the specified `path` using the `open()` function. It’s important to note that this file is opened in text mode since you’ve specified the encoding as `‘utf-8’`.

Reading from the File: After opening the file, you’re reading the first 5 characters using `f.read(5)`. This successfully reads the first 5 characters from the file and prints them.

File Positioning: Next, you’re attempting to move the file pointer to a specific position using `f.seek(4)`. This command seeks the file pointer to the 4th byte position in the file.

Reading After Seeking: Immediately after seeking, you’re trying to read 2 characters from the file using `f.read(2)`. This is where the error occurs.

If the file contains UTF-8 encoded characters, moving the file pointer to an arbitrary byte position (which may not correspond to the start of a valid UTF-8 character) can result in a decoding error when attempting to read characters ( this need to convert the read bytes to text with UTF-8 encoding, but the read bytes are not the start of a valid UTF-8 character ).

Since UTF-8 is a variable-length encoding, characters may span multiple bytes, and positioning the file pointer in the middle of a character could lead to a UnicodeDecodeError when attempting to read subsequent characters.

To resolve this issue, ensure that you’re positioning the file pointer at a valid UTF-8 character boundary to avoid decoding errors.

You might need to adjust your file positioning logic accordingly based on your specific requirements and the structure of the file’s content. We change the code as below, this will open the file in binary mode, and it will read all the binary character which can avoid the convert error.

def handling_file_positioning_correct(path):
    # Handling file positioning
    with open(path, mode='rb') as f:  # Open file in binary mode
        print(f.read(5))  # Read and decode first 5 bytes
        f.seek(4)  # Move file pointer to the 4th byte position
        print(f.read(2))  # Read and decode 2 bytes from the current position


if __name__ == "__main__":
    path = "./resource-files/example_none_ascii_file.txt"
    handling_file_positioning_correct(path)

Output.

b'\xe8\xbf\x99\xe6\x98'
b'\x98\xaf'

6. Conclusion.

Mastering the handling of bytes and Unicode in Python file operations is essential, particularly for data analysis tasks involving non-ASCII text data.

Python’s Unicode functionality offers robust solutions for working with diverse text encodings, providing flexibility and reliability in handling files.

Explore Python’s documentation for comprehensive insights into Unicode handling and file operations.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.