Converting File Formats Using Python 3

06 Mar 2022

I recently needed to convert text files created by one application into a completely different format, to be processed by a different application.

In some cases, the files were too large to be read into memory all at once (without causing resource problems). Here “too large” meant tens or hundreds of gigabytes.

The source files also did not always have line terminators (some did; most did not). Instead they had separately defined record layouts, which specified the number of fields in a record (with data types), together with the field terminator character.

I therefore needed to (a) find a way to read a source file as a stream of characters, and (b) define how the program would identify complete records, by counting field terminators.

The heart of the solution, written in Python 3, was based on the following:

Python
1
2
3
4
5
6
7
8
9
with open('in.txt', 'r', encoding='windows-1252') as in_file, \
    open('out.txt', 'w', encoding='utf-8') as out_file:

    callable = lambda: in_file.read(4096) # characters
    sentinel = ''

    for chunk in iter(callable, sentinel):
        for ch in chunk:
            # conversion logic here

So, what is iter(callable, sentinel)?

The documentation introduces the functionality:

iter(object[, sentinel])

If the second argument, sentinel, is given, then object must be a callable object. The iterator created in this case will call object with no arguments for each call to its __next__() method; if the value returned is equal to sentinel, StopIteration will be raised, otherwise the value will be returned.

What is a “callable object” in Python? Anything which can be called - such as a function, a method and so on.

Why is a callable object needed? Because we want the Python built-in function iter() to iterate over something which is not naturally iterable - a file, in this case, with no line terminators. Another example, would be a file containing binary data.

Therefore we have to wrap our source data in something which can be treated as if it were iterable.

Using a lambda expression for our callable is a convenient and succinct piece of syntax. It lets us read the file one chunk at a time (each chunk containing 4,096 characters).

Regarding the sentinel, the technique used in the above sample code sets the sentinel to an empty string - and therefore the entire chunk of data is processed in one pass, since the sentinel is never encountered.

Each resulting chunk is a string of text, and a string is naturally iterable in Python. We can therefore use for ch in chunk to iterate over that string, one character at a time.

So, instead of reading the entire file into memory all at once, we only need to read in much smaller chunks, one after another.

Acknowledgement: The solution is based on information taken from the following SO question and its answers:

What are the uses of iter(callable, sentinel)?

Below is the full solution, which shows how the script counts field terminators.

Note that while the input file uses field terminators (a control character at the end of every field), the output file is written with field separators (a control character in between each field) together with a line terminator (the usual Linux ‘\n’ control character).

Python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
#!/usr/bin/python3

#
# Convert file from one format to another for:
#  - line endings
#  - field separators or terminators
#  - double-quotes for strings
#  - reformatted dates
#  - file encoding
#

# ----------------------------------------------------

# Input file specs:

sample_spec = {
    'id': 'test_01',
    'fields_per_rec': 5,
    'in_field_terminator': '\x01', # SOH
    'string_fields': [2,3],
    'date_fields': [4]
}

# ----------------------------------------------------

in_file_encoding = 'windows-1252'
out_file_encoding = 'utf-8'

chunk_size = 4096 # characters

def convert(file_spec, file_name):
    field = []
    record = []
    field_count = 0
    in_record_count = 0
    out_record_count = 0

    with open(in_path + file_name, 'r', encoding=in_file_encoding) as in_file, \
         open(out_path + file_name + out_suffix, 'w', encoding=out_file_encoding) as out_file:

        file_callable = lambda: in_file.read(chunk_size)
        sentinel = ''

        for chunk in iter(file_callable, sentinel):
            for ch in chunk:

                if ch == '\x02' or ch == '\n':
                    pass
                elif ch == file_spec['in_field_terminator']:
                    field_count += 1
                    if len(field) > 0 and field_count in file_spec['string_fields']:
                        record.append('"' + ''.join(field) + '"')
                    elif len(field) > 0 and field_count in file_spec['date_fields']:
                        record.append(format_date_time(''.join(field)))
                    else:
                        record.append(''.join(field))
                    field = []
                else:
                    field.append(ch)

                if field_count == file_spec['fields_per_rec']:
                    out_file.write(out_field_sep.join(record))
                    out_file.write(out_record_sep)
                    record = []
                    field_count = 0
                    in_record_count += 1
                    out_record_count += 1

    print('')
    print('file: %s' % file_name)
    print('  source records: %d' % in_record_count)
    print('  target records: %d' % out_record_count)

def format_date_time(date):
    return date[0:4] + '-' + date[4:6] + '-' + date[6:8] + ' 00:00:00'

def convert_files(in_file_spec):
    for suffix in in_file_suffix_list:
        convert(in_file_spec, in_file_spec['id'] + suffix)

#
# -----------------------------------------------------------
#

# these are common to all output files:
out_suffix = '_converted'
out_field_sep = '\x0b' # VT
out_record_sep = '\x0a' # LF

# ----------

#
# Output files are generated here using the file
# specs created above:
#

in_path = 'data/'
out_path = in_path + 'converted/'
in_file_suffix_list = ['.txt']

convert_files(sample_spec)

print('\nDone.')

An example (very small) test file for the above input file spec:

/images/test_file_no_line_separators.png

Here we see that there are no line separators, only a SOH control character at the end of each field. Therefore, more typical approaches such as using file.readline() will not work here.


There are certainly alternative approaches - this is not the only way to get the job done.

For example, you could use file.read(chunk_size), something like this:

Python
1
2
3
4
5
6
with open('input_file.txt', 'r') as file:
    while True:
        chunk = file.read(4096)
        if not chunk:
            break
        process_chunk(chunk)