I recently needed to convert text files created by one application into a completely different format, to be processed by a different application.
In some cases, the files were too large to be read into memory all at once (without causing resource problems). Here “too large” meant tens or hundreds of gigabytes.
The source files also did not always have line terminators (some did; most did not). Instead they had separately defined record layouts, which specified the number of fields in a record, together with the field terminator character.
I therefore needed to (a) find a way to read a source file as a stream of characters, and (b) define how the program would identify complete records, by counting field terminators.
The heart of the solution, written in Python 3, was based on the following:
So, what is
The documentation introduces the functionality:
If the second argument,
sentinel, is given, then
objectmust be a callable object. The iterator created in this case will call
objectwith no arguments for each call to its
__next__()method; if the value returned is equal to
StopIterationwill be raised, otherwise the value will be returned.
What is a “callable object” in Python? Anything which can be called - such as a function, a method and so on.
Why is a callable object needed? Because we want the Python built-in function
iter() to iterate over something which is not naturally iterable - a file, in this case, with no line terminators. Another example, would be a file containing binary data.
Therefore we have to wrap our source data in something which can be treated as if it were iterable.
Using a lambda expression for our
callable is a convenient and succinct piece of syntax. It lets us read the file one chunk at a time (each chunk containing 4,096 characters).
sentinel, the technique used in the above sample code sets the sentinel to an empty string - and therefore the entire chunk of data is processed in one pass, since the sentinel is never encountered.
chunk is a string of text, and a string is naturally iterable in Python. We can therefore use
for ch in chunk to iterate over that string, one character at a time.
So, instead of reading the entire file into memory all at once, we only need to read in much smaller chunks, one after another.
Acknowledgement: The solution is based on information taken from the following SO question and its answers:
Below is the full solution, which shows how the script counts field terminators.
Note that while the input file uses field terminators (a control character at the end of every field), the output file is written with field separators (a control character in between each field) together with a line terminator (the usual Linux ‘\n’ control character).