Streaming and Parsing a CSV File

06 Mar 2022

The source data file:

This is a simple text file containing a list of words. It contains two fields (number and word) separated by tabs. The file also has a header record containing column names.

Basically, something like this:

1
2
3
4
num word
1 apple
2 banana
...

Dependencies - just the one:

XML
1
2
3
4
5
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-csv</artifactId>
    <version>1.9.0</version>
</dependency>

Header.java - this enum represents the columns of the source data file.

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public enum Header {
    NUMBER("number"),
    WORD("word");

    private final String headerName;

    Header(String headerName) {
        this.headerName = headerName;
    }

    public String headerName() {
        return this.headerName;
    }

    public static Header fromString(String text) {        
        for (Header header : Header.values()) {
            if (header.headerName.equalsIgnoreCase(text)) {
                return header;
            }
        }
        throw new IllegalArgumentException("No Header with text " + text + " found.");
    }
}

The enum uses a field headerName which is not strictly needed in my case. But I included it here to also show a method to find an enum from a string (the static fromString method).

Note that the names used in this enum do not have to match the column headers used in the file. The order of entries in the enum does have to correspond to the order of columns in the input file.

(Another approach is to simply use CSVFormat.TDF.withHeader();, where TDF is the tab-delimited predefined format provided by Commons CSV. In this case your code can refer to the column heading names used directly in the source data file.)

Another way to write the enum’s for loop would be:

Java
1
2
3
4
5
6
7
import java.util.Arrays;
import java.util.Optional;
...
Optional<Header> optHeader = Arrays.stream(Header.values())
        .filter(h -> h.headerName.equalsIgnoreCase(text))
        .findFirst();
return optHeader.orElseThrow(() -> new IllegalArgumentException("No enum with text " + text + " found."));

But I find my original for loop to be more readable in this case.

Word.java - this represents one row of data from the input file.

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import org.apache.commons.csv.CSVRecord;

public class Word {

    public static Word fromCsvRecord(CSVRecord csvRecord) {
        return new Word(
                Integer.parseInt(csvRecord.get(Header.NUMBER)),
                csvRecord.get(Header.WORD)
        );
    }

    private Word(int number, String word) {
        this.number = number;
        this.word = word;
    }

    private final int number;
    private final String word;

    public int getNumber() {
        return number;
    }

    public String getWord() {
        return word;
    }

    public void lowercase() {
        System.out.println(getNumber() + " - " + getWord().toLowerCase());
    }

}

In this class, the constructor is private. New instances of Word are created using the static factory method. This has some benefits over the constructor. One example: we can give it a meaningful name - especially helpful if we have multiple different factory methods.

CommonsCsvExample.java - this is where the processing happens.

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.nio.charset.StandardCharsets;
import java.util.function.Consumer;
import java.util.stream.StreamSupport;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVRecord;

public class CommonsCsvExample {

    private final CSVFormat format = CSVFormat.Builder
            .create(CSVFormat.TDF)
            .setSkipHeaderRecord(true)
            .setHeader(Header.class)
            .build();

    public void readFile() throws IOException {
        streamFileData("list.tsv", Word::lowercase);
    }

    private void streamFileData(String fileName, Consumer<Word> consumer) throws IOException {
        final Reader reader = new FileReader(fileName, StandardCharsets.UTF_8);
        StreamSupport.stream(format.parse(reader).spliterator(), true)
                .map((csvRecord) -> Word.fromCsvRecord(csvRecord))
                .forEach(consumer);
    }
}

CSVFormat.Builder - This assumes a TDF (tab-delimited fields) format. The header record is skipped. Our Header enum is used to provide names for the file’s data columns.

Word::lowercase - This is a method reference. It is a special form of a lambda expression - and, as such, it can be passed as a parameter to the streamFileData method.

streamFileData - This method takes a file name and a Consumer. In this simple demo, the consumer is the method reference described above. Its implementation method (in the Word class) does not produce any return value - it just prints some text to stdout. That’s what makes it a consumer.

Reader reader - This produces a character stream from the source file.

That just leaves one final line of code to discuss:

Java
1
2
3
StreamSupport.stream(format.parse(reader).spliterator(), true)
    .map((csvRecord) -> Word.fromCsvRecord(csvRecord))
    .forEach(consumer);

Breaking that down into its separate steps…

format.parse(reader) - This returns a Commons CSV CSVParser, which supports iteration.

.spliterator() - The iterable creates a spliterator over its elements. A spliterator is an iterator which can process its iterated items in parallel (by splitting the input into separate partitions). You can also tell the spliterator to process its data serially (sequentially), if you prefer.

StreamSupport.stream() - This creates a stream from the spliterator.

true - This tells the StreamSupport.stream() method that the spliterator it uses will process the data in parallel (false would cause processing to be serial).

map() - each input record is mapped to a function. Here, this function is a consumer implemented as a lambda.

At this point, the items being streamed by StreamSupport are a series of CSVRecord objects - or, rather, in our case, multiple parallel series of such objects. This is because CSVRecord is the object which is created by Commons CSV as it parses each row of source data using the provided format.

(csvRecord) -> Word.fromCsvRecord(csvRecord) - This is the lambda function handled by the map() method. It uses the Word class’s static factory method, as already described above. That method parses individual fields from each CSVRecord and constructs new Word objects.

forEach(consumer) - Each Word instance is passed to the consumer function, which (as noted above) is Word::lowercase.

The result is a list of words (with their numbers) printed to stdout. The ordering of the output list is not guaranteed. In fact, because the spliterator is allowed to process partitions of the list in parallel, you can expect the output results to be interleaved from different sections of the original source file.

If you want to preserve the source file order in your output, you can enforce serialization (false) and change .forEach(consumer) to .forEachOrdered(consumer).


Acknowledgements:

Most of the core insights and techniques used above came to me from the following two Stack Overflow answers:

Convert Iterable to Stream using Java 8 JDK

Loading a CSV directly into a collection of object via Java Stream API