The source data file:
This is a simple text file containing a list of words. It contains two fields (number and word) separated by tabs. The file also has a header record containing column names.
Basically, something like this:
Dependencies - just the one:
Header.java - this enum represents the columns of the source data file.
The enum uses a field
headerName which is not strictly needed in my case. But I included it here to also show a method to find an enum from a string (the static
Note that the names used in this enum do not have to match the column headers used in the file. The order of entries in the enum does have to correspond to the order of columns in the input file.
(Another approach is to simply use
TDF is the tab-delimited predefined format provided by Commons CSV. In this case your code can refer to the column heading names used directly in the source data file.)
Another way to write the enum’s
for loop would be:
But I find my original
for loop to be more readable in this case.
Word.java - this represents one row of data from the input file.
In this class, the constructor is private. New instances of
Word are created using the static factory method. This has some benefits over the constructor. One example: we can give it a meaningful name - especially helpful if we have multiple different factory methods.
CommonsCsvExample.java - this is where the processing happens.
CSVFormat.Builder - This assumes a TDF (tab-delimited fields) format. The header record is skipped. Our
Header enum is used to provide names for the file’s data columns.
Word::lowercase - This is a method reference. It is a special form of a lambda expression - and, as such, it can be passed as a parameter to the
streamFileData - This method takes a file name and a
Consumer. In this simple demo, the consumer is the method reference described above. Its implementation method (in the
Word class) does not produce any return value - it just prints some text to
stdout. That’s what makes it a consumer.
Reader reader - This produces a character stream from the source file.
That just leaves one final line of code to discuss:
Breaking that down into its separate steps…
format.parse(reader) - This returns a Commons CSV
CSVParser, which supports iteration.
.spliterator() - The iterable creates a spliterator over its elements. A spliterator is an iterator which can process its iterated items in parallel (by splitting the input into separate partitions). You can also tell the spliterator to process its data serially (sequentially), if you prefer.
StreamSupport.stream() - This creates a stream from the spliterator.
true - This tells the
StreamSupport.stream() method that the spliterator it uses will process the data in parallel (
false would cause processing to be serial).
map() - each input record is mapped to a function. Here, this function is a consumer implemented as a lambda.
At this point, the items being streamed by
StreamSupport are a series of
CSVRecord objects - or, rather, in our case, multiple parallel series of such objects. This is because
CSVRecord is the object which is created by Commons CSV as it parses each row of source data using the provided format.
(csvRecord) -> Word.fromCsvRecord(csvRecord) - This is the lambda function handled by the
map() method. It uses the
Word class’s static factory method, as already described above. That method parses individual fields from each
CSVRecord and constructs new
forEach(consumer) - Each
Word instance is passed to the consumer function, which (as noted above) is
The result is a list of words (with their numbers) printed to
stdout. The ordering of the output list is not guaranteed. In fact, because the spliterator is allowed to process partitions of the list in parallel, you can expect the output results to be interleaved from different sections of the original source file.
If you want to preserve the source file order in your output, you can enforce serialization (
false) and change
Most of the core insights and techniques used above came to me from the following two Stack Overflow answers: