The source data file:
This is a simple text file containing a list of words. It contains two fields (number and word) separated by tabs. The file also has a header record containing column names.
Basically, something like this:
|
|
Dependencies - just the one:
|
|
Header.java - this enum represents the columns of the source data file.
|
|
The enum uses a field headerName
which is not strictly needed in my case. But I included it here to also show a method to find an enum from a string (the static fromString
method).
Note that the names used in this enum do not have to match the column headers used in the file. The order of entries in the enum does have to correspond to the order of columns in the input file.
(Another approach is to simply use CSVFormat.TDF.withHeader();
, where TDF
is the tab-delimited predefined format provided by Commons CSV. In this case your code can refer to the column heading names used directly in the source data file.)
Another way to write the enum’s for
loop would be:
|
|
But I find my original for
loop to be more readable in this case.
Word.java - this represents one row of data from the input file.
|
|
In this class, the constructor is private. New instances of Word
are created using the static factory method. This has some benefits over the constructor. One example: we can give it a meaningful name - especially helpful if we have multiple different factory methods.
CommonsCsvExample.java - this is where the processing happens.
|
|
CSVFormat.Builder
- This assumes a TDF (tab-delimited fields) format. The header record is skipped. Our Header
enum is used to provide names for the file’s data columns.
Word::lowercase
- This is a method reference. It is a special form of a lambda expression - and, as such, it can be passed as a parameter to the streamFileData
method.
streamFileData
- This method takes a file name and a Consumer
. In this simple demo, the consumer is the method reference described above. Its implementation method (in the Word
class) does not produce any return value - it just prints some text to stdout
. That’s what makes it a consumer.
Reader reader
- This produces a character stream from the source file.
That just leaves one final line of code to discuss:
|
|
Breaking that down into its separate steps…
format.parse(reader)
- This returns a Commons CSV CSVParser
, which supports iteration.
.spliterator()
- The iterable creates a spliterator over its elements. A spliterator is an iterator which can process its iterated items in parallel (by splitting the input into separate partitions). You can also tell the spliterator to process its data serially (sequentially), if you prefer.
StreamSupport.stream()
- This creates a stream from the spliterator.
true
- This tells the StreamSupport.stream()
method that the spliterator it uses will process the data in parallel (false
would cause processing to be serial).
map()
- each input record is mapped to a function. Here, this function is a consumer implemented as a lambda.
At this point, the items being streamed by StreamSupport
are a series of CSVRecord
objects - or, rather, in our case, multiple parallel series of such objects. This is because CSVRecord
is the object which is created by Commons CSV as it parses each row of source data using the provided format.
(csvRecord) -> Word.fromCsvRecord(csvRecord)
- This is the lambda function handled by the map()
method. It uses the Word
class’s static factory method, as already described above. That method parses individual fields from each CSVRecord
and constructs new Word
objects.
forEach(consumer)
- Each Word
instance is passed to the consumer function, which (as noted above) is Word::lowercase
.
The result is a list of words (with their numbers) printed to stdout
. The ordering of the output list is not guaranteed. In fact, because the spliterator is allowed to process partitions of the list in parallel, you can expect the output results to be interleaved from different sections of the original source file.
If you want to preserve the source file order in your output, you can enforce serialization (false
) and change .forEach(consumer)
to .forEachOrdered(consumer)
.
Acknowledgements:
Most of the core insights and techniques used above came to me from the following two Stack Overflow answers:
Convert Iterable to Stream using Java 8 JDK
Loading a CSV directly into a collection of object via Java Stream API