Monthly Archives: July 2016

Creating a Java 8 Stream from unbounded data using Spliterator

Problem Statement

I have a large XML file. I would like to read it, and group-by and aggregate the rows in it using Java 8. DOM parser with JAXB will not be able to handle this, as its a really large file. I would like to create a Stream from the unbounded data contained in the XML file.

Solution

I read the XML by streaming with Stax. Since I do not load the entire file in memory, I am good. I go a step further, and use JAXB to un-marshall small portions of this file, which I will call a row. I use a Spliterator backed by a BlockingQueue to create a Stream out of it. After I have the stream, I apply the famous grouping-by function and aggregate the rows.

The XML

The sample XML looks like this:

There would be thousands of elements “T”. I have modeled my POJO on the element “T”. I use Stax to read the xml. When I read one element “T”, I use Jaxb to un-marshall it to a Java object and then add it to the Stream.

The POJO

I have modeled the POJO as below:

The Stax Parser

The heart of this is the Stax parser:

 

I use the CountDownLatch only because I need my JUnit to be alive till the document is read fully. It would not be needed in an actual server environment. Note the usage of the BlockingQueue.

Spliterator implementation

 

The grouping logic

This part is very simple. We actually stream a GZip file by using a GZIPInputStream:

 

Sources

https://github.com/paawak/blog/tree/master/code/reactive-streams/spliterator-demo

I found some large xmls from the below location:

http://www.cs.washington.edu/research/xmldatasets/www/repository.html