Monday, 21 November 2016

Assembling the pipeline

Latest Progress

This last week, I added TxtXMLImporter, which supplies txt files to ParseCit and returns the result (can also file it out). This worked fine on the test data, as well as on the actual txt files I got from PDFs using PDFBox. I then wrapped the usage of PDFBox and ParseCit into a PDFXMLImporter, which, in a next step, should also let the user choose between the different pipelines (also using pdftoxml and pdftotxt from the original EggShell) that result in a PDF-to-XML conversion.

I realized that I had some duplicated code, so I factored that out into the common superclass (PaperImporter). Unfortunately, the "inheritance" situation I have doesn't really reflect a good inheritance relationship, so I will want to refactor that.

I also spent some time on going through what I have so far, adding some class comments. This is where I came across some code smells and design flaws (like the bad inheritance mentioned above), which I want to get rid of, probably during this week.

Next Steps

First of all, I want to do a little refactoring and make sure the design is as good as I can make it right now. Then I want to add message comments as well, something I had neglected a little bit so far. The same thing goes for tests. Now that my pipelines seem to work, I think I should write some unit tests for them.

As I mentioned above, PDFXMLImporter should also be able to use the other tools available from the original EggShell. This will be one of the next things I want to do.

Once I have assembled all the tools for converting PDFs to XML, I want to come back to what I began a couple of weeks ago: studying the imported XML, analyzing what each pipeline can give me, and what information I might be able to deduct from it. I especially want to focus on the differences between the entire old pipeline, and the new pipeline with either PDFBox or pdftotxt as a PDF-to-text converter. It's also possible to use multiple pipelines, if it leads to a better result. so that's an approach I want to consider as well.

Since ParseCit can already retrieve a lot of information, I want to start working on taking that data out of the XML and modelling it. I suspect that this might be the easiest way to get a first extended data model rather soon.

Likely Challenges

As soon as I start analyzing the imported XML data, I suspect it might be a challenge to maintain the overview over all possible pipelines and to find out which ones are best suited for what, and to anticipate each one's drawbacks as early as possible. I then need to make a solid prediction about what further features I might be able to extract, in order to conduct the interviews as soon as possible.

No comments:

Post a Comment