Latest ProgressThis last week, I added TxtXMLImporter, which supplies txt files to ParseCit and returns the result (can also file it out). This worked fine on the test data, as well as on the actual txt files I got from PDFs using PDFBox. I then wrapped the usage of PDFBox and ParseCit into a PDFXMLImporter, which, in a next step, should also let the user choose between the different pipelines (also using pdftoxml and pdftotxt from the original EggShell) that result in a PDF-to-XML conversion.
I realized that I had some duplicated code, so I factored that out into the common superclass (PaperImporter). Unfortunately, the "inheritance" situation I have doesn't really reflect a good inheritance relationship, so I will want to refactor that.
I also spent some time on going through what I have so far, adding some class comments. This is where I came across some code smells and design flaws (like the bad inheritance mentioned above), which I want to get rid of, probably during this week.
Next StepsFirst of all, I want to do a little refactoring and make sure the design is as good as I can make it right now. Then I want to add message comments as well, something I had neglected a little bit so far. The same thing goes for tests. Now that my pipelines seem to work, I think I should write some unit tests for them.
As I mentioned above, PDFXMLImporter should also be able to use the other tools available from the original EggShell. This will be one of the next things I want to do.
Once I have assembled all the tools for converting PDFs to XML, I want to come back to what I began a couple of weeks ago: studying the imported XML, analyzing what each pipeline can give me, and what information I might be able to deduct from it. I especially want to focus on the differences between the entire old pipeline, and the new pipeline with either PDFBox or pdftotxt as a PDF-to-text converter. It's also possible to use multiple pipelines, if it leads to a better result. so that's an approach I want to consider as well.
Since ParseCit can already retrieve a lot of information, I want to start working on taking that data out of the XML and modelling it. I suspect that this might be the easiest way to get a first extended data model rather soon.