Latest ProgressLast week, I addressed the problem of non-sequential XML import. I talked about it with Leonel and he pointed out that the imported XML offers position information for every text element. This information should be enough to distinguish between left- and right-hand column elements, which means I should be able to put them all back into the correct order.
However, Leonel also mentioned that he had used a different XML importer for a similar purpose, called ParseCit. It takes raw text input and can even extract title, authors and citations, using a Conditional Random Field (CRF) approach. I downloaded ParseCit, installed the necessary components, and documented my installation process as good as possible, for future reference. I also ran the tool on the provided sample data, which worked fine.
Note that ParseCit doesn't parse PDFs itself, so I needed a second tool for that. I decided to give Apache PDFBox a try, and so far, it looks very good. The tool is merely a runnable jar-file, that offers a variety of command line utilities, well suited for my needs. I the built a pipeline to be able to use PDFBox from within Pharo, which now works fine as well. The pipeline can also re-output the imported raw text, for ParseCit to use it as input. It's worth mentioning that Dominik's modified version of XPdf also includes a text-based import, so I'll definitely want to try it with that as well. I will most likely change my text-import pipeline to enable changing between these two tools, so that I can try out both, and see which one works better.
Since I've now gathered a number of tools I need or might want to use later, I organized them all in a GitHub repository. Of course, that means I redistributed then, so I had to spend some time on checking all the licenses and read about if and how they allow redistribution. Since everything I'm using so far is published either under the GNU GPL or the Apache license, and my repository is merely an aggregation of distinct programs, rather than a combination into a new one, this doesn't seem to be a problem.