Monday, 14 November 2016

Trying out a new importer

Latest Progress

Last week, I addressed the problem of non-sequential XML import. I talked about it with Leonel and he pointed out that the imported XML offers position information for every text element. This information should be enough to distinguish between left- and right-hand column elements, which means I should be able to put them all back into the correct order.

However, Leonel also mentioned that he had used a different XML importer for a similar purpose, called ParseCit. It takes raw text input and can even extract title, authors and citations, using a Conditional Random Field (CRF) approach. I downloaded ParseCit, installed the necessary components, and documented my installation process as good as possible, for future reference. I also ran the tool on the provided sample data, which worked fine.

Note that ParseCit doesn't parse PDFs itself, so I needed a second tool for that. I decided to give Apache PDFBox a try, and so far, it looks very good. The tool is merely a runnable jar-file, that offers a variety of command line utilities, well suited for my needs. I the built a pipeline to be able to use PDFBox from within Pharo, which now works fine as well. The pipeline can also re-output the imported raw text, for ParseCit to use it as input. It's worth mentioning that Dominik's modified version of XPdf also includes a text-based import, so I'll definitely want to try it with that as well. I will most likely change my text-import pipeline to enable changing between these two tools, so that I can try out both, and see which one works better.

Since I've now gathered a number of tools I need or might want to use later, I organized them all in a GitHub repository. Of course, that means I redistributed then, so I had to spend some time on checking all the licenses and read about if and how they allow redistribution. Since everything I'm using so far is published either under the GNU GPL or the Apache license, and my repository is merely an aggregation of distinct programs, rather than a combination into a new one, this doesn't seem to be a problem.

Next Steps

Even thought I tested ParseCit with the provided sample data and managed to establish the PDF-to-text pipeline, I didn't yet get to put the two together. This will be my top priority this week. Once that works, I want to modify the PDF-to-XML importing process to include all possible pipelines (i.e. all different tool chains that deliver an acceptable XML result), in a way that lets me easily choose the path I want to use.

Likely Challenges

As I mentioned, I've never tried ParseCit with actual data. Although there doesn't seem to be a reason why it shouldn't work, you can never be really sure. Since a good quality XML importer is vital to my project, it's very important to me to have at least one well-working import pipeline. Should ParseCit not work the way I'm hoping, I'll have to spend some time on re-ordering the import result of Dominik's pdftoxml.


  1. Nice work Silas. Could you share the URL of your repo and instructions to run the tool? Thanks !

    1. Thanks for the feedback. I've added the link to the repo in-line. Detailed instructions will follow, but it's mainly just downloading the tools folder from the repo and putting it into the directory of your Pharo image.

  2. This comment has been removed by the author.