Monday, 5 December 2016

Analyzing the XMLs and accessing the data

Latest Progress

I first worked a little more on the Metacello configuration, that is, I split my version configuration into version and baseline and made sure the CommandLine packages are added as a dependency. I now have a working version 0.1, which (hopefully) correct dependencies.

I then adapted my conversion utility that can convert the entire examplePDFs folder to XML documents to support all three pipelines. This works fine now, and I have three XML documents for each sample PDF, one document for each pipeline. With that, I could finally start comparing the different results. While obviously #xpdf is very different from #pcparsecit and #pcxpdf, the two latter ones are rather similar. Hower, #pcpdfbox seemed to perform better overall. Right now, I think the best idea is to take all the data ParseCit can extract, using #pcpdfbox, then extracting whatever additoinal features are possible and interesting from #xpdf.

What's interesting is that ParseCit uses three different algorithms for data extraction. One focusses on header data (ParsHed), on solely on citations (ParsCit) and one that extracts various different features throughout the document (SectLabel). There are some overlaps between the features extracted by each of these algorithms. For these overlaps, I need to check which algorithm tends to have a higher confidence value.

Back in Pharo, I tried to get comfortable with handling XML documents and nodes. This worked well rather quickly, but there were a couple of operations I didn't find in XMLNodeWithChildren, so I implemented then as extensions. With all of that working, I started implementing ParseCitDataExtractor (and the neccessary helper classes), which has the purpose of providing the extracted data through normal accessor methods, that is, in the language of Pharo. So far, the provided features are title, author names, author affiliations, author e-mails, and section headers.

During my work, I noticed that importing the PDFs takes quite some time. There isn't much I can do about this, but still I wanted to have an idea about how long exactly. Therefore, I built a small performance analyzer. These are the numbers I got for the sample PDFs, when importing them using #pcpdfbox, loading them in a ParseCitDataExtractor and putting that data into an ad-hoc sample model:

Next Steps

First of all, I'll continue expanding the ParseCitDataExtractor to more of the features provided by ParseCit's XML. I can also begin working on the model, since I now have some data to use it for. Furthermore, there's still the #xpdf pipeline, which offers positional data, that might be useful to extract further features using heuristics, the way EggShell used to do it. For that, I'll first have to identify what features could possibly be extracted. In second step, it would be a good idea to interview some people at SCG, as Leonel and I have discussed before. I want to give people some sample queries, so they have an idea what type of questions I want the data model to be able to answer, then I would like them to suggest further questions they might be interested in. Based on these results and my estimates about what features I might be able to extract, I think I can create an outline of what further extractions I want to attempt.

There's one more thing I want to attempt soon: As I mentioned above, the importing takes rather long. Even for a small amount of papers, the system may look unresponsive for a long time. Therefore I would like to have a progress bar (with e.g. 1 progress-step per imported paper), showing which PDF is being imported right now, maybe with a cancel option, and maybe even with an "estimated time remaining", but I'll have to see how far I can actually go (also in terms of how much time I can spend on it).

Likely Challenges

It appears that in the ParsCit algorithm, the names of cited papers are sometimes split into name and book title or something. I'll need to see how to make the best out of these results. Related to that is the fact that the titles of cited papers may not always completely be equal to the actual paper titles. This might be a problem when wanting to reference them. The same goes for ambiguous affiliation names: there's a good chance that someone might call their affiliation "SCG, University of Bern", someone might call it "Software Composition Group, University of Bern, Switzerland", etc. I don't yet know how to make sure that these are all regarded as the same affiliation, if that's at all possible. Also, there are some encoding problems: some characters are not correctly interpreted. I'll have to see what I can do about this.


No comments:

Post a Comment