Monday, 12 December 2016
Latest ProgressI've been rather busy this week, so I don't have too much to report this time. I've been working on creating a model and loading the data into it. This is going okay, I can now load a ScientificCommunity model for single papers. In the process of implementing the model loader, I had to go over some of last week's work and change and extend some parts of the XML data extractor, which also took some time to do.
Next StepsI'll continue working on the model, especially on adding the reference-relation (citations). Once I have a good model of the scientific community, I can come up with some queries and start thinking about visualizations (that is, start to get comfortable with Roassal). Also, I haven't yet thought about what additional features, if any, I'll want to try and extract. If I do want to extract more features, I'll have to implement the xpdf-XML-sequencer first, then define the features to be extracted, develop heuristics to extract them, test the accuracy, and get the data into the model. However, additional features will not be my main focus right now. I think it's better to have a working, queriable model and some sort of visualization first.
Likely ChallengesI think I've mentioned that before, but sometimes, affiliation names are ambiguous, that is, affiliations we would consider to be the same might not always have the exact same name. Also, some characters don't seem to be handled very well. This might become a problem when looking for a paper title (e.g. to establish a reference relation). I'll have to give these problems some thought as soon as I have time.
Monday, 5 December 2016
Latest ProgressI first worked a little more on the Metacello configuration, that is, I split my version configuration into version and baseline and made sure the CommandLine packages are added as a dependency. I now have a working version 0.1, which (hopefully) correct dependencies.
I then adapted my conversion utility that can convert the entire examplePDFs folder to XML documents to support all three pipelines. This works fine now, and I have three XML documents for each sample PDF, one document for each pipeline. With that, I could finally start comparing the different results. While obviously #xpdf is very different from #pcparsecit and #pcxpdf, the two latter ones are rather similar. Hower, #pcpdfbox seemed to perform better overall. Right now, I think the best idea is to take all the data ParseCit can extract, using #pcpdfbox, then extracting whatever additoinal features are possible and interesting from #xpdf.
What's interesting is that ParseCit uses three different algorithms for data extraction. One focusses on header data (ParsHed), on solely on citations (ParsCit) and one that extracts various different features throughout the document (SectLabel). There are some overlaps between the features extracted by each of these algorithms. For these overlaps, I need to check which algorithm tends to have a higher confidence value.
Back in Pharo, I tried to get comfortable with handling XML documents and nodes. This worked well rather quickly, but there were a couple of operations I didn't find in XMLNodeWithChildren, so I implemented then as extensions. With all of that working, I started implementing ParseCitDataExtractor (and the neccessary helper classes), which has the purpose of providing the extracted data through normal accessor methods, that is, in the language of Pharo. So far, the provided features are title, author names, author affiliations, author e-mails, and section headers.
During my work, I noticed that importing the PDFs takes quite some time. There isn't much I can do about this, but still I wanted to have an idea about how long exactly. Therefore, I built a small performance analyzer. These are the numbers I got for the sample PDFs, when importing them using #pcpdfbox, loading them in a ParseCitDataExtractor and putting that data into an ad-hoc sample model:
Next StepsFirst of all, I'll continue expanding the ParseCitDataExtractor to more of the features provided by ParseCit's XML. I can also begin working on the model, since I now have some data to use it for. Furthermore, there's still the #xpdf pipeline, which offers positional data, that might be useful to extract further features using heuristics, the way EggShell used to do it. For that, I'll first have to identify what features could possibly be extracted. In second step, it would be a good idea to interview some people at SCG, as Leonel and I have discussed before. I want to give people some sample queries, so they have an idea what type of questions I want the data model to be able to answer, then I would like them to suggest further questions they might be interested in. Based on these results and my estimates about what features I might be able to extract, I think I can create an outline of what further extractions I want to attempt.
There's one more thing I want to attempt soon: As I mentioned above, the importing takes rather long. Even for a small amount of papers, the system may look unresponsive for a long time. Therefore I would like to have a progress bar (with e.g. 1 progress-step per imported paper), showing which PDF is being imported right now, maybe with a cancel option, and maybe even with an "estimated time remaining", but I'll have to see how far I can actually go (also in terms of how much time I can spend on it).
Likely ChallengesIt appears that in the ParsCit algorithm, the names of cited papers are sometimes split into name and book title or something. I'll need to see how to make the best out of these results. Related to that is the fact that the titles of cited papers may not always completely be equal to the actual paper titles. This might be a problem when wanting to reference them. The same goes for ambiguous affiliation names: there's a good chance that someone might call their affiliation "SCG, University of Bern", someone might call it "Software Composition Group, University of Bern, Switzerland", etc. I don't yet know how to make sure that these are all regarded as the same affiliation, if that's at all possible. Also, there are some encoding problems: some characters are not correctly interpreted. I'll have to see what I can do about this.
Monday, 28 November 2016
Latest ProgressUnfortunately, I didn't get as much done as I wanted to, this week. I feel like I managed to get rid of the code smells I'd found by last week though. I now distinguish between tool controllers and importers. A tool controller is basically just a wrapper for using the command line services of a third party tool directly as a Pharo message. An importer (currently, there is only the PDFXMLImporter) is supposed to use various combinations of tools (that is, tool controllers), to get a specific importing task done. Currently, PDFXMLImporter supports pdf-to-xml importing using either a combination of PDFBox and ParseCit, a combination of XPdf (pdftotext) and ParseCit, or just XPdf on its own. In the code, these pipelines are referred to as #pcpdfbox, #pcxpdf, and #xpdf respectively.
Additionally, I started having a look at Metacello configurations, to make it easier to load the project. Loading the currently latest version of each package into the project already works fine, but I didn't really get into specifying the dependencies yet. There is also a third party project I heavily depend on (called CommandLine), which is not included in the standard Moose 6.0 image. I'm sure I can use that Metacello configuration to load these packages as well, which would be very nice.
Next StepsMore or less the same one as last week. I might need to implement that text block sorter though, for putting XPdf imports back into the correct order. See "Likely Challenges" about why this might be the case. But basically, this week I want to start analyzing the imported PDFs, and also maybe do some more work on the Metacello configuration. Apart from that, there are always some smaller tasks on my to-do list, like making sure the tool controllers can actually use all of the command line services provided by the third party tools, etc., so I might also do some work on some of these tasks.
Likely ChallengesAlso more or less the same as last week. Only addition: I just had a quick glance at the XMLs imported by ParseCit. While it detects title, author, affiliation, etc. really well, it doesn't seem to provide any layout information about the remaining text blocks. This information might be important for extracting further features, which means I might need to use both the ParseCit and XPdf pipelines in parallel. This isn't a problem (except maybe for a longer import process for each PDF), but it means that, if I actually need it, I'll have to implement the text block sorter rather soon, which will put the text blocks imported by XPdf back into the correct order.
Monday, 21 November 2016
Latest ProgressThis last week, I added TxtXMLImporter, which supplies txt files to ParseCit and returns the result (can also file it out). This worked fine on the test data, as well as on the actual txt files I got from PDFs using PDFBox. I then wrapped the usage of PDFBox and ParseCit into a PDFXMLImporter, which, in a next step, should also let the user choose between the different pipelines (also using pdftoxml and pdftotxt from the original EggShell) that result in a PDF-to-XML conversion.
I realized that I had some duplicated code, so I factored that out into the common superclass (PaperImporter). Unfortunately, the "inheritance" situation I have doesn't really reflect a good inheritance relationship, so I will want to refactor that.
I also spent some time on going through what I have so far, adding some class comments. This is where I came across some code smells and design flaws (like the bad inheritance mentioned above), which I want to get rid of, probably during this week.
Next StepsFirst of all, I want to do a little refactoring and make sure the design is as good as I can make it right now. Then I want to add message comments as well, something I had neglected a little bit so far. The same thing goes for tests. Now that my pipelines seem to work, I think I should write some unit tests for them.
As I mentioned above, PDFXMLImporter should also be able to use the other tools available from the original EggShell. This will be one of the next things I want to do.
Once I have assembled all the tools for converting PDFs to XML, I want to come back to what I began a couple of weeks ago: studying the imported XML, analyzing what each pipeline can give me, and what information I might be able to deduct from it. I especially want to focus on the differences between the entire old pipeline, and the new pipeline with either PDFBox or pdftotxt as a PDF-to-text converter. It's also possible to use multiple pipelines, if it leads to a better result. so that's an approach I want to consider as well.
Since ParseCit can already retrieve a lot of information, I want to start working on taking that data out of the XML and modelling it. I suspect that this might be the easiest way to get a first extended data model rather soon.
Likely ChallengesAs soon as I start analyzing the imported XML data, I suspect it might be a challenge to maintain the overview over all possible pipelines and to find out which ones are best suited for what, and to anticipate each one's drawbacks as early as possible. I then need to make a solid prediction about what further features I might be able to extract, in order to conduct the interviews as soon as possible.
Monday, 14 November 2016
Latest ProgressLast week, I addressed the problem of non-sequential XML import. I talked about it with Leonel and he pointed out that the imported XML offers position information for every text element. This information should be enough to distinguish between left- and right-hand column elements, which means I should be able to put them all back into the correct order.
However, Leonel also mentioned that he had used a different XML importer for a similar purpose, called ParseCit. It takes raw text input and can even extract title, authors and citations, using a Conditional Random Field (CRF) approach. I downloaded ParseCit, installed the necessary components, and documented my installation process as good as possible, for future reference. I also ran the tool on the provided sample data, which worked fine.
Note that ParseCit doesn't parse PDFs itself, so I needed a second tool for that. I decided to give Apache PDFBox a try, and so far, it looks very good. The tool is merely a runnable jar-file, that offers a variety of command line utilities, well suited for my needs. I the built a pipeline to be able to use PDFBox from within Pharo, which now works fine as well. The pipeline can also re-output the imported raw text, for ParseCit to use it as input. It's worth mentioning that Dominik's modified version of XPdf also includes a text-based import, so I'll definitely want to try it with that as well. I will most likely change my text-import pipeline to enable changing between these two tools, so that I can try out both, and see which one works better.
Since I've now gathered a number of tools I need or might want to use later, I organized them all in a GitHub repository. Of course, that means I redistributed then, so I had to spend some time on checking all the licenses and read about if and how they allow redistribution. Since everything I'm using so far is published either under the GNU GPL or the Apache license, and my repository is merely an aggregation of distinct programs, rather than a combination into a new one, this doesn't seem to be a problem.
Next StepsEven thought I tested ParseCit with the provided sample data and managed to establish the PDF-to-text pipeline, I didn't yet get to put the two together. This will be my top priority this week. Once that works, I want to modify the PDF-to-XML importing process to include all possible pipelines (i.e. all different tool chains that deliver an acceptable XML result), in a way that lets me easily choose the path I want to use.
Likely ChallengesAs I mentioned, I've never tried ParseCit with actual data. Although there doesn't seem to be a reason why it shouldn't work, you can never be really sure. Since a good quality XML importer is vital to my project, it's very important to me to have at least one well-working import pipeline. Should ParseCit not work the way I'm hoping, I'll have to spend some time on re-ordering the import result of Dominik's pdftoxml.
Monday, 7 November 2016
Latest ProgressThe first thing I did this week was reorganizing the parts of EggShell that are important for my project so far, putting them into different packages and adding these packages to my own repository. Now I have my own independent version of the project, called ExtendedEggShell, which contains all of what I've done with it until now. I then built a small utility that "intercepts" the importing and modelling process at the XML stage and exports the imported XML string into a new file, and ran this for all the example PDFs.
Once I was done with that, I spent some time thinking about how to assess the precision of the data extraction I will be implementing. I don't think it makes sense to build on EggShell's extensive work in that area, since that assessment will just be a necessity, rater than the focus of my work. I know this isn't the most important question right now, but it just happened to cross my mind.
Finally, I did some work towards comparing the PDFs to their XML-representation and identifying interesting parts for extraction. I wanted to go through all papers and, in the XML document, categorize all the parts by adding tags to them. I wanted to tag all sections, all paragraphs within these sections, all figures and listings within these paragraphs, and so forth. To get started, I went through all the papers rather quickly, just to identify possible tags, then I organized these tags into something like a meta-model (or meta-meta-model?). While this tagging is a lot of work, I think it might be very helpful in the future, since it should enable me to programmatically identify all the parts of the sample papers. I can use that to build the reference model for testing and accuracy assessment, I can use it to extract parts to compare them while I design the heuristics, and it might even be a good training set for some sort of supervised learning algorithm, in case I, or a future student, wants to attempt that. However, i stumbled across a problem I hadn't expected: as soon as a PDF page has figures, embedded text fields, and other elements like these, the extracted content isn't strictly linear anymore. It does follow a pattern, but it's not extracting first the complete left-hand, then the complete right-hand column. This matter will be discussed further in the "Problems" section below.
ProblemsAs I've mentioned above, I ran into problems with the extraction of a certain paper, namely Cara15a.pdf (although I'm most likely going to encounter it with other papers as well). The first page was extracted very well and I was able to tag it they way I wanted to. However, on the second page, the extraction wasn't linear anymore: it jumped between left- and right-hand column; usually, some sort of figure was involved. Here's an (approximate) sketch of the extraction sequence I observed:
I want to see if there is a good reason for this, and if I can still work with it. Maybe I need to have a look at the source code of the extractor tool, but this might be rather difficult, since my experience with C++ is very limited. I will also discuss the matter with Leonel and maybe, if necessary, ask Dominik about it.