Monday, 5 December 2016
Latest ProgressI first worked a little more on the Metacello configuration, that is, I split my version configuration into version and baseline and made sure the CommandLine packages are added as a dependency. I now have a working version 0.1, which (hopefully) correct dependencies.
I then adapted my conversion utility that can convert the entire examplePDFs folder to XML documents to support all three pipelines. This works fine now, and I have three XML documents for each sample PDF, one document for each pipeline. With that, I could finally start comparing the different results. While obviously #xpdf is very different from #pcparsecit and #pcxpdf, the two latter ones are rather similar. Hower, #pcpdfbox seemed to perform better overall. Right now, I think the best idea is to take all the data ParseCit can extract, using #pcpdfbox, then extracting whatever additoinal features are possible and interesting from #xpdf.
What's interesting is that ParseCit uses three different algorithms for data extraction. One focusses on header data (ParsHed), on solely on citations (ParsCit) and one that extracts various different features throughout the document (SectLabel). There are some overlaps between the features extracted by each of these algorithms. For these overlaps, I need to check which algorithm tends to have a higher confidence value.
Back in Pharo, I tried to get comfortable with handling XML documents and nodes. This worked well rather quickly, but there were a couple of operations I didn't find in XMLNodeWithChildren, so I implemented then as extensions. With all of that working, I started implementing ParseCitDataExtractor (and the neccessary helper classes), which has the purpose of providing the extracted data through normal accessor methods, that is, in the language of Pharo. So far, the provided features are title, author names, author affiliations, author e-mails, and section headers.
During my work, I noticed that importing the PDFs takes quite some time. There isn't much I can do about this, but still I wanted to have an idea about how long exactly. Therefore, I built a small performance analyzer. These are the numbers I got for the sample PDFs, when importing them using #pcpdfbox, loading them in a ParseCitDataExtractor and putting that data into an ad-hoc sample model:
Next StepsFirst of all, I'll continue expanding the ParseCitDataExtractor to more of the features provided by ParseCit's XML. I can also begin working on the model, since I now have some data to use it for. Furthermore, there's still the #xpdf pipeline, which offers positional data, that might be useful to extract further features using heuristics, the way EggShell used to do it. For that, I'll first have to identify what features could possibly be extracted. In second step, it would be a good idea to interview some people at SCG, as Leonel and I have discussed before. I want to give people some sample queries, so they have an idea what type of questions I want the data model to be able to answer, then I would like them to suggest further questions they might be interested in. Based on these results and my estimates about what features I might be able to extract, I think I can create an outline of what further extractions I want to attempt.
There's one more thing I want to attempt soon: As I mentioned above, the importing takes rather long. Even for a small amount of papers, the system may look unresponsive for a long time. Therefore I would like to have a progress bar (with e.g. 1 progress-step per imported paper), showing which PDF is being imported right now, maybe with a cancel option, and maybe even with an "estimated time remaining", but I'll have to see how far I can actually go (also in terms of how much time I can spend on it).
Likely ChallengesIt appears that in the ParsCit algorithm, the names of cited papers are sometimes split into name and book title or something. I'll need to see how to make the best out of these results. Related to that is the fact that the titles of cited papers may not always completely be equal to the actual paper titles. This might be a problem when wanting to reference them. The same goes for ambiguous affiliation names: there's a good chance that someone might call their affiliation "SCG, University of Bern", someone might call it "Software Composition Group, University of Bern, Switzerland", etc. I don't yet know how to make sure that these are all regarded as the same affiliation, if that's at all possible. Also, there are some encoding problems: some characters are not correctly interpreted. I'll have to see what I can do about this.
Monday, 28 November 2016
Latest ProgressUnfortunately, I didn't get as much done as I wanted to, this week. I feel like I managed to get rid of the code smells I'd found by last week though. I now distinguish between tool controllers and importers. A tool controller is basically just a wrapper for using the command line services of a third party tool directly as a Pharo message. An importer (currently, there is only the PDFXMLImporter) is supposed to use various combinations of tools (that is, tool controllers), to get a specific importing task done. Currently, PDFXMLImporter supports pdf-to-xml importing using either a combination of PDFBox and ParseCit, a combination of XPdf (pdftotext) and ParseCit, or just XPdf on its own. In the code, these pipelines are referred to as #pcpdfbox, #pcxpdf, and #xpdf respectively.
Additionally, I started having a look at Metacello configurations, to make it easier to load the project. Loading the currently latest version of each package into the project already works fine, but I didn't really get into specifying the dependencies yet. There is also a third party project I heavily depend on (called CommandLine), which is not included in the standard Moose 6.0 image. I'm sure I can use that Metacello configuration to load these packages as well, which would be very nice.
Next StepsMore or less the same one as last week. I might need to implement that text block sorter though, for putting XPdf imports back into the correct order. See "Likely Challenges" about why this might be the case. But basically, this week I want to start analyzing the imported PDFs, and also maybe do some more work on the Metacello configuration. Apart from that, there are always some smaller tasks on my to-do list, like making sure the tool controllers can actually use all of the command line services provided by the third party tools, etc., so I might also do some work on some of these tasks.
Likely ChallengesAlso more or less the same as last week. Only addition: I just had a quick glance at the XMLs imported by ParseCit. While it detects title, author, affiliation, etc. really well, it doesn't seem to provide any layout information about the remaining text blocks. This information might be important for extracting further features, which means I might need to use both the ParseCit and XPdf pipelines in parallel. This isn't a problem (except maybe for a longer import process for each PDF), but it means that, if I actually need it, I'll have to implement the text block sorter rather soon, which will put the text blocks imported by XPdf back into the correct order.
Monday, 21 November 2016
Latest ProgressThis last week, I added TxtXMLImporter, which supplies txt files to ParseCit and returns the result (can also file it out). This worked fine on the test data, as well as on the actual txt files I got from PDFs using PDFBox. I then wrapped the usage of PDFBox and ParseCit into a PDFXMLImporter, which, in a next step, should also let the user choose between the different pipelines (also using pdftoxml and pdftotxt from the original EggShell) that result in a PDF-to-XML conversion.
I realized that I had some duplicated code, so I factored that out into the common superclass (PaperImporter). Unfortunately, the "inheritance" situation I have doesn't really reflect a good inheritance relationship, so I will want to refactor that.
I also spent some time on going through what I have so far, adding some class comments. This is where I came across some code smells and design flaws (like the bad inheritance mentioned above), which I want to get rid of, probably during this week.
Next StepsFirst of all, I want to do a little refactoring and make sure the design is as good as I can make it right now. Then I want to add message comments as well, something I had neglected a little bit so far. The same thing goes for tests. Now that my pipelines seem to work, I think I should write some unit tests for them.
As I mentioned above, PDFXMLImporter should also be able to use the other tools available from the original EggShell. This will be one of the next things I want to do.
Once I have assembled all the tools for converting PDFs to XML, I want to come back to what I began a couple of weeks ago: studying the imported XML, analyzing what each pipeline can give me, and what information I might be able to deduct from it. I especially want to focus on the differences between the entire old pipeline, and the new pipeline with either PDFBox or pdftotxt as a PDF-to-text converter. It's also possible to use multiple pipelines, if it leads to a better result. so that's an approach I want to consider as well.
Since ParseCit can already retrieve a lot of information, I want to start working on taking that data out of the XML and modelling it. I suspect that this might be the easiest way to get a first extended data model rather soon.
Likely ChallengesAs soon as I start analyzing the imported XML data, I suspect it might be a challenge to maintain the overview over all possible pipelines and to find out which ones are best suited for what, and to anticipate each one's drawbacks as early as possible. I then need to make a solid prediction about what further features I might be able to extract, in order to conduct the interviews as soon as possible.
Monday, 14 November 2016
Latest ProgressLast week, I addressed the problem of non-sequential XML import. I talked about it with Leonel and he pointed out that the imported XML offers position information for every text element. This information should be enough to distinguish between left- and right-hand column elements, which means I should be able to put them all back into the correct order.
However, Leonel also mentioned that he had used a different XML importer for a similar purpose, called ParseCit. It takes raw text input and can even extract title, authors and citations, using a Conditional Random Field (CRF) approach. I downloaded ParseCit, installed the necessary components, and documented my installation process as good as possible, for future reference. I also ran the tool on the provided sample data, which worked fine.
Note that ParseCit doesn't parse PDFs itself, so I needed a second tool for that. I decided to give Apache PDFBox a try, and so far, it looks very good. The tool is merely a runnable jar-file, that offers a variety of command line utilities, well suited for my needs. I the built a pipeline to be able to use PDFBox from within Pharo, which now works fine as well. The pipeline can also re-output the imported raw text, for ParseCit to use it as input. It's worth mentioning that Dominik's modified version of XPdf also includes a text-based import, so I'll definitely want to try it with that as well. I will most likely change my text-import pipeline to enable changing between these two tools, so that I can try out both, and see which one works better.
Since I've now gathered a number of tools I need or might want to use later, I organized them all in a GitHub repository. Of course, that means I redistributed then, so I had to spend some time on checking all the licenses and read about if and how they allow redistribution. Since everything I'm using so far is published either under the GNU GPL or the Apache license, and my repository is merely an aggregation of distinct programs, rather than a combination into a new one, this doesn't seem to be a problem.
Next StepsEven thought I tested ParseCit with the provided sample data and managed to establish the PDF-to-text pipeline, I didn't yet get to put the two together. This will be my top priority this week. Once that works, I want to modify the PDF-to-XML importing process to include all possible pipelines (i.e. all different tool chains that deliver an acceptable XML result), in a way that lets me easily choose the path I want to use.
Likely ChallengesAs I mentioned, I've never tried ParseCit with actual data. Although there doesn't seem to be a reason why it shouldn't work, you can never be really sure. Since a good quality XML importer is vital to my project, it's very important to me to have at least one well-working import pipeline. Should ParseCit not work the way I'm hoping, I'll have to spend some time on re-ordering the import result of Dominik's pdftoxml.
Monday, 7 November 2016
Latest ProgressThe first thing I did this week was reorganizing the parts of EggShell that are important for my project so far, putting them into different packages and adding these packages to my own repository. Now I have my own independent version of the project, called ExtendedEggShell, which contains all of what I've done with it until now. I then built a small utility that "intercepts" the importing and modelling process at the XML stage and exports the imported XML string into a new file, and ran this for all the example PDFs.
Once I was done with that, I spent some time thinking about how to assess the precision of the data extraction I will be implementing. I don't think it makes sense to build on EggShell's extensive work in that area, since that assessment will just be a necessity, rater than the focus of my work. I know this isn't the most important question right now, but it just happened to cross my mind.
Finally, I did some work towards comparing the PDFs to their XML-representation and identifying interesting parts for extraction. I wanted to go through all papers and, in the XML document, categorize all the parts by adding tags to them. I wanted to tag all sections, all paragraphs within these sections, all figures and listings within these paragraphs, and so forth. To get started, I went through all the papers rather quickly, just to identify possible tags, then I organized these tags into something like a meta-model (or meta-meta-model?). While this tagging is a lot of work, I think it might be very helpful in the future, since it should enable me to programmatically identify all the parts of the sample papers. I can use that to build the reference model for testing and accuracy assessment, I can use it to extract parts to compare them while I design the heuristics, and it might even be a good training set for some sort of supervised learning algorithm, in case I, or a future student, wants to attempt that. However, i stumbled across a problem I hadn't expected: as soon as a PDF page has figures, embedded text fields, and other elements like these, the extracted content isn't strictly linear anymore. It does follow a pattern, but it's not extracting first the complete left-hand, then the complete right-hand column. This matter will be discussed further in the "Problems" section below.
ProblemsAs I've mentioned above, I ran into problems with the extraction of a certain paper, namely Cara15a.pdf (although I'm most likely going to encounter it with other papers as well). The first page was extracted very well and I was able to tag it they way I wanted to. However, on the second page, the extraction wasn't linear anymore: it jumped between left- and right-hand column; usually, some sort of figure was involved. Here's an (approximate) sketch of the extraction sequence I observed:
I want to see if there is a good reason for this, and if I can still work with it. Maybe I need to have a look at the source code of the extractor tool, but this might be rather difficult, since my experience with C++ is very limited. I will also discuss the matter with Leonel and maybe, if necessary, ask Dominik about it.
Next StepsFirst, I need to get this problem out of the way or find a way to deal with it. Otherwise, I can hardly continue my work. Once this is taken care of, I want to do the tagging, if this is still possible. As soon as I have one paper completely tagged, I want to make sure the parts can be imported into Pharo and modeled in at least some ad-hoc model. Once I have identified all parts in the XML files, I can try and find out which of them we might be able to detect programmatically. With that knowledge, I can conduct some interviews, especially asking different people about sample queries they might want to make and questions they may want to answer, to see which features really should be extracted.
Likely ChallengesObviously, solving the problem mentioned above will be an important challenge. I'll either have to find a way to fix the extraction, or I'll have to settle for the parts that are still extractable. Should I be able to solve this problem, I still want to do the tagging, I think it might be very helpful. However, this will be a lot of work, and I'd need to find out if it's actually feasible.
Monday, 31 October 2016
Latest progressI am going to give a first presentation about my project tomorrow morning, so this was my main focus this week. I took a deeper look at what problem we want to solve, how we want to do it, and what we are building on. My presentation will mostly talk about these points.
I still made some minor progress on the actual project. During the meeting with Oscar and Leonel last Tuesday, I got a better idea about how we might tackle the query language. Instead of creating a fully domain specific language right away or trying to build something very general (almost SQL-like), we should start out with a more OCL-like approach. That is, we define the meta-model in a UML diagram and then express queries by chasing through that graph an collecting the information we are interested in. This basically reduces the "DSL" down to a UML model and some accessor methods, however, at the price of sometimes rather complicated query expressions. We can take care of that in a later step, for example by providing shortcut methods for more tedious and frequently used queries.
Playing around with these ideas and formulating some possible queries (some of which already work on the current model) gave me a good idea about what a the DSL might look like in the end, which is very helpful. However, since our current data model only contains file names, paper titles, and author names, this is as far as I can go with that part right now. Before I can continue working on that, we first have to spend some time on extending the data model.
Next stepsAs I have mentioned above, our number one priority right now will be extending the data model. Additional features we might consider extracting include author affiliation, publishing venue, paragraphs, listings, figures, references, keyword lists, etc. First of all, I assume we should create a prioritized list of these features. Then I want to go through that list item by item, identify them on all sample papers and from that, try to define heuristic rules for automatic recognition. In order to improve them, I supposed these heuristics should then be assessed in terms of accuracy in a next step, but I think EggShell should offer good tools for that already, which just need to be adapted.
Updated project outlookIn a first step, we want our data model to contain more parts of the papers. Once it does so, we need to make sure it can be nicely queried. While we don't necessarily need more than some accessor methods for each model entity, it might be useful to provide certain shortcuts for more tedious and frequently used queries.
In a final step, we can tackle the actual visualizations, which should provide various views onto the data at hand. Our goal is that they encourage users to explore them, that is, to explore the data. Through these visualizations and through exploring them, users may be able to answer questions like:
- Which authors/universities/enterprises/etc work on which topics? Which venues do they publish at?
- How do groups of co-authors evolve over time? Do some of them combine to larger groups? Do some groups break apart?
- How does technology usage evolve over time, with respect to certain communities?
Likely challengesOnce we have identified the most interesting features of the papers, we have to extract them. I intend to use heuristic methods for that, the way EggShell does that for extracting title and contributors. However, not all of these parts, if any, might be easy to extract, especially at a decent precision. So, getting good data extraction for all important features will most likely be the biggest challenge during the next couple of weeks.
Monday, 24 October 2016
Latest ProgressThings have been going rather slow the last two weeks. I'm still trying to figure out the best way to store the data and query that model. My approach was a more SQL-like one: have a selector method which can take a list of arbitrary attribute names and return only these attributes, provide means for conditional selection (similar to WHERE in SQL), and maybe in a further iteration even implement something like joins. However, there's one problem with that approach: it's very complicated and most likely absolute overkill. Sure, the data model will change over time and the DSL will have to adapt to such changes, but it's still not going to be a completely generalized database, capable of storing any arbitrary type of data. I'll need a simpler meta-model.
Last week, Oscar suggested going for an OCL-like approach: expressing the meta-model as a UML diagram and expressing queries as navigations that chase through the meta-model graph. I think if the initial UML model is well designed, it'll be open for expansion of the data model without breaking anything, which is pretty much what I need. I've been working on some drafts for a UML diagram and on a corresponding implementation, but I'm not yet sure if I'm going in the right direction. So far, I don't yet fully understand how to express queries on such a model. I'm meeting with Oscar tomorrow, to discuss his suggestion, so I hope to be making some more progress this week.
Next StepsI'm meeting with Oscar tomorrow, to discuss the OCL approach and will pursue that during the next weeks. Also, I'm giving a first presentation of my project (what have I done so far, where is it going) on Tuesday, November 1st, so I'm also going to be working on that.
Likely ChallengesI spent quite a lot of time on my first approach without really achieving something useful. Although I've learned some things that might be helpful later, I really want to get ahead with my project. It's important for me to find out if the OCL approach is the way to, and, if yes, really think it through and implement it well.
Tuesday, 11 October 2016
Latest ProgressI read further on in Pharo By Example, and I guess I do have a basic understanding of the language and environment by now. I'll definitely read it all the way through, but I can most likely do that alongside my work.
What's more important is that EggShell is now up and running on my machine. I decided to work on a Mac (currently running OS X El Capitan), since that's what the software has mainly been developed on and for. It didn't work just out of the box, but more about that below. I'll most likely try and set it up on Linux as well, but as for my main development environment, I'll stick to Mac.
Once I had EggShell running, I went through the first usage example Dominik gives in his thesis. It all worked very well and helped me get a basic understanding of how the tool works. I then spent some time on analyzing exactly how the recovered data is modeled. I'm now working on a first alteration of that model, to fit what I had in mind, for querying. Whether or not that will go well or if I'll eventually go back to the original model, remains to be seen.
Solved ProblemsAs I mentioned above, I first had some trouble with EggShell. When I ran the example in Dominik's Thesis, importPdfAsXmlString: would always throw an error. Some debugging showed it was because the imported string came back empty. The PDF transformation is done by a PipableOSProcess, which is passed a Terminal command, so extracted that command and ran it directly in a Terminal. Here's what that looked like:
Since I'm fairly (or actually completely) new to Mac OS X, I didn't know if libpng was something that should already be installed on my machine, if it was something I should have installed myself, and any case, how to get it to work now. So I contacted Dominik and showed him the screenshot above. Luckily, he found the problem pretty quickly: Apparently, libpng is a Unix package which used to be pre-installed on older versions of OS X. At least with El Capitan, that's no longer the case. Dominik suggested I install Homebrew, which is a very handy package manager for OS X, and use it to install libpng. So that's what I did, and indeed, it fixed the problem. The libpng package was the only thing missing, and after installing it, EggShell worked without any further problems.