A couple of months ago, Dominik Seliner finished his bachelor's thesis on creating a workbench for modelling scientific communities, at the Software Composition Group (SCG) at University of Bern. The result of his research was a tool called EggShell, which extracts data, such as title and contributors, from proceedings of events like conventions, published in PDF format, and provides a network visualization of that data. The identification of the PDFs' different parts is done using heuristics, rather than machine learning algorithms. Dominik's thesis heavily focuses on how two different approaches to extracting said data perform in comparison, in terms of accurately detecting the different parts. The visualization is achieved using the Roassal visualization engine, which is implemented in Pharo. Most parts of EggShell are written in a sophisticated Pharo image called Moose. In order to transform the PDFs into text or XML files, a modified build of Xpdf is used.
In a first step, my project will pick up where Dominik's work left off. I'll attempt to equip EggShell with a domain specific language (DSL) to query the data model, in order to visually answer questions about scientific communities in a later step. Another goal is to extend the data model and structure recovery to extract more parts of the PDFs. Whether I'm going to permanently stick to EggShell, re-implement certain parts of it, or develop an entirely new tool, remains to be seen.
The exact project specifications aren't set in stone and are likely to change over the course of my work. Whenever something changes, or more details are established, I'll give an evolved project outlook in my next post.