Monday, 31 October 2016

Playing with queries and preparing the presentation

Latest progress

I am going to give a first presentation about my project tomorrow morning, so this was my main focus this week. I took a deeper look at what problem we want to solve, how we want to do it, and what we are building on. My presentation will mostly talk about these points.

I still made some minor progress on the actual project. During the meeting with Oscar and Leonel last Tuesday, I got a better idea about how we might tackle the query language. Instead of creating a fully domain specific language right away or trying to build something very general (almost SQL-like), we should start out with a more OCL-like approach. That is, we define the meta-model in a UML diagram and then express queries by chasing through that graph an collecting the information we are interested in. This basically reduces the "DSL" down to a UML model and some accessor methods, however, at the price of sometimes rather complicated query expressions. We can take care of that in a later step, for example by providing shortcut methods for more tedious and frequently used queries.

Playing around with these ideas and formulating some possible queries (some of which already work on the current model) gave me a good idea about what a the DSL might look like in the end, which is very helpful. However, since our current data model only contains file names, paper titles, and author names, this is as far as I can go with that part right now. Before I can continue working on that, we first have to spend some time on extending the data model.

Next steps

As I have mentioned above, our number one priority right now will be extending the data model. Additional features we might consider extracting include author affiliation, publishing venue, paragraphs, listings, figures, references, keyword lists, etc. First of all, I assume we should create a prioritized list of these features. Then I want to go through that list item by item, identify them on all sample papers and from that, try to define heuristic rules for automatic recognition. In order to improve them, I supposed these heuristics should then be assessed in terms of accuracy in a next step, but I think EggShell should offer good tools for that already, which just need to be adapted.

Updated project outlook 

In a first step, we want our data model to contain more parts of the papers. Once it does so, we need to make sure it can be nicely queried. While we don't necessarily need more than some accessor methods for each model entity, it might be useful to provide certain shortcuts for more tedious and frequently used queries.

In a final step, we can tackle the actual visualizations, which should provide various views onto the data at hand. Our goal is that they encourage users to explore them, that is, to explore the data. Through these visualizations and through exploring them, users may be able to answer questions like:
  • Which authors/universities/enterprises/etc work on which topics? Which venues do they publish at?
  • How do groups of co-authors evolve over time? Do some of them combine to larger groups? Do some groups break apart?
  • How does technology usage evolve over time, with respect to certain communities?

Likely challenges

Once we have identified the most interesting features of the papers, we have to extract them. I intend to use heuristic methods for that, the way EggShell does that for extracting title and contributors. However, not all of these parts, if any, might be easy to extract, especially at a decent precision. So, getting good data extraction for all important features will most likely be the biggest challenge during the next couple of weeks.

No comments:

Post a Comment