Monday, 12 December 2016

Working on the model

Latest Progress

I've been rather busy this week, so I don't have too much to report this time. I've been working on creating a model and loading the data into it. This is going okay, I can now load a ScientificCommunity model for single papers. In the process of implementing the model loader, I had to go over some of last week's work and change and extend some parts of the XML data extractor, which also took some time to do.

Next Steps

I'll continue working on the model, especially on adding the reference-relation (citations). Once I have a good model of the scientific community, I can come up with some queries and start thinking about visualizations (that is, start to get comfortable with Roassal). Also, I haven't yet thought about what additional features, if any, I'll want to try and extract. If I do want to extract more features, I'll have to implement the xpdf-XML-sequencer first, then define the features to be extracted, develop heuristics to extract them, test the accuracy, and get the data into the model. However, additional features will not be my main focus right now. I think it's better to have a working, queriable model and some sort of visualization first.

Likely Challenges

I think I've mentioned that before, but sometimes, affiliation names are ambiguous, that is, affiliations we would consider to be the same might not always have the exact same name. Also, some characters don't seem to be handled very well. This might become a problem when looking for a paper title (e.g. to establish a reference relation). I'll have to give these problems some thought as soon as I have time.

1 comment:

  1. I'm glad that your work is going well. I completely agree with the statement: "I think it's better to have a working, queriable model and some sort of visualization first." Yes !
    You could write a "how to" with the steps to install and use your implementation, so I could give you some feedback. For the moment, having citations is enough. You probably will need to define a filter for false positives (maybe using the confidence attribute of ParsCit). So do not spend much time looking for other type of data to query. Instead, we could have sections and paragraphs (or maybe just a single field), so when we search we could match text that is not exactly the same but close (to avoid errors during data extraction). Best regards, Leonel.