I've been rather busy this week, so I don't have too much to report this time. I've been working on creating a model and loading the data into it. This is going okay, I can now load a ScientificCommunity model for single papers. In the process of implementing the model loader, I had to go over some of last week's work and change and extend some parts of the XML data extractor, which also took some time to do.
I'll continue working on the model, especially on adding the reference-relation (citations). Once I have a good model of the scientific community, I can come up with some queries and start thinking about visualizations (that is, start to get comfortable with Roassal). Also, I haven't yet thought about what additional features, if any, I'll want to try and extract. If I do want to extract more features, I'll have to implement the xpdf-XML-sequencer first, then define the features to be extracted, develop heuristics to extract them, test the accuracy, and get the data into the model. However, additional features will not be my main focus right now. I think it's better to have a working, queriable model and some sort of visualization first.
I think I've mentioned that before, but sometimes, affiliation names are ambiguous, that is, affiliations we would consider to be the same might not always have the exact same name. Also, some characters don't seem to be handled very well. This might become a problem when looking for a paper title (e.g. to establish a reference relation). I'll have to give these problems some thought as soon as I have time.