blog

rss

Using Sindice to get the best URI for a person

19/02/2008, 19h01

One of the open issues that we still had in SWAML was how to search the best document where we can find more information about subscribers of a mailing list. In the most part of cases we only have his name and his email address. So we needed to use an InverseFunctionalProperty, such as foaf:mbox_sha1sum, in order to find more information. That was one of the main reasons because Iván and I started with Futil, but unfortunately we never finish that project… at least we discovered a lot of people and it was a very funny hacking experience.

These days I had the opportunity to meet part of the team of Sindice project. Sindice is a lookup index for Semantic Web documents that indexes the Semantic Web and can tell you which sources mention a resource URI, IFP or keyword. For more information read their paper.

I decided to used it (I see that I’m not alone), and I’ve written a client in python for its API to test it. The project works well (see a query over the sha of my email), but the order of the results it is not the best in many cases (for example my URI is the last one in that query, although all result are good). I was talking with the developers (thanks Richard!), but actually the project only uses retrieval techniques to assign a score to the results.

Then we have some good documents where there are more information about our subscribers, but we need to choose the best one. So I think that we can apply a simple SPARQL query, and get the first result where our person is the foaf:primaryTopic:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person
WHERE {
  <http://.../file.rdf> foaf:primaryTopic ?person .
  ?person rdf:type foaf:Person . 
  ?person foaf:mbox_sha1sum "..."                                    
}

It could not be very efficient, I know it, but it works. And now it’s implemented in SWAML :-)