DBpedia

December 10, 2008

I posted something a year ago about DBpedia, but never really looked at what they offer.  You can do some pretty interesting things with their query interface to the Wikipedia infoboxes.  Anytime you see structured data in an infobox on Wikipedia, you have a chance to grab it as TSV, JSON or XML.  Of course, it’s hit or miss whether Wikipedia has numbers formatted nicely.  I tried looking up casualties for military conflicts and it varied from things like “1000 men”, “12 galleys”, “heavy”, to “20-30,000″.   Good use case for Mechanical Turk.  Here’s a 5 minute example:

dbpedia

Graph US cities by total population within the city and the greater metropolitan population.  I use the online query explorer with this query:

SELECT ?subject ?poptotal ?popmetro WHERE {
?subject rdf:type <http://sw.opencyc.org/2008/06/10/concept/Mx4rvViEwJwpEbGdrcN5Y29ycA>.
?subject <http://dbpedia.org/property/populationTotal> ?poptotal.
?subject <http://dbpedia.org/property/populationMetro> ?popmetro.
}
ORDER BY DESC(xsd:integer(?poptotal))
LIMIT 1000

This means, give me the name, population (total) and population (metro) of US Cities.  The opencyc.org bit means US Cities – it was the most convenient way to grab these.  You can figure out how to structure your query by looking at an example page, like http://dbpedia.org/page/New_York_City.  It will have the same URI ending as the Wikipedia page.  This query returns:

subject    poptotal    popmetro
:New_York_City [http]    8274527    19750000
:Los_Angeles%2C_California [http]    3849378    17755322
:Chicago [http]    2836658    9785747
:Houston%2C_Texas [http]    2208180    5628101

Copy paste into a text editor to clean up the city names and a few population numbers that have commas, etc.  Four lines of R:

> library(Rlab)
> cities <- read.delim(“cities”, header=FALSE)
> lplot(log(cities[,2]), log(cities[,3]), labels=cities[,1], xlim=c(9,17), ylim=c(11,17), xlab=”Population total (log)”, ylab=”Population Metro (log)”, tcex=.5)
> abline(lm(log(cities[,3])~log(cities[,2])))

Here’s the plot.  Cities above the line are more sprawl-ish.

citiespop

I’m thinking of other stuff to do with this.  Wikipedians are dutiful about infoboxing particular data.  Things like the casualties of ancient sea battles, which characters appeared in which episodes of a show, and the measurements of every female adult model.

One Response to “DBpedia”

  1. dave love Says:

    it would be interesting to color code cities that have anti-sprawl legislation or not. Can you pull cities from other countries? It would be a nice comparison.


Leave a Reply