Scraping Wikipedia using search results

I wanted to see how easily I could develop a method for scraping unstructured data from Wikipedia, using only search results.  Freebase was low on Short Non-Fictions, so I went looking for essays and authors.  I searched for “an essay by” restricted to wikipedia.org, which gave me about 800 results like:

To the Person Sitting in Darkness – Wikipedia, the free encyclopedia
“To the Person Sitting in Darkness” is an essay by Mark Twain published in 1901. It is a satire of the Philippine-American War expressing Twain’s …
en.wikipedia.org/wiki/To_the_Person_Sitting_in_Darkness – 17k

I could have used Yahoo Web Search API to get this in an XML format with <title> and <summary>, but instead used a little grep to get it into a decent tab-separated format.  For simplicity, I was only interested in identifying the author within a Wikipedia article about a particular essay. By comparing the title with the text just before “an essay by”, I eliminated a lot of cases like this:

Jorge Luis Borges – Simple English Wikipedia, the free encyclopedia
Hallucinating Spaces, or the Aleph An essay from Borgesland by Susana … Borges’ Bad Politics Slate.com presents an essay by Clive James arguing that …
simple.wikipedia.org/wiki/Jorge_Luis_Borges – 34k

At this point, I had to identify the cases with descriptive adjectives before the author name, as in “A Nice Cup of Tea is an essay by British writer George Orwell”. Since these cases were rare, I put them aside to edit by hand.  I was able to get about 130 rows of pure essay name, tab, author name.

There are a few options for adding batches of data to Freebase.  The most GUI is the Freebase tool for importing lists of stuff. The advantage with Freebase having a tool like this is to help people reconcile their list with the untyped (lacking structured information) topics already imported from Wikipedia.  Linking the essays with the author requires a little more effort.  It’s possible to use the Query Editor to make write commands, so I just pasted a ton of write commands here (they ask you to experiment with write commands on sandbox.freebase.com first).  The rows from the tab-separated text go in between the square brackets, separated by commas.

{
“query” : [
{
“author” : {
“connect” : “insert”,
“id” : “/en/jorge_luis_borges”
},
“name” : “a new refutation of time”,
“type” : “/book/written_work”
}
]
}

Advertisements

1 thought on “Scraping Wikipedia using search results”

  1. nice post! A couple of notes:

    If a subject has a wikipedia article (not just a placeholder but an actual article) it’s safe to assume that there is a corresponding “topic” in freebase (the metaweb data team regularly “syncs” up to wikipedia. So, as you mentioned, you should attempt to not duplicate that topic. Furthermore, the query you posted is perfectly fine but if, by chance, there were another written_work with that name you would get an error response, basically mql would be telling you “there are two of these things, so your query is ambiguous”. The sure fire way to make sure you’re linking to the precise thing you want to link to is use it’s id. In this case:

    “id”:”/guid/9202a8c04000641f8000000004641c62″

    Like I said, if you’re pretty confident the name is unique (especially if you’ve indicated what type of thing it is) then your query is probably fine.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s