Archive for the 'structured data' Category

interactiveSVG

March 17, 2009

interactiveSVG is an R package I made for adding ECMAScript interactions to SVG plots.  Flash is the main choice for data visualizations on webpages, but I like the idea of open formats and I think it’s great that you can read the source code of an SVG file.

The package uses the RSVGTipsDevice package that I posted here earlier, and does some scraping and rearranging of the SVG file.  ECMAScript onclick events do the work of rearranging SVG elements or updating a text box with a value.  I tried animations like easing at first, but noticed that FireFox, Safari and Opera all were pretty slow in rendering these.

Here’s an example SVG file of the 2010 U.S. Budget.  You can click the bars to get an updated total of the selection.  I got the budget numbers from this Wikipedia article.

interactivesvg

Also, try this pairs plot of the states dataset, with tooltips and zoomable subplots.

Tar file and example SVG files at http://mike-love.net/interactiveSVG/

DBpedia

December 10, 2008

I posted something a year ago about DBpedia, but never really looked at what they offer.  You can do some pretty interesting things with their query interface to the Wikipedia infoboxes.  Anytime you see structured data in an infobox on Wikipedia, you have a chance to grab it as TSV, JSON or XML.  Of course, it’s hit or miss whether Wikipedia has numbers formatted nicely.  I tried looking up casualties for military conflicts and it varied from things like “1000 men”, “12 galleys”, “heavy”, to “20-30,000″.   Good use case for Mechanical Turk.  Here’s a 5 minute example:

dbpedia

Graph US cities by total population within the city and the greater metropolitan population.  I use the online query explorer with this query:

SELECT ?subject ?poptotal ?popmetro WHERE {
?subject rdf:type <http://sw.opencyc.org/2008/06/10/concept/Mx4rvViEwJwpEbGdrcN5Y29ycA>.
?subject <http://dbpedia.org/property/populationTotal> ?poptotal.
?subject <http://dbpedia.org/property/populationMetro> ?popmetro.
}
ORDER BY DESC(xsd:integer(?poptotal))
LIMIT 1000

This means, give me the name, population (total) and population (metro) of US Cities.  The opencyc.org bit means US Cities – it was the most convenient way to grab these.  You can figure out how to structure your query by looking at an example page, like http://dbpedia.org/page/New_York_City.  It will have the same URI ending as the Wikipedia page.  This query returns:

subject    poptotal    popmetro
:New_York_City [http]    8274527    19750000
:Los_Angeles%2C_California [http]    3849378    17755322
:Chicago [http]    2836658    9785747
:Houston%2C_Texas [http]    2208180    5628101

Copy paste into a text editor to clean up the city names and a few population numbers that have commas, etc.  Four lines of R:

> library(Rlab)
> cities <- read.delim(“cities”, header=FALSE)
> lplot(log(cities[,2]), log(cities[,3]), labels=cities[,1], xlim=c(9,17), ylim=c(11,17), xlab=”Population total (log)”, ylab=”Population Metro (log)”, tcex=.5)
> abline(lm(log(cities[,3])~log(cities[,2])))

Here’s the plot.  Cities above the line are more sprawl-ish.

citiespop

I’m thinking of other stuff to do with this.  Wikipedians are dutiful about infoboxing particular data.  Things like the casualties of ancient sea battles, which characters appeared in which episodes of a show, and the measurements of every female adult model.

Google News Archive Timeline

December 6, 2008

timeline

Fun stuff.  Try it out:

plague – Google News Archive Search.

Brendan O’Connor puts machine learning and statistics in a jar and shakes the jar

December 3, 2008

Brendan O’Connor puts machine learning and statistics in a jar and shakes the jar:

ML sounds like it’s young, vibrant, interesting to learn, and growing; Stats does not.

Is marketing a problem? Machine learning terms definitely sound pretty cool. Maybe the perspective of computational intelligence lends itself to cool names. Though the Stanford statisticians certainly know how to play this game — for example, they made up their own names for variants of L1 and L2-regularized regression, leaving annoyed people like me forever googling “lasso” and “ridge” trying to remember which is which. (On the other hand, perhaps that’s child’s play compared to the true original sin of ML nomenclature: tossing around the highly deceptive term “neural network” for a stack of linear functions paired with a wonky, overhyped training algorithm; the combination of which, many years later, still causes confusion.)

Yelp rating trend

October 16, 2008

New feature on Yelp.  It appears only on some pages (not only determined by the number of reviews).

The Pollster loess treatment includes faint actual datapoints.  In this case you’d have a lot of overlapping points.  How noisy is the actual data?

Wikipedia article traffic statistics

October 9, 2008

Nate Silver had a post recently that had links to this website.  Are Wikipedia article statistic a cleaner signal of aggregate interest in a topic than Google trends?  Wikipedia funnels people using different queries to the same page.

Wikipedia article traffic statistics

“This is very much a beta service and may disappear or change at any time.”

Obama vs. McCain trend with error bars

September 9, 2008

I’ve been checking polling sites like an addict recently, especially Pollster.com and FiveThirtyEight.com. Pollster sometimes employs 68% and 95% confidence intervals around their loess trend lines, like in this article about historic convention bounces:

Intuitively, it seems like the error might fluctuate over time depending on frequency of observations and the variance. The simpleboot package has a function for bootstrapping of loess fits, which will return the standard error from these fits.

Here’s the current graph from Pollster:

And here is a similar loess with +/- 1 and 2 standard error generated from bootstrapping fits.

Looking at the past 4 months shows McCain’s recent upturn in the polls.

It seems like the global constant for standard error is a decent assumption, at least from the graphs I generated from polling data.

Thanks to Brendan for the code for importing the Pollster data, from his post about polling loess at SocialScience++.

Here are my files for generating the graphs and the TSV data (convert .doc to .R or .tsv):

loess-error – testing predict() vs simpleboot, poll-boot – the poll graphs with bootstrap error bars, poll2 – TSV file