Another Freebase graph.  The x axis is a listing of 1,000+ non-user types.  The y axis is the number of topics in that type (log 10 scale, number of digits).  The median is 39.5 topics per type.  For instance there are 39 people listed as comic book pencilers and 40 building functions in the architecture domain.  Up at the top are 4.1 million musical tracks.  At the bottom are a number of types with only one topic, like interview.

Will Moffat mentioned a good idea a while ago: setting goals for adding data within domains and types, and having visual indicators of progress within those domains.  Freebase could make some back-of-the-envelope estimations for how many entries are missing in a certain type.  This is needed because low number does not necessarily mean missing data (as in days of the week).


  1. Hi Mike. This is a great idea. I work at Metaweb, and have thought some about computing these estimates for all the types in the system. For instance, Freebase has about eight thousand composers. How many are there? First, that brings up notability: depending upon how notable they have to be to be in the list, the number will change.

    Notability notwithstanding, here is one heuristic. Take the strings that comprise of names of composers, and collect the most predictive keywords found in a ten word radius of those in a large corpus, say web. Remove the ones that are common across English in general. Now make queries with these keywords (e.g., ‘classical era’), and filter for English words and names that you already had: this is a potential upper bound of composers in your corpus.

    Shoot me an email if you’d like to talk more about this.

    — Praveen.

