Notes on the Microsoft Concept Graph

Introduction

For a long time there has been a battle in Natural Language Programming between statistically oriented systems that are trained from large quantities against systems as opposed to grammar oriented systems motivated by the modern linguistics associated with Professor Noam Chomsky. Although some modern NLP systems start by parsing sentences, using something like the Stanford Parser.

I've long been interested in the problem of Named Entity Resolution and, in particular, the approach used by DBpedia Spotlight, which collects a large number of possible surface forms (for instance, both "NYPD" and "New York City Police Department" refer to this concept.) Spotlight is different from some other systems in that it resolves entities to concepts, rather than simply marking phrases in the text that correspond to particular roles(for instance, in a sentence like "Frank Boltz was an employee of the NYPD", the pink phrase is a person's name while the second is the name of an organization -- this assignment could be made even if we had no idea what these entities are.)

I like to think of this kind of system as a "magic magic marker", which highlights phrases in text to tag them with either general or specific meanings. A system like spotlight works in two phases, first finding places where the surface form dictionary matches the text, and then determining which interpretations are correct. (For instance, the word "Kate" could be a surface form for the first name of a women having over one of over 1000 names derived from Catherine -- an inspection of the context is necessary to narrow it down to a particular Kate.)

The Microsoft Concept graph is a database of surface forms and possible interpretations. Rather than resolving phrases to specific concepts (say "Pikachu" to Pikachu) it tags phrases with general concepts such as "Pokemon" or "Character". Produced with technology similar to that used to create word embeddings such as word2vec, the concept graph is positioned as a tool useful for understanding short texts such as search queries and tweets.

This chapter is based on my notes of a preliminary investigation of the Microsoft Concept Graph intended to be a rapid evaluation of the product for text analysis applications.

Overview

Let's look at the first 15 lines of the file to get a quick sense of the contents:

label surface form score
factor age 35167
free rich company datum size 33222
free rich company datum revenue 33185
state california 18062
supplement msm glucosamine sulfate 15942
factor gender 14230
factor temperature 13660
metal copper 11142
issue stress pain depression sickness 11110
variable age 9375
information name 9274
state new york 8925
social medium facebook 8919
material plastic 8628
supplemental material cds 8175

(Note that I added the header at the top, the actual file has no header)

The file as a whole has 33,377,320 lines and is sorted by descending score. The label is a label which could be applied to a text phrase, the surface form is the phrase, and the score is a measure of the strength of association between the two. In the top 15 lines we can already see some sense of the diversity of concepts, such as "factor" which represents various properties that an object could have, as well as "state" and "material". Already we can see a few examples where the results are strange such as the concept "free rich company datum" (which seems to represent a property that a company could have) and the issue "stress pain depression sickness" (which seems ill formed and a bit verbose)

Analysis of labels

Already we see one quality that most classifications based on unsupervised machine learning lack: the categories (labels) are meaningfully named, at least mostly. Speaking of labels, the system assigns 5,376,526 different labels. The most commonly assigned labels are here:

rank number of members label
1 364111 factor
2 203549 feature
3 201986 issue
4 172106 product
5 158829 item
6 142963 area
7 137435 topic
8 133715 service
9 122903 activity
10 112387 information
11 110915 event
12 108940 company
13 102032 common search term
14 92337 program
15 91842 technique
16 88835 application
17 88342 organization
18 84534 case
19 83271 method
20 82397 name
21 80643 project
22 77880 option
23 75264 parameter
24 73788 tool
25 68767 group
26 64969 term
27 62168 problem
28 61827 material
29 61768 variable
30 56243 technology
31 55607 place
32 55161 measure
33 54641 artist
34 53449 community
35 51183 element
36 50445 aspect
37 50411 player
38 49954 condition
39 48435 concept
40 47923 system
41 46679 function
42 44516 task
43 43694 brand
44 42806 initiative
45 42075 device
46 41886 component
47 41500 datum
48 39146 person
49 37843 site
50 37505 resource

The most common labels are short words with broad meanings. The one multi-word label that appears in the top 50, "common search term" turns out to be a strange one, containing search terms that would be used to find pirate software such as "adobe photoshop crack" and "age of empires 3 serial". A look at some of the least used labels shows that there is plenty of room at the bottom:

rank number of members label
5376507 1 0168monuments
5376508 1 01527 1441
5376509 1 012 agonists
5376510 1 0 10v control application
5376511 1 00v diode
5376512 1 00portable item
5376513 1 00 later flst
5376514 1 00db features
5376515 1 00 construction equipment vehicle
5376516 1 0067j once pmma particle
5376517 1 0 027 in delivery microcatheter
5376518 1 00163192contaminated debris
5376519 1 0 014 inch guidewire
5376520 1 000 square foot building
5376521 1 00054j a medical device embodiment
5376522 1 0003j non invasive medical imaging technique
5376523 1 0002j thermoplastic
5376524 1 0002j hydrocarbon
5376525 1 0002highly absorbent article
5376526 1 00 01 04

There are 2,364,966 labels which have only one surface form and, looking at the examples above, these are often gibberish; for practical work, it's clear that a large number of junk records could be removed.

Deep drill into a few labels

Let's take a look at the top 25 "factors":

label surface form score
factor age 35167
factor gender 14230
factor temperature 13660
factor size 6709
factor stress 6433
factor education 6256
factor cost 5661
factor smoking 5532
factor location 5247
factor diet 5205
factor ph 5160
factor weather 4604
factor weight 4157
factor genetic 3844
factor climate 3756
factor income 3727
factor ethnicity 3708
factor obesity 2975
factor humidity 2945
factor time 2940
factor culture 2904
factor environment 2688
factor type 2268
factor experience 2211
factor lifestyle 2177

Note that the "factors" are related to what we could call "predicates" in the RDF world, being attributes that something could have -- it reads like a list of possible independent variables that could affect people. If we look at other labels assigned to age, we get a list of very similar looking concepts:

label surface form score
factor age 35167
variable age 9375
characteristic age 4494
demographic variable age 3703
information age 3465
risk factor age 3154
demographic datum age 2682
demographic characteristic age 2579
demographic factor age 2541
demographic information age 2433
datum age 1834
patient characteristic age 1834
demographic age 1573
continuous variable age 1374
parameter age 1321
personal information age 1261
personal characteristic age 1109
confounding factor age 1086
covariate age 1071
patient factor age 773
baseline characteristic age 649
potential confounder age 641
criterion age 630
clinical datum age 574
issue age 566

These concepts form a messy categorization much like a folksonomy, for instance, the distinction between a continuous vs a discrete variable is potentially interesting:

label surface form score
continuous variable age 1374
continuous variable bmi 133
continuous variable weight 125
continuous variable height 86
continuous variable income 74
continuous variable blood pressure 56
continuous variable patient age 54
continuous variable birth weight 44
continuous variable body mass index 41
continuous variable temperature 35
continuous variable hemoglobin 31
continuous variable tumor size 26
discrete variable gender 26
continuous variable time 25
continuous variable expenditure 21
continuous variable age at diagnosis 20
continuous variable vital sign 20
continuous variable education 19
continuous variable gestational age 19
continuous variable biochemical result 19
continuous variable patient s age 19
discrete variable marital status 19
discrete variable thickness 14
discrete variable group status 10
discrete variable count datum 8
discrete variable brazil nut harvest method 8
discrete variable mortality 7
discrete variable road 7
discrete variable river access 7
discrete variable specific management practice 7
discrete variable return of spontaneous circulation 6
discrete variable presence of somatic mutation 6
discrete variable land cover 6
discrete variable burnt area 6
discrete variable location 5
discrete variable presence 5
discrete variable asa 5
discrete variable anticipated career choice 5
discrete variable fare class 5
discrete variable method of research used 5

The concept graph picks up this distinction, but looking carefully note that "thickness" isn't necessarily a discrete variable. Although a number of interesting categories exist in the graph, a considerable amount of cleanup would be necessary to create useful classifications.

Precision and Recall Analysis

Let's take a detailed look at a label which ought to have a well-defined list of values, specifically, "chemical elements"

rank label surface form score
1 chemical element carbon 137
2 chemical element oxygen 112
3 chemical element nitrogen 77
4 chemical element iron 63
5 chemical element gold 49
... ... ... ...
27 chemical element fluoride 11
... ... ... ...
34 chemical element oxygen carbon gold molybdenum 6
... ... ... ...
39 chemical element heavy metal 5
... ... ... ...
50 chemical element cr 3
... ... ... ...
64 chemical element trace element 2
... ... ... ...
88 chemical element 7 li 1
... ... ... ...
143 chemical element helium 1
144 chemical element neon 1

Note that the surface forms fit into a number of categories: (i) chemical elements by name, (ii) chemical elements by abbreviation, (iii) phrases that could stand in for some class of chemical element (ex. "heavy metal", "rare earth"), (iv) isotopes (ex. "7 li"), (v) names of ions (ex. "fluoride"), and (vi) crazy misses (ex. "oxygen carbon gold molybdenum") Many of these are examples of the kind of "near miss" situations that turn up in any kind of classification, particularly when language is involved. If we imagine however, that we're looking for a list of chemical elements, and we're willing to consider abbreviations to be valid surface forms, there turn out to be 70 elements identified by name and 9 with abbreviations and 65 surface forms that are not chemical elements. That gives us precision of 79/144 = 54.8% and recall of 70/118=59.3% for names and 9/118=7.6% for abbreviations

Considering that the concept graph discovered the concept of "chemical element", this is impressive, but if you really need a list of chemical elements you're better off getting them out of DBpedia, where the query

select count(*) {
    ?element dct:subject dbc:Chemical_elements
}
gets 100% recall with 92% precision because this query gets a few results such as Chemical_Element and Transfermium_Wars. In either case one would need to apply curation to get a perfect list: DBpedia comes much closer for this well-defined concept, but the Microsoft Concept graph discovers concepts on its own.

A few sample surface forms

Just to give you a sense of what you will find, I'll show a few examples of what you can get if you look at the labels assigned to surface forms. "Foot" is a good example of polysemy, that is, a word having multiple meanings:

label surface form score
area foot 210
extremity foot 190
body part foot 180
symptom foot 126
animal disease foot 83
unit foot 66
personal weapon foot 60
measurement foot 42
part foot 35
physical feature foot 34
feature foot 31
side effect foot 30
dry area foot 29
site foot 28
event foot 27

Note that the current version of the Microsoft Concept Graph makes no attempt to distinguish between polysemous concepts, but they are working on this for the next phase. Such a classification isn't as simple as picking "the right choice" because a particular use of the word foot could be as an "area", "extremity", "body part" and "personal weapon" at one time.

The Concept Graph is rich in knowledge about the biomedical domain. We get a nice set of categories for a common drug:

label surface form score
antidiarrheal agent loperamide 120
antimotility drug loperamide 88
medication loperamide 86
antimotility agent loperamide 66
antidiarrheal drug loperamide 32
over the counter medication loperamide 31
antidiarrheal medication loperamide 30
agent loperamide 28
compound loperamide 26
antidiarrheal loperamide 26
anti diarrheal agent loperamide 23
antidiarrhoeal drug loperamide 22
medicine loperamide 19
anti cancer drug loperamide 19
opiate loperamide 18
over the counter medicine loperamide 18
over the counter anti diarrheal medication loperamide 18
antiperistaltic agent loperamide 17
antidiarrhoea medicine loperamide 17
synthetic opiate loperamide 16

I also often see nice (if repetitive) results when the surface form is a drug catgegory:

label surface form score
medication calcium channel blocker 468
vasodilator calcium channel blocker 116
agent calcium channel blocker 108
antihypertensive drug calcium channel blocker 45
medicine calcium channel blocker 39
pharmacological agent calcium channel blocker 39
antihypertensive calcium channel blocker 30
antihypertensive medication calcium channel blocker 29
compound calcium channel blocker 27
vasodilators calcium channel blocker 23
therapy calcium channel blocker 22
pharmacologic agent calcium channel blocker 21
smooth muscle relaxant calcium channel blocker 20
blood pressure medication calcium channel blocker 13
cardiac medication calcium channel blocker 13
antihypertensives calcium channel blocker 12
cardiovascular drug calcium channel blocker 11
nonantimicrobial medication calcium channel blocker 11
combination calcium channel blocker 10
antihypertensive agent calcium channel blocker 10

(The trouble is, however, that medical applications are going to be held to account for errors, thus a high level of accuracy will be required.)

One area where precision is less important is pop culture, and good results can be had for relatively obscure topics:

label surface form score
artist mf doom 15
rapper mf doom 3
producer mf doom 2
successful artist mf doom 2
american emcee mf doom 2
underground hip-hop producer mf doom 2
hip hop legend mf doom 2
name mf doom 1
act mf doom 1
musician mf doom 1
hip hop artist mf doom 1
record producer mf doom 1
signee mf doom 1
talented artist mf doom 1
rap artist mf doom 1
contemporary musician mf doom 1
others artist mf doom 1
intelligent conscious, talented rapper mf doom 1
prestigious great-producer-okay-rappers mf doom 1
influence mf doom 1
powerhouse artist mf doom 1
popular alternative rapper mf doom 1

Finally, I'll show an example of what I call a "critical error", the kind of small mistake which has an outsized effect:

label surface form score
dictator adolf hitler 27
person adolf hitler 25
leader adolf hitler 22
historical figure adolf hitler 21
powerful speaker adolf hitler 12
individual adolf hitler 11
good leader adolf hitler 9
nazi leader adolf hitler 7
name adolf hitler 6
charismatic leader adolf hitler 4
ruler adolf hitler 4
german leader adolf hitler 4
high ranking nazi leader adolf hitler 4
man adolf hitler 3
figure adolf hitler 3
key individual adolf hitler 3
military dictator adolf hitler 3
famous leader adolf hitler 3
madman adolf hitler 3
psychopath adolf hitler 3

Note that the system classifies Adolph Hitler as a "good leader" as well as many other things; 19 labels look good (95% precision), but the 1 bad one could deeply offend somebody -- a different situation than the gobbledygook (but not actively harmful) labels we see on so many topics.

Conclusion

The Microsoft Concept Graph is one of many databases that (i) involves the intersection of language and concepts, and (ii) applies to a broad range of human knowledge. In this chapter, I've shown a number of small cross sections which give some idea of what you'll find inside it. It's hard to give a meaningful evaluation of the Concept Graph in general, because quality is a matter of fitness for some specific application. The example where I find 54.8% precision and 59.3% recall for chemical elements is an example of how this kind of categorization is evaluated in a particular domain.