For a long time there has been a battle in Natural Language Programming between statistically oriented systems that are trained from large quantities against systems as opposed to grammar oriented systems motivated by the modern linguistics associated with Professor Noam Chomsky. Although some modern NLP systems start by parsing sentences, using something like the Stanford Parser.
I've long been interested in the problem of Named Entity Resolution and, in particular, the approach used by DBpedia Spotlight, which collects a large number of possible surface forms (for instance, both "NYPD" and "New York City Police Department" refer to this concept.) Spotlight is different from some other systems in that it resolves entities to concepts, rather than simply marking phrases in the text that correspond to particular roles(for instance, in a sentence like "Frank Boltz was an employee of the NYPD", the pink phrase is a person's name while the second is the name of an organization -- this assignment could be made even if we had no idea what these entities are.)
I like to think of this kind of system as a "magic magic marker", which highlights phrases in text to tag them with either general or specific meanings. A system like spotlight works in two phases, first finding places where the surface form dictionary matches the text, and then determining which interpretations are correct. (For instance, the word "Kate" could be a surface form for the first name of a women having over one of over 1000 names derived from Catherine -- an inspection of the context is necessary to narrow it down to a particular Kate.)
The Microsoft Concept graph is a database of surface forms and possible interpretations. Rather than resolving phrases to specific concepts (say "Pikachu" to Pikachu) it tags phrases with general concepts such as "Pokemon" or "Character". Produced with technology similar to that used to create word embeddings such as word2vec, the concept graph is positioned as a tool useful for understanding short texts such as search queries and tweets.
This chapter is based on my notes of a preliminary investigation of the Microsoft Concept Graph intended to be a rapid evaluation of the product for text analysis applications.
Let's look at the first 15 lines of the file to get a quick sense of the contents:
label | surface form | score |
---|---|---|
factor | age | 35167 |
free rich company datum | size | 33222 |
free rich company datum | revenue | 33185 |
state | california | 18062 |
supplement | msm glucosamine sulfate | 15942 |
factor | gender | 14230 |
factor | temperature | 13660 |
metal | copper | 11142 |
issue | stress pain depression sickness | 11110 |
variable | age | 9375 |
information | name | 9274 |
state | new york | 8925 |
social medium | 8919 | |
material | plastic | 8628 |
supplemental material | cds | 8175 |
(Note that I added the header at the top, the actual file has no header)
The file as a whole has 33,377,320 lines and is sorted by descending score. The label is a label which could be applied to a text phrase, the surface form is the phrase, and the score is a measure of the strength of association between the two. In the top 15 lines we can already see some sense of the diversity of concepts, such as "factor" which represents various properties that an object could have, as well as "state" and "material". Already we can see a few examples where the results are strange such as the concept "free rich company datum" (which seems to represent a property that a company could have) and the issue "stress pain depression sickness" (which seems ill formed and a bit verbose)
Already we see one quality that most classifications based on unsupervised machine learning lack: the categories (labels) are meaningfully named, at least mostly. Speaking of labels, the system assigns 5,376,526 different labels. The most commonly assigned labels are here:
rank | number of members | label |
---|---|---|
1 | 364111 | factor |
2 | 203549 | feature |
3 | 201986 | issue |
4 | 172106 | product |
5 | 158829 | item |
6 | 142963 | area |
7 | 137435 | topic |
8 | 133715 | service |
9 | 122903 | activity |
10 | 112387 | information |
11 | 110915 | event |
12 | 108940 | company |
13 | 102032 | common search term |
14 | 92337 | program |
15 | 91842 | technique |
16 | 88835 | application |
17 | 88342 | organization |
18 | 84534 | case |
19 | 83271 | method |
20 | 82397 | name |
21 | 80643 | project |
22 | 77880 | option |
23 | 75264 | parameter |
24 | 73788 | tool |
25 | 68767 | group |
26 | 64969 | term |
27 | 62168 | problem |
28 | 61827 | material |
29 | 61768 | variable |
30 | 56243 | technology |
31 | 55607 | place |
32 | 55161 | measure |
33 | 54641 | artist |
34 | 53449 | community |
35 | 51183 | element |
36 | 50445 | aspect |
37 | 50411 | player |
38 | 49954 | condition |
39 | 48435 | concept |
40 | 47923 | system |
41 | 46679 | function |
42 | 44516 | task |
43 | 43694 | brand |
44 | 42806 | initiative |
45 | 42075 | device |
46 | 41886 | component |
47 | 41500 | datum |
48 | 39146 | person |
49 | 37843 | site |
50 | 37505 | resource |
The most common labels are short words with broad meanings. The one multi-word label that appears in the top 50, "common search term" turns out to be a strange one, containing search terms that would be used to find pirate software such as "adobe photoshop crack" and "age of empires 3 serial". A look at some of the least used labels shows that there is plenty of room at the bottom:
rank | number of members | label |
---|---|---|
5376507 | 1 | 0168monuments |
5376508 | 1 | 01527 1441 |
5376509 | 1 | 012 agonists |
5376510 | 1 | 0 10v control application |
5376511 | 1 | 00v diode |
5376512 | 1 | 00portable item |
5376513 | 1 | 00 later flst |
5376514 | 1 | 00db features |
5376515 | 1 | 00 construction equipment vehicle |
5376516 | 1 | 0067j once pmma particle |
5376517 | 1 | 0 027 in delivery microcatheter |
5376518 | 1 | 00163192contaminated debris |
5376519 | 1 | 0 014 inch guidewire |
5376520 | 1 | 000 square foot building |
5376521 | 1 | 00054j a medical device embodiment |
5376522 | 1 | 0003j non invasive medical imaging technique |
5376523 | 1 | 0002j thermoplastic |
5376524 | 1 | 0002j hydrocarbon |
5376525 | 1 | 0002highly absorbent article |
5376526 | 1 | 00 01 04 |
There are 2,364,966 labels which have only one surface form and, looking at the examples above, these are often gibberish; for practical work, it's clear that a large number of junk records could be removed.
Let's take a look at the top 25 "factors":
label | surface form | score |
---|---|---|
factor | age | 35167 |
factor | gender | 14230 |
factor | temperature | 13660 |
factor | size | 6709 |
factor | stress | 6433 |
factor | education | 6256 |
factor | cost | 5661 |
factor | smoking | 5532 |
factor | location | 5247 |
factor | diet | 5205 |
factor | ph | 5160 |
factor | weather | 4604 |
factor | weight | 4157 |
factor | genetic | 3844 |
factor | climate | 3756 |
factor | income | 3727 |
factor | ethnicity | 3708 |
factor | obesity | 2975 |
factor | humidity | 2945 |
factor | time | 2940 |
factor | culture | 2904 |
factor | environment | 2688 |
factor | type | 2268 |
factor | experience | 2211 |
factor | lifestyle | 2177 |
Note that the "factors" are related to what we could call "predicates" in the RDF world, being attributes that something could have -- it reads like a list of possible independent variables that could affect people. If we look at other labels assigned to age, we get a list of very similar looking concepts:
label | surface form | score |
---|---|---|
factor | age | 35167 |
variable | age | 9375 |
characteristic | age | 4494 |
demographic variable | age | 3703 |
information | age | 3465 |
risk factor | age | 3154 |
demographic datum | age | 2682 |
demographic characteristic | age | 2579 |
demographic factor | age | 2541 |
demographic information | age | 2433 |
datum | age | 1834 |
patient characteristic | age | 1834 |
demographic | age | 1573 |
continuous variable | age | 1374 |
parameter | age | 1321 |
personal information | age | 1261 |
personal characteristic | age | 1109 |
confounding factor | age | 1086 |
covariate | age | 1071 |
patient factor | age | 773 |
baseline characteristic | age | 649 |
potential confounder | age | 641 |
criterion | age | 630 |
clinical datum | age | 574 |
issue | age | 566 |
These concepts form a messy categorization much like a folksonomy, for instance, the distinction between a continuous vs a discrete variable is potentially interesting:
label | surface form | score |
---|---|---|
continuous variable | age | 1374 |
continuous variable | bmi | 133 |
continuous variable | weight | 125 |
continuous variable | height | 86 |
continuous variable | income | 74 |
continuous variable | blood pressure | 56 |
continuous variable | patient age | 54 |
continuous variable | birth weight | 44 |
continuous variable | body mass index | 41 |
continuous variable | temperature | 35 |
continuous variable | hemoglobin | 31 |
continuous variable | tumor size | 26 |
discrete variable | gender | 26 |
continuous variable | time | 25 |
continuous variable | expenditure | 21 |
continuous variable | age at diagnosis | 20 |
continuous variable | vital sign | 20 |
continuous variable | education | 19 |
continuous variable | gestational age | 19 |
continuous variable | biochemical result | 19 |
continuous variable | patient s age | 19 |
discrete variable | marital status | 19 |
discrete variable | thickness | 14 |
discrete variable | group status | 10 |
discrete variable | count datum | 8 |
discrete variable | brazil nut harvest method | 8 |
discrete variable | mortality | 7 |
discrete variable | road | 7 |
discrete variable | river access | 7 |
discrete variable | specific management practice | 7 |
discrete variable | return of spontaneous circulation | 6 |
discrete variable | presence of somatic mutation | 6 |
discrete variable | land cover | 6 |
discrete variable | burnt area | 6 |
discrete variable | location | 5 |
discrete variable | presence | 5 |
discrete variable | asa | 5 |
discrete variable | anticipated career choice | 5 |
discrete variable | fare class | 5 |
discrete variable | method of research used | 5 |
The concept graph picks up this distinction, but looking carefully note that "thickness" isn't necessarily a discrete variable. Although a number of interesting categories exist in the graph, a considerable amount of cleanup would be necessary to create useful classifications.
Let's take a detailed look at a label which ought to have a well-defined list of values, specifically, "chemical elements"
rank | label | surface form | score |
---|---|---|---|
1 | chemical element | carbon | 137 |
2 | chemical element | oxygen | 112 |
3 | chemical element | nitrogen | 77 |
4 | chemical element | iron | 63 |
5 | chemical element | gold | 49 |
... | ... | ... | ... |
27 | chemical element | fluoride | 11 |
... | ... | ... | ... |
34 | chemical element | oxygen carbon gold molybdenum | 6 |
... | ... | ... | ... |
39 | chemical element | heavy metal | 5 |
... | ... | ... | ... |
50 | chemical element | cr | 3 |
... | ... | ... | ... |
64 | chemical element | trace element | 2 |
... | ... | ... | ... |
88 | chemical element | 7 li | 1 |
... | ... | ... | ... |
143 | chemical element | helium | 1 |
144 | chemical element | neon | 1 |
Note that the surface forms fit into a number of categories: (i) chemical elements by name, (ii) chemical elements by abbreviation, (iii) phrases that could stand in for some class of chemical element (ex. "heavy metal", "rare earth"), (iv) isotopes (ex. "7 li"), (v) names of ions (ex. "fluoride"), and (vi) crazy misses (ex. "oxygen carbon gold molybdenum") Many of these are examples of the kind of "near miss" situations that turn up in any kind of classification, particularly when language is involved. If we imagine however, that we're looking for a list of chemical elements, and we're willing to consider abbreviations to be valid surface forms, there turn out to be 70 elements identified by name and 9 with abbreviations and 65 surface forms that are not chemical elements. That gives us precision of 79/144 = 54.8% and recall of 70/118=59.3% for names and 9/118=7.6% for abbreviations
Considering that the concept graph discovered the concept of "chemical element", this is impressive, but if you really need a list of chemical elements you're better off getting them out of DBpedia, where the query
select count(*) {
?element dct:subject dbc:Chemical_elements
}
gets 100% recall with 92% precision because this query gets a few results such as
Chemical_Element and
Transfermium_Wars. In either case one would need to apply curation to get a perfect list: DBpedia comes much closer for this well-defined concept, but the Microsoft Concept graph discovers concepts on its own.
Just to give you a sense of what you will find, I'll show a few examples of what you can get if you look at the labels assigned to surface forms. "Foot" is a good example of polysemy, that is, a word having multiple meanings:
label | surface form | score |
---|---|---|
area | foot | 210 |
extremity | foot | 190 |
body part | foot | 180 |
symptom | foot | 126 |
animal disease | foot | 83 |
unit | foot | 66 |
personal weapon | foot | 60 |
measurement | foot | 42 |
part | foot | 35 |
physical feature | foot | 34 |
feature | foot | 31 |
side effect | foot | 30 |
dry area | foot | 29 |
site | foot | 28 |
event | foot | 27 |
Note that the current version of the Microsoft Concept Graph makes no attempt to distinguish between polysemous concepts, but they are working on this for the next phase. Such a classification isn't as simple as picking "the right choice" because a particular use of the word foot could be as an "area", "extremity", "body part" and "personal weapon" at one time.
The Concept Graph is rich in knowledge about the biomedical domain. We get a nice set of categories for a common drug:
label | surface form | score |
---|---|---|
antidiarrheal agent | loperamide | 120 |
antimotility drug | loperamide | 88 |
medication | loperamide | 86 |
antimotility agent | loperamide | 66 |
antidiarrheal drug | loperamide | 32 |
over the counter medication | loperamide | 31 |
antidiarrheal medication | loperamide | 30 |
agent | loperamide | 28 |
compound | loperamide | 26 |
antidiarrheal | loperamide | 26 |
anti diarrheal agent | loperamide | 23 |
antidiarrhoeal drug | loperamide | 22 |
medicine | loperamide | 19 |
anti cancer drug | loperamide | 19 |
opiate | loperamide | 18 |
over the counter medicine | loperamide | 18 |
over the counter anti diarrheal medication | loperamide | 18 |
antiperistaltic agent | loperamide | 17 |
antidiarrhoea medicine | loperamide | 17 |
synthetic opiate | loperamide | 16 |
I also often see nice (if repetitive) results when the surface form is a drug catgegory:
label | surface form | score |
---|---|---|
medication | calcium channel blocker | 468 |
vasodilator | calcium channel blocker | 116 |
agent | calcium channel blocker | 108 |
antihypertensive drug | calcium channel blocker | 45 |
medicine | calcium channel blocker | 39 |
pharmacological agent | calcium channel blocker | 39 |
antihypertensive | calcium channel blocker | 30 |
antihypertensive medication | calcium channel blocker | 29 |
compound | calcium channel blocker | 27 |
vasodilators | calcium channel blocker | 23 |
therapy | calcium channel blocker | 22 |
pharmacologic agent | calcium channel blocker | 21 |
smooth muscle relaxant | calcium channel blocker | 20 |
blood pressure medication | calcium channel blocker | 13 |
cardiac medication | calcium channel blocker | 13 |
antihypertensives | calcium channel blocker | 12 |
cardiovascular drug | calcium channel blocker | 11 |
nonantimicrobial medication | calcium channel blocker | 11 |
combination | calcium channel blocker | 10 |
antihypertensive agent | calcium channel blocker | 10 |
(The trouble is, however, that medical applications are going to be held to account for errors, thus a high level of accuracy will be required.)
One area where precision is less important is pop culture, and good results can be had for relatively obscure topics:
label | surface form | score |
---|---|---|
artist | mf doom | 15 |
rapper | mf doom | 3 |
producer | mf doom | 2 |
successful artist | mf doom | 2 |
american emcee | mf doom | 2 |
underground hip-hop producer | mf doom | 2 |
hip hop legend | mf doom | 2 |
name | mf doom | 1 |
act | mf doom | 1 |
musician | mf doom | 1 |
hip hop artist | mf doom | 1 |
record producer | mf doom | 1 |
signee | mf doom | 1 |
talented artist | mf doom | 1 |
rap artist | mf doom | 1 |
contemporary musician | mf doom | 1 |
others artist | mf doom | 1 |
intelligent conscious, talented rapper | mf doom | 1 |
prestigious great-producer-okay-rappers | mf doom | 1 |
influence | mf doom | 1 |
powerhouse artist | mf doom | 1 |
popular alternative rapper | mf doom | 1 |
Finally, I'll show an example of what I call a "critical error", the kind of small mistake which has an outsized effect:
label | surface form | score |
---|---|---|
dictator | adolf hitler | 27 |
person | adolf hitler | 25 |
leader | adolf hitler | 22 |
historical figure | adolf hitler | 21 |
powerful speaker | adolf hitler | 12 |
individual | adolf hitler | 11 |
good leader | adolf hitler | 9 |
nazi leader | adolf hitler | 7 |
name | adolf hitler | 6 |
charismatic leader | adolf hitler | 4 |
ruler | adolf hitler | 4 |
german leader | adolf hitler | 4 |
high ranking nazi leader | adolf hitler | 4 |
man | adolf hitler | 3 |
figure | adolf hitler | 3 |
key individual | adolf hitler | 3 |
military dictator | adolf hitler | 3 |
famous leader | adolf hitler | 3 |
madman | adolf hitler | 3 |
psychopath | adolf hitler | 3 |
Note that the system classifies Adolph Hitler as a "good leader" as well as many other things; 19 labels look good (95% precision), but the 1 bad one could deeply offend somebody -- a different situation than the gobbledygook (but not actively harmful) labels we see on so many topics.
The Microsoft Concept Graph is one of many databases that (i) involves the intersection of language and concepts, and (ii) applies to a broad range of human knowledge. In this chapter, I've shown a number of small cross sections which give some idea of what you'll find inside it. It's hard to give a meaningful evaluation of the Concept Graph in general, because quality is a matter of fitness for some specific application. The example where I find 54.8% precision and 59.3% recall for chemical elements is an example of how this kind of categorization is evaluated in a particular domain.