This is a second installment of the work that began with Looking For Metadata in All the Wrong Places. The short of it is that I scanned my computer for XMP metadata packets, stored those packets in a triple store and now I'm using the SPARQL query language to explore it. This project is part of the development of the Gastrodon library, which is designed to make SPARQL query from a Jupyter notebook as easy as pie.
In this essay, I take an inventory of the namespaces, classes, and properties that describe the files on my computer. Finding that properties that describe color swatches are the most numerically prevalent, I investigate the use of them, and along the way, look at the thumbnail embedded in a metadata packet.
As always, We start by importing libraries:
%load_ext autotime
import sys
sys.path.append("/Users/paul_000/Documents/Github/gastrodon")
from gastrodon import Endpoint,QName,ttl,URIRef
import pandas as pd
from base64 import b64decode
from rdflib import Graph,Namespace,RDF
from IPython.display import display_jpeg,display_html
prefixes=Graph()
prefixes.load('/xmp/prefixes.ttl',format='ttl')
endpoint=Endpoint("http://127.0.0.1:8890/sparql/",prefixes)
First I'll show you a list of RDF namespaces used in the XMP metadata packets. XMP metadata packets are encoded in RDF/XML and use XML namespace syntax to define RDF namespaces, which map short names to URIs, just as do XML namespaces.
The import process consolidates the namespace declarations used in the packets, bringing in declarations from a number of vendors. Most of them are in Adobe namespace, as Adobe developed the XMP specification, but frequently those represent terminology such as EXIF which were adopted from other vendors.
Note that namespace declarations are local to a particular graph (or database), thus somebody working elsewhere could define different short names for a given namespace, or assign a different namespace to a given short name. RDF data is serialized together with namespace declarations (just like XML) so you can exchange data between RDF databases without worrying about this.
pd.options.display.max_rows = 100
endpoint.namespaces()
Technically, the database that I'm using, the open-source edition of OpenLink Virtuoso is a quad store instead of a triple store, because triples are organized into named graphs. Each named graph holds a set of triples; these can be used in numerous ways, but they are significant here because Virtuoso comes with about 5000 triples pre-loaded, and while I survey the types and properties used in the XMP data, I don't want to look at the ones that came with the database. All of the graphs except for the last one come with the database, and will be ignored.
endpoint.select("""
SELECT DISTINCT ?g {
GRAPH ?g { ?s ?p ?o .}
}
""")
A "Class" in RDF is in some ways like a class in a programming language like Java or Python. In other ways it is completely different.
In a language like Java or C++ a class definition is absolutely necessary if you want to define methods and properties; this is linked to how those languages are implemented, and it is situation much like how you need to define a table in SQL if you want to insert rows of data.
In other languages, such as Javascript and Python, it is possible to attach methods and properties to an existing instance without affecting other instances. However, even in languages like this, where programmers could do something completely different, programmers tend to write code in a "class first" style.
In RDF, we say that ?someInstance
is a member of ?someClass
if the fact
?someInstance rdf:type ?someClass .
is set. For better and for worse we can use the shorthand
?someInstance a ?someClass .
to mean the same thing. I say "for worse" because "a" is a contraction of "is a", and the verb "to be" (of which "is" is a form) is heavily overloaded in most languages. (Some advocate reform of this.) Frequently I find that people confuse "a" with rdf:type
, rdfs:subClassOf
and even owl:sameAs
and that can lead to a whole lot of trouble.
The above explanation applies because, although RDF packets are rich with data expressed in RDF properties, they make little use of RDF classes. In fact, almost all of the classes used are classes defined in the basic RDF namespace <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
that denote collections which are unordered rdf:Bag
, ordered rdf:Seq
or that define alternate values rdf:Alt
.
endpoint.select("""
SELECT ?type (COUNT(*) AS ?count) {
GRAPH <http://rdf.ontology2.com/xmp/xmpAll> { ?that a ?type .}
} GROUP BY ?type ORDER BY DESC(?count)
""")
Although looking at class assignments are not particularly productive for this data sets, looking at the properties used is highly productive.
Off the bat, it is visible that the 7 most common properties have to do with representing color swatches, so I'll spend the rest of this essay exploring them.
Another kind of property that appears frequently are properties of the form rdf:_1
, rdf:_2
and so forth. These are used together with rdf:Bag
, rdf:Seq
and rdf:Alt
to define Containers. Given a Container, the first element is linked with rdf:_1
, the second with rdf:_2
, etc.
Note: the results of this query stop at 10,000 rows because Virtuoso is configured to return a maximum of 10,000 results via the public SPARQL endpoint. This is easy to change for a private SPARQL endpoint, however, it is a feature which can protect a SPARQL endpoint from users who write excessively complex queries.
pd.options.display.max_rows = 50
endpoint.select("""
SELECT ?predicate (COUNT(*) AS ?count) {
GRAPH <http://rdf.ontology2.com/xmp/xmpAll> { ?subject ?predicate ?object.}
} GROUP BY ?predicate ORDER BY DESC(?count)
""")
By excluding the rdf:_#
properties, we evade the limit and get a look at some of the less commonly used properties.
endpoint.select("""
SELECT ?predicate (COUNT(*) AS ?count) {
GRAPH <http://rdf.ontology2.com/xmp/xmpAll> { ?subject ?predicate ?object.}
FILTER (!STRSTARTS(STR(?predicate),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
} GROUP BY ?predicate ORDER BY DESC(?count)
""")
I'd like to look for example swatches, but I can't do that by writing
?swatch a xapG:Swatch
because swatches are not members of a class, however, we can find swatches by finding subjects that have an appropriate property.
endpoint.select("""
SELECT ?subject {
?subject xapG:swatchName ?predicate
} LIMIT 5
""")
I can now put <nodeID://b10447>
on the left hand side (LHS) of a triple to see what properties a swatch has:
endpoint.select("""
SELECT ?predicate ?object {
<nodeID://b10447> ?predicate ?object
}
""")
Putting <nodeID://b10447>
on the right hand side (RHS) of the triple and working to the left I can find what has this swatch as a property
endpoint.select("""
SELECT ?subject ?predicate {
?subject ?predicate <nodeID://b10447> .
}
""")
The property is a member of a list, so I go to the left again:
endpoint.select("""
SELECT ?subject ?predicate {
?subject ?predicate <nodeID://b10446> .
}
""")
and again...
endpoint.select("""
SELECT ?subject ?predicate {
?subject ?predicate <nodeID://b10445> .
}
""")
Another list!
base=endpoint.select("""
SELECT ?subject ?predicate {
?subject ?predicate <nodeID://b10444> .
}
""")
base
This gets to a non-blank node, meaning this is the root of the XMP packet. From the root, the path is
root -> xapTg:SwatchGroups -> rdf:_1 -> xapG:Colorants -> rdf_1 -> xapG:blue, xapG:green , etc.
At this point I use the peel
operator to extract the complete XMP packet, which starts at the roots and crawls through blank nodes (ex. nodeId://b10444
) to capture containers, nested properties, and such. This is a little RDF graph extracted from the big RDF graph stored in Virtuoso, and I print it out in the popular Turtle format, which most people find more readable than RDF/XML:
g=endpoint.peel(URIRef(base.at[0,'subject']))
ttl(g)
This packet has 431 facts:
len(g)
There are many interesting things to see in the above packet, such as history and a font list, but I will stay on track with the color swatches. The color swatches in the above packet at organized in a single group, which we can visualize (_ignoring, for the moment, the single "SPOT" color which is CYMK instead of RGB; I'll show an example of CYMK swatches later.)
First I'll show the first five swatches...
s1=endpoint.select("""
SELECT ?swatchName ?red ?green ?blue {
<nodeID://b10446> ?any ?swatch .
?swatch xapG:swatchName ?swatchName .
?swatch xapG:type "PROCESS" .
OPTIONAL { ?swatch xapG:red ?red }
OPTIONAL { ?swatch xapG:blue ?blue }
OPTIONAL { ?swatch xapG:green ?green }
}
""")
s1.head()
Then convert numbers to color codes for cascading style sheets:
def colorcode(frame):
frame["code"]=list(map(lambda x: "rgb(%d,%d,%d)" % x,zip(
round(frame["red"]),
round(frame["green"]),
round(frame["blue"]))))
return frame
colorcode(s1).head()
At which point it is straightforward to write a custom formatter that displays the color:
pd.options.display.max_colwidth=0
colorblock=lambda code: "<div style='height:20px; width:100px; background: %s; border: 1px solid black'></div>" % (code,)
display_html(s1.to_html(formatters=[None,None,None,None,colorblock],escape=False),raw=True)
A good way to research how properties are used is to group on and count the values. As described in Part 2 of the XMP Specification there are four color spaces which can be used to specify swatches:
Note CMYK is a specification of dyes used to print colors on paper, GRAY is grayscale, RGB is the common additive color specification used on computer screens, and LAB separates the luminance (brightness) from hue and saturation.
endpoint.select("""
SELECT ?object (COUNT(*) AS ?count) {
?subject xapG:mode ?object.
} GROUP BY ?object ORDER BY DESC(?count)
""")
Color swatches can be Process colors (created by combining a number of dyes or light colors) or Spot colors (a specific color created with a particular dye.) Process colors are much more popular today because they don't require any thinking on the part of the artist or printer, but in the past Spot colors were commonly used to save money by having fewer printing plates. (I think of Scientific American illustrations from 1970s) Spot colors are also effective because they can produce artistic and distinctive effects as opposed to Spot color.
In terms of data, note that a handful of swatches are defined with a non-standard version of the word "PROCESS"; this is a frequent occurence in most large data sets, although RDF contributes to the problem by having been slow to develop a standard for validation. (17 years from the beginning of RDF, SHACL is still only a proposed reccomendation.)
endpoint.select("""
SELECT ?object (COUNT(*) AS ?count) {
?subject xapG:type ?object.
} GROUP BY ?object ORDER BY DESC(?count)
""")
Here are some common color names:
endpoint.select("""
SELECT ?object (COUNT(*) AS ?count) {
?subject xapG:swatchName ?object.
} GROUP BY ?object ORDER BY DESC(?count)
""")
These are dominated by uninteresting names derived from the color numbers; filtering most of these out, I find that colors like the middle-of-the-road blue PANTONE 300 C are popular
endpoint.select("""
SELECT ?object (COUNT(*) AS ?count) {
?subject xapG:swatchName ?object.
FILTER(!CONTAINS(?object,"="))
} GROUP BY ?object ORDER BY DESC(?count)
""")
In the example above there was just one group of swatches, but what about cases where there are a large number of color
swatches? The query below finds the documents which have the largest number of swatch groups. Simply, the query below implements a count of how many members are in the xapTg:SwatchGroups
list. The list looks like
:Groups a rdf:Seq ;
rdf:_1 "First Member" .
rdf:_2 "Second Member" .
rdf:_3 "Third Member" .
...
Assuming a collection is well formed, the length of the list is the number of rdf:_#
properties that list has, which we can calculate like so, so long as the list has only rdf:_#
and rdf:type
properties:
many_groups=endpoint.select("""
SELECT ?packet (COUNT(*) AS ?count) {
?packet xapTPg:SwatchGroups ?group .
?group ?grouppred ?member .
FILTER (?grouppred!=rdf:type)
} GROUP BY ?packet ORDER BY DESC(?count) LIMIT 10
""")
many_groups
We can pull out the name of the packet...
many_groups.at[0,'packet']
At the risk of making you scroll down a lot, I'll show you the whole graph contained in this packet. One reason why this packet is so long is that it contains an embedded thumbnail image, so I will show you that image as a reward for scrolling:
g=endpoint.peel(URIRef(many_groups.at[0,'packet']))
ttl(g)
For all of that, the least I can do is show you the thumbnail. To do that, I am going to use the rdflib Graph
object that is stored in the g
variable (which just contains facts from the packet) as opposed to writing a SPARQL query against the big graph in Virtuoso which holds all of the packets.
The first step I to take is import some functions and then create rdflib Namespace
objects for libraries, because once I have the namespace objects, it is very easy to write the names either as
xap.Thumbnails
or
xap['Thumbnails']
def getNS(graph,name):
return Namespace(list(filter(lambda x:x[0]==name,graph.namespaces()))[0][1])
xap=getNS(g,"xap")
xapG=getNS(g,"xapG")
xapGImg=getNS(g,"xapGImg")
There is an RDF
Namespace in the object which would let you write RDF.Seq
or RDF.Bag
but you cannot use it to write RDF._1
or RDF._2
because RDF
is a ClosedNamespace
which only resolves names from a short list. This protects you from using an invalid namespace/name combination, but for that reason I wrote a function that looks up the list member predicate for a given number.
While I am at it, I'll add one to the index so that member(0)
is the first element of the list, just as it would be in an ordinary Python list.
def member(index):
return URIRef("http://www.w3.org/1999/02/22-rdf-syntax-ns#_"+str(index+1))
The rdflib Graph
uses Python slicing in an unusual way:
g[?s:?p:?o]
allows slicing over the triples that have a particular combination of subject, predicate and object. With that, the declared namespaces, and the member()
functions I can get the base64-encoded thumbnail, decode it and display it.
thumblist=list(g[:xap.Thumbnails:])[0][1]
thumb=list(g[thumblist:member(0):])[0]
thumbData=b64decode(list(g[thumb:xapGImg.image:])[0].toPython())
display_jpeg(thumbData,raw=True)
Nice Fractal! Note that this is a monochrome image so that the color swatches don't tell us anything about the image itself. The XMP specification, however, was meant to satisfy Adobe's needs for it's suite of media handling tools, and swatch groups are useful for artists to keep track of what colors are being used on a project that involves a number of different tools.
That said, I'll draw the color swatches attached to this document. The first step is to find which predicates are in use for this set of color swatches:
swatch_predicates=endpoint.select("""
SELECT DISTINCT ?predicate {
?packet xapTPg:SwatchGroups ?groups .
?groups ?grouppred ?group .
?group xapG:Colorants ?colorants .
?colorants ?colorpred ?swatch .
?swatch ?predicate ?value
}
""",bindings={"packet": URIRef(many_groups.at[0,'packet'])})
swatch_predicates
I next write a query which returns one row per color. A key thing to note in this query is the use of the OPTIONAL
clause, which will match whether or not the pattern inside it matches. It is important in this case because, most of the time, the xapG:tint
property is not set -- if I didn't use an optional pattern here, only the first seven rows would be returned.
pd.options.display.max_rows = 100
swatch_colors=endpoint.select("""
SELECT ?groupName ?swatchName ?cyan ?magenta ?yellow ?black ?tint {
?packet xapTPg:SwatchGroups ?groups .
?groups ?grouppred ?group .
?group xapG:Colorants ?colorants .
?colorants ?colorpred ?swatch .
?group xapG:groupName ?groupName
FILTER (?colorpred!=rdf:type)
OPTIONAL { ?swatch xapG:black ?black}
OPTIONAL { ?swatch xapG:cyan ?cyan }
OPTIONAL { ?swatch xapG:mode ?mode }
OPTIONAL { ?swatch xapG:magenta ?magenta}
OPTIONAL { ?swatch xapG:swatchName ?swatchName}
OPTIONAL { ?swatch xapG:tint ?tint}
OPTIONAL { ?swatch xapG:type ?type}
OPTIONAL { ?swatch xapG:yellow ?yellow}
}
""",bindings={"packet": URIRef(many_groups.at[0,'packet'])})
swatch_colors
Unlike the examples from "computing-solutions.pdf", where the swatches were described mainly in the RGB color space, swatches from the fractal document are entirely in the CYMK color space.
Here are the definitions of the xapG namespace types, excerpted from Part 2 of the XMP specification:
Name
|
Type
|
Description
|
xapG:A
xapG:B
|
Integer
|
A
or B value when the mode is LAB. Range -128 to 127.
|
xapG:L
|
Real
|
L
value when the mode is LAB. Range 0-100.
|
xapG:black
xapG:cyan
xapG:magenta
xapG:yellow
|
Real
|
Colour
value when the mode is CMYK. Range 0-100.
|
xapG:blue
xapG:green
xapG:red
|
Integer
|
Colour
value when the mode is RGB. Range 0-255.
|
xapG:mode
|
closed
Choice
|
The
colour
space in which the
colour
is defined. One of: CMYK,
RGB, LAB.
Library
colours
are represented in the
colour
space for which they are defined.
|
xapG:swatchName
|
Text
|
Name
of the swatch.
|
xapG:type
|
closed
Choice
|
The
type of
colour
, one of PROCESS or SPOT.
|
It turns out that xapG:tint
does not appear in the specification. Looking at the Illustrator docs, I find
You can also create tints in the Swatches panel. A tint is a global process color or spot color with a modified intensity. Tints of the same color are linked together, so that if you edit the color of a tint swatch, all associated tint swatches (and the objects painted with those swatches) change color, though the tint values remain unchanged. Tints are identified by a percentage (when the Swatches panel is in list view)
At least in the CYMK space, a 100% Tint means "print the 100% of the ink specified in this swatch"; setting that tint to, say, 20% means "print only 20% of the ink specified in this swatch;" the effect of this on the usual white background would be similar to blending the color 80% with white on the computer screen.
To display colors on a web site, current web specifications require that colors be specified in RGB format. Thus, to display CMYK colors, I have to convert colors to RGB color space. The following simple formulas (from here) aproximately convert CYMK colors to RGB:
swatch_colors["red"]=2.55/100*(100-swatch_colors["cyan"])*(100-swatch_colors["black"])
swatch_colors["green"]=2.55/100*(100-swatch_colors["magenta"])*(100-swatch_colors["black"])
swatch_colors["blue"]=2.55/100*(100-swatch_colors["yellow"])*(100-swatch_colors["black"])
Conversion between CYMK and RGB colors is a bit iffy because the two color systems are not completely comparable. Red, Green and Blue colors on a screen roughly correspond to three kinds of stimulus applied to three kinds of cone in the eye. The exact stimulus from a printed page, however, depends on the color of the light shining on the page.
The formula above lacks a contribution for tint.
Tint could be emulated by multiplying all of the color channels below by swatch_colors["tint"]/100.0
after assuming a default value of 100.0
. I don't bother to do this, since the tint makes no difference in this case.
I delete columns that I no longer need, then show a few rows before visualizing color swatches with the functions defined previously:
for color in ["cyan","magenta","yellow","black","tint"]:
del swatch_colors[color]
swatch_colors.head()
colorcode(swatch_colors)
swatch_colors.head()
for color in ["red","green","blue"]:
del swatch_colors[color]
pd.options.display.max_colwidth=0
showgroup=lambda group: group.replace(' Swatch Group','')
display_html(swatch_colors.to_html(formatters=[showgroup,None,colorblock],escape=False),raw=True)
Standard SPARQL, unlike some query languages, doesn't have a specific function to count the members of a list. Fortunately, it is straightfoward to write a query that returns rows for all members, then counts (and possibly groups) over the rows. In a future essay, I will use inference to simplify this kind of query, but for now, the following query returns, on individual rows, the number of swatches found in individual documents because this is what is required by matplotlib and other tools for investigating statistics.
Note that this query only returns results for documents which have swatches; it does not contain a '0' for any of the documents that do not have swatches. I'll take a look at the (majority) of swatchless documents in a moment. Also, this query is adding together all of the swatches in all of the swatch groups associated with a document and producingt just one number.
swatch_counts=endpoint.select("""
SELECT (COUNT(*) AS ?count) {
?packet xapTPg:SwatchGroups ?groups .
?groups ?grouppred ?group .
?group xapG:Colorants ?colorants .
?colorants ?colorpred ?swatch .
FILTER (?grouppred!=rdf:type) .
FILTER (?colorpred!=rdf:type)
} GROUP BY ?packet
""")
swatch_counts.head()
import matplotlib.pyplot as plt
%matplotlib inline
The Pandas dataframe itself can return basic descriptive statistics, such as mean and median:
swatch_counts.describe()
The following histogram reveals peaks at particular counts, suggesting that the many of the documents carry "standard" groups of swatches that probably have been supplied by editing tools. (Looking at the list of common colors above also suggests this.)
swatch_counts.plot.hist(bins=100,range=(0,100))
the xap:CreatorTool
predicate exists to describe the tool which created an XMP packet. If we look at the tools used to save documents that have color swatches, these appear to be entirely adobe tools, almost always Adobe Illustrator.
tools_with_swatches=endpoint.select("""
SELECT ?creatorTool (COUNT(*) AS ?count) {
?packet xapTPg:SwatchGroups ?groups ;
xap:CreatorTool ?creatorTool .
} GROUP BY ?creatorTool ORDER BY ?creatorTool
""")
tools_with_swatches
The MINUS
clause in SPARQL is good for finding data records that lack a property. Looking at the tools that have created documents without swatches, it is a much more diverse list in terms of both Adobe and non-Adobe tools.
tools_without_swatches=endpoint.select("""
SELECT ?creatorTool (COUNT(*) AS ?count) {
?packet xap:CreatorTool ?creatorTool .
MINUS {?packet xapTPg:SwatchGroups ?groups .}
} GROUP BY ?creatorTool ORDER BY ?creatorTool
""")
tools_without_swatches
as is frequently the case, the number of different values is awkwardly large, thus pandas automatically cuts out the middle of the list to display it. I could tell pandas to show you 523 rows, but if it were 5230 rows, or 5,230,000 rows, it would be impractical to show all the values, in which case they need to be summarized in some way.
One answer would be to sort the "top N names" to the top, but I like how alphabetical order groups together different versions of the product. Thus I use pandas to filter for tools that produced more than 40 swatchless documents. This illustrates the diversity of tools, both desktop and server, that generate XMP annotated documents.
tools_without_swatches[tools_without_swatches["count"] >40]
RDF, and by extension, XMP supports the full Unicode character set. rdflib, the Python library I use to work with rdf, builds on the Python 3 support to support a full range of characters. The following function selects for strings that only contain ASCII characters:
def only_ascii(sample):
return all(map(lambda x: x< 128,[ord(x) for x in sample]))
I then negate the only_ascii
function, apply it to a Pandas series, and filter the whole dataframe to see creatorTool(s)
that contain non-ascii characters:
not_ascii=tools_without_swatches["creatorTool"].apply(lambda a: not only_ascii(a))
tools_without_swatches[not_ascii]
A certain big software company likes to use the registered mark, another uses the copyright mark, others contain chinese characters. If you have a good eye you might notice row 271
doesn't have visible non-ascii characters, instead it has visible non-ascii characters.
tools_without_swatches.at[271,'creatorTool']
Somebody just had to insert a non-breaking space; oddly, this turns up just once, while other names for Lightroom turn up far more often:
tools_without_swatches[tools_without_swatches["creatorTool"].apply(lambda x: x.find('Lightroom')>=0)]
As usual in a data investigation, you encounter the ocassional random thing, such as the URL of a web site that generated a document.
tools_without_swatches[tools_without_swatches["creatorTool"].apply(lambda x: x.find('http')==0)]
Even though this URI fails to resolve with the following message:
I'd better take a look at the document to see that this link doesn't compromise my privacy:
endpoint.select("""
SELECT ?packet {
?packet xap:CreatorTool "https://ecf.nysd.uscourts.gov/cgi-bin/show_temp.pl?file=10815940-0--25817.pdf&type=application/pdf"
}
""")
cftc=endpoint.peel(URIRef(_.at[0,'packet']))
ttl(cftc)
It's just a court order regarding the liquidation of MF Global that I got when I crawled the CFTC web site.
In this episode I
This episode is part of a series that demonstrates the art and science of SPARQL queries against complex data sets, such as XMP. I've been working hard to make my queries clear, but in the next episode I'll tune up my tools and use inference to make short work of the information about document history that can be found in XMP packets.
Want to read more articles like this?
Subscribe to our mailing list!
|