The XMP specification describes a system of embedding metadata into files of all kinds. Although part 3 of the specification defines specfic ways to embed XMP packets into common file formats such as PNG, PDF, WAV and SVG, part 1 defines a method of embedding a chunk of RDF-XML with a unique pattern of bytes in the header that allows a simple routine to find XMP packets in any kind of file.
show_image("md-pipeline-2.png")
Such a naive XMP extractor, it turns out, finds a large number of XMP packets inside files where you wouldn't expect them, such as executables, ZIP files, JAR files, Office documents and more. Many of these files are composite files that incorporate a number of media files into a larger file, often uncompressed on the expectation that the content of these files is already compressed.
I scanned a Windows computer, which is heavily used for software development, creative work, games and other things, so it contains a wide variety of files. Each XMP packet was extracted from the containing document, tagged with a small amount of metadata tying it to its source, and was then inserted into OpenLink Virtuoso Open Source Edition, a triple store that supports SPARQL queries.
In this Jupyter notebook I'll introduce some of the tools I use to make reports based on RDF data, and introduce the basics of the widespread, if obscure, XMP format.
%load_ext autotime
import sys
sys.path.append("/Users/paul_000/Documents/Github/gastrodon")
from gastrodon import Endpoint,QName,ttl,URIRef
import pandas as pd
pd.options.display.width=120
pd.options.display.max_colwidth=100
RDF refers to entities and properties with URIs, for instance, the Dublin Core vocabulary uses the term <http://purl.org/dc/elements/1.1/creator>
to describe the creator of a creative work, but we could make a statement like
@prefix dc: <http://purl.org/dc/elements/1.1/>
to define a "dc" namespace such that we can write dc:creator
instead of the complete IRI.
The gastrodon library automatically handles namespaces and prefixes for us, but we need to load a list of namespaces to make that happen.
from rdflib import Graph
prefixes=Graph()
prefixes.load('/xmp/prefixes.ttl',format='ttl')
endpoint=Endpoint("http://127.0.0.1:8890/sparql/",prefixes)
An RDF database contains a number of facts, represented as (subject,predicate,object) triples. The following query matches all facts in the database (because it has variables in all three positions of the matching pattern) and returns a count, roughly 5.3 million.
endpoint.select("""
SELECT (COUNT(*) as ?cnt) {
?s ?p ?o .
}
""")
Next we count facts that have the following predicate:
<http://rdf.ontology2.com/metadata/file>
Although this predicate is a URI, it is not necessary to fetch this URI in order to process this data. We've attached one file property to each XMP record to record which file we found it in, thus we count 54,652 XMP packets.
endpoint.select("""
SELECT (COUNT(*) AS ?cnt) {
?s <http://rdf.ontology2.com/metadata/file> ?o
}
""")
We can access that exact same predicate with
o2:file
because Gastrodon keeps a list of namespaces and automatically prepends
prefix o2: <http://rdf.ontology2.com/metadata/file>
to the query. (It knows this because the namespace was declared in the We can count distinct values for the file names in SPARQL the same way we would in SQL, thus finding that there are 24,228 files with XMP packets.
endpoint.select("""
SELECT (COUNT(DISTINCT ?file) AS ?cnt) {
?s o2:file ?file
}
""")
We sum of the size of all XMP packets and discover 638MB of data on a hard drive with 425GB of used space, amounting to about 0.15% of all space in use.
endpoint.select("""
SELECT (SUM(?size)/1000000.0 as ?cnt) {
?s o2:xmpSize ?size .
}
""")
As a warm-up, let's look for files with very short names (otherwise they might be awkward to fit in the table below) and count how many XMP packets they contain.
We do this with a simple SPARQL query that queries over the o2:file
property which connects the XMP packet to the filename it was extracted from. Just like we would in SQL, we GROUP and ORDER the results to count the number of packets they contain.
endpoint.select("""
SELECT ?file (COUNT(*) AS ?cnt) {
?s o2:file ?file
} GROUP BY ?file ORDER BY STRLEN(?file) LIMIT 10
""")
Explorer.exe
is called the File Explorer by Microsoft, but in addition to the folder browser,it is also responsible for the start menu, task bar, and other functions. If you use Windows, you use it every day.
The following query is as simple as a SPARQL query gets. This query matches triples that share a specific predicate and object (value) and returns the associated subjects.
frame=endpoint.select("""
SELECT ?s {
?s o2:file "C:/Windows/explorer.exe"
}
""")
frame
We get a list of file:
URIs, each of which is the name of an XMP packet, which we generated by appending a number and letter to the filename.
To get some idea of the content of one packet, we'll look at the facts referencing the first packet.
pngOne=endpoint.select("""
SELECT ?p ?o {
<file:///C:/Windows/explorer.exe/0001w> ?p ?o
}
""")
pngOne
Note that the properties in the o2
namespace are properties I added to identify the packets, as opposed to all of the other properties which were copied verbatim from the XMP packet. Other properties come from Dublin Core, XMP and Photoshop-specific namespaces, as well as namespaces such as exif
and tiff
which contain industry-standard terminology despite being on an Adobe URI.
Part 2 of the XMP specification describes a number of namespaces of types and properties to use with XMP. Despite that, XMP users are free to use any RDF vocabulary they like so long as they comply with certain conventions described in Part 1.
Let's look first at the 'o2' properties because these provide a map for finding and describing XMP packets.
pngOne[1:8]
Since all of these properties have a 1-to-1 relationship with an XMP packet, we can write a query which works like a typical SQL query, where all the patterns share the same subject and we get exactly one row per packet. This table shows the length of all the packets, whether they allow read or write access, and their exact location in the file.
packets=endpoint.select("""
SELECT ?packetNumber ?rwFlag ?xmpStart ?xmpEnd ?xmpSize {
?s o2:file "C:/Windows/explorer.exe" .
?s o2:packetNumber ?packetNumber .
?s o2:rwFlag ?rwFlag .
?s o2:xmpStart ?xmpStart .
?s o2:xmpEnd ?xmpEnd .
?s o2:xmpSize ?xmpSize .
}
""")
packets
We already see some things that are highly suspicious; if we just look at the first five, for instance, we see that they all have exactly the same size.
endpoint.select("""
SELECT ?packetNumber ?rwFlag ?xmpStart ?xmpEnd ?xmpSize {
?s o2:file "C:/Windows/explorer.exe" ;
o2:packetNumber ?packetNumber ;
o2:rwFlag ?rwFlag ;
o2:xmpStart ?xmpStart ;
o2:xmpEnd ?xmpEnd ;
o2:xmpSize ?xmpSize .
} LIMIT 5
""")
The next SPARQL query sums up the total size of all the packets contained in explorer.exe
and we find, amazingly that the XMP packets add up to 20% of the total file size!
sumSize=endpoint.select("""
SELECT SUM(?xmpSize) AS ?xmpBytes {
?s o2:file "C:/Windows/explorer.exe" ;
o2:xmpSize ?xmpSize .
}
""")
sumSize
sumSize.at[0,'xmpBytes']
pngOne.at[2,'o']
100.0*sumSize.at[0,'xmpBytes']/pngOne.at[2,'o']
Knowing the location in the file, we can slice out a range of bytes and thus see the actual XMP packet:
fname="C:/Windows/explorer.exe"
offset=packets.at[0,"xmpStart"]
size=packets.at[0,"xmpSize"]
(fname,offset,size)
def getslice(fname,offset,size):
with open(fname,"rb") as file:
file.seek(offset)
return file.read(size)
rawpacket=getslice(fname,offset,size).decode("utf-8")
print(rawpacket.rstrip())
Note in particular the snippet at the beginning that reads
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
this snippet contains a special sequence of characters that is unlikely to appear anyplace else, and can thus be scanned for in order to find XMP packets. This signature is not dependent on the structure of the file so a simple reader can find XMP packets in any kind of file, or even find XMP packets on a raw disk volume -- some tools for undeleting photos from flash cards can locate media files from their embedded XMP packets, extracting the deleted files the way that we're about to extract a PNG file from Explorer.exe
.
The packet itself is written in RDF/XML which was the first RDF serialization format, with just a few constraints that are set in Part 1.
Note that the XMP packet is followed by a whopping 12,050 characters of whitespace, as opposed to just 2,281 characters of XML!
len(rawpacket)-rawpacket.rfind('>')
all(map(lambda x:x.isspace(),rawpacket[-12049:]))
This packet claims to be part of a PNG file, so let's see if we can extract the whole image; to start, we know that every PNG file starts with a eight-byte sequence, so we can look for the first occurence of this sequence that appears before the XMP packet.
PNGSTART=bytes.fromhex("89504E470D0A1A0A")
PNGSTART
!pip install bitstring
from bitstring import ConstBitStream
x=ConstBitStream(filename='C:/Windows/explorer.exe')
x.pos=x.rfind(PNGSTART,end=offset*8,bytealigned=True)[0]
We can then move forward 8 bytes to skip past the header
begin=x.bytepos
x.bytepos += 8
The next problem is finding the end of the PNG file; although there is no indication of the exact length of the PNG file, a PNG file consists of a number of chunks, each of which has a chunk type, length, and checksum. The last chunk is called "IEND", and once we've read it, we're at the end of the file.
def readchunks(x:ConstBitStream):
while True:
length=x.read("uintbe:32")
chunkId=x.read("bytes:4").decode("utf-8")
start=x.bytepos
x.bytepos += length
x.bytepos += 4
if chunkId=="IEND":
break
yield chunkId,start,length
Here we see all the chunks, including the iTXt
chunk which contains the XMP packet.
list(readchunks(x))
The cursor on the BitStream is now at the end of the PNG file, so we're ready to extract the image file once we store the endpoint
end=x.bytepos
(begin,offset,end)
imagedata=x[begin*8:end*8].bytes
The PNG file has a total length of 14,600 bytes, of which 14,372 bytes are the XMP packet, and of which 12,050 bytes are whitespace. If you think that's outrageous, you should see what happens next...
len(imagedata)
In principle it is pretty easy to display an image in a Jupyter notebook, but when we try it the first time, we don't see anything at all
from IPython.display import display_png,display_html,display,HTML
display_png(imagedata,raw=True)
My hunch when I saw that is that I was looking at white lines on a transparent background; I looked at it in photoshop, confirmed I was right, and then worked out a way to display an image with a custom background. The trick is that you can embed an image, data and all, in a URI in base64 and in turn embed it in HTML.
from base64 import b64encode
b64=b64encode(imagedata).decode("ascii")
b64url='data:image/png;base64,{}'.format(b64)
embedded="<div style='height:24px; width:24px; background:MidnightBlue'><img src='{}'></div>".format(b64url)
HTML(embedded)
This image is one you've seen if you use Windows 10, as it is part of the taskbar, which is provided by Explorer.exe
; it's clear now why this image is white-on-transparent:
def show_image(filename):
with open(filename,"rb") as f:
image=f.read()
display_png(image,raw=True)
show_image("taskbar.png")
At this point you might think you're seeing the kind of madness you can get from two, not just one, billion dollar software company.
Actually it makes a little more sense than it looks, as shown by the diagram below. The PNG file consists of a number of "chunks", one of which is an iTXt chunk which is designed to hold uncompressed text information. The PNG file was written with a large amount of whitespace padding at the end of the XMP packet so that an XMP editor could later add a few kilobytes of metadata without having to modify the entire file.
This looks absurd in the case of a very simple image which contains just 94 bytes of compressed data; it would seem even more absurd if one were using the image on the web or a mobile app. It makes more sense in a media workflow where people are working with large files: for instance, RAW camera files are frequently 20 MB or more and size, and 12kb of whitespace is a small price to pay, in that context, for an application like Adobe Lightroom or Microsoft Photos to update the metadata without needing to rewrite the entire file.
In the process of building the Explorer.exe
binary, the compiler and linker consolidate fragments of code, data, and many kinds of files into a single file. The compiler adds metadata to the binary that the executable uses to find resources by name just as it would find a subroutine by name:
show_image("embedding.png")
Note there is no "metadata master plan" going on here! XMP data "rides along" in the iTXt chunk because most PNG tools ignore it, keeping metadata together with the image, as designed. What we find, however, is that there are also PNG files embedded in Explorer.exe
without any XMP data, for instance, the first PNG file starts at a byte position
x.pos=x.find(PNGSTART,bytealigned=True)[0]
x.bytepos
that is long before the first XMP packet, which occurs at:
packets.at[0,"xmpStart"]
In fact, we spot 260 PNG headers in Explorer.exe
, more than the 79 XMP packets contained in the file. There is no plan for managing the metadata for the files embedded in this executable, yet large amounts of whitespace bulk up this executable by more than 20%.
Quick note: The situation is not different on Linux, MacOS, or other operating systems because all executable formats have some way to embed images. What publishers should do is remove unncessary metadata before publishing, which is easy to do in the case of the PNG because we can simply omit the iTXt chunk.
len(list(x.findall(PNGSTART,bytealigned=True)))
Next we'd like to see what kind of files we're are inside Explorer.exe
; here we do another GROUP BY
query but now we are using the OPTIONAL
clause, because if we did not, the pattern would not match when the dc:format
property is not specified. With OPTIONAL
specified, the ?format
variable is set to None
when that happens.
endpoint.select("""
SELECT ?format (COUNT(*) as ?cnt) {
?s o2:file "C:/Windows/explorer.exe" ;
o2:packetNumber ?packetNumber ;
o2:xmpSize ?xmpSize .
OPTIONAL { ?s dc:format ?format . }
} GROUP BY ?format
""")
At this point we can use the MINUS
clause in SPARQL to find cases where dc:format
is not specified, as MINUS matches only if the pattern inside of it does not match. producing a list of the mysterious files. Note in this case, all of the XMP packets are relatively short.
endpoint.select("""
SELECT ?packetNumber ?flag ?xmpSize {
?s o2:file "C:/Windows/explorer.exe" ;
o2:packetNumber ?packetNumber ;
o2:rwFlag ?flag ;
o2:xmpSize ?xmpSize .
MINUS { ?s dc:format ?format . }
} GROUP BY ?format
""")
Looking at the facts in packet number 56 we don't see anything too unusual, but note that the file type is left unspecified.
endpoint.select("""
SELECT ?p ?o {
?s o2:file "C:/Windows/explorer.exe" .
?s o2:packetNumber 57 .
?s ?p ?o .
}
""")
Conjecturing that it is just another PNG file, I use the same PNG extractor as the image before and find that it just the logo for Cortana, which also appears in the snapshot. Although both of these images were created with Adobe Photoshop CC 2014 (Windows)
, it seems they they were not saved with a consistent set of properties.
x.pos=x.rfind(PNGSTART,end=4082948*8,bytealigned=True)[0]
begin=x.bytepos
x.bytepos += 8
list(readchunks(x))
end=x.bytepos
(begin,end)
imagedata=x[begin*8:end*8].bytes
... and you know it just happens to be the logo for Cortana, Microsoft's voice agent:
b64=b64encode(imagedata).decode("ascii")
b64url='data:image/png;base64,{}'.format(b64)
embedded="<div style='height:24px; width:24px; background:MidnightBlue'><img src='{}'></div>".format(b64url)
HTML(embedded)
It is axiomatic that we can express anything with subject-predicate-object triples. Some cases, however, take a little more work than others. RDF triples easily represent the relational structures commonly used in SQL databases, but what about the nested and sequential structures that people expect out of NoSQL?
RDF provides us with a powerful tool for representing post-relational structures in the form of blank nodes. To understand them, let's take a look at one fact from the first XMP packet, the one that describes the "Task View" image.
pngOne[21:22]
Note that the left hand side (the object) is nodeID://b814605
, which is a reference to a blank node.
Identifiers which derive from a URI, such as <http://ns.adobe.com/xap/1.0/mm/>
refer to the same thing everywhere in the world, so long as the string content is the same. Blank nodes are different, in that blank nodes are specific to a particular triple store. Some other triple store could use that exact same name for a different purpose, or if I loaded the same data into Virtuoso again, it could wind up with a different name.
Because blank nodes are local to a triple store, the exact behavior of blank nodes are different RDF databases. In particular, most triple stores have ways to refer to specific blank nodes, because even if they do not exist globally, it is still necessary to talk about them locally to copy graphs from one triple store to another.
Let's look at the properties of this blank node:
frame=endpoint.select("""
SELECT ?p ?o {
<nodeID://b814605> ?p ?o
}
""")
frame
<nodeID://b814605>
turns out to be an ordered list (an rdf:Seq) In this case it is a list of one member, which is itself a blank node. If there were multiple members, these would be indicated as rdf:_2, rdf:_3 and so forth.
This method of representing a list is analagous to the Java ArrayList which represents a list as an array. An RDF tool could, pretty easily, use the number as an index into an array to find a value.
There are three kinds of RDF container, the
RDF contains another mechanism, called a 'Collection' (or rdf:List) that represents ordered lists as does the LISP programming language or the java LinkedList. These also involve blank nodes, but we'll avoid them for now since they are not used in XMP packets.
To continue exploring, we can follow the properties that lead from the single member of the above list...
frame=endpoint.select("""
SELECT ?p ?o {
<nodeID://b814606> ?p ?o
}
""")
frame
... and we find that <nodeID://b814606>
is just a data record that is nested inside the larger data record, much as one would see in a JSON file. This record represents one moment in the history of the file, the moment it was created. If the file had a longer history, it would contain a list with a large number of records.
The analogy to JSON gets closer if we extract all the facts in the XMP packet.
The process to do this is exactly like what we just did above, following links from one blank node to the next. The peel
function from gastrodon does this, copying all of the facts into a Graph
object from the rdflib
library.
Here we do this, listing all of the namespaces which are used in this particular Graph; all of the namespace
g=endpoint.peel(URIRef("file:///C:/Windows/explorer.exe/0001w"))
list(g.namespaces())
The Graph object from rdflib is a little triple store, which accepts SPARQL queries the same way that OpenLink Virtuoso does. For instance, the following query confirms that the recorded end of the XMP packet is the start position plus the length in bytes:
def scalar(result):
return list(result)[0][0].toPython()
scalar(g.query("""
SELECT ?computedSize {
?s o2:xmpStart ?start .
?s o2:xmpEnd ?end .
?s o2:xmpSize ?size .
BIND (?end-?start AS ?computedSize)
BIND (?size=?computedSize as ?computedSize)
}
"""))
This points out a way in which RDF and SPARQL are unique. With a standard data model and query language, we can use the same data and queries on a small scale (in memory), medium scale (disk based), and large scale (distributed cluster)! With our data in a graph we can process it directly, for instance, iterating over all triples in the graph in order to make a list of all URIs that appear in it:
uris=set()
for fact in g.triples((None,None,None)):
for node in fact:
if isinstance(node,URIRef):
uris.add(node)
uris
Finally, we can write out the packet in Turtle format, which is somewhere in appearance between JSON and the RDF/XML packet. Turtle is the most popular way to write RDF data by hand today.
ttl(g)
Want to read more articles like this?
Subscribe to our mailing list!
|