Looking For Metadata in All The Wrong Places¶

The XMP specification describes a system of embedding metadata into files of all kinds. Although part 3 of the specification defines specfic ways to embed XMP packets into common file formats such as PNG, PDF, WAV and SVG, part 1 defines a method of embedding a chunk of RDF-XML with a unique pattern of bytes in the header that allows a simple routine to find XMP packets in any kind of file.

show_image("md-pipeline-2.png")

time: 3 ms

Such a naive XMP extractor, it turns out, finds a large number of XMP packets inside files where you wouldn't expect them, such as executables, ZIP files, JAR files, Office documents and more. Many of these files are composite files that incorporate a number of media files into a larger file, often uncompressed on the expectation that the content of these files is already compressed.

I scanned a Windows computer, which is heavily used for software development, creative work, games and other things, so it contains a wide variety of files. Each XMP packet was extracted from the containing document, tagged with a small amount of metadata tying it to its source, and was then inserted into OpenLink Virtuoso Open Source Edition, a triple store that supports SPARQL queries.

In this Jupyter notebook I'll introduce some of the tools I use to make reports based on RDF data, and introduce the basics of the widespread, if obscure, XMP format.

Summary¶

Common files such as ZIP files, Office Documents, PDF, Executable Programs and Libraries are wholly or in part a composition of smaller files that contain copious metadata that is normally invisible
XMP Metadata Packets, written in a first-generation dialect of RDF, are widespread in media files of many kind and contain
Although XMP was developed before the SPARQL Query Language, XMP packets can be copied into a Triple store and queried with next generation tools -- without any data transformation on import!
We begin a gentle introduction to XMP, RDF and accessing RDF data in a Jupyter notebook with the in-development Gastrodon toolkit.
Important files from Microsoft Windows are 20% XMP metadata, most of that being whitespace

Getting Started¶

%load_ext autotime
import sys
sys.path.append("/Users/paul_000/Documents/Github/gastrodon")
from gastrodon import Endpoint,QName,ttl,URIRef
import pandas as pd
pd.options.display.width=120
pd.options.display.max_colwidth=100

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 7 ms

RDF refers to entities and properties with URIs, for instance, the Dublin Core vocabulary uses the term <http://purl.org/dc/elements/1.1/creator> to describe the creator of a creative work, but we could make a statement like

@prefix dc: <http://purl.org/dc/elements/1.1/>

to define a "dc" namespace such that we can write dc:creator instead of the complete IRI.

The gastrodon library automatically handles namespaces and prefixes for us, but we need to load a list of namespaces to make that happen.

from rdflib import Graph
prefixes=Graph()
prefixes.load('/xmp/prefixes.ttl',format='ttl')
endpoint=Endpoint("http://127.0.0.1:8890/sparql/",prefixes)

time: 37.5 ms

What is the size of the data?¶

An RDF database contains a number of facts, represented as (subject,predicate,object) triples. The following query matches all facts in the database (because it has variables in all three positions of the matching pattern) and returns a count, roughly 5.3 million.

endpoint.select("""
    SELECT (COUNT(*) as ?cnt) {
        ?s ?p ?o .
    }
""")

time: 284 ms

Next we count facts that have the following predicate:

<http://rdf.ontology2.com/metadata/file>

Although this predicate is a URI, it is not necessary to fetch this URI in order to process this data. We've attached one file property to each XMP record to record which file we found it in, thus we count 54,652 XMP packets.

endpoint.select("""
    SELECT (COUNT(*) AS ?cnt) {
        ?s <http://rdf.ontology2.com/metadata/file> ?o
    }
""")

time: 19 ms

We can access that exact same predicate with

o2:file

because Gastrodon keeps a list of namespaces and automatically prepends

prefix o2: <http://rdf.ontology2.com/metadata/file>

to the query. (It knows this because the namespace was declared in the We can count distinct values for the file names in SPARQL the same way we would in SQL, thus finding that there are 24,228 files with XMP packets.

endpoint.select("""
    SELECT (COUNT(DISTINCT ?file) AS ?cnt) {
        ?s o2:file ?file
    }
""")

time: 86.5 ms

We sum of the size of all XMP packets and discover 638MB of data on a hard drive with 425GB of used space, amounting to about 0.15% of all space in use.

endpoint.select("""
    SELECT (SUM(?size)/1000000.0 as ?cnt) {
        ?s o2:xmpSize ?size .
    }
""")

time: 26 ms

What does the data look like?¶

As a warm-up, let's look for files with very short names (otherwise they might be awkward to fit in the table below) and count how many XMP packets they contain.

We do this with a simple SPARQL query that queries over the o2:file property which connects the XMP packet to the filename it was extracted from. Just like we would in SQL, we GROUP and ORDER the results to count the number of packets they contain.

endpoint.select("""
    SELECT ?file (COUNT(*) AS ?cnt) {
        ?s o2:file ?file
    } GROUP BY ?file ORDER BY STRLEN(?file) LIMIT 10
""")

time: 111 ms

Explorer.exe is called the File Explorer by Microsoft, but in addition to the folder browser,it is also responsible for the start menu, task bar, and other functions. If you use Windows, you use it every day.

The following query is as simple as a SPARQL query gets. This query matches triples that share a specific predicate and object (value) and returns the associated subjects.

frame=endpoint.select("""
    SELECT ?s {
        ?s o2:file "C:/Windows/explorer.exe"
    }
""")

frame

time: 24.5 ms

We get a list of file: URIs, each of which is the name of an XMP packet, which we generated by appending a number and letter to the filename.

To get some idea of the content of one packet, we'll look at the facts referencing the first packet.

pngOne=endpoint.select("""
    SELECT ?p ?o {
        <file:///C:/Windows/explorer.exe/0001w> ?p ?o
    }
""")
pngOne

time: 19 ms

Note that the properties in the o2 namespace are properties I added to identify the packets, as opposed to all of the other properties which were copied verbatim from the XMP packet. Other properties come from Dublin Core, XMP and Photoshop-specific namespaces, as well as namespaces such as exif and tiff which contain industry-standard terminology despite being on an Adobe URI.

Part 2 of the XMP specification describes a number of namespaces of types and properties to use with XMP. Despite that, XMP users are free to use any RDF vocabulary they like so long as they comply with certain conventions described in Part 1.

Let's look first at the 'o2' properties because these provide a map for finding and describing XMP packets.

pngOne[1:8]

time: 11 ms

Since all of these properties have a 1-to-1 relationship with an XMP packet, we can write a query which works like a typical SQL query, where all the patterns share the same subject and we get exactly one row per packet. This table shows the length of all the packets, whether they allow read or write access, and their exact location in the file.

packets=endpoint.select("""
    SELECT ?packetNumber ?rwFlag ?xmpStart ?xmpEnd ?xmpSize {
        ?s o2:file "C:/Windows/explorer.exe" .
        ?s o2:packetNumber ?packetNumber .
        ?s o2:rwFlag ?rwFlag .
        ?s o2:xmpStart ?xmpStart .
        ?s o2:xmpEnd ?xmpEnd .
        ?s o2:xmpSize ?xmpSize .
    }
""")
packets

time: 72.5 ms

We already see some things that are highly suspicious; if we just look at the first five, for instance, we see that they all have exactly the same size.

endpoint.select("""
    SELECT ?packetNumber ?rwFlag ?xmpStart ?xmpEnd ?xmpSize {
        ?s o2:file "C:/Windows/explorer.exe" ;
           o2:packetNumber ?packetNumber ;
           o2:rwFlag ?rwFlag ;
           o2:xmpStart ?xmpStart ;
           o2:xmpEnd ?xmpEnd ;
           o2:xmpSize ?xmpSize .
    } LIMIT 5
""")

time: 27 ms

The next SPARQL query sums up the total size of all the packets contained in explorer.exe and we find, amazingly that the XMP packets add up to 20% of the total file size!

sumSize=endpoint.select("""
    SELECT SUM(?xmpSize) AS ?xmpBytes {
        ?s o2:file "C:/Windows/explorer.exe" ;
           o2:xmpSize ?xmpSize .
    }
""")
sumSize

time: 12 ms

sumSize.at[0,'xmpBytes']

975131

time: 8.5 ms

pngOne.at[2,'o']

4847928

time: 15 ms

100.0*sumSize.at[0,'xmpBytes']/pngOne.at[2,'o']

20.114387012348367

time: 12.5 ms

Viewing the raw packet¶

Knowing the location in the file, we can slice out a range of bytes and thus see the actual XMP packet:

fname="C:/Windows/explorer.exe"
offset=packets.at[0,"xmpStart"]
size=packets.at[0,"xmpSize"]
(fname,offset,size)

('C:/Windows/explorer.exe', 3217844, 14331)

time: 10.5 ms

def getslice(fname,offset,size):
    with open(fname,"rb") as file:
        file.seek(offset)
        return file.read(size)

time: 13 ms

rawpacket=getslice(fname,offset,size).decode("utf-8")
print(rawpacket.rstrip())

<?xpacket begin="﻿" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c014 79.156797, 2014/08/20-09:53:02        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
            xmlns:tiff="http://ns.adobe.com/tiff/1.0/"
            xmlns:exif="http://ns.adobe.com/exif/1.0/">
         <xmp:CreatorTool>Adobe Photoshop CC 2014 (Windows)</xmp:CreatorTool>
         <xmp:CreateDate>2015-05-04T15:40:01-07:00</xmp:CreateDate>
         <xmp:ModifyDate>2015-05-05T10:55:34-07:00</xmp:ModifyDate>
         <xmp:MetadataDate>2015-05-05T10:55:34-07:00</xmp:MetadataDate>
         <dc:format>image/png</dc:format>
         <photoshop:ColorMode>3</photoshop:ColorMode>
         <xmpMM:InstanceID>xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164</xmpMM:InstanceID>
         <xmpMM:DocumentID>xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164</xmpMM:DocumentID>
         <xmpMM:OriginalDocumentID>xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164</xmpMM:OriginalDocumentID>
         <xmpMM:History>
            <rdf:Seq>
               <rdf:li rdf:parseType="Resource">
                  <stEvt:action>created</stEvt:action>
                  <stEvt:instanceID>xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164</stEvt:instanceID>
                  <stEvt:when>2015-05-04T15:40:01-07:00</stEvt:when>
                  <stEvt:softwareAgent>Adobe Photoshop CC 2014 (Windows)</stEvt:softwareAgent>
               </rdf:li>
            </rdf:Seq>
         </xmpMM:History>
         <tiff:Orientation>1</tiff:Orientation>
         <tiff:XResolution>720000/10000</tiff:XResolution>
         <tiff:YResolution>720000/10000</tiff:YResolution>
         <tiff:ResolutionUnit>2</tiff:ResolutionUnit>
         <exif:ColorSpace>65535</exif:ColorSpace>
         <exif:PixelXDimension>24</exif:PixelXDimension>
         <exif:PixelYDimension>24</exif:PixelYDimension>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
time: 11 ms

Note in particular the snippet at the beginning that reads

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>

this snippet contains a special sequence of characters that is unlikely to appear anyplace else, and can thus be scanned for in order to find XMP packets. This signature is not dependent on the structure of the file so a simple reader can find XMP packets in any kind of file, or even find XMP packets on a raw disk volume -- some tools for undeleting photos from flash cards can locate media files from their embedded XMP packets, extracting the deleted files the way that we're about to extract a PNG file from Explorer.exe.

The packet itself is written in RDF/XML which was the first RDF serialization format, with just a few constraints that are set in Part 1.

Note that the XMP packet is followed by a whopping 12,050 characters of whitespace, as opposed to just 2,281 characters of XML!

len(rawpacket)-rawpacket.rfind('>')

12050

time: 13 ms

all(map(lambda x:x.isspace(),rawpacket[-12049:]))

True

time: 13 ms

Viewing the Embedded Image¶

This packet claims to be part of a PNG file, so let's see if we can extract the whole image; to start, we know that every PNG file starts with a eight-byte sequence, so we can look for the first occurence of this sequence that appears before the XMP packet.

PNGSTART=bytes.fromhex("89504E470D0A1A0A")
PNGSTART

b'\x89PNG\r\n\x1a\n'

time: 9 ms

!pip install bitstring
from bitstring import ConstBitStream

Requirement already satisfied: bitstring in c:\users\paul_000\anaconda3\lib\site-packages
time: 2.18 s

x=ConstBitStream(filename='C:/Windows/explorer.exe')

time: 1.5 ms

x.pos=x.rfind(PNGSTART,end=offset*8,bytealigned=True)[0]

time: 9 ms

We can then move forward 8 bytes to skip past the header

begin=x.bytepos
x.bytepos += 8

time: 10 ms

The next problem is finding the end of the PNG file; although there is no indication of the exact length of the PNG file, a PNG file consists of a number of chunks, each of which has a chunk type, length, and checksum. The last chunk is called "IEND", and once we've read it, we're at the end of the file.

def readchunks(x:ConstBitStream):
    while True:
        length=x.read("uintbe:32")
        chunkId=x.read("bytes:4").decode("utf-8")
        start=x.bytepos
        x.bytepos += length
        x.bytepos += 4
        if chunkId=="IEND":
            break
        yield chunkId,start,length

time: 14 ms

Here we see all the chunks, including the iTXt chunk which contains the XMP packet.

list(readchunks(x))

[('IHDR', 3217776, 13),
 ('pHYs', 3217801, 9),
 ('iTXt', 3217822, 14372),
 ('cHRM', 3232206, 32),
 ('IDAT', 3232250, 94)]

time: 15.5 ms

The cursor on the BitStream is now at the end of the PNG file, so we're ready to extract the image file once we store the endpoint

end=x.bytepos
(begin,offset,end)

(3217760, 3217844, 3232360)

time: 15 ms

imagedata=x[begin*8:end*8].bytes

time: 7.5 ms

The PNG file has a total length of 14,600 bytes, of which 14,372 bytes are the XMP packet, and of which 12,050 bytes are whitespace. If you think that's outrageous, you should see what happens next...

len(imagedata)

14600

time: 15 ms

In principle it is pretty easy to display an image in a Jupyter notebook, but when we try it the first time, we don't see anything at all

from IPython.display import display_png,display_html,display,HTML
display_png(imagedata,raw=True)

time: 11.5 ms

My hunch when I saw that is that I was looking at white lines on a transparent background; I looked at it in photoshop, confirmed I was right, and then worked out a way to display an image with a custom background. The trick is that you can embed an image, data and all, in a URI in base64 and in turn embed it in HTML.

from base64 import b64encode

time: 11 ms

b64=b64encode(imagedata).decode("ascii")
b64url='data:image/png;base64,{}'.format(b64)
embedded="<div style='height:24px; width:24px; background:MidnightBlue'><img src='{}'></div>".format(b64url)
HTML(embedded)

time: 17.5 ms

This image is one you've seen if you use Windows 10, as it is part of the taskbar, which is provided by Explorer.exe; it's clear now why this image is white-on-transparent:

def show_image(filename):
    with open(filename,"rb") as f:
        image=f.read()
        display_png(image,raw=True)
        
show_image("taskbar.png")

time: 8 ms

What's going on?¶

At this point you might think you're seeing the kind of madness you can get from two, not just one, billion dollar software company.

Actually it makes a little more sense than it looks, as shown by the diagram below. The PNG file consists of a number of "chunks", one of which is an iTXt chunk which is designed to hold uncompressed text information. The PNG file was written with a large amount of whitespace padding at the end of the XMP packet so that an XMP editor could later add a few kilobytes of metadata without having to modify the entire file.

This looks absurd in the case of a very simple image which contains just 94 bytes of compressed data; it would seem even more absurd if one were using the image on the web or a mobile app. It makes more sense in a media workflow where people are working with large files: for instance, RAW camera files are frequently 20 MB or more and size, and 12kb of whitespace is a small price to pay, in that context, for an application like Adobe Lightroom or Microsoft Photos to update the metadata without needing to rewrite the entire file.

In the process of building the Explorer.exe binary, the compiler and linker consolidate fragments of code, data, and many kinds of files into a single file. The compiler adds metadata to the binary that the executable uses to find resources by name just as it would find a subroutine by name:

show_image("embedding.png")

time: 13.5 ms

Note there is no "metadata master plan" going on here! XMP data "rides along" in the iTXt chunk because most PNG tools ignore it, keeping metadata together with the image, as designed. What we find, however, is that there are also PNG files embedded in Explorer.exe without any XMP data, for instance, the first PNG file starts at a byte position

x.pos=x.find(PNGSTART,bytealigned=True)[0]
x.bytepos

2556456

time: 30.5 ms

that is long before the first XMP packet, which occurs at:

packets.at[0,"xmpStart"]

3217844

time: 13 ms

In fact, we spot 260 PNG headers in Explorer.exe, more than the 79 XMP packets contained in the file. There is no plan for managing the metadata for the files embedded in this executable, yet large amounts of whitespace bulk up this executable by more than 20%.

Quick note: The situation is not different on Linux, MacOS, or other operating systems because all executable formats have some way to embed images. What publishers should do is remove unncessary metadata before publishing, which is easy to do in the case of the PNG because we can simply omit the iTXt chunk.

len(list(x.findall(PNGSTART,bytealigned=True)))

260

time: 45.5 ms

Are all of the metadata packets for PNG files?¶

Next we'd like to see what kind of files we're are inside Explorer.exe; here we do another GROUP BY query but now we are using the OPTIONAL clause, because if we did not, the pattern would not match when the dc:format property is not specified. With OPTIONAL specified, the ?format variable is set to None when that happens.

endpoint.select("""
    SELECT ?format (COUNT(*) as ?cnt) {
        ?s o2:file "C:/Windows/explorer.exe" ;
           o2:packetNumber ?packetNumber ;
           o2:xmpSize ?xmpSize .
        OPTIONAL { ?s dc:format ?format . }
    } GROUP BY ?format
""")

time: 25 ms

At this point we can use the MINUS clause in SPARQL to find cases where dc:format is not specified, as MINUS matches only if the pattern inside of it does not match. producing a list of the mysterious files. Note in this case, all of the XMP packets are relatively short.

endpoint.select("""
    SELECT ?packetNumber ?flag ?xmpSize {
        ?s o2:file "C:/Windows/explorer.exe" ;
           o2:packetNumber ?packetNumber ;
           o2:rwFlag ?flag ;
           o2:xmpSize ?xmpSize .
        MINUS { ?s dc:format ?format . }
    } GROUP BY ?format
""")

time: 19 ms

Looking at the facts in packet number 56 we don't see anything too unusual, but note that the file type is left unspecified.

endpoint.select("""
    SELECT ?p ?o {
        ?s o2:file "C:/Windows/explorer.exe" .
        ?s o2:packetNumber 57 .
        ?s ?p ?o .
    }
""")

time: 16 ms

Conjecturing that it is just another PNG file, I use the same PNG extractor as the image before and find that it just the logo for Cortana, which also appears in the snapshot. Although both of these images were created with Adobe Photoshop CC 2014 (Windows), it seems they they were not saved with a consistent set of properties.

x.pos=x.rfind(PNGSTART,end=4082948*8,bytealigned=True)[0]
begin=x.bytepos
x.bytepos += 8

time: 1.5 ms

list(readchunks(x))

[('IHDR', 4082864, 13),
 ('tEXt', 4082889, 25),
 ('iTXt', 4082926, 878),
 ('IDAT', 4083816, 593)]

time: 14 ms

end=x.bytepos
(begin,end)

(4082848, 4084425)

time: 11 ms

imagedata=x[begin*8:end*8].bytes

time: 8.5 ms

... and you know it just happens to be the logo for Cortana, Microsoft's voice agent:

b64=b64encode(imagedata).decode("ascii")
b64url='data:image/png;base64,{}'.format(b64)
embedded="<div style='height:24px; width:24px; background:MidnightBlue'><img src='{}'></div>".format(b64url)
HTML(embedded)

time: 19 ms

Blank Nodes: beyond triples¶

It is axiomatic that we can express anything with subject-predicate-object triples. Some cases, however, take a little more work than others. RDF triples easily represent the relational structures commonly used in SQL databases, but what about the nested and sequential structures that people expect out of NoSQL?

RDF provides us with a powerful tool for representing post-relational structures in the form of blank nodes. To understand them, let's take a look at one fact from the first XMP packet, the one that describes the "Task View" image.

pngOne[21:22]

time: 21.5 ms

Note that the left hand side (the object) is nodeID://b814605, which is a reference to a blank node.

Identifiers which derive from a URI, such as <http://ns.adobe.com/xap/1.0/mm/> refer to the same thing everywhere in the world, so long as the string content is the same. Blank nodes are different, in that blank nodes are specific to a particular triple store. Some other triple store could use that exact same name for a different purpose, or if I loaded the same data into Virtuoso again, it could wind up with a different name.

Because blank nodes are local to a triple store, the exact behavior of blank nodes are different RDF databases. In particular, most triple stores have ways to refer to specific blank nodes, because even if they do not exist globally, it is still necessary to talk about them locally to copy graphs from one triple store to another.

Let's look at the properties of this blank node:

frame=endpoint.select("""
    SELECT ?p ?o {
        <nodeID://b814605> ?p ?o
    }
""")
frame

time: 35.5 ms

<nodeID://b814605> turns out to be an ordered list (an rdf:Seq) In this case it is a list of one member, which is itself a blank node. If there were multiple members, these would be indicated as rdf:_2, rdf:_3 and so forth.

This method of representing a list is analagous to the Java ArrayList which represents a list as an array. An RDF tool could, pretty easily, use the number as an index into an array to find a value.

There are three kinds of RDF container, the

rdf:Seq an ordered list
rdf:Alt an alternative list (of which the data consumer is intended to pick one value; a list of names in various languages could be an example)
rdf:Bag an unordered list, aka a set

RDF contains another mechanism, called a 'Collection' (or rdf:List) that represents ordered lists as does the LISP programming language or the java LinkedList. These also involve blank nodes, but we'll avoid them for now since they are not used in XMP packets.

To continue exploring, we can follow the properties that lead from the single member of the above list...

frame=endpoint.select("""
    SELECT ?p ?o {
        <nodeID://b814606> ?p ?o
    }
""")
frame

time: 23.5 ms

... and we find that <nodeID://b814606> is just a data record that is nested inside the larger data record, much as one would see in a JSON file. This record represents one moment in the history of the file, the moment it was created. If the file had a longer history, it would contain a list with a large number of records.

The analogy to JSON gets closer if we extract all the facts in the XMP packet.

The process to do this is exactly like what we just did above, following links from one blank node to the next. The peel function from gastrodon does this, copying all of the facts into a Graph object from the rdflib library.

Here we do this, listing all of the namespaces which are used in this particular Graph; all of the namespace

g=endpoint.peel(URIRef("file:///C:/Windows/explorer.exe/0001w"))
list(g.namespaces())

[('xml', rdflib.term.URIRef('http://www.w3.org/XML/1998/namespace')),
 ('rdf', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#')),
 ('rdfs', rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#')),
 ('xsd', rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#')),
 ('xmp', rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/')),
 ('xap', rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/')),
 ('photoshop', rdflib.term.URIRef('http://ns.adobe.com/photoshop/1.0/')),
 ('xmpMM', rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/')),
 ('exif', rdflib.term.URIRef('http://ns.adobe.com/exif/1.0/')),
 ('tiff', rdflib.term.URIRef('http://ns.adobe.com/tiff/1.0/')),
 ('stEvt',
  rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/sType/ResourceEvent#')),
 ('xapMM', rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/')),
 ('dc', rdflib.term.URIRef('http://purl.org/dc/elements/1.1/')),
 ('o2', rdflib.term.URIRef('http://rdf.ontology2.com/metadata/'))]

time: 33.5 ms

The Graph object from rdflib is a little triple store, which accepts SPARQL queries the same way that OpenLink Virtuoso does. For instance, the following query confirms that the recorded end of the XMP packet is the start position plus the length in bytes:

def scalar(result):
    return list(result)[0][0].toPython()

scalar(g.query("""
   SELECT ?computedSize {
      ?s o2:xmpStart ?start .
      ?s o2:xmpEnd ?end .
      ?s o2:xmpSize ?size .
      BIND (?end-?start AS ?computedSize)
      BIND (?size=?computedSize as ?computedSize)
   }
"""))

True

time: 2.76 s

This points out a way in which RDF and SPARQL are unique. With a standard data model and query language, we can use the same data and queries on a small scale (in memory), medium scale (disk based), and large scale (distributed cluster)! With our data in a graph we can process it directly, for instance, iterating over all triples in the graph in order to make a list of all URIs that appear in it:

uris=set()
for fact in g.triples((None,None,None)):
    for node in fact:
        if isinstance(node,URIRef):
            uris.add(node)

uris

{rdflib.term.URIRef('file:///C:/Windows/explorer.exe/0001w'),
 rdflib.term.URIRef('http://ns.adobe.com/exif/1.0/ColorSpace'),
 rdflib.term.URIRef('http://ns.adobe.com/exif/1.0/PixelXDimension'),
 rdflib.term.URIRef('http://ns.adobe.com/exif/1.0/PixelYDimension'),
 rdflib.term.URIRef('http://ns.adobe.com/photoshop/1.0/ColorMode'),
 rdflib.term.URIRef('http://ns.adobe.com/tiff/1.0/Orientation'),
 rdflib.term.URIRef('http://ns.adobe.com/tiff/1.0/ResolutionUnit'),
 rdflib.term.URIRef('http://ns.adobe.com/tiff/1.0/XResolution'),
 rdflib.term.URIRef('http://ns.adobe.com/tiff/1.0/YResolution'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/CreateDate'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/CreatorTool'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/MetadataDate'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/ModifyDate'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/DocumentID'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/History'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/InstanceID'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/OriginalDocumentID'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/sType/ResourceEvent#action'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/sType/ResourceEvent#instanceID'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/sType/ResourceEvent#softwareAgent'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/sType/ResourceEvent#when'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/format'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/file'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/fileLength'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/packetNumber'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/rwFlag'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/xmpEnd'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/xmpSize'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/xmpStart'),
 rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq'),
 rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#_1'),
 rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')}

time: 6.5 ms

Finally, we can write out the packet in Turtle format, which is somewhere in appearance between JSON and the RDF/XML packet. Turtle is the most popular way to write RDF data by hand today.

ttl(g)

@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix exif: <http://ns.adobe.com/exif/1.0/> .
@prefix o2: <http://rdf.ontology2.com/metadata/> .
@prefix photoshop: <http://ns.adobe.com/photoshop/1.0/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix stEvt: <http://ns.adobe.com/xap/1.0/sType/ResourceEvent#> .
@prefix tiff: <http://ns.adobe.com/tiff/1.0/> .
@prefix xap: <http://ns.adobe.com/xap/1.0/> .
@prefix xapMM: <http://ns.adobe.com/xap/1.0/mm/> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xmp: <http://ns.adobe.com/xap/1.0/> .
@prefix xmpMM: <http://ns.adobe.com/xap/1.0/mm/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .


<file:///C:/Windows/explorer.exe/0001w> exif:ColorSpace "65535" ;
    exif:PixelXDimension "24" ;
    exif:PixelYDimension "24" ;
    photoshop:ColorMode "3" ;
    tiff:Orientation "1" ;
    tiff:ResolutionUnit "2" ;
    tiff:XResolution "720000/10000" ;
    tiff:YResolution "720000/10000" ;
    xap:CreateDate "2015-05-04T15:40:01-07:00" ;
    xap:CreatorTool "Adobe Photoshop CC 2014 (Windows)" ;
    xap:MetadataDate "2015-05-05T10:55:34-07:00" ;
    xap:ModifyDate "2015-05-05T10:55:34-07:00" ;
    xapMM:DocumentID "xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164" ;
    xapMM:History [ a rdf:Seq ;
            rdf:_1 [ stEvt:action "created" ;
                    stEvt:instanceID "xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164" ;
                    stEvt:softwareAgent "Adobe Photoshop CC 2014 (Windows)" ;
                    stEvt:when "2015-05-04T15:40:01-07:00" ] ] ;
    xapMM:InstanceID "xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164" ;
    xapMM:OriginalDocumentID "xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164" ;
    dc:format "image/png" ;
    o2:file "C:/Windows/explorer.exe" ;
    o2:fileLength 4847928 ;
    o2:packetNumber 1 ;
    o2:rwFlag "w" ;
    o2:xmpEnd 3232175 ;
    o2:xmpSize 14331 ;
    o2:xmpStart 3217844 .


time: 17 ms

Conclusion¶

Common files such as ZIP files, Office Documents, PDF, Executable Programs and Libraries are wholly or in part a composition of smaller files that contain copious metadata that is normally invisible
XMP Metadata Packets, written in a first-generation dialect of RDF, are widespread in media files of many kind and contain
Although XMP was developed before the SPARQL Query Language, XMP packets can be copied into a Triple store and queried with next generation tools -- without any data transformation on import!
This Jupyter notebook is a development case for the Gastrodon library, which puts RDF data on your fingertips in the Jupyter environment.

Want to read more articles like this? Subscribe to our mailing list!

	file	cnt
0	C:/Windows/explorer.exe	79
1	C:/Windows/System32/dccw.exe	14
2	C:/Windows/SysWOW64/dccw.exe	14
3	C:/Windows/System32/WpcMon.exe	8
4	C:/home/paul_000/krawler/04.jpg	1
5	C:/home/paul_000/krawler/08.jpg	1
6	C:/Windows/Installer/178e6a.msi	4
7	C:/Windows/System32/printui.dll	1
8	C:/home/paul_000/krawler/14.jpg	1
9	C:/home/paul_000/krawler/01.jpg	1

	s
0	file:///C:/Windows/explorer.exe/0001w
1	file:///C:/Windows/explorer.exe/0002w
2	file:///C:/Windows/explorer.exe/0003w
3	file:///C:/Windows/explorer.exe/0004w
4	file:///C:/Windows/explorer.exe/0005w
5	file:///C:/Windows/explorer.exe/0006w
6	file:///C:/Windows/explorer.exe/0007w
7	file:///C:/Windows/explorer.exe/0008w
8	file:///C:/Windows/explorer.exe/0009w
9	file:///C:/Windows/explorer.exe/0010w
10	file:///C:/Windows/explorer.exe/0011w
11	file:///C:/Windows/explorer.exe/0012w
12	file:///C:/Windows/explorer.exe/0013w
13	file:///C:/Windows/explorer.exe/0014w
14	file:///C:/Windows/explorer.exe/0015w
15	file:///C:/Windows/explorer.exe/0016w
16	file:///C:/Windows/explorer.exe/0017w
17	file:///C:/Windows/explorer.exe/0018w
18	file:///C:/Windows/explorer.exe/0019w
19	file:///C:/Windows/explorer.exe/0020w
20	file:///C:/Windows/explorer.exe/0021w
21	file:///C:/Windows/explorer.exe/0022w
22	file:///C:/Windows/explorer.exe/0023w
23	file:///C:/Windows/explorer.exe/0024w
24	file:///C:/Windows/explorer.exe/0025w
25	file:///C:/Windows/explorer.exe/0026w
26	file:///C:/Windows/explorer.exe/0027w
27	file:///C:/Windows/explorer.exe/0028w
28	file:///C:/Windows/explorer.exe/0029w
29	file:///C:/Windows/explorer.exe/0030w
...	...
49	file:///C:/Windows/explorer.exe/0050w
50	file:///C:/Windows/explorer.exe/0051w
51	file:///C:/Windows/explorer.exe/0052w
52	file:///C:/Windows/explorer.exe/0053w
53	file:///C:/Windows/explorer.exe/0054w
54	file:///C:/Windows/explorer.exe/0055w
55	file:///C:/Windows/explorer.exe/0056r
56	file:///C:/Windows/explorer.exe/0057r
57	file:///C:/Windows/explorer.exe/0058r
58	file:///C:/Windows/explorer.exe/0059r
59	file:///C:/Windows/explorer.exe/0060r
60	file:///C:/Windows/explorer.exe/0061r
61	file:///C:/Windows/explorer.exe/0062r
62	file:///C:/Windows/explorer.exe/0063r
63	file:///C:/Windows/explorer.exe/0064r
64	file:///C:/Windows/explorer.exe/0065r
65	file:///C:/Windows/explorer.exe/0066r
66	file:///C:/Windows/explorer.exe/0067r
67	file:///C:/Windows/explorer.exe/0068w
68	file:///C:/Windows/explorer.exe/0069w
69	file:///C:/Windows/explorer.exe/0070w
70	file:///C:/Windows/explorer.exe/0071w
71	file:///C:/Windows/explorer.exe/0072w
72	file:///C:/Windows/explorer.exe/0073w
73	file:///C:/Windows/explorer.exe/0074w
74	file:///C:/Windows/explorer.exe/0075w
75	file:///C:/Windows/explorer.exe/0076w
76	file:///C:/Windows/explorer.exe/0077w
77	file:///C:/Windows/explorer.exe/0078w
78	file:///C:/Windows/explorer.exe/0079w

	packetNumber	rwFlag	xmpStart	xmpEnd	xmpSize
0	1	w	3217844	3232175	14331
1	2	w	3232444	3246775	14331
2	3	w	3247052	3261383	14331
3	4	w	3261676	3276007	14331
4	5	w	3276300	3290631	14331
5	6	w	3290932	3305263	14331
6	7	w	3305572	3319903	14331
7	8	w	3320236	3334567	14331
8	9	w	3334932	3349263	14331
9	10	w	3352636	3366967	14331
10	11	w	3370436	3384767	14331
11	12	w	3385044	3399375	14331
12	13	w	3399668	3413999	14331
13	14	w	3414300	3428631	14331
14	15	w	3428932	3443263	14331
15	16	w	3443580	3457911	14331
16	17	w	3458228	3472559	14331
17	18	w	3472900	3487231	14331
18	19	w	3487604	3501935	14331
19	20	w	3505332	3519663	14331
20	21	w	3523148	3537921	14773
21	22	w	3538188	3552961	14773
22	23	w	3553236	3568009	14773
23	24	w	3568292	3583065	14773
24	25	w	3583348	3598121	14773
25	26	w	3598412	3613185	14773
26	27	w	3613492	3628265	14773
27	28	w	3628596	3643369	14773
28	29	w	3643732	3658505	14773
29	30	w	3658884	3673657	14773
...	...	...	...	...	...
49	50	w	3974732	3989063	14331
50	51	w	3990076	4004407	14331
51	52	w	4005636	4019967	14331
52	53	w	4021164	4035495	14331
53	54	w	4036836	4051167	14331
54	55	w	4052604	4066935	14331
55	56	r	4081532	4082369	837
56	57	r	4082948	4083785	837
57	58	r	4084532	4085369	837
58	59	r	4086268	4087105	837
59	60	r	4088292	4089129	837
60	61	r	4090908	4091745	837
61	62	r	4094100	4094945	845
62	63	r	4095532	4096377	845
63	64	r	4097132	4097977	845
64	65	r	4098884	4099729	845
65	66	r	4100932	4101777	845
66	67	r	4103572	4104417	845
67	68	w	4130404	4144735	14331
68	69	w	4144996	4159327	14331
69	70	w	4159588	4173919	14331
70	71	w	4174188	4188519	14331
71	72	w	4188772	4203103	14331
72	73	w	4203364	4217695	14331
73	74	w	4217964	4232295	14331
74	75	w	4232588	4246919	14331
75	76	w	4247228	4261559	14331
76	77	w	4261868	4276199	14331
77	78	w	4276540	4290871	14331
78	79	w	4291204	4305535	14331

	packetNumber	flag	xmpSize
0	61	r	837
1	62	r	845
2	64	r	845
3	58	r	837
4	59	r	837
5	60	r	837
6	65	r	845
7	66	r	845
8	56	r	837
9	57	r	837
10	63	r	845
11	67	r	845

	p	o
0	dc:format	image/png
1	o2:file	C:/Windows/explorer.exe
2	o2:fileLength	4847928
3	o2:packetNumber	1
4	o2:rwFlag	w
5	o2:xmpEnd	3232175
6	o2:xmpSize	14331
7	o2:xmpStart	3217844
8	exif:ColorSpace	65535
9	exif:PixelXDimension	24
10	exif:PixelYDimension	24
11	photoshop:ColorMode	3
12	tiff:Orientation	1
13	tiff:ResolutionUnit	2
14	tiff:XResolution	720000/10000
15	tiff:YResolution	720000/10000
16	xap:CreateDate	2015-05-04T15:40:01-07:00
17	xap:CreatorTool	Adobe Photoshop CC 2014 (Windows)
18	xap:MetadataDate	2015-05-05T10:55:34-07:00
19	xap:ModifyDate	2015-05-05T10:55:34-07:00
20	xapMM:DocumentID	xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164
21	xapMM:History	nodeID://b814605
22	xapMM:InstanceID	xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164
23	xapMM:OriginalDocumentID	xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164

	packetNumber	flag	xmpSize
0	61	r	837
1	62	r	845
2	64	r	845
3	58	r	837
4	59	r	837
5	60	r	837
6	65	r	845
7	66	r	845
8	56	r	837
9	57	r	837
10	63	r	845
11	67	r	845

	p	o
0	stEvt:action	created
1	stEvt:instanceID	xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164
2	stEvt:softwareAgent	Adobe Photoshop CC 2014 (Windows)
3	stEvt:when	2015-05-04T15:40:01-07:00

	format	cnt
0	image/png	67
1	None	12

	p	o
0	rdf:type	rdf:Seq
1	rdf:_1	nodeID://b814606

	packetNumber	flag	xmpSize
0	61	r	837
1	62	r	845
2	64	r	845
3	58	r	837
4	59	r	837
5	60	r	837
6	65	r	845
7	66	r	845
8	56	r	837
9	57	r	837
10	63	r	845
11	67	r	845

	packetNumber	flag	xmpSize
0	61	r	837
1	62	r	845
2	64	r	845
3	58	r	837
4	59	r	837
5	60	r	837
6	65	r	845
7	66	r	845
8	56	r	837
9	57	r	837
10	63	r	845
11	67	r	845