Looking For Metadata in All The Wrong Places

The XMP specification describes a system of embedding metadata into files of all kinds. Although part 3 of the specification defines specfic ways to embed XMP packets into common file formats such as PNG, PDF, WAV and SVG, part 1 defines a method of embedding a chunk of RDF-XML with a unique pattern of bytes in the header that allows a simple routine to find XMP packets in any kind of file.

In [55]:
show_image("md-pipeline-2.png")
time: 3 ms

Such a naive XMP extractor, it turns out, finds a large number of XMP packets inside files where you wouldn't expect them, such as executables, ZIP files, JAR files, Office documents and more. Many of these files are composite files that incorporate a number of media files into a larger file, often uncompressed on the expectation that the content of these files is already compressed.

I scanned a Windows computer, which is heavily used for software development, creative work, games and other things, so it contains a wide variety of files. Each XMP packet was extracted from the containing document, tagged with a small amount of metadata tying it to its source, and was then inserted into OpenLink Virtuoso Open Source Edition, a triple store that supports SPARQL queries.

In this Jupyter notebook I'll introduce some of the tools I use to make reports based on RDF data, and introduce the basics of the widespread, if obscure, XMP format.

Summary

  • Common files such as ZIP files, Office Documents, PDF, Executable Programs and Libraries are wholly or in part a composition of smaller files that contain copious metadata that is normally invisible
  • XMP Metadata Packets, written in a first-generation dialect of RDF, are widespread in media files of many kind and contain
  • Although XMP was developed before the SPARQL Query Language, XMP packets can be copied into a Triple store and queried with next generation tools -- without any data transformation on import!
  • We begin a gentle introduction to XMP, RDF and accessing RDF data in a Jupyter notebook with the in-development Gastrodon toolkit.
  • Important files from Microsoft Windows are 20% XMP metadata, most of that being whitespace

Getting Started

In [60]:
%load_ext autotime
import sys
sys.path.append("/Users/paul_000/Documents/Github/gastrodon")
from gastrodon import Endpoint,QName,ttl,URIRef
import pandas as pd
pd.options.display.width=120
pd.options.display.max_colwidth=100
The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 7 ms

RDF refers to entities and properties with URIs, for instance, the Dublin Core vocabulary uses the term <http://purl.org/dc/elements/1.1/creator> to describe the creator of a creative work, but we could make a statement like

@prefix dc: <http://purl.org/dc/elements/1.1/>

to define a "dc" namespace such that we can write dc:creator instead of the complete IRI.

The gastrodon library automatically handles namespaces and prefixes for us, but we need to load a list of namespaces to make that happen.

In [2]:
from rdflib import Graph
prefixes=Graph()
prefixes.load('/xmp/prefixes.ttl',format='ttl')
endpoint=Endpoint("http://127.0.0.1:8890/sparql/",prefixes)
time: 37.5 ms

What is the size of the data?

An RDF database contains a number of facts, represented as (subject,predicate,object) triples. The following query matches all facts in the database (because it has variables in all three positions of the matching pattern) and returns a count, roughly 5.3 million.

In [3]:
endpoint.select("""
    SELECT (COUNT(*) as ?cnt) {
        ?s ?p ?o .
    }
""")
Out[3]:
cnt
0 5344416
time: 284 ms

Next we count facts that have the following predicate:

<http://rdf.ontology2.com/metadata/file>

Although this predicate is a URI, it is not necessary to fetch this URI in order to process this data. We've attached one file property to each XMP record to record which file we found it in, thus we count 54,652 XMP packets.

In [4]:
endpoint.select("""
    SELECT (COUNT(*) AS ?cnt) {
        ?s <http://rdf.ontology2.com/metadata/file> ?o
    }
""")
Out[4]:
cnt
0 54652
time: 19 ms

We can access that exact same predicate with

o2:file

because Gastrodon keeps a list of namespaces and automatically prepends

prefix o2: <http://rdf.ontology2.com/metadata/file>

to the query. (It knows this because the namespace was declared in the We can count distinct values for the file names in SPARQL the same way we would in SQL, thus finding that there are 24,228 files with XMP packets.

In [5]:
endpoint.select("""
    SELECT (COUNT(DISTINCT ?file) AS ?cnt) {
        ?s o2:file ?file
    }
""")
Out[5]:
cnt
0 24228
time: 86.5 ms

We sum of the size of all XMP packets and discover 638MB of data on a hard drive with 425GB of used space, amounting to about 0.15% of all space in use.

In [6]:
endpoint.select("""
    SELECT (SUM(?size)/1000000.0 as ?cnt) {
        ?s o2:xmpSize ?size .
    }
""")
Out[6]:
cnt
0 638.696476
time: 26 ms

What does the data look like?

As a warm-up, let's look for files with very short names (otherwise they might be awkward to fit in the table below) and count how many XMP packets they contain.

We do this with a simple SPARQL query that queries over the o2:file property which connects the XMP packet to the filename it was extracted from. Just like we would in SQL, we GROUP and ORDER the results to count the number of packets they contain.

In [7]:
endpoint.select("""
    SELECT ?file (COUNT(*) AS ?cnt) {
        ?s o2:file ?file
    } GROUP BY ?file ORDER BY STRLEN(?file) LIMIT 10
""")
Out[7]:
file cnt
0 C:/Windows/explorer.exe 79
1 C:/Windows/System32/dccw.exe 14
2 C:/Windows/SysWOW64/dccw.exe 14
3 C:/Windows/System32/WpcMon.exe 8
4 C:/home/paul_000/krawler/04.jpg 1
5 C:/home/paul_000/krawler/08.jpg 1
6 C:/Windows/Installer/178e6a.msi 4
7 C:/Windows/System32/printui.dll 1
8 C:/home/paul_000/krawler/14.jpg 1
9 C:/home/paul_000/krawler/01.jpg 1
time: 111 ms

Explorer.exe is called the File Explorer by Microsoft, but in addition to the folder browser,it is also responsible for the start menu, task bar, and other functions. If you use Windows, you use it every day.

The following query is as simple as a SPARQL query gets. This query matches triples that share a specific predicate and object (value) and returns the associated subjects.

In [8]:
frame=endpoint.select("""
    SELECT ?s {
        ?s o2:file "C:/Windows/explorer.exe"
    }
""")

frame
Out[8]:
s
0 file:///C:/Windows/explorer.exe/0001w
1 file:///C:/Windows/explorer.exe/0002w
2 file:///C:/Windows/explorer.exe/0003w
3 file:///C:/Windows/explorer.exe/0004w
4 file:///C:/Windows/explorer.exe/0005w
5 file:///C:/Windows/explorer.exe/0006w
6 file:///C:/Windows/explorer.exe/0007w
7 file:///C:/Windows/explorer.exe/0008w
8 file:///C:/Windows/explorer.exe/0009w
9 file:///C:/Windows/explorer.exe/0010w
10 file:///C:/Windows/explorer.exe/0011w
11 file:///C:/Windows/explorer.exe/0012w
12 file:///C:/Windows/explorer.exe/0013w
13 file:///C:/Windows/explorer.exe/0014w
14 file:///C:/Windows/explorer.exe/0015w
15 file:///C:/Windows/explorer.exe/0016w
16 file:///C:/Windows/explorer.exe/0017w
17 file:///C:/Windows/explorer.exe/0018w
18 file:///C:/Windows/explorer.exe/0019w
19 file:///C:/Windows/explorer.exe/0020w
20 file:///C:/Windows/explorer.exe/0021w
21 file:///C:/Windows/explorer.exe/0022w
22 file:///C:/Windows/explorer.exe/0023w
23 file:///C:/Windows/explorer.exe/0024w
24 file:///C:/Windows/explorer.exe/0025w
25 file:///C:/Windows/explorer.exe/0026w
26 file:///C:/Windows/explorer.exe/0027w
27 file:///C:/Windows/explorer.exe/0028w
28 file:///C:/Windows/explorer.exe/0029w
29 file:///C:/Windows/explorer.exe/0030w
... ...
49 file:///C:/Windows/explorer.exe/0050w
50 file:///C:/Windows/explorer.exe/0051w
51 file:///C:/Windows/explorer.exe/0052w
52 file:///C:/Windows/explorer.exe/0053w
53 file:///C:/Windows/explorer.exe/0054w
54 file:///C:/Windows/explorer.exe/0055w
55 file:///C:/Windows/explorer.exe/0056r
56 file:///C:/Windows/explorer.exe/0057r
57 file:///C:/Windows/explorer.exe/0058r
58 file:///C:/Windows/explorer.exe/0059r
59 file:///C:/Windows/explorer.exe/0060r
60 file:///C:/Windows/explorer.exe/0061r
61 file:///C:/Windows/explorer.exe/0062r
62 file:///C:/Windows/explorer.exe/0063r
63 file:///C:/Windows/explorer.exe/0064r
64 file:///C:/Windows/explorer.exe/0065r
65 file:///C:/Windows/explorer.exe/0066r
66 file:///C:/Windows/explorer.exe/0067r
67 file:///C:/Windows/explorer.exe/0068w
68 file:///C:/Windows/explorer.exe/0069w
69 file:///C:/Windows/explorer.exe/0070w
70 file:///C:/Windows/explorer.exe/0071w
71 file:///C:/Windows/explorer.exe/0072w
72 file:///C:/Windows/explorer.exe/0073w
73 file:///C:/Windows/explorer.exe/0074w
74 file:///C:/Windows/explorer.exe/0075w
75 file:///C:/Windows/explorer.exe/0076w
76 file:///C:/Windows/explorer.exe/0077w
77 file:///C:/Windows/explorer.exe/0078w
78 file:///C:/Windows/explorer.exe/0079w

79 rows × 1 columns

time: 24.5 ms

We get a list of file: URIs, each of which is the name of an XMP packet, which we generated by appending a number and letter to the filename.

To get some idea of the content of one packet, we'll look at the facts referencing the first packet.

In [9]:
pngOne=endpoint.select("""
    SELECT ?p ?o {
        <file:///C:/Windows/explorer.exe/0001w> ?p ?o
    }
""")
pngOne
Out[9]:
p o
0 dc:format image/png
1 o2:file C:/Windows/explorer.exe
2 o2:fileLength 4847928
3 o2:packetNumber 1
4 o2:rwFlag w
5 o2:xmpEnd 3232175
6 o2:xmpSize 14331
7 o2:xmpStart 3217844
8 exif:ColorSpace 65535
9 exif:PixelXDimension 24
10 exif:PixelYDimension 24
11 photoshop:ColorMode 3
12 tiff:Orientation 1
13 tiff:ResolutionUnit 2
14 tiff:XResolution 720000/10000
15 tiff:YResolution 720000/10000
16 xap:CreateDate 2015-05-04T15:40:01-07:00
17 xap:CreatorTool Adobe Photoshop CC 2014 (Windows)
18 xap:MetadataDate 2015-05-05T10:55:34-07:00
19 xap:ModifyDate 2015-05-05T10:55:34-07:00
20 xapMM:DocumentID xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164
21 xapMM:History nodeID://b814605
22 xapMM:InstanceID xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164
23 xapMM:OriginalDocumentID xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164
time: 19 ms

Note that the properties in the o2 namespace are properties I added to identify the packets, as opposed to all of the other properties which were copied verbatim from the XMP packet. Other properties come from Dublin Core, XMP and Photoshop-specific namespaces, as well as namespaces such as exif and tiff which contain industry-standard terminology despite being on an Adobe URI.

Part 2 of the XMP specification describes a number of namespaces of types and properties to use with XMP. Despite that, XMP users are free to use any RDF vocabulary they like so long as they comply with certain conventions described in Part 1.

Let's look first at the 'o2' properties because these provide a map for finding and describing XMP packets.

In [10]:
pngOne[1:8]
Out[10]:
p o
1 o2:file C:/Windows/explorer.exe
2 o2:fileLength 4847928
3 o2:packetNumber 1
4 o2:rwFlag w
5 o2:xmpEnd 3232175
6 o2:xmpSize 14331
7 o2:xmpStart 3217844
time: 11 ms

Since all of these properties have a 1-to-1 relationship with an XMP packet, we can write a query which works like a typical SQL query, where all the patterns share the same subject and we get exactly one row per packet. This table shows the length of all the packets, whether they allow read or write access, and their exact location in the file.

In [11]:
packets=endpoint.select("""
    SELECT ?packetNumber ?rwFlag ?xmpStart ?xmpEnd ?xmpSize {
        ?s o2:file "C:/Windows/explorer.exe" .
        ?s o2:packetNumber ?packetNumber .
        ?s o2:rwFlag ?rwFlag .
        ?s o2:xmpStart ?xmpStart .
        ?s o2:xmpEnd ?xmpEnd .
        ?s o2:xmpSize ?xmpSize .
    }
""")
packets
Out[11]:
packetNumber rwFlag xmpStart xmpEnd xmpSize
0 1 w 3217844 3232175 14331
1 2 w 3232444 3246775 14331
2 3 w 3247052 3261383 14331
3 4 w 3261676 3276007 14331
4 5 w 3276300 3290631 14331
5 6 w 3290932 3305263 14331
6 7 w 3305572 3319903 14331
7 8 w 3320236 3334567 14331
8 9 w 3334932 3349263 14331
9 10 w 3352636 3366967 14331
10 11 w 3370436 3384767 14331
11 12 w 3385044 3399375 14331
12 13 w 3399668 3413999 14331
13 14 w 3414300 3428631 14331
14 15 w 3428932 3443263 14331
15 16 w 3443580 3457911 14331
16 17 w 3458228 3472559 14331
17 18 w 3472900 3487231 14331
18 19 w 3487604 3501935 14331
19 20 w 3505332 3519663 14331
20 21 w 3523148 3537921 14773
21 22 w 3538188 3552961 14773
22 23 w 3553236 3568009 14773
23 24 w 3568292 3583065 14773
24 25 w 3583348 3598121 14773
25 26 w 3598412 3613185 14773
26 27 w 3613492 3628265 14773
27 28 w 3628596 3643369 14773
28 29 w 3643732 3658505 14773
29 30 w 3658884 3673657 14773
... ... ... ... ... ...
49 50 w 3974732 3989063 14331
50 51 w 3990076 4004407 14331
51 52 w 4005636 4019967 14331
52 53 w 4021164 4035495 14331
53 54 w 4036836 4051167 14331
54 55 w 4052604 4066935 14331
55 56 r 4081532 4082369 837
56 57 r 4082948 4083785 837
57 58 r 4084532 4085369 837
58 59 r 4086268 4087105 837
59 60 r 4088292 4089129 837
60 61 r 4090908 4091745 837
61 62 r 4094100 4094945 845
62 63 r 4095532 4096377 845
63 64 r 4097132 4097977 845
64 65 r 4098884 4099729 845
65 66 r 4100932 4101777 845
66 67 r 4103572 4104417 845
67 68 w 4130404 4144735 14331
68 69 w 4144996 4159327 14331
69 70 w 4159588 4173919 14331
70 71 w 4174188 4188519 14331
71 72 w 4188772 4203103 14331
72 73 w 4203364 4217695 14331
73 74 w 4217964 4232295 14331
74 75 w 4232588 4246919 14331
75 76 w 4247228 4261559 14331
76 77 w 4261868 4276199 14331
77 78 w 4276540 4290871 14331
78 79 w 4291204 4305535 14331

79 rows × 5 columns

time: 72.5 ms

We already see some things that are highly suspicious; if we just look at the first five, for instance, we see that they all have exactly the same size.

In [12]:
endpoint.select("""
    SELECT ?packetNumber ?rwFlag ?xmpStart ?xmpEnd ?xmpSize {
        ?s o2:file "C:/Windows/explorer.exe" ;
           o2:packetNumber ?packetNumber ;
           o2:rwFlag ?rwFlag ;
           o2:xmpStart ?xmpStart ;
           o2:xmpEnd ?xmpEnd ;
           o2:xmpSize ?xmpSize .
    } LIMIT 5
""")
Out[12]:
packetNumber rwFlag xmpStart xmpEnd xmpSize
0 1 w 3217844 3232175 14331
1 2 w 3232444 3246775 14331
2 3 w 3247052 3261383 14331
3 4 w 3261676 3276007 14331
4 5 w 3276300 3290631 14331
time: 27 ms

The next SPARQL query sums up the total size of all the packets contained in explorer.exe and we find, amazingly that the XMP packets add up to 20% of the total file size!

In [13]:
sumSize=endpoint.select("""
    SELECT SUM(?xmpSize) AS ?xmpBytes {
        ?s o2:file "C:/Windows/explorer.exe" ;
           o2:xmpSize ?xmpSize .
    }
""")
sumSize
Out[13]:
xmpBytes
0 975131
time: 12 ms
In [14]:
sumSize.at[0,'xmpBytes']
Out[14]:
975131
time: 8.5 ms
In [15]:
pngOne.at[2,'o']
Out[15]:
4847928
time: 15 ms
In [16]:
100.0*sumSize.at[0,'xmpBytes']/pngOne.at[2,'o']
Out[16]:
20.114387012348367
time: 12.5 ms

Viewing the raw packet

Knowing the location in the file, we can slice out a range of bytes and thus see the actual XMP packet:

In [17]:
fname="C:/Windows/explorer.exe"
offset=packets.at[0,"xmpStart"]
size=packets.at[0,"xmpSize"]
(fname,offset,size)
Out[17]:
('C:/Windows/explorer.exe', 3217844, 14331)
time: 10.5 ms
In [18]:
def getslice(fname,offset,size):
    with open(fname,"rb") as file:
        file.seek(offset)
        return file.read(size)
time: 13 ms
In [19]:
rawpacket=getslice(fname,offset,size).decode("utf-8")
print(rawpacket.rstrip())
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c014 79.156797, 2014/08/20-09:53:02        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
            xmlns:tiff="http://ns.adobe.com/tiff/1.0/"
            xmlns:exif="http://ns.adobe.com/exif/1.0/">
         <xmp:CreatorTool>Adobe Photoshop CC 2014 (Windows)</xmp:CreatorTool>
         <xmp:CreateDate>2015-05-04T15:40:01-07:00</xmp:CreateDate>
         <xmp:ModifyDate>2015-05-05T10:55:34-07:00</xmp:ModifyDate>
         <xmp:MetadataDate>2015-05-05T10:55:34-07:00</xmp:MetadataDate>
         <dc:format>image/png</dc:format>
         <photoshop:ColorMode>3</photoshop:ColorMode>
         <xmpMM:InstanceID>xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164</xmpMM:InstanceID>
         <xmpMM:DocumentID>xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164</xmpMM:DocumentID>
         <xmpMM:OriginalDocumentID>xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164</xmpMM:OriginalDocumentID>
         <xmpMM:History>
            <rdf:Seq>
               <rdf:li rdf:parseType="Resource">
                  <stEvt:action>created</stEvt:action>
                  <stEvt:instanceID>xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164</stEvt:instanceID>
                  <stEvt:when>2015-05-04T15:40:01-07:00</stEvt:when>
                  <stEvt:softwareAgent>Adobe Photoshop CC 2014 (Windows)</stEvt:softwareAgent>
               </rdf:li>
            </rdf:Seq>
         </xmpMM:History>
         <tiff:Orientation>1</tiff:Orientation>
         <tiff:XResolution>720000/10000</tiff:XResolution>
         <tiff:YResolution>720000/10000</tiff:YResolution>
         <tiff:ResolutionUnit>2</tiff:ResolutionUnit>
         <exif:ColorSpace>65535</exif:ColorSpace>
         <exif:PixelXDimension>24</exif:PixelXDimension>
         <exif:PixelYDimension>24</exif:PixelYDimension>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
time: 11 ms

Note in particular the snippet at the beginning that reads

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>

this snippet contains a special sequence of characters that is unlikely to appear anyplace else, and can thus be scanned for in order to find XMP packets. This signature is not dependent on the structure of the file so a simple reader can find XMP packets in any kind of file, or even find XMP packets on a raw disk volume -- some tools for undeleting photos from flash cards can locate media files from their embedded XMP packets, extracting the deleted files the way that we're about to extract a PNG file from Explorer.exe.

The packet itself is written in RDF/XML which was the first RDF serialization format, with just a few constraints that are set in Part 1.

Note that the XMP packet is followed by a whopping 12,050 characters of whitespace, as opposed to just 2,281 characters of XML!

In [20]:
len(rawpacket)-rawpacket.rfind('>')
Out[20]:
12050
time: 13 ms
In [21]:
all(map(lambda x:x.isspace(),rawpacket[-12049:]))
Out[21]:
True
time: 13 ms

Viewing the Embedded Image

This packet claims to be part of a PNG file, so let's see if we can extract the whole image; to start, we know that every PNG file starts with a eight-byte sequence, so we can look for the first occurence of this sequence that appears before the XMP packet.

In [22]:
PNGSTART=bytes.fromhex("89504E470D0A1A0A")
PNGSTART
Out[22]:
b'\x89PNG\r\n\x1a\n'
time: 9 ms
In [23]:
!pip install bitstring
from bitstring import ConstBitStream
Requirement already satisfied: bitstring in c:\users\paul_000\anaconda3\lib\site-packages
time: 2.18 s
In [24]:
x=ConstBitStream(filename='C:/Windows/explorer.exe')
time: 1.5 ms
In [25]:
x.pos=x.rfind(PNGSTART,end=offset*8,bytealigned=True)[0]
time: 9 ms

We can then move forward 8 bytes to skip past the header

In [26]:
begin=x.bytepos
x.bytepos += 8
time: 10 ms

The next problem is finding the end of the PNG file; although there is no indication of the exact length of the PNG file, a PNG file consists of a number of chunks, each of which has a chunk type, length, and checksum. The last chunk is called "IEND", and once we've read it, we're at the end of the file.

In [27]:
def readchunks(x:ConstBitStream):
    while True:
        length=x.read("uintbe:32")
        chunkId=x.read("bytes:4").decode("utf-8")
        start=x.bytepos
        x.bytepos += length
        x.bytepos += 4
        if chunkId=="IEND":
            break
        yield chunkId,start,length
        
    
time: 14 ms

Here we see all the chunks, including the iTXt chunk which contains the XMP packet.

In [28]:
list(readchunks(x))
Out[28]:
[('IHDR', 3217776, 13),
 ('pHYs', 3217801, 9),
 ('iTXt', 3217822, 14372),
 ('cHRM', 3232206, 32),
 ('IDAT', 3232250, 94)]
time: 15.5 ms

The cursor on the BitStream is now at the end of the PNG file, so we're ready to extract the image file once we store the endpoint

In [29]:
end=x.bytepos
(begin,offset,end)
Out[29]:
(3217760, 3217844, 3232360)
time: 15 ms
In [30]:
imagedata=x[begin*8:end*8].bytes
time: 7.5 ms

The PNG file has a total length of 14,600 bytes, of which 14,372 bytes are the XMP packet, and of which 12,050 bytes are whitespace. If you think that's outrageous, you should see what happens next...

In [31]:
len(imagedata)
Out[31]:
14600
time: 15 ms

In principle it is pretty easy to display an image in a Jupyter notebook, but when we try it the first time, we don't see anything at all

In [32]:
from IPython.display import display_png,display_html,display,HTML
display_png(imagedata,raw=True)
time: 11.5 ms

My hunch when I saw that is that I was looking at white lines on a transparent background; I looked at it in photoshop, confirmed I was right, and then worked out a way to display an image with a custom background. The trick is that you can embed an image, data and all, in a URI in base64 and in turn embed it in HTML.

In [33]:
from base64 import b64encode
time: 11 ms
In [34]:
b64=b64encode(imagedata).decode("ascii")
b64url='data:image/png;base64,{}'.format(b64)
embedded="<div style='height:24px; width:24px; background:MidnightBlue'><img src='{}'></div>".format(b64url)
HTML(embedded)
Out[34]:
time: 17.5 ms

This image is one you've seen if you use Windows 10, as it is part of the taskbar, which is provided by Explorer.exe; it's clear now why this image is white-on-transparent:

In [35]:
def show_image(filename):
    with open(filename,"rb") as f:
        image=f.read()
        display_png(image,raw=True)
        
show_image("taskbar.png")
time: 8 ms

What's going on?

At this point you might think you're seeing the kind of madness you can get from two, not just one, billion dollar software company.

Actually it makes a little more sense than it looks, as shown by the diagram below. The PNG file consists of a number of "chunks", one of which is an iTXt chunk which is designed to hold uncompressed text information. The PNG file was written with a large amount of whitespace padding at the end of the XMP packet so that an XMP editor could later add a few kilobytes of metadata without having to modify the entire file.

This looks absurd in the case of a very simple image which contains just 94 bytes of compressed data; it would seem even more absurd if one were using the image on the web or a mobile app. It makes more sense in a media workflow where people are working with large files: for instance, RAW camera files are frequently 20 MB or more and size, and 12kb of whitespace is a small price to pay, in that context, for an application like Adobe Lightroom or Microsoft Photos to update the metadata without needing to rewrite the entire file.

In the process of building the Explorer.exe binary, the compiler and linker consolidate fragments of code, data, and many kinds of files into a single file. The compiler adds metadata to the binary that the executable uses to find resources by name just as it would find a subroutine by name:

In [36]:
show_image("embedding.png")
time: 13.5 ms

Note there is no "metadata master plan" going on here! XMP data "rides along" in the iTXt chunk because most PNG tools ignore it, keeping metadata together with the image, as designed. What we find, however, is that there are also PNG files embedded in Explorer.exe without any XMP data, for instance, the first PNG file starts at a byte position

In [37]:
x.pos=x.find(PNGSTART,bytealigned=True)[0]
x.bytepos
Out[37]:
2556456
time: 30.5 ms

that is long before the first XMP packet, which occurs at:

In [38]:
packets.at[0,"xmpStart"]
Out[38]:
3217844
time: 13 ms

In fact, we spot 260 PNG headers in Explorer.exe, more than the 79 XMP packets contained in the file. There is no plan for managing the metadata for the files embedded in this executable, yet large amounts of whitespace bulk up this executable by more than 20%.

Quick note: The situation is not different on Linux, MacOS, or other operating systems because all executable formats have some way to embed images. What publishers should do is remove unncessary metadata before publishing, which is easy to do in the case of the PNG because we can simply omit the iTXt chunk.

In [39]:
len(list(x.findall(PNGSTART,bytealigned=True)))
Out[39]:
260
time: 45.5 ms

Are all of the metadata packets for PNG files?

Next we'd like to see what kind of files we're are inside Explorer.exe; here we do another GROUP BY query but now we are using the OPTIONAL clause, because if we did not, the pattern would not match when the dc:format property is not specified. With OPTIONAL specified, the ?format variable is set to None when that happens.

In [40]:
endpoint.select("""
    SELECT ?format (COUNT(*) as ?cnt) {
        ?s o2:file "C:/Windows/explorer.exe" ;
           o2:packetNumber ?packetNumber ;
           o2:xmpSize ?xmpSize .
        OPTIONAL { ?s dc:format ?format . }
    } GROUP BY ?format
""")
Out[40]:
format cnt
0 image/png 67
1 None 12
time: 25 ms

At this point we can use the MINUS clause in SPARQL to find cases where dc:format is not specified, as MINUS matches only if the pattern inside of it does not match. producing a list of the mysterious files. Note in this case, all of the XMP packets are relatively short.

In [41]:
endpoint.select("""
    SELECT ?packetNumber ?flag ?xmpSize {
        ?s o2:file "C:/Windows/explorer.exe" ;
           o2:packetNumber ?packetNumber ;
           o2:rwFlag ?flag ;
           o2:xmpSize ?xmpSize .
        MINUS { ?s dc:format ?format . }
    } GROUP BY ?format
""")
Out[41]:
packetNumber flag xmpSize
0 61 r 837
1 62 r 845
2 64 r 845
3 58 r 837
4 59 r 837
5 60 r 837
6 65 r 845
7 66 r 845
8 56 r 837
9 57 r 837
10 63 r 845
11 67 r 845
time: 19 ms

Looking at the facts in packet number 56 we don't see anything too unusual, but note that the file type is left unspecified.

In [42]:
endpoint.select("""
    SELECT ?p ?o {
        ?s o2:file "C:/Windows/explorer.exe" .
        ?s o2:packetNumber 57 .
        ?s ?p ?o .
    }
""")
Out[42]:
p o
0 o2:file C:/Windows/explorer.exe
1 o2:fileLength 4847928
2 o2:packetNumber 57
3 o2:rwFlag r
4 o2:xmpEnd 4083785
5 o2:xmpSize 837
6 o2:xmpStart 4082948
7 xap:CreatorTool Adobe Photoshop CC 2014 (Windows)
8 xapMM:DerivedFrom nodeID://b814727
9 xapMM:DocumentID xmp.did:36CA86F2857E11E4A5EACBDE7A2C5308
10 xapMM:InstanceID xmp.iid:36CA86F1857E11E4A5EACBDE7A2C5308
11 xapMM:OriginalDocumentID xmp.did:68fdb7b7-8591-624f-ba66-aacfdce9ba29
time: 16 ms

Conjecturing that it is just another PNG file, I use the same PNG extractor as the image before and find that it just the logo for Cortana, which also appears in the snapshot. Although both of these images were created with Adobe Photoshop CC 2014 (Windows), it seems they they were not saved with a consistent set of properties.

In [43]:
x.pos=x.rfind(PNGSTART,end=4082948*8,bytealigned=True)[0]
begin=x.bytepos
x.bytepos += 8
time: 1.5 ms
In [44]:
list(readchunks(x))
Out[44]:
[('IHDR', 4082864, 13),
 ('tEXt', 4082889, 25),
 ('iTXt', 4082926, 878),
 ('IDAT', 4083816, 593)]
time: 14 ms
In [45]:
end=x.bytepos
(begin,end)
Out[45]:
(4082848, 4084425)
time: 11 ms
In [46]:
imagedata=x[begin*8:end*8].bytes
time: 8.5 ms

... and you know it just happens to be the logo for Cortana, Microsoft's voice agent:

In [47]:
b64=b64encode(imagedata).decode("ascii")
b64url='data:image/png;base64,{}'.format(b64)
embedded="<div style='height:24px; width:24px; background:MidnightBlue'><img src='{}'></div>".format(b64url)
HTML(embedded)
Out[47]:
time: 19 ms

Blank Nodes: beyond triples

It is axiomatic that we can express anything with subject-predicate-object triples. Some cases, however, take a little more work than others. RDF triples easily represent the relational structures commonly used in SQL databases, but what about the nested and sequential structures that people expect out of NoSQL?

RDF provides us with a powerful tool for representing post-relational structures in the form of blank nodes. To understand them, let's take a look at one fact from the first XMP packet, the one that describes the "Task View" image.

In [48]:
pngOne[21:22]
Out[48]:
p o
21 xapMM:History nodeID://b814605
time: 21.5 ms

Note that the left hand side (the object) is nodeID://b814605, which is a reference to a blank node.

Identifiers which derive from a URI, such as <http://ns.adobe.com/xap/1.0/mm/> refer to the same thing everywhere in the world, so long as the string content is the same. Blank nodes are different, in that blank nodes are specific to a particular triple store. Some other triple store could use that exact same name for a different purpose, or if I loaded the same data into Virtuoso again, it could wind up with a different name.

Because blank nodes are local to a triple store, the exact behavior of blank nodes are different RDF databases. In particular, most triple stores have ways to refer to specific blank nodes, because even if they do not exist globally, it is still necessary to talk about them locally to copy graphs from one triple store to another.

Let's look at the properties of this blank node:

In [49]:
frame=endpoint.select("""
    SELECT ?p ?o {
        <nodeID://b814605> ?p ?o
    }
""")
frame
Out[49]:
p o
0 rdf:type rdf:Seq
1 rdf:_1 nodeID://b814606
time: 35.5 ms

<nodeID://b814605> turns out to be an ordered list (an rdf:Seq) In this case it is a list of one member, which is itself a blank node. If there were multiple members, these would be indicated as rdf:_2, rdf:_3 and so forth.

This method of representing a list is analagous to the Java ArrayList which represents a list as an array. An RDF tool could, pretty easily, use the number as an index into an array to find a value.

There are three kinds of RDF container, the

  • rdf:Seq an ordered list
  • rdf:Alt an alternative list (of which the data consumer is intended to pick one value; a list of names in various languages could be an example)
  • rdf:Bag an unordered list, aka a set

RDF contains another mechanism, called a 'Collection' (or rdf:List) that represents ordered lists as does the LISP programming language or the java LinkedList. These also involve blank nodes, but we'll avoid them for now since they are not used in XMP packets.

To continue exploring, we can follow the properties that lead from the single member of the above list...

In [50]:
frame=endpoint.select("""
    SELECT ?p ?o {
        <nodeID://b814606> ?p ?o
    }
""")
frame
Out[50]:
p o
0 stEvt:action created
1 stEvt:instanceID xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164
2 stEvt:softwareAgent Adobe Photoshop CC 2014 (Windows)
3 stEvt:when 2015-05-04T15:40:01-07:00
time: 23.5 ms

... and we find that <nodeID://b814606> is just a data record that is nested inside the larger data record, much as one would see in a JSON file. This record represents one moment in the history of the file, the moment it was created. If the file had a longer history, it would contain a list with a large number of records.

The analogy to JSON gets closer if we extract all the facts in the XMP packet.

The process to do this is exactly like what we just did above, following links from one blank node to the next. The peel function from gastrodon does this, copying all of the facts into a Graph object from the rdflib library.

Here we do this, listing all of the namespaces which are used in this particular Graph; all of the namespace

In [51]:
g=endpoint.peel(URIRef("file:///C:/Windows/explorer.exe/0001w"))
list(g.namespaces())
Out[51]:
[('xml', rdflib.term.URIRef('http://www.w3.org/XML/1998/namespace')),
 ('rdf', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#')),
 ('rdfs', rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#')),
 ('xsd', rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#')),
 ('xmp', rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/')),
 ('xap', rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/')),
 ('photoshop', rdflib.term.URIRef('http://ns.adobe.com/photoshop/1.0/')),
 ('xmpMM', rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/')),
 ('exif', rdflib.term.URIRef('http://ns.adobe.com/exif/1.0/')),
 ('tiff', rdflib.term.URIRef('http://ns.adobe.com/tiff/1.0/')),
 ('stEvt',
  rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/sType/ResourceEvent#')),
 ('xapMM', rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/')),
 ('dc', rdflib.term.URIRef('http://purl.org/dc/elements/1.1/')),
 ('o2', rdflib.term.URIRef('http://rdf.ontology2.com/metadata/'))]
time: 33.5 ms

The Graph object from rdflib is a little triple store, which accepts SPARQL queries the same way that OpenLink Virtuoso does. For instance, the following query confirms that the recorded end of the XMP packet is the start position plus the length in bytes:

In [52]:
def scalar(result):
    return list(result)[0][0].toPython()

scalar(g.query("""
   SELECT ?computedSize {
      ?s o2:xmpStart ?start .
      ?s o2:xmpEnd ?end .
      ?s o2:xmpSize ?size .
      BIND (?end-?start AS ?computedSize)
      BIND (?size=?computedSize as ?computedSize)
   }
"""))
Out[52]:
True
time: 2.76 s

This points out a way in which RDF and SPARQL are unique. With a standard data model and query language, we can use the same data and queries on a small scale (in memory), medium scale (disk based), and large scale (distributed cluster)! With our data in a graph we can process it directly, for instance, iterating over all triples in the graph in order to make a list of all URIs that appear in it:

In [53]:
uris=set()
for fact in g.triples((None,None,None)):
    for node in fact:
        if isinstance(node,URIRef):
            uris.add(node)

uris
Out[53]:
{rdflib.term.URIRef('file:///C:/Windows/explorer.exe/0001w'),
 rdflib.term.URIRef('http://ns.adobe.com/exif/1.0/ColorSpace'),
 rdflib.term.URIRef('http://ns.adobe.com/exif/1.0/PixelXDimension'),
 rdflib.term.URIRef('http://ns.adobe.com/exif/1.0/PixelYDimension'),
 rdflib.term.URIRef('http://ns.adobe.com/photoshop/1.0/ColorMode'),
 rdflib.term.URIRef('http://ns.adobe.com/tiff/1.0/Orientation'),
 rdflib.term.URIRef('http://ns.adobe.com/tiff/1.0/ResolutionUnit'),
 rdflib.term.URIRef('http://ns.adobe.com/tiff/1.0/XResolution'),
 rdflib.term.URIRef('http://ns.adobe.com/tiff/1.0/YResolution'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/CreateDate'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/CreatorTool'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/MetadataDate'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/ModifyDate'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/DocumentID'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/History'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/InstanceID'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/mm/OriginalDocumentID'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/sType/ResourceEvent#action'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/sType/ResourceEvent#instanceID'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/sType/ResourceEvent#softwareAgent'),
 rdflib.term.URIRef('http://ns.adobe.com/xap/1.0/sType/ResourceEvent#when'),
 rdflib.term.URIRef('http://purl.org/dc/elements/1.1/format'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/file'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/fileLength'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/packetNumber'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/rwFlag'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/xmpEnd'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/xmpSize'),
 rdflib.term.URIRef('http://rdf.ontology2.com/metadata/xmpStart'),
 rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq'),
 rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#_1'),
 rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')}
time: 6.5 ms

Finally, we can write out the packet in Turtle format, which is somewhere in appearance between JSON and the RDF/XML packet. Turtle is the most popular way to write RDF data by hand today.

In [54]:
ttl(g)
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix exif: <http://ns.adobe.com/exif/1.0/> .
@prefix o2: <http://rdf.ontology2.com/metadata/> .
@prefix photoshop: <http://ns.adobe.com/photoshop/1.0/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix stEvt: <http://ns.adobe.com/xap/1.0/sType/ResourceEvent#> .
@prefix tiff: <http://ns.adobe.com/tiff/1.0/> .
@prefix xap: <http://ns.adobe.com/xap/1.0/> .
@prefix xapMM: <http://ns.adobe.com/xap/1.0/mm/> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xmp: <http://ns.adobe.com/xap/1.0/> .
@prefix xmpMM: <http://ns.adobe.com/xap/1.0/mm/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .


<file:///C:/Windows/explorer.exe/0001w> exif:ColorSpace "65535" ;
    exif:PixelXDimension "24" ;
    exif:PixelYDimension "24" ;
    photoshop:ColorMode "3" ;
    tiff:Orientation "1" ;
    tiff:ResolutionUnit "2" ;
    tiff:XResolution "720000/10000" ;
    tiff:YResolution "720000/10000" ;
    xap:CreateDate "2015-05-04T15:40:01-07:00" ;
    xap:CreatorTool "Adobe Photoshop CC 2014 (Windows)" ;
    xap:MetadataDate "2015-05-05T10:55:34-07:00" ;
    xap:ModifyDate "2015-05-05T10:55:34-07:00" ;
    xapMM:DocumentID "xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164" ;
    xapMM:History [ a rdf:Seq ;
            rdf:_1 [ stEvt:action "created" ;
                    stEvt:instanceID "xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164" ;
                    stEvt:softwareAgent "Adobe Photoshop CC 2014 (Windows)" ;
                    stEvt:when "2015-05-04T15:40:01-07:00" ] ] ;
    xapMM:InstanceID "xmp.iid:656c488a-1b92-8a4c-b879-5a834d7f5164" ;
    xapMM:OriginalDocumentID "xmp.did:656c488a-1b92-8a4c-b879-5a834d7f5164" ;
    dc:format "image/png" ;
    o2:file "C:/Windows/explorer.exe" ;
    o2:fileLength 4847928 ;
    o2:packetNumber 1 ;
    o2:rwFlag "w" ;
    o2:xmpEnd 3232175 ;
    o2:xmpSize 14331 ;
    o2:xmpStart 3217844 .


time: 17 ms

Conclusion

  • Common files such as ZIP files, Office Documents, PDF, Executable Programs and Libraries are wholly or in part a composition of smaller files that contain copious metadata that is normally invisible
  • XMP Metadata Packets, written in a first-generation dialect of RDF, are widespread in media files of many kind and contain
  • Although XMP was developed before the SPARQL Query Language, XMP packets can be copied into a Triple store and queried with next generation tools -- without any data transformation on import!
  • This Jupyter notebook is a development case for the Gastrodon library, which puts RDF data on your fingertips in the Jupyter environment.
Want to read more articles like this? Subscribe to our mailing list!
 
In [ ]: