Appendix A: Detailed testing results
Test 1: Redaction of embedded text
The aim of Test 1 was to determine whether remnants of redacted text could be found in PDF documents redacted with Adobe Acrobat Pro DC 2017.
A Microsoft Word document was created that contained a title and three lines of text. The last two lines represented sensitive information.
A corresponding PDF document was created using each of the rendering engines being examined: Adobe Acrobat, Adobe Acrobat Distiller, Microsoft Word, CutePDF and LibreOffice Writer. This represented five PDF documents.
The internal structures of the PDF documents were parsed with the PDF Stream Dumper tool. For each of the PDF documents, the objects within the file structures that contained embedded text were identified. For example, the embedded text object from the file generated using Adobe Acrobat is shown below with embedded text highlighted in green.
Examination of the embedded structural objects within each PDF document revealed that all rendering engines utilised font subsets to reduce file size. However, in regard to the mapping of character codes to character selectors (glyphs), different engines used alternate mechanisms.
Microsoft Word did not embed a CMap but instead used pre-defined WinAnsiEncoding. Adobe Acrobat Distiller only embedded a CMap for a single character mapping and for the remaining characters utilised pre-defined WinAnsiEncoding.
Adobe Acrobat embedded a custom ToUnicode CMap within the PDF document . The order of the mappings in the CMap reflected that in Unicode.
In contrast, the PDF documents created with CutePDF (using open source Ghostscript), or open source LibreOffice Writer, both embedded ToUnicode CMap objects where the order of mappings reflected the order that the characters first appeared in text. The CMap object from the former is shown.
Apart from facilitating the mapping of character codes to character selectors, the CMap created an artefact where the order of the mappings itself encoded a text string. The string contained the password from the PDF document which is highlighted in red below.
The bottom two lines of text were then redacted from each PDF document with Adobe Acrobat.
The internal structures of the redacted PDF documents were parsed with the PDF Stream Dumper tool. In all cases, the redacted text was successfully removed from the embedded PDF text objects. For example, the object from the PDF document produced by Adobe Acrobat is shown below with embedded text highlighted in green.
However, examination of the CMap objects within the redacted PDF documents created by CutePDF or LibreOffice Writer revealed that remnants of redacted text remained. For example, the CMap object from the redacted PDF document that was generated with LibreOffice Writer is shown below with the data artefact that reveals the password string shown in red.
It appears that the redaction functionality of Adobe Acrobat does not identify artefacts of redacted text in CMap objects when the PDF document was generated by CutePDF or LibreOffice Writer. In this test case, the redacted password was fully recoverable.
Test 2: Redaction of text within an embedded image
The aim of Test 2 was to verify that an embedded image containing text within a PDF document was edited by the redaction process and not simply obscured.
A representation of the text from the PDF document in Test 1 was copied into an image file. The image file was in turn used to create five separate PDF documents with the rendering engines used in Test 1.
Each PDF document was analysed with the pdf2txt tool which extracts embedded text. In all cases, no extractable text was found. This was the expected result, given that all text was represented within an embedded image. For example, the result of running the pdf2txt tool against the PDF document generated by Adobe Acrobat is shown below. To verify this result, each PDF document was also parsed with the PDF Stream Dumper tool.
The pdfimages tool, which can detect and analyse embedded images, was run against the PDF documents. The tool successfully identified the embedded images. The output of the tool when run against the PDF document generated by Adobe Acrobat is shown below.
To test the redaction functionality, the last two lines of the PDF documents (representing sensitive information) were redacted using Adobe Acrobat. The redacted PDF documents were again analysed with the pdfimages tool which revealed that the embedded image file had changed in size. For example, the image file in the redacted PDF document created with Adobe Acrobat had decreased from 22.9Kb to 21.1 Kb indicating it had been edited by the redaction process.
To verify that the sensitive information within the embedded image file had been successfully redacted, the embedded image was extracted from the PDF documents using the pdfimages tool and examined. For example, the embedded image file from the PDF document created with Adobe Acrobat is shown below. It had been edited by the redaction process to remove the sensitive information.
Test 3: Redaction of historical revisions of text
PDF documents store historical revisions of edited text. The aim of Test 3 was to verify that all historical revisions of sensitive text were removed by redaction with Adobe Acrobat.
The PDF documents from Test 1 were opened in Adobe Acrobat. In each case, one of the lines representing sensitive text was edited multiple times, making sure that each edit was saved.
The PDF documents were then parsed with the pdfwalker tool with the revisions of the PDF documents being reflected within the file structures. For example, the file generated by Adobe Acrobat is shown below.
Sensitive text was redacted using Acrobat and the PDF documents were again parsed with the pdfwalker tool. The output of the pdfwalker tool indicated that previous revisions of the sensitive text had been removed. For example, the output from parsing the PDF document generated by Adobe Acrobat is shown below. Note the pdfwalker tool always has ’Revision 1’ as an artefact.
This result was verified by using the PDF Stream Dumper tool to identify the number of file objects in the redacted PDF documents. In the case of the PDF document generated with Adobe Acrobat, the number of file objects had been reduced from 105 to 18, before and after redaction respectively.
Use of the PDF Stream Dumper tool to examine the embedded text objects showed that in every case historical revisions of sensitive text were removed by redaction. In contrast, the data remnant within CMap objects remained for PDF documents generated with CutePDF or LibreOffice Writer, as was found in Test 1.
Test 4: Redaction of text within a PDF form
Using Adobe Acrobat, text was entered into two PDF form fields.
The PDF documents were subsequently analysed with the pdfwalker tool. Text was found in three objects shown below. Two of the objects corresponded to each of the two form fields. The third object with text data corresponded to the form dictionary.
The embedded text within sections of the form dictionary is shown below. Embedded text is highlighted in green.
For the first test, both the form fields in the PDF document were fully redacted using Adobe Acrobat.
For the second test, the form fields in the PDF document were partially redacted using Adobe Acrobat.
The redacted PDF documents were then parsed with the pdfwalker tool. In both cases, the form objects were deleted leaving only the object that had contained the form dictionary. The output from parsing the partially redacted PDF document is shown below.
The contents of the remaining form dictionary objects were further analysed with the PDF Stream Dumper tool and no text remnants were found. For example, the content of the form dictionary from the partially redacted PDF document is shown below.
The result was the same for the PDF document where the form fields were fully redacted.
Test 5: Embedded text obscured with an image
On occasions, attempts at redaction have failed when underlying text was merely obscured by covering it with another layer in the form of a blackened rectangle or image .
Starting with the Microsoft Word file used in Test 1, text was obscured by inserting an overlying image file as shown below. A PDF document was then generated using each of the five rendering engines.
Using the pdf2txt tool, it was identified that the underlying text remained within the PDF documents despite being obscured with an overlying image. For example, the output of the pdf2txt tool for the PDF document generated with CutePDF is shown below.
This demonstrated that obscuring information in a PDF document using an image is not an effective way to redact information.
Test 6: Redacting encrypted PDF documents
PDF documents from previous tests were encrypted with Adobe Acrobat, redacted and then parsed with PDF analysis tools. The aim of this test was to verify whether encryption changed the underlying file structure of PDF documents or not, and thus if there was any impact on the effectiveness of redaction activities.
For all previous test cases, the results were similar and were unaffected by starting with an encrypted PDF document. For PDF documents created with CutePDF or LibreOffice Writer, as found previously, remnants of redacted text were left in the ToUnicode CMap objects. No other remnants of redacted data were found.
Test 7: Sanitising PDF documents
Adobe Acrobat offers a sanitisation feature (i.e. ‘Remove Hidden Information’) that removes hidden data. For example, metadata that might identify the author of a document.
Prior to sanitisation, the PDF documents contained objects with embedded metadata. The PDF documents were each parsed with the PDF Stream Dumper tool and the metadata analysed. The metadata from the PDF document produced by CutePDF is shown below. Useful information is highlighted in green and includes the rendering engine (CutePDF or Ghostscript), the software used to access the rendering engine (Microsoft Word) and the author’s user account (IEUser).
The metadata is also available from the PDF document’s properties within a PDF reader. The metadata from the PDF document produced by CutePDF and read by Adobe Acrobat Reader is shown below.
To check the efficacy of the sanitisation feature, the PDF documents were sanitised and parsed with the pdfwalker tool. In all test cases, sanitisation deleted the object containing metadata. For example, the file structure from the PDF document produced by CutePDF before (left) and after (right) sanitisation is shown below. The object containing metadata is indicated.
The successful sanitisation of metadata was also confirmed by checking the PDF document’s properties with Adobe Acrobat Reader as shown below. Empty metadata fields are highlighted.
This test also investigated whether sanitising a PDF document would affect the CMap data remnants identified in previous tests. The PDF documents that were rendered with CutePDF and LibreOffice Writer were sanitised and the resultant PDF documents were parsed with the PDF Stream Dumper tool. In both cases, this had no effect on the CMap remnants in the redacted PDF documents.
Furthermore, this test also investigated whether sanitising a PDF document would remove multiple historical revisions of text as seen previously in Test 3. When the PDF documents were sanitised and parsed with the pdfwalker tool, it was evident that historical revisions of text had been removed. The structure of the sanitised PDF document is shown below.