There have been numerous cases of security breaches resulting from a failure to effectively redact sensitive or private information from documents prior to release into the public domain  . To assist in mitigating this security risk, Adobe Acrobat Pro DC 2017 provides redaction and sanitisation functionality that aims to completely remove undesirable information and other hidden information (e.g. metadata) from PDF documents.
Scope of testing
The Defence Signals Directorate (DSD) previously examined the redaction functionality in Adobe Acrobat Pro 10 in 2011 to determine if any redacted information could be recovered from PDF documents. The current round of testing by the Australian Cyber Security Centre aimed to examine the same functionality previously tested but in Adobe Acrobat Pro DC 2017.
For the purposes of this document, the definition of successful redaction was the complete removal of redacted data from every location in a PDF document’s file structure.
As part of testing, a number of test cases were considered that represented some of the different ways that information could be stored within a PDF document. This included:
- embedded text
- embedded image
- data from historical editing
- interactive form
- embedded text obscured by an embedded image
- embedded text in an encrypted PDF
- embedded metadata.
Unless otherwise stated, the following application versions were used:
- Adobe Acrobat Pro DC 2017 (2017.012.20093) which installs Adobe PDF Library 15 and Adobe Acrobat Distiller 17
- Microsoft Word 2010 (14.0.6023.1)
- LibreOffice Writer 126.96.36.199
- CutePDF Writer 3.2 which installs Ghostscript 8.15.
The Calibri font was used in Microsoft Office documents in Microsoft Windows. This font was installed in Ubuntu Linux so that test files could be opened in LibreOffice Writer.
PDF documents were generated using each of the below rendering engines:
- Adobe Acrobat (Using the ‘Create PDF’ Microsoft Word add-in or native PDF authoring within Adobe Acrobat, both of which use the Adobe PDF Library)
- Adobe Acrobat Distiller (Printing to the Adobe PDF printer)
- Microsoft Word (Using ‘Save As’ PDF functionality)
- CutePDF (Printing to the CutePDF Writer printer)
- LibreOffice Writer (Using the ‘Export As PDF’ functionality).
In some cases, not all rendering engines were tested as not all possessed the necessary functionality.
For the purposes of testing, the application used to create the PDF documents refers to the rendering engine that did the PDF conversion. For example, using the ‘Create PDF’ Microsoft Word add-in installed by Adobe Acrobat, the PDF conversion is performed by the Adobe Acrobat rendering engine (Adobe PDF Library). Similarly, when printing to the Adobe PDF printer installed by Adobe Acrobat within Microsoft Word, the file conversion is performed by the Adobe Acrobat Distiller rendering engine. Only choosing to save the file by selecting the ‘Save As’ PDF option in Microsoft Word results in a PDF being rendered by Microsoft Word.
The previous testing conducted by DSD in 2011 used a single rendering engine to create PDF documents. In the current round of testing, PDF documents using different rendering engines were used to determine whether the source of the PDF document had any impact upon successful redaction.
Depending on the rendering engines used, PDF documents were generated using different versions of the PDF standard:
- Adobe Acrobat
- Adobe PDF Library (PDF version 1.5)
- Adobe Acrobat Distiller (PDF version 1.5)
- Microsoft Word (PDF version 1.5)
- Cute PDF (PDF version 1.4)
- LibreOffice Writer (PDF version 1.4).
Using the functions of Adobe Acrobat, PDF documents were generated using different versions of the PDF standard:
- interactive form (PDF version 1.6)
- encryption (PDF version 1.6)
- sanitisation (PDF version 1.6)
- redaction (PDF version 1.7).
The redacted PDF documents were analysed with free or open source tools to determine whether any redacted information could be recovered:
- Pdfminer toolkit written by Yusuke Shinyama (pdf2txt) 
- Poppler toolkit (pdfimages) 
- Origami toolkit (pdfwalker) 
- PDF Stream Dumper 9.3 by David Zimmer .
Testing results and recommendations
Successful redaction outcomes
No redacted information was recovered from PDF documents created with Adobe Acrobat, Adobe Acrobat Distiller and Microsoft Word. This result is similar to that of previous testing conducted by DSD in 2011 which examined the redaction functionality of Adobe Acrobat Pro 10.
Failures in redacting information
Remnants of redacted information were recovered from PDF documents created with CutePDF and LibreOffice Writer.
The remnants of redacted information were located within objects containing embedded font maps (CMap objects) . The ability to recover these data remnants was the result of differences in the mechanisms used by CutePDF and LibreOffice Writer to embed font maps, and the Adobe Acrobat redaction functionality’s inability to identify and remove them. The Adobe Acrobat sanitisation functionality also failed to remove these data remnants.
It is not known whether PDF documents created by rendering engines not tested during this round of testing will also fail to be successfully redacted. Until this is known, assurance that data remnants cannot be recovered from redacted PDF documents requires that the creation of PDF documents be restricted to Adobe Acrobat, Adobe Acrobat Distiller and Microsoft Word.
To assist in identifying the software used to create a PDF document, the metadata can be examined via the document’s properties in Adobe Acrobat or Adobe Acrobat Reader. If the PDF document was created with Adobe Acrobat, Adobe Acrobat Distiller or Microsoft Word, the PDF Producer field will contain ‘Adobe PDF Library’, ‘Acrobat Distiller’ or ‘Microsoft Word’ respectively. If the PDF Producer field contains something else, there is a chance that redaction of sensitive or private information might fail. Note that if the PDF document had been previously sanitised, the metadata would have been deleted and the PDF Producer field will be empty. In these cases it should be treated as if it was created by a rendering engine that cannot be successfully redacted by Adobe Acrobat.
Detailed testing results
For a full breakdown and discussion of the testing results see Appendix A and B respectively.
When a requirement exists to redact sensitive or private information from PDF documents before releasing them into the public domain or to other third parties, organisations should:
- verify the original PDF document was created using Adobe Acrobat, Adobe Acrobat Distiller or Microsoft Word by checking the metadata of the PDF document
- perform redaction and sanitisation of the document using Adobe Acrobat Pro DC 2017.
The Australian Government Information Security Manual (ISM) assists in the protection of information that is processed, stored or communicated by organisations’ systems. It can be found at https://www.cyber.gov.au/acsc/view-all-content/ism.
The Strategies to Mitigate Cyber Security Incidents complements the advice in the ISM. The complete list of strategies can be found at https://www.cyber.gov.au/acsc/view-all-content/publications/strategies-mitigate-cyber-security-incidents.
A guide to redacting sensitive information from PDF documents, including step-by-step instructions, is available from Adobe at https://helpx.adobe.com/acrobat/using/removing-sensitive-content-pdfs.html.
The latest PDF standard (ISO 32000-2:2017) is available for purchase from the International Organization for Standardization (ISO) at https://www.iso.org/standard/63534.html.
If you have any questions regarding this guidance you can contact us via 1300 CYBER1 (1300 292 371) or https://www.cyber.gov.au/acsc/contact.
Appendix A: Detailed testing results
Test 1: Redaction of embedded text
The aim of Test 1 was to determine whether remnants of redacted text could be found in PDF documents redacted with Adobe Acrobat Pro DC 2017.
A Microsoft Word document was created that contained a title and three lines of text. The last two lines represented sensitive information.
A corresponding PDF document was created using each of the rendering engines being examined: Adobe Acrobat, Adobe Acrobat Distiller, Microsoft Word, CutePDF and LibreOffice Writer. This represented five PDF documents.
The internal structures of the PDF documents were parsed with the PDF Stream Dumper tool. For each of the PDF documents, the objects within the file structures that contained embedded text were identified. For example, the embedded text object from the file generated using Adobe Acrobat is shown below with embedded text highlighted in green.
Examination of the embedded structural objects within each PDF document revealed that all rendering engines utilised font subsets to reduce file size. However, in regard to the mapping of character codes to character selectors (glyphs), different engines used alternate mechanisms.
Microsoft Word did not embed a CMap but instead used pre-defined WinAnsiEncoding. Adobe Acrobat Distiller only embedded a CMap for a single character mapping and for the remaining characters utilised pre-defined WinAnsiEncoding.
Adobe Acrobat embedded a custom ToUnicode CMap within the PDF document . The order of the mappings in the CMap reflected that in Unicode.
In contrast, the PDF documents created with CutePDF (using open source Ghostscript), or open source LibreOffice Writer, both embedded ToUnicode CMap objects where the order of mappings reflected the order that the characters first appeared in text. The CMap object from the former is shown.
Apart from facilitating the mapping of character codes to character selectors, the CMap created an artefact where the order of the mappings itself encoded a text string. The string contained the password from the PDF document which is highlighted in red below.
The bottom two lines of text were then redacted from each PDF document with Adobe Acrobat.
The internal structures of the redacted PDF documents were parsed with the PDF Stream Dumper tool. In all cases, the redacted text was successfully removed from the embedded PDF text objects. For example, the object from the PDF document produced by Adobe Acrobat is shown below with embedded text highlighted in green.
However, examination of the CMap objects within the redacted PDF documents created by CutePDF or LibreOffice Writer revealed that remnants of redacted text remained. For example, the CMap object from the redacted PDF document that was generated with LibreOffice Writer is shown below with the data artefact that reveals the password string shown in red.
It appears that the redaction functionality of Adobe Acrobat does not identify artefacts of redacted text in CMap objects when the PDF document was generated by CutePDF or LibreOffice Writer. In this test case, the redacted password was fully recoverable.
Test 2: Redaction of text within an embedded image
The aim of Test 2 was to verify that an embedded image containing text within a PDF document was edited by the redaction process and not simply obscured.
A representation of the text from the PDF document in Test 1 was copied into an image file. The image file was in turn used to create five separate PDF documents with the rendering engines used in Test 1.
Each PDF document was analysed with the pdf2txt tool which extracts embedded text. In all cases, no extractable text was found. This was the expected result, given that all text was represented within an embedded image. For example, the result of running the pdf2txt tool against the PDF document generated by Adobe Acrobat is shown below. To verify this result, each PDF document was also parsed with the PDF Stream Dumper tool.
The pdfimages tool, which can detect and analyse embedded images, was run against the PDF documents. The tool successfully identified the embedded images. The output of the tool when run against the PDF document generated by Adobe Acrobat is shown below.
To test the redaction functionality, the last two lines of the PDF documents (representing sensitive information) were redacted using Adobe Acrobat. The redacted PDF documents were again analysed with the pdfimages tool which revealed that the embedded image file had changed in size. For example, the image file in the redacted PDF document created with Adobe Acrobat had decreased from 22.9Kb to 21.1 Kb indicating it had been edited by the redaction process.
To verify that the sensitive information within the embedded image file had been successfully redacted, the embedded image was extracted from the PDF documents using the pdfimages tool and examined. For example, the embedded image file from the PDF document created with Adobe Acrobat is shown below. It had been edited by the redaction process to remove the sensitive information.
Test 3: Redaction of historical revisions of text
PDF documents store historical revisions of edited text. The aim of Test 3 was to verify that all historical revisions of sensitive text were removed by redaction with Adobe Acrobat.
The PDF documents from Test 1 were opened in Adobe Acrobat. In each case, one of the lines representing sensitive text was edited multiple times, making sure that each edit was saved.
The PDF documents were then parsed with the pdfwalker tool with the revisions of the PDF documents being reflected within the file structures. For example, the file generated by Adobe Acrobat is shown below.
Sensitive text was redacted using Acrobat and the PDF documents were again parsed with the pdfwalker tool. The output of the pdfwalker tool indicated that previous revisions of the sensitive text had been removed. For example, the output from parsing the PDF document generated by Adobe Acrobat is shown below. Note the pdfwalker tool always has ’Revision 1’ as an artefact.
This result was verified by using the PDF Stream Dumper tool to identify the number of file objects in the redacted PDF documents. In the case of the PDF document generated with Adobe Acrobat, the number of file objects had been reduced from 105 to 18, before and after redaction respectively.
Use of the PDF Stream Dumper tool to examine the embedded text objects showed that in every case historical revisions of sensitive text were removed by redaction. In contrast, the data remnant within CMap objects remained for PDF documents generated with CutePDF or LibreOffice Writer, as was found in Test 1.
Test 4: Redaction of text within a PDF form
Using Adobe Acrobat, text was entered into two PDF form fields.
The PDF documents were subsequently analysed with the pdfwalker tool. Text was found in three objects shown below. Two of the objects corresponded to each of the two form fields. The third object with text data corresponded to the form dictionary.
The embedded text within sections of the form dictionary is shown below. Embedded text is highlighted in green.
For the first test, both the form fields in the PDF document were fully redacted using Adobe Acrobat.
For the second test, the form fields in the PDF document were partially redacted using Adobe Acrobat.
The redacted PDF documents were then parsed with the pdfwalker tool. In both cases, the form objects were deleted leaving only the object that had contained the form dictionary. The output from parsing the partially redacted PDF document is shown below.
The contents of the remaining form dictionary objects were further analysed with the PDF Stream Dumper tool and no text remnants were found. For example, the content of the form dictionary from the partially redacted PDF document is shown below.
The result was the same for the PDF document where the form fields were fully redacted.
Test 5: Embedded text obscured with an image
On occasions, attempts at redaction have failed when underlying text was merely obscured by covering it with another layer in the form of a blackened rectangle or image .
Starting with the Microsoft Word file used in Test 1, text was obscured by inserting an overlying image file as shown below. A PDF document was then generated using each of the five rendering engines.
Using the pdf2txt tool, it was identified that the underlying text remained within the PDF documents despite being obscured with an overlying image. For example, the output of the pdf2txt tool for the PDF document generated with CutePDF is shown below.
This demonstrated that obscuring information in a PDF document using an image is not an effective way to redact information.
Test 6: Redacting encrypted PDF documents
PDF documents from previous tests were encrypted with Adobe Acrobat, redacted and then parsed with PDF analysis tools. The aim of this test was to verify whether encryption changed the underlying file structure of PDF documents or not, and thus if there was any impact on the effectiveness of redaction activities.
For all previous test cases, the results were similar and were unaffected by starting with an encrypted PDF document. For PDF documents created with CutePDF or LibreOffice Writer, as found previously, remnants of redacted text were left in the ToUnicode CMap objects. No other remnants of redacted data were found.
Test 7: Sanitising PDF documents
Adobe Acrobat offers a sanitisation feature (i.e. ‘Remove Hidden Information’) that removes hidden data. For example, metadata that might identify the author of a document.
Prior to sanitisation, the PDF documents contained objects with embedded metadata. The PDF documents were each parsed with the PDF Stream Dumper tool and the metadata analysed. The metadata from the PDF document produced by CutePDF is shown below. Useful information is highlighted in green and includes the rendering engine (CutePDF or Ghostscript), the software used to access the rendering engine (Microsoft Word) and the author’s user account (IEUser).
The metadata is also available from the PDF document’s properties within a PDF reader. The metadata from the PDF document produced by CutePDF and read by Adobe Acrobat Reader is shown below.
To check the efficacy of the sanitisation feature, the PDF documents were sanitised and parsed with the pdfwalker tool. In all test cases, sanitisation deleted the object containing metadata. For example, the file structure from the PDF document produced by CutePDF before (left) and after (right) sanitisation is shown below. The object containing metadata is indicated.
The successful sanitisation of metadata was also confirmed by checking the PDF document’s properties with Adobe Acrobat Reader as shown below. Empty metadata fields are highlighted.
This test also investigated whether sanitising a PDF document would affect the CMap data remnants identified in previous tests. The PDF documents that were rendered with CutePDF and LibreOffice Writer were sanitised and the resultant PDF documents were parsed with the PDF Stream Dumper tool. In both cases, this had no effect on the CMap remnants in the redacted PDF documents.
Furthermore, this test also investigated whether sanitising a PDF document would remove multiple historical revisions of text as seen previously in Test 3. When the PDF documents were sanitised and parsed with the pdfwalker tool, it was evident that historical revisions of text had been removed. The structure of the sanitised PDF document is shown below.
Appendix B: Discussion of testing results
It was demonstrated that there was a difference between the CMap objects generated by the different rendering engines. Microsoft Word did not embed a custom CMap and Adobe Acrobat Distiller only did so for a single character.
Adobe Acrobat, CutePDF and LibreOffice Writer all embedded custom CMap objects. Adobe Acrobat did not customise the order of character code to character selector mappings and instead used the order of mappings as they appear in Unicode. In contrast, CutePDF using open source Ghostscript and LibreOffice Writer both customised the order of mappings so that it reflected the order that characters first appeared in text. Thus, the order of mappings in the CMap objects created an encoding mechanism from which meaningful data could be extracted.
Within CMap data structures, mappings are created the first time a character appears in a per-font context. The CMap will remain unless all characters of a particular font are deleted from the PDF document. If a single character remains, the CMap will remain. Thus, if redaction removes all characters that are mapped within a CMap, it can be expected that the CMap will be deleted and redaction will be successful. This remains to be tested. In this case, the analysis only partially redacted the text and this left the CMap objects in place.
PDF documents created with Adobe Acrobat, Adobe Acrobat Distiller or Microsoft Word were able to be successfully redacted by Adobe Acrobat. Parsing of PDF documents failed to identify any remnants of redacted data. This is a result of the fact that these rendering engines either did not use embedded CMap objects or if they did, the order of mappings did not reflect the order that characters first appeared in text. In contrast, PDF documents created with CutePDF or LibreOffice Writer were not successfully redacted. Parsing of these PDF documents found remnants of redacted data within CMap data structures.
Data remnants that were found in redacted PDF documents were the result of:
- the rendering engine creating CMap objects in which the order of mappings was determined by the order that characters first appeared in text
- redaction failing to reorder the mappings within CMap objects
- redaction failing to delete orphaned mappings for characters that no longer existed
- CMap objects remaining as all text of a particular font was not redacted.
This represents a security vulnerability which occurs if the following pre-conditions are met:
- the PDF document was rendered by CutePDF (Ghostscript) or LibreOffice Writer
- Adobe Acrobat was used to redact the PDF document.
The PDF standard was checked for a requirement in regard to input characters and the order of mappings within a CMap object but none was found. Neither was a requirement found within the Adobe CMap and CIDFont Files Specification . Adobe has released a developer technical note that specifies that the order of character codes in a CMap must be in increasing byte order but not how the order of mappings relates to input characters .
If the relationship between input characters and the order of mappings within a CMap is arbitrary, it explains the results in this document and means that custom mechanisms are not prohibited by any specification. It might also mean that the redaction software should be responsible for identifying and removing artefacts of redacted data from a CMap object. In this regard, the Application Software Extended Package for Redaction Tools protection profile , published by the National Information Assurance Partnership (NIAP), is ambiguous. There is a requirement for the target of evaluation to remove all references and indicators in the structural data to objects that are ‘completely redacted’. If text is partially redacted on a per-font basis, this could mean that data remnants in a CMap would be allowed. The protection profile hasn’t been assigned to any software products.
Since CMap objects are created on a per-font basis, the likelihood of recovering remnants increases the closer the redacted text is located to the first incidence of a character from a particular font. In addition, it was observed during testing that mappings within CMaps were unique, although there is no requirement that this be the case. As a result, no matter how much previous text exists on a per-font basis, the likelihood of recovering remnants increases if the redacted text is composed of unique characters that occur for the first time. The likelihood of recovering remnants also increases as the amount of text in the PDF document decreases.
It is simple to demonstrate this security vulnerability in a test environment. In real-world PDF documents however, the likelihood of recovering redacted text from CMap objects would vary. Real world examples that could be vulnerable to the recovery of remnants might include:
- redacted text occurring at the beginning of a paragraph where a font is used for the first time (e.g. at the beginning of information quoted from another source which is highlighted via use of a different font)
- the redacted text is a password, key or passphrase that is comprised of unusual characters that occur nowhere else within the PDF document
- a small PDF document with very little text.
This security vulnerability could be mitigated if Adobe Acrobat’s redaction functionality:
- randomised the order of mappings in CMap objects or used a pre-existing order such as that found in Unicode
- parsed CMap objects and deleted orphaned mappings.
This security vulnerability could also be mitigated if the CutePDF and LibreOffice Writer rendering engines changed the order of mappings in CMap objects so that it did not reflect the order in which characters first appear in text.
Due to the above, the highest assurance that remnants of redacted data will not remain in PDF documents requires organisations to:
- verify the original PDF document was created using Adobe Acrobat, Adobe Acrobat Distiller or Microsoft Word by checking the metadata of the PDF document
- perform redaction and sanitisation of the document using Adobe Acrobat Pro DC 2017.