APT PDFs and metadata extraction

One of the modules in our new Rapid Reverse Engineering class is artifact extraction. For this section of the class the students use a python module we create for doing some artifact/metadata extraction from samples. One of the more interesting pieces of metadata that attackers leave behind is the software that the malicious file was created with. In this case I was looking at some PDFs. I then realized that I extract this information for individual samples, but I have never run a test on a large set of known APT malware to see what comes out. So a quick adventure I set out on and wow was I surprised by the information.

I ended up with the following pie graph

The sample size was roughly 300+ known APT samples that we have. It wasn't our whole sample set of PDF's but for starters was a decent size. List (top 10) looked like this

Acrobat Web Capture 8.0 (15%)
Adobe LiveCycle Designer ES 8.2 (15%)
Acrobat Web Capture 9.0 (8%)
Python PDF Library - http://pybrary.net/pyPdf/ (7%)
Acrobat Distiller 9.0.0 (Windows) (7%)
Acrobat Distiller 6.0.1 (Windows) (7%)
pdfeTeX-1.21a (7%)
Adobe Acrobat 9.2.0 (4%)
Adobe PDF Library 9.0 (4%)

A number of things amazed me about this data. One of them was the lack of opsec on the attackers perspective, and the old versions of software that they are using. From the offensive perspective if you are dealing with targets that have resources to do deep level forensics and operations then every little bit of opsec is needed. It only takes a small amount of data to put together a large piece of the puzzle.

From the defensive position it points out the ability for defense organizations to do some early detection. I doubt that most organizations are actually keeping track or analyzing what types of clean, business case pdfs come through the front doors. What do the normal clean pdf's coming through your front doors actually look like? Are the clean business case PDFs being created by the
"Python PDF Library - http://pybrary.net/pyPdf/" software? This is a piece of software that is no longer maintained. If you have a standard set of pdf's that come through your front doors and they aren't using strange libraries such as pyPDF then it might be time to create a nice little snort signature and alert on it. I wouldn't recommend blocking at that level (unless you are up for it), but alerting on something simple like that can create extremely large dividends for response/defense teams. Imagine telling your CIO/CISO that you detected and re-mediated APT* attack coming through the front door by a simple snort sig.

Some of the honorable mentions for that didn't make it into the top 10 are:

Advanced PDF Repair at http://www.pdf-repair.com
Acrobat Web Capture 6.0 (wow that is old)
¦ d o P D F V e r 6 . 2 B u i l d 2 8 8 ( W i n d o w s X P x 3 2 ) *Ya that is the way it show's up

alientools PDF Generator 1.52
PDFlib 7.0.3 (C++/Win32)

I am getting to the point that you must look at data sets and see what type of information you can gleam from them. This idea might be feasible in your organization and it might not, but you as the defender have the ability to determine that for yourself.

At the end of April (25-26th) we are debuting Rapid Reverse Engineering in New York City with Trail Of Bits http://www.trailofbits.com/training/#rapidre. Rapid Reverse Engineering is a class designed for helping students learn how to rapidly assess files for incident response scenarios.

Latest Images

Trending Articles

Latest Images