Update Oct 18, 2012: Jovana Milutinovich did an outstanding translation of this article in the Serbo-Croatian language which can be found here: Analiziranje PDF fajlova i Shellcode-a. Thank you Jovana for doing this and bringing it to my attention!
With all the talk over the recent Adobe CVE-2008-2992 vulnerability being exploited in the wild, I thought it would be a good time to document how I go about analyzing PDF files and shellcode. Before I get into how I go about doing this type of analysis I would like to thank all the contributors to MalwareDomainList.com, as they supplied me with several malicious PDF files and links to malicious PDF files. I should also note that PDF file analysis is a new subject for me, and I have spent the last few days really diving into the what exactly makes up a PDF file, and the additional functionalities made available by Adobe for PDF files.
The very first thing I do is take a look at the PDF file using either “less” or “more” to display the contents of the PDF file out to the terminal windows just to see if I can spot anything out of the ordinary. The stream sections in a PDF file can be compressed, and will usually show up looking something like this if they are:
To decompress the stream data and clean up the formating in the PDF file I use a tool called pdftk, and decompress the file using the following command:
pdftk bad.pdf output bad_dumped.pdf uncompress
As you have probably figured out already the first image demonstrates this is one of the exploits currently targeting the Adobe CVE-2008-2992 vulnerability, and the second image is the shellcode being passed into the util.printf function. Now to take a look at the shellcode.
To extract the shellcode I simply highlight it, copy and paste it into a text editor for manipulation. I prefer to use vi, so to clean up the shellcode I normally use regular expressions and substitutions. To clean up this particular instance I simply executed these two commands in vi:
The first regex simply removes all the “ and + characters. The second simply joins all the lines together to make one long string of text. This resulted in the shellcode now looking like this:
Now there are several different ways to analyze shellcode, but I tend to use just two. The first way is some simple perl-fu that simply outputs the character representation of the shellcode. The perl-fu part I got from an ISC SANS diary entry made by Daniel Wesemann. Here is the command I execute:
cat shellcode.file | perl -pe ‘s/%u(..)(..)/chr(hex($2)).chr(hex($1))/ge’
In this case it does not work and displays the following:
Executing the exact same command string as before:
cat shellcode.file2 | perl -pe ‘s/%u(..)(..)/chr(hex($2)).chr(hex($1))/ge’
Results in this:
As you can see the shellcode is simple in that it downloads a file into the windows system directory via urlmon and executes it.
The second method I use to analyze shellcode is with the libemu library and test application sctest. Libemu is a library providing basic x86 emulation and sctest is part of it’s test suite. sctest will not work in all cases, but you can extend it’s functionality by writing your own test application using the libemu library.
cat shellcode.file | perl -pe ‘s/%u(..)(..)/chr(hex($2)).chr(hex($1))/ge’ > shellcode.out
As you can probably see we are simply redirecting the output to a file instead of to the console window. To verify this step worked you can compare the newly created file using a tool called hexdump. Simply comparing the output from the following two commands will verify this:
hexdump -C shellcode.out
cat shellcode.file | perl -pe ‘s/%u(..)(..)/chr(hex($2)).chr(hex($1))/ge’ | hexdump -C
These two commands should result in the exact same output and look something like this:
The reason we can not simply just highlight, copy and paste the output from the perl-fu command above from the console window and then paste it into a text file is because the console character set can not display the full range of characters correctly resulting in “???” being displayed. This is not the case when you redirect to a file as the characters don’t have to be interrupted and displayed to the console. If you don’t believe me simply cat out the shellcode.out file and compare it to what it looks like when you open it in a text editor like vi.
Now that we have dumped the shellcode into a file we can pass it into sctest via a stdin redirection for analysis. Here is the command I use:
sctest -Ss 100000 < shellcode.out
This results in the following output:
Looking at the output we can clearly see a url, which you can with a fair amount of confidence conclude that this is the url used in droping a binary using something like urlmon as we saw before. There are plenty of more in-depth procedures in analyzing PDFs and shellcode, but I have found the procedures I explained in this post to work on about 90% of all the PDF files and shellcode I have looked at in the past.
The following command will output PDF document Metadata, Bookmarks and Page Labels:
pdftk bad.pdf data_dump output
PDF document metadata can be very useful in finding out information about the author of the pdf document, date created, and modified dates. Speaking of gathering information on the author and using metadata to investigate a pdf document I found the article “Shoulder Surfing a Malicious PDF Author” by Didier Stevens to be an outstanding example of what can be learned from this data. Didier has published a pdf parsing tool written in python called pdf-parser.py, which looks to be very promising in analyzing pdf files. I just started playing with the tool today, so I can’t really elaborate on it’s functionalities and usability, but I can say that only after a few minutes I was able to extract the same data as I did with the pdftk tool.
Another tool written by Didier that I find useful is for analyzing shellcode is XORSearch. It’s basically a small light weight application that will try to brute force shellcode that has been XORed, which is very common. A little hint to any Mac OS X users out there, to compile XORSearch you have to remove the #include<malloc.h> from the header, as it is depreciated and not installed.
As always if you have any questions or comments regarding this post feel free to hit me up anytime, I always enjoy hearing from someone that actually read my post. ;)