如何使用命令行工具从PDF文件中提取JavaScript? [英] How can I extract a JavaScript from a PDF file with a command line tool?

查看:1923
本文介绍了如何使用命令行工具从PDF文件中提取JavaScript?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用命令行工具从PDF文件中提取JavaScript对象?



我试图使用GUI Python有这个功能。



我发现这两个模块,但不能运行它们:pyPdf2和pyPdf。

解决方案

在处理PDF中的JavaScript时,您必须注意两种情况(在仔细调查有问题的文件之前,您不能提前区分)。


  1. 无害的JavaScript

  2. 恶意JavaScript



案例1:无害,有用,开放JavaScript



OP提供了一个来自PlanetPDF的示例JavaScript加载PDF的链接:





一个很容易处理。只需使用 pdfinfo -js (但是请确保使用最新的基于Poppler 的版本 - 基于XPDF的 pdfinfo 不知道 -js !)



结果:

  $ pdfinfo -js ppjslc_commonex_3.pdf 

标题:Planet PDF JavaScript学习中心示例#2
作者:Chris Dahl,ARTS PDF全球服务
创建者:PScript5.dll版本5.2.2
生产者:Acrobat Distiller 6.0.1(Windows)
CreationDate:Thu Oct 28 18:13:38 2004
ModDate:Thu Oct 28 18:17:46 2004
标签:no
UserProperties:no
嫌疑:no
形式:AcroForm
JavaScript:yes
页数:1
加密:no
页面大小:612 x 792 pts(letter)
页面:0
文件大小: 84720字节
优化:无
PDF版本:1.5

名称字典docOpened:
//用于存储文档是否已经打开的变量
var bAlreadyOpened;

function docOpened()
{

if(bAlreadyOpened!=true)
{
//
var d = new Date();
var sDate = util.printd(mm / dd / yyyy,d);

//设置日期现在
app.alert(现在将字段插入字段);
this.getField(todaysDate)。value = sDate;

//现在设置bAlreadyOpened为true,所以它不会
//再次运行
bAlreadyOpened =true;
}
else
{
//文档已经打开
}
}

//调用docOpened )function
docOpened();

正如你所看到的, -js 自动从PDF中提取所有JavaScript并将其打印到< stdout>



这是一个无害的JavaScript,不会试图隐藏自己,而不是混淆,在当前日期插入一个表单字段,在弹出一个信息消息后,它会做什么。 p>

案例2:恶意,损害,隐藏和模糊处理的JavaScript



在旷野中有许多PDF文件JavaScript不是像上面那样无害,而是由你的钱之后的恶意软件作者写的,或者只是在有趣之后给它们成功。





例如,为了隐藏包含JavaScript的事实,它们 不要 在相应的PDF中使用'clear' / JavaScript / JS 对象字典。



这些名称必须 才能让PDF阅读器知道应该如何处理该对象。他们使用另一种方法表示相同的名称:

  /#4Aava#53cript 
/ J#61vaScrip#74
/#4a#61#76#61#53#63#72#69#70#74
[...]

这种方法,不幸的是,甚至被官方的PDF规范文件合法。它允许将PDF名称标记中的一些或甚至所有字符的选择替换为它们各自的ASCII十六进制数(结合每个替换的字符的前导散列符号)。



这可以愚弄一些更幼稚的尝试,在PDF中找到 / JavaScript 字符串(例如使用简单的 grep -a )。



有一些免费软件工具可用于剖析和分析这些情况:





但是所有这些工具只有在你已经一些基本) 关于PDF语法的知识 (当然还有关于JavaScript)。



以下是使用 pdfid.py 对三种不同PDF的三个简短示例:


  1. 第一个不能使用 pdfid.py 发现的任何JavaScript:

      $ pdfid.py nojavascript.pdf 

    PDFiD 0.2.1 nojavascript.pdf
    PDF标题:%PDF -1.5
    obj 193
    endobj 193
    stream 54
    endstream 54
    xref 1
    trailer 1
    startxref 1
    / Page 1
    /加密0
    / ObjStm 0
    / JS 0
    / JavaScript 0
    / AA 12
    / OpenAction 0
    / AcroForm 1
    / JBIG2Decode 0
    / RichMedia 0
    / Launch 0
    / EmbeddedFile 0
    / XFA 0
    / Colors> 2 ^ 24 0


  2. 第二个包含JavaScript,名称 / JavaScript 以纯文本形式出现在PDF中:

      $ pdfid.py javascript1.pdf | grep -E'(/ JS | / JavaScript)

    / JS 30
    / JavaScript 30


  3. 最后一个包含JavaScript,名称为tokens / JavaScript / JS 都被混淆:

      $ pdfid.py javascript2.pdf | grep -E'(/ JS | / JavaScript)

    / JS 30(30)
    / JavaScript 30(30)

    事实上, pdfid.py 列出了括号中的第二个数字,表明它发现了混淆。 30个30 / JavaScript 名称令牌被遮盖 - 这使得PDF文件高度可疑,这需要进一步调查。因为没有正常PDF生成工具(我知道)使用此模糊处理...







更新



在我的另一个答案中提供了不同方法(包括命令行工具)的列表:





目前最好的工具是 peepdf.py ,因为它可以处理严重模糊的JavaScript。这是一个Python框架,用于探索(和更改)PDF文件的源代码,专门分析恶意PDF。



其作者最近添加了 extract 子命令,它提取并打印PDF中包含的JavaScript的源代码:



短使用信息:


  1. 从GitHub上检出来源:

    git clone https://github.com/jesparza/peepdf.git git.peepdf

  2. 创建符号链接这是在 $ PATH )到脚本:

    cd git.peepdf;

    ln -s $(pwd)/peepdf.py $ {HOME} /bin/peepdf.py

  3. 使用PeePDF子命令创建脚本文件以提取javascript:

    echo'extract js& all-javascripts-from-my.pdf'> xtract.txt

  4. 运行PeePDF(设置松散解析模式 -l 强制模式忽略错误 -f )以非交互方式执行包含的子命令行在新创建的脚本文件中, -s

    peepdf.py -l -f -s xtract。 txt my.pdf

  5. 调查提取的JavaScript的内容:

    cat all-javascripts-from-my.pdf


How can I extract a JavaScript object from a PDF file using a command line tool?

I am trying to make a GUI using Python with this function.

I found these two modules but couldn't run them: pyPdf2 and pyPdf.

解决方案

When you deal with JavaScript in PDFs, you have to be aware of two cases (which you cannot necessarily distinguish in advance, before closely investigating the file in question).

  1. "Harmless" JavaScript
  2. Malicious JavaScript

Case 1: Harmless, "useful", "open" JavaScript

The OP gave a link to a sample JavaScript-loaded PDF from PlanetPDF:

That one is easy to handle. Just use pdfinfo -js (but be sure that you use one of the most recent, Poppler-based releases -- the XPDF-based pdfinfo does not know about -js!)

Here is the result:

$ pdfinfo -js ppjslc_commonex_3.pdf

 Title:          Planet PDF JavaScript Learning Center Example #2
 Author:         Chris Dahl, ARTS PDF Global Services
 Creator:        PScript5.dll Version 5.2.2
 Producer:       Acrobat Distiller 6.0.1 (Windows)
 CreationDate:   Thu Oct 28 18:13:38 2004
 ModDate:        Thu Oct 28 18:17:46 2004
 Tagged:         no
 UserProperties: no
 Suspects:       no
 Form:           AcroForm
 JavaScript:     yes
 Pages:          1
 Encrypted:      no
 Page size:      612 x 792 pts (letter)
 Page rot:       0
 File size:      84720 bytes
 Optimized:      no
 PDF version:    1.5

 Name Dictionary "docOpened":
 // variable to store whether document has been opened already or not
 var bAlreadyOpened;

 function docOpened()
 {

    if(bAlreadyOpened != "true")
    {
        // document has just been opened
        var d = new Date();
        var sDate = util.printd("mm/dd/yyyy", d);

                 // set date now
                 app.alert("About to insert date into field now");
        this.getField("todaysDate").value = sDate;

        // now set bAlreadyOpened to true so it doesn’t
        // run again
 bAlreadyOpened = "true";
    }
    else
    {
        // document has already been opened
    }
 }

 // call the docOpened() function
 docOpened();

As you can see, -js attempts to automatically extract all JavaScript from the PDF and prints it to <stdout>.

This one was a harmless JavaScript, not trying to hide itself, not obfuscated, inserting the current date into a form field, after popping up an info message about what it is going to do.

Case 2: Malicious, damaging, hidden and obfuscated JavaScript

There are numerous examples of PDFs out in the wilderness containing JavaScripts which are not as harmless as the above, written by Malware authors who are after your money, or just after the "fun" it gives them if they succeed.

The JavaScripts in these cases are very frequently hidden and obfuscated.

For example, in order to hide the fact that there is even JavaScript contained, they do not use the 'clear' /JavaScript and /JS names in the respective PDF object dictionaries. These names must be present for the PDF readers to know what they should do with the object.

Instead, they use another method to express the same names:

/#4Aava#53cript
/J#61vaScrip#74
/#4a#61#76#61#53#63#72#69#70#74
[...]

This method, unfortunately, was even made "legal" by the official PDF specification documents. It allows to replace a selection of some or even of all characters in a PDF name token by their respective ASCII hex number (combined with a leading hash sign for each replaced char).

This can fool some of the more naive attempts to find the /JavaScript string inside a PDF (such as using a simple grep -a).

There are a few Free Software tools available, which can be used to dissect and analyze such cases:

  • Didier Stevens' Python scripts pdfid.py and pdf-parser.py are very useful for a first look (and even for a complete analysis) of these cases.

  • Jose Miguel Esparza's Python framework peepdf is even more powerful. It can even de-obfuscate, beautify and make readable again any obfuscated JavaScript contents inside the PDF.

  • Origami is Ruby-based, and also quite powerful. And there are a few more...

But all these tools are only useful if you already have (at least some basic) knowledge about PDF syntax (and about JavaScript, of course).

Here are three short examples using pdfid.py against three different PDFs:

  1. the first does not cantain any JavaScript that is discovered by pdfid.py:

    $ pdfid.py nojavascript.pdf
    
     PDFiD 0.2.1  nojavascript.pdf
      PDF Header: %PDF-1.5
      obj                  193
      endobj               193
      stream                54
      endstream             54
      xref                   1
      trailer                1
      startxref              1
      /Page                  1
      /Encrypt               0
      /ObjStm                0
      /JS                    0 
      /JavaScript            0
      /AA                   12
      /OpenAction            0
      /AcroForm              1
      /JBIG2Decode           0
      /RichMedia             0
      /Launch                0
      /EmbeddedFile          0
      /XFA                   0
      /Colors > 2^24         0
    

  2. the second contains JavaScript, and the name /JavaScript appears in clear text inside the PDF:

    $ pdfid.py javascript1.pdf | grep -E '(/JS|/JavaScript)
    
      /JS                   30
      /JavaScript           30
    

  3. the last contains JavaScript, and the name tokens /JavaScript and /JS both are obfuscated:

    $ pdfid.py javascript2.pdf | grep -E '(/JS|/JavaScript)
    
      /JS                   30(30)
      /JavaScript           30(30)
    

    The fact that pdfid.py lists a second number in parentheses shows, that it discovered the obfuscation. 30 out of 30 /JavaScript name tokens are obscured -- this makes the PDF file highly suspicious, which warrants further investigation. Because no "normal" PDF generating tool (that is known to me) uses this obfuscation...


Update

A list of different methods (including command line tools) is available in another answer of mine here:

The best tool currently is peepdf.py, because it can handle even heavily obfuscated JavaScript. This is a Python framework to explore (and change) the source code of PDF files, specialized in analysing malicious PDFs.

Its author(s) recently added the extract sub-command, which extracts and prints the source code of JavaScripts contained in the PDF:

Short usage info:

  1. Checkout the sources from GitHub:
    git clone https://github.com/jesparza/peepdf.git git.peepdf
  2. Create a symlink (which is in your $PATH) to the script:
    cd git.peepdf ;
    ln -s $(pwd)/peepdf.py ${HOME}/bin/peepdf.py
  3. Create a script file with the PeePDF subcommand to extract the javascript:
    echo 'extract js > all-javascripts-from-my.pdf' > xtract.txt
  4. Run PeePDF (setting loose parsing mode, -l, and force mode to ignore errors, -f) to execute non-interactively the sub-command line(s) contained in the newly created script file, -s:
    peepdf.py -l -f -s xtract.txt my.pdf
  5. Investigate the contents of the extracted JavaScript:
    cat all-javascripts-from-my.pdf

这篇关于如何使用命令行工具从PDF文件中提取JavaScript?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆