PDF数据和表格刮到Excel [英] PDF Data and Table Scraping to Excel

查看:67
本文介绍了PDF数据和表格刮到Excel的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到一种提高我的数据输入工作效率的好方法.

I'm trying to figure out a good way to increase the productivity of my data entry job.

我想做的是想出一种从PDF抓取数据并将其输入Excel的方法.

What I am looking to do is come up with a way to scrape data from a PDF and input it into Excel.

更具体地说,我正在使用的数据来自杂货店传单.现在,我们必须手动将传单中的每笔交易输入数据库.传单的示例是 http://weeklyspecials.safeway.com/customer_Frame.jsp?drpStoreID = 1551

More specifically the data I am working with is from grocery store flyers. As it stands now we have to manually enter every deal in the flyer into a database. A sample of a flyer is http://weeklyspecials.safeway.com/customer_Frame.jsp?drpStoreID=1551

我希望做的是列出产品,价格和预定义选项(会员卡,优惠券,精选品种……之类的东西)列.

What I am hoping to do is have columns for products, price, and predefined options (Loyalty Cards, Coupons, Select Variety... that sort of thing).

任何帮助将不胜感激,如果我需要更具体些,请告诉我.

Any help would be appreciated, and if I need to be more specific let me know.

推荐答案

查看特定的PDF后 由OP链接 ,我不得不说这并没有完全显示出典型的表格格式.

After looking at the specific PDF linked to by the OP, I have to say that this is not quite displaying a typical table format.

它在单元格"中包含许多图像,但是这些单元格并非完全严格地垂直或水平对齐:

It contains many images inside the "cells", but the cells are not all strictly vertically or horizontally aligned:

因此,这甚至不是一张漂亮的"桌子,而是一个非常丑陋且笨拙的桌子,可用于...

So this isn't even a 'nice' table, but an extremely ugly and awkward one to work with...

话虽如此,我必须补充:

Having said that, I'll have to add:

标准PDF并未提供有关它们在页面上绘制的语义的任何提示: 语法提供的唯一区别是矢量元素(线条,填充,...),图像和文本之间的区别.

Standard PDFs do not provide any hints about the semantics of what they draw on a page: the only distinction that the syntax provides is the distinctions between vector elements (lines, fills,...), images and text.

通过解析PDF源代码,很难以编程方式识别任何字符,无论是表的一部分还是行的一部分,还是仅是一个空白区域中的一个孤独的单个字符.

Whether any character is part of a table or part of a line or just a lonely, single character within an otherwise empty area is not easy to recognize programmatically by parsing the PDF source code.

有关为何为什么永远不认为 PDF文件格式适合托管可提取的结构化数据 的背景信息,请参见本文:

For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:

Why Updating Dollars for Docs Was So Difficult (ProPublica-Website)

...但是使用TabulaPDF效果很好!

上面已经说过了,现在让我添加一下:

...but doing so with TabulaPDF works very well!

Having said the above now let me add this:

  • For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting what I said in my introductionary paragraphs! -- check out TabulaPDF. See these links:

Tabula-Extractor用Ruby编写. 在后台,它使用PDFBox(用Java编写)和其他一些第三方库. 要运行,Tabula-Extractor需要安装JRuby-1.7.

Tabula-Extractor is written in Ruby. In the background it makes use of PDFBox (which is written in Java) and a few other third-party libs. To run, Tabula-Extractor requires JRuby-1.7 installed.

我直接从其GitHub源代码存储库中使用Tabula-Extractor的出血边缘"版本. 使它正常工作非常容易,因为在我的系统上已经存在JRuby-1.7.4_0:

I'm using the 'bleeding-edge' version of Tabula-Extractor directly from its GitHub source code repository. Getting it to work was extremely easy, since on my system JRuby-1.7.4_0 is already present:

mkdir ~/svn-stuff
cd ~/svn-stuff
git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor

此Git克隆中已包含必需的库,因此无需安装PDFBox. 命令行工具位于/bin/子目录中.

Included in this Git clone will already be the required libraries, so no need to install PDFBox. The command line tool is in the /bin/ subdirectory.

探索命令行选项:

~/svn-stuff/git.tabula-extractor/bin/tabula -h

Tabula helps you extract tables from PDFs

Usage:
       tabula [options] <pdf_file>
where [options] are:
         --pages, -p <s>:   Comma separated list of ranges, or all. Examples:
                            --pages 1-3,5-7, --pages 3 or --pages all. Default
                            is --pages 1 (default: 1)
          --area, -a <s>:   Portion of the page to analyze
                            (top,left,bottom,right). Example: --area
                            269.875,12.75,790.5,561. Default is entire page
       --columns, -c <s>:   X coordinates of column boundaries. Example
                            --columns 10.1,20.2,30.3
      --password, -s <s>:   Password to decrypt document. Default is empty
                            (default: )
             --guess, -g:   Guess the portion of the page to analyze per page.
             --debug, -d:   Print detected table areas instead of processing.
        --format, -f <s>:   Output format (CSV,TSV,HTML,JSON) (default: CSV)
       --outfile, -o <s>:   Write output to <file> instead of STDOUT (default:
                            -)
       --spreadsheet, -r:   Force PDF to be extracted using spreadsheet-style
                            extraction (if there are ruling lines separating
                            each cell, as in a PDF of an Excel spreadsheet)
    --no-spreadsheet, -n:   Force PDF not to be extracted using
                            spreadsheet-style extraction (if there are ruling
                            lines separating each cell, as in a PDF of an Excel
                            spreadsheet)
            --silent, -i:   Suppress all stderr output.
  --use-line-returns, -u:   Use embedded line returns in cells. (Only in
                            spreadsheet mode.)
           --version, -v:   Print version and exit
              --help, -h:   Show this message

提取OP所需的表

我什至没有尝试从OP的庞然大物PDF中提取这张丑陋的桌子. ...

Extracting the table which the OP wants

I'm not even trying to extract this ugly table from the OP's monster PDF. I'll leave it as an excercise to these readers who are feeling adventurous enough...

相反,我将演示如何提取不错的"表.我将从 官方PDF-1.7规范 ,此处以屏幕截图表示:

Instead, I'll demo how to extract a 'nice' table. I'll take pages 651-653 from the official PDF-1.7 specification, here represented with screenshots:

我使用了以下命令:

 ~/svn-stuff/git.tabula-extractor/bin/tabula \
   -p 651,652,653 -g -n -u -f CSV            \
    ~/Downloads/pdfs/PDF32000_2008.pdf

将生成的CSV导入LibreOffice Calc后,电子表格如下所示:

After importing the generated CSV into LibreOffice Calc, the spreadsheet looks like this:

在我看来,这似乎是对一张表格的完美提取,该表格确实分布在3个不同的PDF页面上. (即使表单元格中使用的换行符也将其插入了电子表格.)

To me this looks like the perfect extraction of a table which did spread over 3 different PDF pages. (Even the newlines used within table cells made it into the spreadsheet.)

这是ASCiinema的截屏视频(您也可以 下载 并在tabula-extractor:

Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

这篇关于PDF数据和表格刮到Excel的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆