如何从命令行以CSV格式从PDF提取表数据? [英] How to extract table data from PDF as CSV from the command line?

查看:106
本文介绍了如何从命令行以CSV格式从PDF提取表数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从此处中提取所有行,而忽略列标题为以及所有页面标题,即Supported Devices.

pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \
 | sed '$d'                                                  \
 | sed -r 's/ +/,/g; s/ //g'                                 \
 > output.csv

生成的文件应为CSV电子表格格式(用逗号分隔的值字段).

换句话说,我想改进上面的命令,以使输出完全不会制动.有什么想法吗?

解决方案

我也将为您提供另一种解决方案.

虽然在这种情况下pdftotext方法可以尽力而为,但是在某些情况下,并非每个页面的列宽都相同(如您的良性PDF所示).

这里不是那么知名,但是很酷的免费和开源软件 是最佳选择.

我本人正在使用GitHub直接检出:

$ cd $HOME ; mkdir svn-stuff ; cd svn-stuff
$ git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor

我为自己编写了一个非常简单的包装器脚本,如下所示:

$ cat ~/bin/tabulaextr

 #!/bin/bash
 cd ${HOME}/svn-stuff/git.tabula-extractor/bin
 ./tabula $@

由于~/bin/在我的$PATH中,所以我才跑步

$ tabulaextr --pages all                                 \
         $(pwd)/DAC06E7D1302B790429AF6E84696FCFAB20B.pdf \
        | tee my.csv

从所有页面提取所有表格并将其转换为单个CSV文件.

CVS的前十行(总共8727行)如下所示:

$ head DAC06E7D1302B790429AF6E84696FCFAB20B.csv 

 Retail Branding,Marketing Name,Device,Model
 "","",AD681H,Smartfren Andromax AD681H
 "","",FJL21,FJL21
 "","",Luno,Luno
 "","",T31,Panasonic T31
 "","",hws7721g,MediaPad 7 Youth 2
 3Q,OC1020A,OC1020A,OC1020A
 7Eleven,IN265,IN265,IN265
 A.O.I. ELECTRONICS FACTORY,A.O.I.,TR10CS1_11,TR10CS1
 AG Mobile,Status,Status,Status

原始PDF中的内容如下:

它甚至在最后一页293中也显示了以下几行:

 nabi,"nabi Big Tab HD\xe2\x84\xa2 20""",DMTAB-NV20A,DMTAB-NV20A
 nabi,"nabi Big Tab HD\xe2\x84\xa2 24""",DMTAB-NV24A,DMTAB-NV24A

在PDF页面上看起来像这样:

TabulaPDF和Tabula-Extractor对于这样的工作真的非常酷!


更新

这是ASCiinema的截屏视频(您也可以 下载 并在tabula-extractor:

标记下在asciinema命令行工具的帮助下在Linux/MacOSX/Unix终端中本地播放.

I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices.

pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \
 | sed '$d'                                                  \
 | sed -r 's/ +/,/g; s/ //g'                                 \
 > output.csv

The resulting file should be in CSV spreadsheet format (comma separated value fields).

In other words, I want to improve the above command so that the output doesn't brake at all. Any ideas?

解决方案

I'll offer you another solution as well.

While in this case the pdftotext method works with reasonable effort, there may be cases where not each page has the same column widths (as your rather benign PDF shows).

Here the not-so-well-known, but pretty cool Free and OpenSource Software Tabula-Extractor is the best choice.

I myself am using the direct GitHub checkout:

$ cd $HOME ; mkdir svn-stuff ; cd svn-stuff
$ git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor

I wrote myself a pretty simple wrapper script like this:

$ cat ~/bin/tabulaextr

 #!/bin/bash
 cd ${HOME}/svn-stuff/git.tabula-extractor/bin
 ./tabula $@

Since ~/bin/ is in my $PATH, I just run

$ tabulaextr --pages all                                 \
         $(pwd)/DAC06E7D1302B790429AF6E84696FCFAB20B.pdf \
        | tee my.csv

to extract all the tables from all pages and convert them to a single CSV file.

The first ten (out of a total of 8727) lines of the CVS look like this:

$ head DAC06E7D1302B790429AF6E84696FCFAB20B.csv 

 Retail Branding,Marketing Name,Device,Model
 "","",AD681H,Smartfren Andromax AD681H
 "","",FJL21,FJL21
 "","",Luno,Luno
 "","",T31,Panasonic T31
 "","",hws7721g,MediaPad 7 Youth 2
 3Q,OC1020A,OC1020A,OC1020A
 7Eleven,IN265,IN265,IN265
 A.O.I. ELECTRONICS FACTORY,A.O.I.,TR10CS1_11,TR10CS1
 AG Mobile,Status,Status,Status

which in the original PDF look like this:

It even got these lines on the last page, 293, right:

 nabi,"nabi Big Tab HD\xe2\x84\xa2 20""",DMTAB-NV20A,DMTAB-NV20A
 nabi,"nabi Big Tab HD\xe2\x84\xa2 24""",DMTAB-NV24A,DMTAB-NV24A

which look on the PDF page like this:

TabulaPDF and Tabula-Extractor are really, really cool for jobs like this!


Update

Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

这篇关于如何从命令行以CSV格式从PDF提取表数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆