使用 R 识别 PDF 表格 [英] Recognize PDF table using R

查看:29
本文介绍了使用 R 识别 PDF 表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从一些 pdf 报告中的表格中提取数据.

I'm trying to extract data from tables inside some pdf reports.

我已经看到一些使用 pdftools 和类似软件包的示例,我成功获取了文本,但是,我只想提取表格.

I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables.

有没有办法使用 R 来识别和提取表格?

Is there a way to use R to recognize and extract only tables?

推荐答案

好问题,我最近也在想同样的事情,谢谢!

Awsome question, I wondered about the same thing recently, thanks!

我做到了,使用 tabulizer ‘0.2.2’ 正如 @hrbrmstr 所建议的那样.如果您使用 R >3.5.x,我提供以下解决方案.按特定顺序安装三个包:

I did it, with tabulizer ‘0.2.2’ as @hrbrmstr also suggests. If you are using R > 3.5.x, I'm providing following solution. Install the three packages in specific order:

# install.packages("rJava")
# library(rJava) # load and attach 'rJava' now
# install.packages("devtools")
# devtools::install_github("ropensci/tabulizer", args="--no-multiarch")

更新: 再次测试该方法后,看起来只需执行 install.packages("tabulizer")现在.rJava 将作为依赖项自动安装.

Update: After just testing the approach again, it looks like it's enough to just do install.packages("tabulizer") now. rJava will be installed automatically as a dependency.

现在您可以从 PDF 报告中提取表格了.

Now you are ready to extract tables from your PDF reports.

library(tabulizer)

## load report
l <- "https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf" 
m <- extract_tables(l, encoding="UTF-8")[[2]]  ## comes as a character matrix
## Note: peep into `?extract_tables` for further specs (page, location etc.)!

## use first row as column names
dat <- setnames(type.convert(as.data.frame(m[-1, ]), as.is=TRUE), m[1, ])
## example-specific date conversion
dat$Date <- as.POSIXlt(dat$Date, format="%m/%d/%y")
dat <- within(dat, Date$year <- ifelse(Date$year > 120, Date$year - 100, Date$year))

dat ## voilà
#    Speed (mph)          Driver                        Car    Engine       Date
# 1      407.447 Craig Breedlove          Spirit of America    GE J47 1963-08-05
# 2      413.199       Tom Green           Wingfoot Express    WE J46 1964-10-02
# 3      434.220      Art Arfons              Green Monster    GE J79 1964-10-05
# 4      468.719 Craig Breedlove          Spirit of America    GE J79 1964-10-13
# 5      526.277 Craig Breedlove          Spirit of America    GE J79 1965-10-15
# 6      536.712      Art Arfons              Green Monster    GE J79 1965-10-27
# 7      555.127 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-02
# 8      576.553      Art Arfons              Green Monster    GE J79 1965-11-07
# 9      600.601 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-15
# 10     622.407   Gary Gabelich                 Blue Flame    Rocket 1970-10-23
# 11     633.468   Richard Noble                   Thrust 2 RR RG 146 1983-10-04
# 12     763.035      Andy Green                 Thrust SSC   RR Spey 1997-10-15

希望它对你有用.

限制:当然,这个例子中的表格非常简单,也许你不得不使用 gsub 和类似的东西.

Limitations: Of course, the table in this example is quite simple and maybe you have to mess around with gsub and this kind of stuff.

这篇关于使用 R 识别 PDF 表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆