从R中的pdf提取表 [英] Extracting tables from pdf in R

查看:440
本文介绍了从R中的pdf提取表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从pdf中提取表格.这是链接

I need to extract tables from a pdf. Here's the link

https://ainfo .cnptia.embrapa.br/digital/bitstream/item/155505/1/doc-202-1.pdf

我想从第15页至第21页中提取表.所有这些表都具有相同的结构(18列)和标题.这是单个表的快照.

I want to extract tables from page 15 - page 21. All of these tables have the same structure (18 columns) and headings. Here's a snapshot of a single table.

在每个表格中,我只对第6-8和17列感兴趣:CicloGraus Dias/dias,Epcaja de Plantion and Regiao de Adaptacao`

In each table, I am only interested in columns 6 - 8 and 17 column: Ciclo, Graus Dias/dias, Epcaja de PlantionandRegiao de adaptacao`

这就是我所做的:

library(dplyr)
library(tabulizer)

out <- extract_tables("mydocument.pdf"), pages = c(15:21))

# this gives me a list of 7 tables. 

temp <- data.frame(out[[1]]) # taking the first table as an example
temp %>% dplyr::select(X3, X4, X5, X12) # these are the columns corresponding to `Ciclo`, `Graus Dias/dias`, Epcaja de Plantion` and `Regiao de adaptacao`

# this is a snapshot of first table

但是,当我提取第7个表时:

However, when I extract the 7th table:

  temp <- data.frame(out[[7]])

#  Column 1: 4 are merged into a single column. 

总而言之,extract_tables函数在某些表中没有保持一致的列位置并合并列.我该如何解决它,以使自己拥有
一个csv文件中包含Ciclo , Graus Dias/dias , Epcaja de PlantionRegiao de adaptacao列的组合表.

In summary, the extract_tables function is not doing consistent column position and merging columns in some tables. How Can I fix it such that I have
a combined table with columns Ciclo,Graus Dias/dias, Epcaja de Plantion and Regiao de adaptacao in one csv file.

推荐答案

这是一个数据准备和争执的问题,根据我的经验,这不是一个解析问题,因为制表器的解析算法除了进行更改外没有太多余地方法之间,在这种情况下.从我可以看到的当我尝试提取表时,不仅是错误解析的第7页表.每个页面的解析方式不同,但是所有数据似乎都保留了下来.我可以看到您的第一个表有13列,第二个表是17列,第3列是12列,第4列是10列,最后三个列是11列.我建议做的是分别解析每个页面,并根据每个页面上所需的输出执行数据清理,然后将它们绑定在一起.这是一个漫长的过程,并且非常针对每个解析的表,因此我将仅提供示例脚本:

This is a data prep and wrangling problem, and not a parsing issue in my experience, as the parsing algorithms of tabulizer don't offer much leeway apart from changing between methods, in this case. From what I can see when I try to extract your tables its not only the table of page No. 7 that is incorrectly parsed. Every page is parsed differently but all the data seem to be retained. I can see that your first table has 13 columns, second 17, 3rd 12, 4th 10 and the last three 11 columns. What i would propose to do instead is to parse each page individually and perform data cleaning according to your desired output on each of them and then bind them together. This is a lengthy process and very specific to each table parsed so i will only provide an example script:

library(dplyr)
library(tidyr)
library(tabulizer)
# I create a dummy list to iterate through all the pages and push a data.frame in
result <- list()
for (i in 15:21){
  out <- as.data.frame(extract_tables("mydocument.pdf", page = i, method = 'stream'), stringsAsFactors = FALSE)
  result[[i]] <- out
}
# Remove excess list items -
# there is probably a better way to do this from within the for loop
result <- result[-(1:14)]

## ------- DATA CLEANING OPERATIONS examples:
# Remove top 3x lines from the first page of table1 not part of data
result[[1]] <- result[[1]][-(1:3),]
# Perform data cleaning operations such as split/ merge columns according to your liking
# for instance if you want to split column X1 into 4 (as in your original post), you can do that by splitting by whitespace
result[[1]] <- separate(result[[1]], 1, into = c('X1.1','X1.2','X1.3', 'X1.4'),sep = ' ', remove = TRUE)

## ---- After data cleaning operations:
# Bind all dataframes (they should have equal number of columns by now into one and make sure the colnames match as well)
df <-bind_rows(result)
# Write your output csv file
write.csv(df, 'yourfilename.csv')

另外,您可能想看看制表器的不同解析方法(我在这里将其设置为流",因为根据我的经验,这通常会产生最佳结果,但是格子"在某些情况下可能会更好地工作)表格).

Also you might wanna take a look at the different parsing methods of tabulizer (I have set it at 'stream' here since this by my experience usually yields the best results, but maybe 'lattice' would work better for some of the tables).

这篇关于从R中的pdf提取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆