刮除跨多个页面的大型pdf表 [英] Scraping large pdf tables which span across multiple pages
问题描述
我正尝试抓取跨多页的PDF表.我尝试了很多事情,但最好的方法似乎是pdftotext -layout
,这是在这里建议的.问题在于,结果文本文件不易于使用,因为表布局在页面之间有所不同,因此列未对齐.还要注意以Solsonès"开头的行中缺少值:
I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout
as advised here. The problem is that the resultant text file is not easy to work with, as the table layout differs across pages, so the columns are not aligned. Also note missing values in lines beginning with "Solsonès":
TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012
COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT N
Alt Camp VY Nulles 7,5 5,5 10,9 12,3 16,7 21,6 22,3 24,4 20,1 15,9
Alt Camp DQ Vila-rodona 7,9 5,6 11,0 12,0 16,6 21,6 22,0 24,3 19,9 15,8
Alt Empordà U1 Cabanes 8,2 6,5 11,7 12,6 17,5 22,0 23,1 24,4 20,4 16,6
Alt Empordà W1 Castelló d'Empúries 8,1 6,4 11,6 12,9 17,0 21,1 22,0 23,4 20,1 16,4
[...]
TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012
COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT
Baix Empordà DF la Bisbal d'Empordà 6,6 5,3 10,9 12,6 17,2 21,9 22,9 24,6 20,3 16
Baix Empordà UB la Tallada d'Empordà 6,1 5,2 10,7 12,3 16,6 21,3 22,2 23,8 19,7 15
Baix Empordà UC Monells 6,1 4,6 9,9 11,4 16,5 21,7 23,0 24,5 19,6 15
[...]
TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012
COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT
[...]
Solsonès CA Clariana de Cardener 4,6 3,3 10,3 10,2 16,7 22,3 d.i.
Solsonès Z8 el Port del Comte (2.316 m) -0,9 -6,3 -0,2 -2,0 5,3 10,5 10,9 13,8 7,8 4,2
Solsonès VO Lladurs 3,0 2,6 9,5 9,0 15,3 21,4 21,6 24,3 17,5 13,0
Solsonès VP Pinós 3,0 1,6 8,9 9,2 15,4 21,1 21,3 23,8 17,6 13,3
Solsonès XT Solsona d.i. 24,3 18,0 13,5
Tarragonès VQ Constantí 7,9 6,0 11,2 13,1 17,1 21,9 22,6 24,6 20,6 16,6
Tarragonès XE Tarragona - Complex Educatiu 10,2 7,8 12,3 14,6 18,3 23,0 24,2 26,2 23,0 * 18,4
Tarragonès DK Torredembarra 9,7 7,7 12,3 14,3 17,9 22,8 24,3 26,2 22,7 18,5
Terra Alta WD Batea 6,3 5,0 11,2 12,1 18,3 23,0 23,3 25,5 20,2 15,9
Terra Alta XP Gandesa 6,6 5,2 11,2 12,2 18,1 22,9 23,4 25,6 20,4 16,0
因此,此输出不是很容易解析.还有什么其他方法可用?
似乎我使用的每个工具都只能提取有关表格单元格 layout 的信息,但不能提取属于特定列的信息.如果单元格为空,这将非常明显-空单元格不在输出中,您只会获得非空的单元格"及其布局. PDF本身是否包含此表格信息?如果没有,则搜索将其提取出来的工具就没有意义.
It seems that every tool I use is only capable to extract information about layout of the table cells, but it doesn't extract the information of belonging to particular column. This is very much apparent if the cells are empty - the empty cells are not in the output, you only get non-empty "cells" with their layout. Does the PDF itself contain this tabular information? If not, it doesn't make sense to search for tool that will extract it.
付费解决方案并非毫无疑问,因为它最终可能比我花几个工作日投资更便宜...
Paid solutions are not out of question, as it might in the end be cheaper than invest several working days of my time...
我尝试过的事情:
- 复制粘贴-缺少值的问题(第5页)
- 另存为Acrobat中的文本(结果比复制粘贴还要糟糕)
- 在Excel中打开作为外部数据源-无法识别该表
- https://www.pdftoexcelonline.com/-结果错误
- http://www.pdftoexcel.org/以及他们对Able2Extract的试用-它们弄乱了一些列.他们在预览中正确识别了列,但是在excel输出中,它们被弄乱了
- http://www.pdftoword.com/-只接收我的电子邮件,从不发送任何内容
- 在 scraperwiki 上使用python http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/ 似乎非常复杂,尤其是对于非python用户和 https://scraperwiki.com/不是免费的
- copy paste - makes problems with missing values (pg 5)
- save as text from Acrobat (even worse result than copy-paste)
- open in Excel as external data source - will not recognize the table
- https://www.pdftoexcelonline.com/ - results in error
- http://www.pdftoexcel.org/ as well as their trial of Able2Extract - they messed up some columns. They recognized the columns correctly in the preview but in the excel output they were messed up
- http://www.pdftoword.com/ - just takes my email and never sends anything
- using python on scraperwiki http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/ seems very complicated especially for non-python users and https://scraperwiki.com/ is not free
I have encountered several python libraries like pdftables but they are not easy to use for non-python developer like me (I was not even able to run these things). Is there any easier way to accomplish the task?
我正尝试在R中使用tm
库,作为在此推荐,但
I am trying to use tm
library in R as recommended here, but I have encountered some problems
Ian推荐的Cloud SDK.我注册了,但我绝对不知道从这里去哪里-如何上传页面,识别页面等:
the Cloud SDK recommended by Ian. I registered but I absolutely don't know where to go from here - how to upload pages, recognize them etc:
推荐答案
这是R解决方案,但并非没有缺陷.
Here is an R solution, but it is not without its flaws.
# Read the lines of your file into R
x <- readLines("EMAtaules2012.txt")
# Make sure it shows up as UTF-8 to get proper accents and so on
Encoding(x) <- "UTF-8"
# Identify the lines where the data starts
Start <- grep("COMARCA", x)
# Grab the names of each table
ListNames <- gsub("\\s+", " ", x[Start-2])
# Figure out the number of rows of data per page
Runs <- rle(diff(cumsum(x != "")))
Nrows <- Runs$lengths[Runs$lengths > 4]+1
# Make our life easier by making this column name
# a single string
x <- gsub("i NOM EMA", "i_NOM_EMA", x)
# Since these are fixed width files, we need to figure
# out the widths of each column. This is the sum of
# the number of characters in the header row plus
# the number of spaces between each column name
Spaces <- gregexpr(x[Start], pattern="\\s+")
Spaces <- lapply(Spaces, function(x) c(attr(x, "match.length"), 0))
Chars <- lapply(strsplit(x[Start], "\\s+"), nchar)
Widths <- lapply(seq_along(Spaces),
function(x) rowSums(cbind(Spaces[[x]],
Chars[[x]])))
第2部分:使用read.fwf
获取数据
Part 2: Using read.fwf
to get the data in
# Now, you can use `read.fwf` to read your data files in
temp <- lapply(seq_along(Start), function(fwf) {
A <- read.fwf(textConnection(x),
widths = c(Widths[[fwf]]),
header = FALSE,
skip = Start[fwf]+1,
n = Nrows[fwf]-2,
blank.lines.skip = TRUE,
strip.white = TRUE,
stringsAsFactors = FALSE)
# Add in the column names
names(A) <- scan(what = "character",
file = textConnection(x[Start[fwf]]),
quiet = TRUE)
A
})
# Assign the table names
names(temp) <- ListNames
# Some more cleanup. The original tables span multiple pages
# in the PDF, but we can `rbind` them together in R
Tables <- unique(ListNames)
final <- lapply(seq_along(Tables), function(final) {
A <- do.call(rbind, temp[names(temp) %in% Tables[final]])
rownames(A) <- NULL
A
})
# Add the names back in
names(final) <- Tables
第3部分:它起作用了吗?
# View the first few rows and columns of the first three tables
lapply(final[1:3], function(y) head(y[1:5], 3))
# $` TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012`
# COMARCA CODI i_NOM_EMA GEN FEB
# 1 Alt Camp DQ Vila-rodona 7,9 5,6
# 2 Alt Empordà U1 Cabanes 8,2 6,5
# 3 Alt Empordà W1 Castelló d'Empúries 8,1 6,4
#
# $` TEMPERATURA MÀXIMA MITJANA MENSUAL ( ºC ) - 2012`
# COMARCA CODI i_NOM_EMA GEN FEB
# 1 Alt Camp DQ Vila-rodona 13,1 11,7
# 2 Alt Empordà U1 Cabanes 15,1 12,4
# 3 Alt Empordà W1 Castelló d'Empúries 14,4 11,7
#
# $` TEMPERATURA MÍNIMA MITJANA MENSUAL ( ºC ) - 2012`
# COMARCA CODI i_NOM_EMA GEN FEB
# 1 Alt Camp DQ Vila-rodona 3,8 0,5
# 2 Alt Empordà U1 Cabanes 2,4 0,9
# 3 Alt Empordà W1 Castelló d'Empúries 2,1 0,5
# Some tables, like those on page 76 (for the table "DIRECCIÓ DOMINANT DEL VENT"), had more columns than others.
# Did our script take care of that?
names(final$` DIRECCIÓ DOMINANT DEL VENT`)
# [1] "COMARCA" "CODI" "i_NOM_EMA" "vent" "GEN" "FEB"
# [7] "MAR" "ABR" "MAI" "JUN" "JUL" "AGO"
# [13] "SET" "OCT" "NOV" "DES" "ANY"
种有效.但是,您的输入文件并不完美,这意味着仍然需要进行很多的清理工作.例如,PDF中的某些列似乎具有多个值.不知道您将如何对此进行任何分析.
It sort of worked. But, your input file is not perfect, and that means that there will still be a lot of cleaning up to to. For instance, some columns in the PDF seem to have multiple values. Not sure how you would be able to do any analysis on those.
希望,以上代码中的注释有助于您开始弄清楚如何以更好的方式抓取数据.
Hopefully, the comments in the above code help get you started on figuring out how to go about scraping the data in a better way.
在上面第1部分"之后,这是一个依赖( gasp )Excel的解决方案.基本思想是,如果您将文本导入为固定宽度",则Excel实际上可以很好地检测出列中断的位置.
Continuing after "Part 1" above, here's a solution that relies on (gasp) Excel. The basic idea is that Excel actually does a pretty decent job of detecting where the column breaks are if you import text as Fixed Width.
因此,我们使用R将文本分成单独的页面,每页一个文件,仅将数据(而不是列名或行名,在所有数据集中大多数都相同)
So, we use R to break up the text into separate pages, one file per page, only the data (not the column names or the row names, which are mostly the same across all datasets).
有了这,这是最后的R步骤:
With that, here's the last R step:
# Output just the data
temp <- lapply(seq_along(Widths), function(y) {
DEL <- sum(Widths[[y]][1:3])-2
A <- substring(x[(Start[y]+1):(sum(Start[y], Nrows[y]))], DEL)
writeLines(A, paste("temp_", y, ".txt", collapse = ""))
A
})
让我们打开文件"temp_9.txt",它是缺少列的文件:
Let's open file "temp_9.txt", which is one that has the missing columns:
^^确保选择了固定宽度"-由于文件没有定界符,因此默认情况下应为固定宽度".
^^ Make sure "Fixed Width" is selected -- It should be by default since the file has no delimiters.
^^ Excel将向您显示将要在何处创建列的预览.
^^ Excel shows you a preview of where it is going to make the columns.
^^我已突出显示问题行",以供您查看其工作原理.
^^ I've highlighted the "problem rows" for you to see how it worked out.
这篇关于刮除跨多个页面的大型pdf表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!