刮除跨多个页面的大型pdf表 [英] Scraping large pdf tables which span across multiple pages

查看:93
本文介绍了刮除跨多个页面的大型pdf表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试抓取跨多页的PDF表.我尝试了很多事情,但最好的方法似乎是pdftotext -layout,这是在这里建议的.问题在于,结果文本文件不易于使用,因为表布局在页面之间有所不同,因此列未对齐.还要注意以Solsonès"开头的行中缺少值:

I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is that the resultant text file is not easy to work with, as the table layout differs across pages, so the columns are not aligned. Also note missing values in lines beginning with "Solsonès":

                                                                        TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012

COMARCA          CODI i NOM EMA                    GEN    FEB    MAR         ABR       MAI      JUN      JUL          AGO        SET        OCT        N

Alt Camp         VY   Nulles                        7,5    5,5   10,9         12,3     16,7     21,6     22,3         24,4       20,1        15,9
Alt Camp         DQ   Vila-rodona                   7,9    5,6   11,0         12,0     16,6     21,6     22,0         24,3       19,9        15,8
Alt Empordà      U1   Cabanes                       8,2    6,5   11,7         12,6     17,5     22,0     23,1         24,4       20,4        16,6
Alt Empordà      W1   Castelló d'Empúries           8,1    6,4   11,6         12,9     17,0     21,1     22,0         23,4       20,1        16,4

[...]
                                                                                 TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012

COMARCA          CODI i NOM EMA                             GEN    FEB    MAR         ABR       MAI      JUN      JUL          AGO        SET        OCT

Baix Empordà     DF   la Bisbal d'Empordà                    6,6    5,3   10,9         12,6     17,2     21,9     22,9         24,6       20,3        16
Baix Empordà     UB   la Tallada d'Empordà                   6,1    5,2   10,7         12,3     16,6     21,3     22,2         23,8       19,7        15
Baix Empordà     UC   Monells                                6,1    4,6    9,9         11,4     16,5     21,7     23,0         24,5       19,6        15

[...]

                                                                        TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012

COMARCA         CODI i NOM EMA                      GEN    FEB    MAR         ABR       MAI      JUN      JUL           AGO        SET        OCT
[...]

Solsonès        CA   Clariana de Cardener            4,6    3,3   10,3         10,2     16,7     22,3      d.i.
Solsonès        Z8   el Port del Comte (2.316 m)    -0,9   -6,3   -0,2         -2,0      5,3     10,5     10,9          13,8        7,8         4,2
Solsonès        VO   Lladurs                         3,0    2,6    9,5          9,0     15,3     21,4     21,6          24,3       17,5        13,0
Solsonès        VP   Pinós                           3,0    1,6    8,9          9,2     15,4     21,1     21,3          23,8       17,6        13,3
Solsonès        XT   Solsona                                                                               d.i.         24,3       18,0        13,5
Tarragonès      VQ   Constantí                       7,9   6,0    11,2         13,1     17,1     21,9     22,6          24,6       20,6        16,6
Tarragonès      XE   Tarragona - Complex Educatiu   10,2   7,8    12,3         14,6     18,3     23,0     24,2          26,2       23,0 *      18,4
Tarragonès      DK   Torredembarra                   9,7   7,7    12,3         14,3     17,9     22,8     24,3          26,2       22,7        18,5
Terra Alta      WD   Batea                           6,3   5,0    11,2         12,1     18,3     23,0     23,3          25,5       20,2        15,9
Terra Alta      XP   Gandesa                         6,6   5,2    11,2         12,2     18,1     22,9     23,4          25,6       20,4        16,0

完整的文件可供下载-UTF8

因此,此输出不是很容易解析.还有什么其他方法可用?

似乎我使用的每个工具都只能提取有关表格单元格 layout 的信息,但不能提取属于特定列的信息.如果单元格为空,这将非常明显-空单元格不在输出中,您只会获得非空的单元格"及其布局. PDF本身是否包含此表格信息?如果没有,则搜索将其提取出来的工具就没有意义.

It seems that every tool I use is only capable to extract information about layout of the table cells, but it doesn't extract the information of belonging to particular column. This is very much apparent if the cells are empty - the empty cells are not in the output, you only get non-empty "cells" with their layout. Does the PDF itself contain this tabular information? If not, it doesn't make sense to search for tool that will extract it.

付费解决方案并非毫无疑问,因为它最终可能比我花几个工作日投资更便宜...

Paid solutions are not out of question, as it might in the end be cheaper than invest several working days of my time...

我尝试过的事情:

  • copy paste - makes problems with missing values (pg 5)
  • save as text from Acrobat (even worse result than copy-paste)
  • open in Excel as external data source - will not recognize the table
  • https://www.pdftoexcelonline.com/ - results in error
  • http://www.pdftoexcel.org/ as well as their trial of Able2Extract - they messed up some columns. They recognized the columns correctly in the preview but in the excel output they were messed up
  • http://www.pdftoword.com/ - just takes my email and never sends anything
  • using python on scraperwiki http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/ seems very complicated especially for non-python users and https://scraperwiki.com/ is not free
  • I have encountered several python libraries like pdftables but they are not easy to use for non-python developer like me (I was not even able to run these things). Is there any easier way to accomplish the task?

我正尝试在R中使用tm库,作为在此推荐,但

I am trying to use tm library in R as recommended here, but I have encountered some problems

Ian推荐的Cloud SDK.我注册了,但我绝对不知道从这里去哪里-如何上传页面,识别页面等:

the Cloud SDK recommended by Ian. I registered but I absolutely don't know where to go from here - how to upload pages, recognize them etc:

推荐答案

这是R解决方案,但并非没有缺陷.

Here is an R solution, but it is not without its flaws.

# Read the lines of your file into R
x <- readLines("EMAtaules2012.txt")

# Make sure it shows up as UTF-8 to get proper accents and so on
Encoding(x) <- "UTF-8"

# Identify the lines where the data starts
Start <- grep("COMARCA", x)

# Grab the names of each table
ListNames <- gsub("\\s+", " ", x[Start-2])

# Figure out the number of rows of data per page
Runs <- rle(diff(cumsum(x != "")))
Nrows <- Runs$lengths[Runs$lengths > 4]+1

# Make our life easier by making this column name
#  a single string
x <- gsub("i NOM EMA", "i_NOM_EMA", x)

# Since these are fixed width files, we need to figure
#  out the widths of each column. This is the sum of
#  the number of characters in the header row plus
#  the number of spaces between each column name
Spaces <- gregexpr(x[Start], pattern="\\s+")
Spaces <- lapply(Spaces, function(x) c(attr(x, "match.length"), 0))
Chars <- lapply(strsplit(x[Start], "\\s+"), nchar)
Widths <- lapply(seq_along(Spaces), 
                 function(x) rowSums(cbind(Spaces[[x]], 
                                           Chars[[x]])))

第2部分:使用read.fwf获取数据

Part 2: Using read.fwf to get the data in

# Now, you can use `read.fwf` to read your data files in
temp <- lapply(seq_along(Start), function(fwf) {
  A <- read.fwf(textConnection(x), 
                widths = c(Widths[[fwf]]), 
                header = FALSE, 
                skip = Start[fwf]+1, 
                n = Nrows[fwf]-2, 
                blank.lines.skip = TRUE,
                strip.white = TRUE,
                stringsAsFactors = FALSE)
  # Add in the column names
  names(A) <- scan(what = "character", 
                   file = textConnection(x[Start[fwf]]), 
                   quiet = TRUE)
  A
})

# Assign the table names
names(temp) <- ListNames

# Some more cleanup. The original tables span multiple pages
#  in the PDF, but we can `rbind` them together in R
Tables <- unique(ListNames)
final <- lapply(seq_along(Tables), function(final) {
  A <- do.call(rbind, temp[names(temp) %in% Tables[final]])
  rownames(A) <- NULL
  A
})
# Add the names back in
names(final) <- Tables

第3部分:它起作用了吗?

# View the first few rows and columns of the first three tables
lapply(final[1:3], function(y) head(y[1:5], 3))
# $` TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012`
#       COMARCA CODI           i_NOM_EMA GEN FEB
# 1    Alt Camp   DQ         Vila-rodona 7,9 5,6
# 2 Alt Empordà   U1             Cabanes 8,2 6,5
# 3 Alt Empordà   W1 Castelló d'Empúries 8,1 6,4
# 
# $` TEMPERATURA MÀXIMA MITJANA MENSUAL ( ºC ) - 2012`
#       COMARCA CODI           i_NOM_EMA  GEN  FEB
# 1    Alt Camp   DQ         Vila-rodona 13,1 11,7
# 2 Alt Empordà   U1             Cabanes 15,1 12,4
# 3 Alt Empordà   W1 Castelló d'Empúries 14,4 11,7
# 
# $` TEMPERATURA MÍNIMA MITJANA MENSUAL ( ºC ) - 2012`
#       COMARCA CODI           i_NOM_EMA GEN FEB
# 1    Alt Camp   DQ         Vila-rodona 3,8 0,5
# 2 Alt Empordà   U1             Cabanes 2,4 0,9
# 3 Alt Empordà   W1 Castelló d'Empúries 2,1 0,5

# Some tables, like those on page 76 (for the table "DIRECCIÓ DOMINANT DEL VENT"), had more columns than others. 
# Did our script take care of that?
names(final$` DIRECCIÓ DOMINANT DEL VENT`)
#  [1] "COMARCA"   "CODI"      "i_NOM_EMA" "vent"      "GEN"       "FEB"      
#  [7] "MAR"       "ABR"       "MAI"       "JUN"       "JUL"       "AGO"      
# [13] "SET"       "OCT"       "NOV"       "DES"       "ANY"    

种有效.但是,您的输入文件并不完美,这意味着仍然需要进行很多的清理工作.例如,PDF中的某些列似乎具有多个值.不知道您将如何对此进行任何分析.

It sort of worked. But, your input file is not perfect, and that means that there will still be a lot of cleaning up to to. For instance, some columns in the PDF seem to have multiple values. Not sure how you would be able to do any analysis on those.

希望,以上代码中的注释有助于您开始弄清楚如何以更好的方式抓取数据.

Hopefully, the comments in the above code help get you started on figuring out how to go about scraping the data in a better way.

在上面第1部分"之后,这是一个依赖( gasp )Excel的解决方案.基本思想是,如果您将文本导入为固定宽度",则Excel实际上可以很好地检测出列中断的位置.

Continuing after "Part 1" above, here's a solution that relies on (gasp) Excel. The basic idea is that Excel actually does a pretty decent job of detecting where the column breaks are if you import text as Fixed Width.

因此,我们使用R将文本分成单独的页面,每页一个文件,仅将数据(而不是列名或行名,在所有数据集中大多数都相同)

So, we use R to break up the text into separate pages, one file per page, only the data (not the column names or the row names, which are mostly the same across all datasets).

有了这,这是最后的R步骤:

With that, here's the last R step:

# Output just the data
temp <- lapply(seq_along(Widths), function(y) {
  DEL <- sum(Widths[[y]][1:3])-2
  A <- substring(x[(Start[y]+1):(sum(Start[y], Nrows[y]))], DEL)
  writeLines(A, paste("temp_", y, ".txt", collapse = ""))
  A
})

让我们打开文件"temp_9.txt",它是缺少列的文件:

Let's open file "temp_9.txt", which is one that has the missing columns:

^^确保选择了固定宽度"-由于文件没有定界符,因此默认情况下应为固定宽度".

^^ Make sure "Fixed Width" is selected -- It should be by default since the file has no delimiters.

^^ Excel将向您显示将要在何处创建列的预览.

^^ Excel shows you a preview of where it is going to make the columns.

^^我已突出显示问题行",以供您查看其工作原理.

^^ I've highlighted the "problem rows" for you to see how it worked out.

这篇关于刮除跨多个页面的大型pdf表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆