使用R从两列PDF中提取文本 [英] Extract Text from Two-Column PDF with R
问题描述
我有很多PDF都是两栏格式的.我在R中使用pdftools
包.是否有一种方法可以按照两列格式读取每个PDF,而无需单独裁剪每个PDF?
I have a lot of PDFs which are in two-column format. I am using the pdftools
package in R. Is there a way to read each PDF according to the two-column format without cropping each PDF individually?
每个PDF由可选文本组成,并且pdf_text
函数在读取文本时没有问题,唯一的问题是它将读取第一列的第一行,然后继续进行下一列,而不是向下移动第一列.
Each PDF consists of selectable text, and the pdf_text
function has no problem reading the text, the only issue is that it will read the first line of the first column, then proceed to the next column, instead of moving down the first column.
非常感谢您的帮助.
推荐答案
我遇到了同样的问题.我要做的是获取每个pdf页面的最频繁的空间值,并将其存储到Vector中.然后,我使用该值对其进行了切片.
I'd the same problem. What I did was to get the most frequent space values for each of my pdfs pages and stored it into a Vector. Then I sliced it using that value.
library(pdftools)
src <- ""
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
QTD_COLUMNS <- 2
read_text <- function(text) {
result <- ''
#Get all index of " " from page.
lstops <- gregexpr(pattern =" ",text)
#Puts the index of the most frequents ' ' in a vector.
stops <- as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
#Slice based in the specified number of colums (this can be improved)
for(i in seq(1, QTD_COLUMNS, by=1))
{
temp_result <- sapply(text, function(x){
start <- 1
stop <-stops[i]
if(i > 1)
start <- stops[i-1] + 1
if(i == QTD_COLUMNS)#last column, read until end.
stop <- nchar(x)+1
substr(x, start=start, stop=stop)
}, USE.NAMES=FALSE)
temp_result <- trim(temp_result)
result <- append(result, temp_result)
}
result
}
txt <- pdf_text(src)
result <- ''
for (i in 1:length(txt)) {
page <- txt[i]
t1 <- unlist(strsplit(page, "\n"))
maxSize <- max(nchar(t1))
t1 <- paste0(t1,strrep(" ", maxSize-nchar(t1)))
result = append(result,read_text(t1))
}
result
这篇关于使用R从两列PDF中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!