如何使用R从pdf提取粗体和非粗体文本 [英] How to extract bold and non-bold text from pdf using R
问题描述
我正在使用R提取文本.下面的代码很好地从pdf中提取了非粗体文本,但是忽略了粗体部分.有没有办法提取粗体和非粗体文本?
I am using R for extracting text. The code below works well to extract the non-bold text from pdf but it ignores the bold part. Is there a way to extract both bold and non-bold text?
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
library(pdftools)
library(tesseract)
library(tiff)
info <- pdf_info(news)
numberOfPageInPdf <- as.numeric(info[2])
numberOfPageInPdf
for (i in 1:numberOfPageInPdf){
bitmap <- pdf_render_page(news, page=i, dpi = 300, numeric = TRUE)
file_name <- paste0("page", i, ".tiff")
file_tiff <- tiff::writeTIFF(bitmap, file_name)
out <- ocr(file_name)
file_txt <- paste0("text", i, ".txt")
writeLines(out, file_txt)
}
推荐答案
我喜欢为此使用tabulizer
库.这是一个小例子:
I like using the tabulizer
library for this. Here's a small example:
devtools::install_github("ropensci/tabulizer")
library(tabulizer)
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
# note that you need to specify UTF-8 as the encoding, otherwise your special characters
# won't come in correctly
page1 <- extract_tables(news, guess=TRUE, page = 1, encoding='UTF-8')
page1[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "" "Division: 1" "" "" "" "" "Série: A"
[2,] "" "514" "" "Fontaine 1 KBSK 1" "" "" "303"
[3,] "1" "62529 WIRIG ANTHONY" "" "2501 1⁄2-1⁄2" "51560" "CZEBE ATTILLA" "2439"
[4,] "2" "62359 BRUNNER NICOLAS" "" "2443 0-1" "51861" "PICEU TOM" "2401"
[5,] "3" "75655 CEKRO EKREM" "" "2393 0-1" "10391" "GEIRNAERT STEVEN" "2400"
[6,] "4" "50211 MARECHAL ANDY" "" "2355 0-1" "35181" "LEENHOUTS KOEN" "2388"
[7,] "5" "73059 CLAESEN PIETER" "" "2327 1⁄2-1⁄2" "25615" "DECOSTER FREDERIC" "2373"
[8,] "6" "63614 HOURIEZ CLEMENT" "" "2304 1⁄2-1⁄2" "44954" "MAENHOUT THIBAUT" "2372"
[9,] "7" "60369 CAPONE NICOLA" "" "2283 1⁄2-1⁄2" "10430" "VERLINDE TIEME" "2271"
[10,] "8" "70653 LE QUANG KIM" "" "2282 0-1" "44636" "GRYSON WOUTER" "2269"
[11,] "" "" "< 2361 >" "12 - 20" "" "< 2364 >" ""
如果只关心某些表,还可以使用locate_areas
函数指定特定区域.请注意,要使locate_areas
正常工作,我必须先在本地下载文件;否则,请执行以下步骤.使用URL返回错误.
You can also use the locate_areas
function to specify a specific region if you only care about some of the tables. Note that for locate_areas
to work, I had to download the file locally first; using the URL returned an error.
您会注意到,每个表在返回列表中都是其自己的元素.
You'll note that each table is its own element in the returned list.
下面是使用自定义区域仅选择每个页面上的第一个表的示例:
Here's an example using a custom region to just select the first table on each page:
customArea <- extract_tables(news, guess=FALSE, page = 1, area=list(c(84,27,232,569), encoding = 'UTF-8')
与使用OCR(光学字符识别)库tesseract
相比,这也是一种更直接的方法,因为您不依赖OCR库将像素排列转换回文本.在数字PDF中,每个文本元素都有一个x和y位置,并且tabulizer
库使用该信息来检测表试探法并提取合理格式化的数据.您会看到仍然有一些清理工作要做,但这非常易于管理.
This is also a more direct method than using the OCR (Optical Character Recognition) library tesseract
beacuse you're not relying on the OCR library to translate pixel arrangement back into text. In digital PDFs, each text element has an x and y position, and the tabulizer
library uses that information to detect table heuristics and extract sensibly formatted data. You'll see you still have some clean up to do, but it's pretty manageable.
只是为了好玩,这是一个使用data.table
just for fun, here's a little example of starting the clean up with data.table
library(data.table)
cleanUp <- setDT(as.data.frame(page1[[1]]))
cleanUp[ , `:=` (Division = as.numeric(gsub("^.*(\\d+{1,2}).*", "\\1", grep('Division', cleanUp$V2, value=TRUE))),
Series = as.character(gsub(".*:\\s(\\w).*","\\1", grep('Série:', cleanUp$V7, value=TRUE))))
][,ID := tstrsplit(V2," ", fixed=TRUE, keep = 1)
][, c("V1", "V3") := NULL
][-grep('Division', V2, fixed=TRUE)]
在这里,我们已经将Division
,Series
和ID
移到了自己的列中,并删除了Division
标题行.这只是一般性的想法,需要稍作改进才能应用于全部27页.
Here we've moved Division
, Series
, and ID
into their own columns, and removed the Division
header row. This is just the general idea, and would need a little refinement to apply to all 27 pages.
V2 V4 V5 V6 V7 Division Series ID
1: 514 Fontaine 1 KBSK 1 303 1 A 514
2: 62529 WIRIG ANTHONY 2501 1/2-1/2 51560 CZEBE ATTILLA 2439 1 A 62529
3: 62359 BRUNNER NICOLAS 2443 0-1 51861 PICEU TOM 2401 1 A 62359
4: 75655 CEKRO EKREM 2393 0-1 10391 GEIRNAERT STEVEN 2400 1 A 75655
5: 50211 MARECHAL ANDY 2355 0-1 35181 LEENHOUTS KOEN 2388 1 A 50211
6: 73059 CLAESEN PIETER 2327 1/2-1/2 25615 DECOSTER FREDERIC 2373 1 A 73059
7: 63614 HOURIEZ CLEMENT 2304 1/2-1/2 44954 MAENHOUT THIBAUT 2372 1 A 63614
8: 60369 CAPONE NICOLA 2283 1/2-1/2 10430 VERLINDE TIEME 2271 1 A 60369
9: 70653 LE QUANG KIM 2282 0-1 44636 GRYSON WOUTER 2269 1 A 70653
10: 12 - 20 < 2364 > 1 A NA
这篇关于如何使用R从pdf提取粗体和非粗体文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!