如何使用R从pdf提取粗体和非粗体文本 [英] How to extract bold and non-bold text from pdf using R

查看:128
本文介绍了如何使用R从pdf提取粗体和非粗体文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用R提取文本.下面的代码很好地从pdf中提取了非粗体文本,但是忽略了粗体部分.有没有办法提取粗体非粗体文本?

I am using R for extracting text. The code below works well to extract the non-bold text from pdf but it ignores the bold part. Is there a way to extract both bold and non-bold text?

 news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
 library(pdftools)
 library(tesseract)
 library(tiff)
 info <- pdf_info(news)
 numberOfPageInPdf <- as.numeric(info[2])
 numberOfPageInPdf
 for (i in 1:numberOfPageInPdf){
      bitmap <- pdf_render_page(news, page=i, dpi = 300, numeric = TRUE)
      file_name <- paste0("page", i, ".tiff") 
      file_tiff <- tiff::writeTIFF(bitmap, file_name)
      out <- ocr(file_name)
      file_txt <- paste0("text", i, ".txt") 
      writeLines(out, file_txt)
    }

推荐答案

我喜欢为此使用tabulizer库.这是一个小例子:

I like using the tabulizer library for this. Here's a small example:

devtools::install_github("ropensci/tabulizer")
library(tabulizer)

news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'

# note that you need to specify UTF-8 as the encoding, otherwise your special characters
# won't come in correctly

page1 <- extract_tables(news, guess=TRUE, page = 1, encoding='UTF-8')

page1[[1]]

      [,1] [,2]                    [,3]       [,4]                [,5]    [,6]                [,7]      
 [1,] ""   "Division: 1"           ""         ""                  ""      ""                  "Série: A"
 [2,] ""   "514"                   ""         "Fontaine 1 KBSK 1" ""      ""                  "303"     
 [3,] "1"  "62529 WIRIG ANTHONY"   ""         "2501 1⁄2-1⁄2"      "51560" "CZEBE ATTILLA"     "2439"    
 [4,] "2"  "62359 BRUNNER NICOLAS" ""         "2443 0-1"          "51861" "PICEU TOM"         "2401"    
 [5,] "3"  "75655 CEKRO EKREM"     ""         "2393 0-1"          "10391" "GEIRNAERT STEVEN"  "2400"    
 [6,] "4"  "50211 MARECHAL ANDY"   ""         "2355 0-1"          "35181" "LEENHOUTS KOEN"    "2388"    
 [7,] "5"  "73059 CLAESEN PIETER"  ""         "2327 1⁄2-1⁄2"      "25615" "DECOSTER FREDERIC" "2373"    
 [8,] "6"  "63614 HOURIEZ CLEMENT" ""         "2304 1⁄2-1⁄2"      "44954" "MAENHOUT THIBAUT"  "2372"    
 [9,] "7"  "60369 CAPONE NICOLA"   ""         "2283 1⁄2-1⁄2"      "10430" "VERLINDE TIEME"    "2271"    
[10,] "8"  "70653 LE QUANG KIM"    ""         "2282 0-1"          "44636" "GRYSON WOUTER"     "2269"    
[11,] ""   ""                      "< 2361 >" "12 - 20"           ""      "< 2364 >"          ""      

如果只关心某些表,还可以使用locate_areas函数指定特定区域.请注意,要使locate_areas正常工作,我必须先在本地下载文件;否则,请执行以下步骤.使用URL返回错误.

You can also use the locate_areas function to specify a specific region if you only care about some of the tables. Note that for locate_areas to work, I had to download the file locally first; using the URL returned an error.

您会注意到,每个表在返回列表中都是其自己的元素.

You'll note that each table is its own element in the returned list.

下面是使用自定义区域仅选择每个页面上的第一个表的示例:

Here's an example using a custom region to just select the first table on each page:

customArea <- extract_tables(news, guess=FALSE, page = 1, area=list(c(84,27,232,569), encoding = 'UTF-8')

与使用OCR(光学字符识别)库tesseract相比,这也是一种更直接的方法,因为您不依赖OCR库将像素排列转换回文本.在数字PDF中,每个文本元素都有一个x和y位置,并且tabulizer库使用该信息来检测表试探法并提取合理格式化的数据.您会看到仍然有一些清理工作要做,但这非常易于管理.

This is also a more direct method than using the OCR (Optical Character Recognition) library tesseract beacuse you're not relying on the OCR library to translate pixel arrangement back into text. In digital PDFs, each text element has an x and y position, and the tabulizer library uses that information to detect table heuristics and extract sensibly formatted data. You'll see you still have some clean up to do, but it's pretty manageable.

只是为了好玩,这是一个使用data.table

just for fun, here's a little example of starting the clean up with data.table

library(data.table)

cleanUp <- setDT(as.data.frame(page1[[1]]))

cleanUp[ ,  `:=` (Division = as.numeric(gsub("^.*(\\d+{1,2}).*", "\\1", grep('Division', cleanUp$V2, value=TRUE))),
  Series = as.character(gsub(".*:\\s(\\w).*","\\1", grep('Série:', cleanUp$V7, value=TRUE))))
  ][,ID := tstrsplit(V2," ", fixed=TRUE, keep = 1)
  ][, c("V1", "V3") := NULL
  ][-grep('Division', V2, fixed=TRUE)]

在这里,我们已经将DivisionSeriesID移到了自己的列中,并删除了Division标题行.这只是一般性的想法,需要稍作改进才能应用于全部27页.

Here we've moved Division, Series, and ID into their own columns, and removed the Division header row. This is just the general idea, and would need a little refinement to apply to all 27 pages.

                       V2                V4    V5                V6   V7 Division Series    ID
 1:                   514 Fontaine 1 KBSK 1                          303        1      A   514
 2:   62529 WIRIG ANTHONY      2501 1/2-1/2 51560     CZEBE ATTILLA 2439        1      A 62529
 3: 62359 BRUNNER NICOLAS          2443 0-1 51861         PICEU TOM 2401        1      A 62359
 4:     75655 CEKRO EKREM          2393 0-1 10391  GEIRNAERT STEVEN 2400        1      A 75655
 5:   50211 MARECHAL ANDY          2355 0-1 35181    LEENHOUTS KOEN 2388        1      A 50211
 6:  73059 CLAESEN PIETER      2327 1/2-1/2 25615 DECOSTER FREDERIC 2373        1      A 73059
 7: 63614 HOURIEZ CLEMENT      2304 1/2-1/2 44954  MAENHOUT THIBAUT 2372        1      A 63614
 8:   60369 CAPONE NICOLA      2283 1/2-1/2 10430    VERLINDE TIEME 2271        1      A 60369
 9:    70653 LE QUANG KIM          2282 0-1 44636     GRYSON WOUTER 2269        1      A 70653
10:                                 12 - 20                < 2364 >             1      A    NA

这篇关于如何使用R从pdf提取粗体和非粗体文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆