在R中在pdf上执行ocr时出错 [英] Error while doing ocr on pdf in r

查看:162
本文介绍了在R中在pdf上执行ocr时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在R中尝试pdf上的OCR,这给了我错误. 运行代码后,还生成了"i.txt"文件,但仍然出现错误.

Trying OCR on pdf in r and it is giving me the error. After running the code the "i.txt" file is also been generated, but still the error is getting.

pdftoppm version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdftoppm [options] <PDF-file> <PPM-root>
  -f <int>          : first page to print
  -l <int>          : last page to print
  -r <number>       : resolution, in DPI (default is 150)
  -mono             : generate a monochrome PBM file
  -gray             : generate a grayscale PGM file
  -freetype <string>: enable FreeType font rasterizer: yes, no
  -aa <string>      : enable font anti-aliasing: yes, no
  -aaVector <string>: enable vector anti-aliasing: yes, no
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -cfg <string>     : configuration file to use in place of .xpdfrc
  -v                : print copyright and version info
  -h                : print usage information
  -help             : print usage information
  --help            : print usage information
  -?                : print usage information
convert.exe: unable to open image '*.ppm': Invalid argument @ error/blob.c/OpenBlob/3146.
convert.exe: no images defined `D:/PDF_OCR_File/test.pdf.tif' @ error/convert.c/ConvertImageCommand/3275.
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.
[[1]]
[1] FALSE

Warning messages:
1: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe D:/PDF_OCR_File/test.pdf -f 1 -l 2 -r 600 ocrbook"' had status 99 
2: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe ",  :
  '"D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe D:/PDF_OCR_File/test.pdf -f 1 -l 2 -r 600 ocrbook"' execution failed with error code 99
3: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm D:/PDF_OCR_File/test.pdf.tif"' had status 1 
4: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm ",  :
  '"D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm D:/PDF_OCR_File/test.pdf.tif"' execution failed with error code 1
5: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe D:/PDF_OCR_File/test.pdf.tif D:/PDF_OCR_File/test.pdf -l eng"' had status 1 
6: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe ",  :
  '"D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe D:/PDF_OCR_File/test.pdf.tif D:/PDF_OCR_File/test.pdf -l eng"' execution failed with error code 1
7: In file.remove(paste0(i, ".tiff")) :
  cannot remove file 'D:/PDF_OCR_File/test.pdf.tiff', reason 'No such file or directory'

我的setwd()是这个"D:/PDF_OCR_File"

My setwd() is this "D:/PDF_OCR_File"

这是我遇到错误的代码

dest <- "D:/PDF_OCR_File"
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})


myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)




lapply(myfiles, function(i){

  shell(shQuote(paste0("D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe ", i, " -f 1 -l 2 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tiff" ))
})

我不知道哪里出了问题,或者我犯了什么错误. 任何建议都会有所帮助, 谢谢.

I don't know where it is getting wrong, or what mistake I'm making. Any suggestion will be helpful, Thanks.

推荐答案

我敢打赌,您正在使用

I bet you are using this for your code, example, huh? I found a lot of issues with that code as well as some antiquated syntax.

我想出的解决方案是:

  dest <- "C:\\users\\YOURNAME\\desktop"

  files <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

  sapply(files, FUN = function(a){
  file.rename(from = a, to =  paste0(dirname(a), "/", gsub(" ", "", basename(a))))
      })

      files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
    lapply(files, function(i){
      shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 70 ", i,".pdf", " ",i)))
      })


  myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
    lapply(myppms, function(y){
      shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
      file.remove(paste0(y,".ppm"))
      })

  mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
    lapply(mytiffs, function(z){
      shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
      file.remove(paste0(z,".tif"))
      })

GitHub片段的第一个问题是,这些选项既缺失,而且在CMD理解的位置不正确,这就是为什么要获得帮助菜单的原因. "ocrbook"是输出文件名(如果要执行多个文件,这很不好),因此,无论文件名为"ocrbook-000001.png",您都将获得一个PPM,PNG.该代码块中的函数(i)的问题在于,它正在寻找"originalpdfname.pdf.png",而不是转换为"ocrbook-000001"的文件名.我通过在函数内创建一个函数来查找PNG文件并将其放入(z)的方法来解决该问题.

The first problem with the GitHub snippet is that the options are both missing pieces and are in the wrong place for CMD to understand, which is why you are getting the help menu. "ocrbook" is the output file name (which is bad if you want to do more than one file), so you are going to get a PPM, PNG, whatever file named "ocrbook-000001.png". The issue with the function(i) in that block of code is that it is looking for the "originalpdfname.pdf.png" instead of the filename that was converted "ocrbook-000001". I fixed that by creating a function within a function to find the PNG files and put them into (z).

Tesseract [应该]可以很好地转换PNG文件,因此不需要使用ImageMagick将PPM转换为TIFF.只需使用xPDF将PDF转换为PNG.但是,在GitHub示例中,ImageMagick语法已过时. "convert"显然与另一个CMD命令冲突,因此在以后的迭代中将其更改为"magick".请参见此处.为了保持一致性,我还是在示例中使用了转换器.

Tesseract [is supposed to] convert PNG files just fine, so there is no need to use ImageMagick to covert from a PPM to TIFF. Just use xPDF to convert the PDF to a PNG. However, in the GitHub example, the ImageMagick syntax is outdated. "convert" apparently clashes with another CMD command, so it was changed in later iterations to "magick". See here. For consistency I used the converter in the example anyways.

关于该代码示例的另一件事是,tesseract默认为英语...这可能是使用较新版本创建的,因此不再需要指定"-l eng".请参见此处. "out"显然是导出的txt文件名(仅出于观察目的),您将需要将该路径剥离下来并在函数中使用它,以便它模仿原始文件名,并且每次运行该文件时都不会覆盖该文件名.新文件上的OCR.

Another thing about that code example is that tesseract defaults to English... this may be something that was created with newer versions, so there is no longer a need to specify "-l eng" anymore. See here. "out" apparently is the exported txt file name (just purely from observation), and you will need to strip the path down and use it in a function so that it mimics the original file name and doesn't overwrite each time it runs the OCR on a new file.

这篇关于在R中在pdf上执行ocr时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆