从给定坐标中提取PDF文本 [英] PDF text extraction from given coordinates

查看:147
本文介绍了从给定坐标中提取PDF文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Ghostscript从PDF的一部分中提取文本(使用坐标).

有人可以帮我吗?

解决方案

是的,使用Ghostscript,您可以从PDF中提取文本.但是,不是,这不是完成这项工作的最佳工具.不,您不能在部分"中执行此操作(单页的一部分).您可以执行的操作:仅提取特定范围页面的文本.

首先: Ghostscript的 txtwrite输出设备(不太好)

 gs \
   -dBATCH \
   -dNOPAUSE \
   -sDEVICE=txtwrite \
   -dFirstPage=3 \
   -dLastPage=5 \
   -sOutputFile=- \
   /path/to/your/pdf

这会将第3-5页上包含的所有文本输出到stdout.如果要输出到文本文件,请使用

   -sOutputFile=textfilename.txt


gs更新:

最新版本的Ghostscript在txtwrite设备和错误修复方面已取得重大改进.请参见最近的Ghostscript更改日志(在该页面上搜索 txtwrite )以获取详细信息.


第二:Ghostscript的 Ghostscript Git源代码存储库.您必须将PDF转换为PostScript,然后在PS文件上运行以下命令:

gs \
  -q \
  -dNODISPLAY \
  -P- \
  -dSAFER \
  -dDELAYBIND \
  -dWRITESYSTEMDICT \
  -dSIMPLE \
   /path/to/ps2ascii.ps \
   input.ps \
  -c quit

如果未定义-dSIMPLE参数,则除了纯文本内容之外,每行输出还包含一些有关所使用的字体和字体大小的信息.

如果用-dCOMPLEX替换该参数,则会获得有关所用颜色和图像的其他信息.

阅读 ps2ascii.ps 中的注释以了解有关此实用程序的更多信息.使用起来很不舒服,但是对我来说,在大多数情况下我都需要它....

第三: XPDF的 pdftotext CLI实用程序(比Ghostscript更舒适)

一种更舒适的文本提取方式:使用pdftotext(适用于Windows以及Linux/Unix或Mac OS X).该实用程序基于Poppler或XPDF.这是您可以尝试的命令:

 pdftotext \
   -f 13 \
   -l 17 \
   -layout \
   -opw supersecret \
   -upw secret \
   -eol unix \
   -nopgbrk \
   /path/to/your/pdf
   - |less

这将显示页面范围13( f 第一页)到17( l 最后一页),保留受双密码保护的命名PDF文件的布局(使用用户密码和所有者密码 secret supersecret ),并采用Unix EOL约定,但不在PDF页面之间插入分页符,通过更少的管道...

pdftotext -h显示所有可用的命令行选项.

当然,这两个工具仅适用于PDF的文本部分(如果有的话).哦,数学公式也不太好...;-)


pdftotext更新:

最新版本的Poppler pdftotext现在具有提取"PDF的一部分(使用坐标)" 页面的选项,就像OP所要求的那样.参数为:

  • -x <int> :作物区域左上角的x坐标
  • -y <int> :作物区域左上角的y坐标
  • -W <int> :裁剪区域的宽度(以像素为单位)(默认为0)
  • -H <int> :裁剪区域的高度(以像素为单位)(默认为0)

最好,如果与-layout参数一起使用.


第四:MuPDF的mutool draw命令还可以提取文本

跨平台的开源 MuPDF 应用程序(由同一家公司也开发了Ghostscript)捆绑了一个命令行工具mutool.要使用此工具从PDF中提取文本,请使用:

mutool draw -F txt the.pdf

会将提取的文本发送到<stdout>.使用-o filename.txt将其写入文件.

第五:PDFLib的文本提取工具包(TET)(是最好的……但它是PayWare)

TET pdflib 系列产品可以在PDF文件中(甚至更多)找到文本内容的xy坐标. TET具有命令行界面,它是我所知道的所有文本提取工具中最强大的. (它甚至可以处理连字...)从他们的网站报价:

几何
TET为文本提供精确的度量标准,例如页面上的位置,字形宽度和文本方向.页面上的特定区域可以被排除或包括在文本提取中,例如忽略页眉和页脚或页边距.

以我的经验,虽然它不是最简单的CLI界面,但是您可以想象:习惯之后,它将按其承诺的那样工作,对于大多数PDF来说,您... /p>


还有更多选择:

    来自PoDoFo项目(开放源代码)的
  1. podofotxtextract (CLI工具)
  2. calibre (通常是用于处理电子书的GUI程序,开源)具有命令行选项可以从PDF提取文本
  3. AbiWord (GUI文字处理器,开源)可以导入PDF并保存其文件作为.txt:abiword --to=txt --to-name=output.txt input.pdf

I would like to extract text from a portion (using coordinates) of PDF using Ghostscript.

Can anyone help me out?

解决方案

Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in "portions" (parts of single pages). What you can do: extract the text of a certain range of pages only.

First: Ghostscript's txtwrite output device (not so good)

 gs \
   -dBATCH \
   -dNOPAUSE \
   -sDEVICE=txtwrite \
   -dFirstPage=3 \
   -dLastPage=5 \
   -sOutputFile=- \
   /path/to/your/pdf

This will output all text contained on pages 3-5 to stdout. If you want output to a text file, use

   -sOutputFile=textfilename.txt


gs Update:

Recent versions of Ghostscript have seen major improvements in the txtwrite device and bug fixes. See recent Ghostscript changelogs (search for txtwrite on that page) for details.


Second: Ghostscript's ps2ascii.ps PostScript utility (better)

This one requires you to download the latest version of the file ps2ascii.ps from the Ghostscript Git source code repository. You'd have to convert your PDF to PostScript, then run this command on the PS file:

gs \
  -q \
  -dNODISPLAY \
  -P- \
  -dSAFER \
  -dDELAYBIND \
  -dWRITESYSTEMDICT \
  -dSIMPLE \
   /path/to/ps2ascii.ps \
   input.ps \
  -c quit

If the -dSIMPLE parameter is not defined, each output line contains some additional info beyond the pure text content about fonts and fontsize used.

If you replace that parameter by -dCOMPLEX, you'll get additional infos about colors and images used.

Read the comments inside the ps2ascii.ps to learn more about this utility. It's not comfortable to use, but for me it worked in most cases I needed it....

Third: XPDF's pdftotext CLI utility (more comfortable than Ghostscript)

A more comfortable way to do text extraction: use pdftotext (available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try:

 pdftotext \
   -f 13 \
   -l 17 \
   -layout \
   -opw supersecret \
   -upw secret \
   -eol unix \
   -nopgbrk \
   /path/to/your/pdf
   - |less

This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less...

pdftotext -h displays all available commandline options.

Of course, both tools only work for the text parts of PDFs (if they have any). Oh, and mathematical formula also won't work too well... ;-)


pdftotext Update:

Recent versions of Poppler's pdftotext have now options to extract "a portion (using coordinates) of PDF" pages, like the OP asked for. The parameters are:

  • -x <int> : top left corner's x-coordinate of crop area
  • -y <int> : top left corner's y-coordinate of crop area
  • -W <int> : crop area's width in pixels (defaults to 0)
  • -H <int> : crop area's height in pixels (defaults to 0)

Best, if used with the -layout parameter.


Fourth: MuPDF's mutool draw command can also extract text

The cross-platform, open source MuPDF application (made by the same company that also develops Ghostscript) has bundled a command line tool, mutool. To extract text from a PDF with this tool, use:

mutool draw -F txt the.pdf

will emit the extracted text to <stdout>. Use -o filename.txt to write it into a file.

Fifth: PDFLib's Text Extraction Toolkit (TET) (best of all... but it is PayWare)

TET, the Text Extraction Toolkit from the pdflib family of products can find the x-y-coordinate of text content in a PDF file (and much more). TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...) Quote from their website:

Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.

In my experience, while it's does not sport the most straight-forward CLI interface you can imagine: after you got used to it, it will do what it promises to do, for most PDFs you throw towards it...


And there are even more options:

  1. podofotxtextract (CLI tool) from the PoDoFo project (Open Source)
  2. calibre (normally a GUI program to handle eBooks, Open Source) has a commandline option that can extract text from PDFs
  3. AbiWord (a GUI word processor, Open Source) can import PDFs and save its files as .txt: abiword --to=txt --to-name=output.txt input.pdf

这篇关于从给定坐标中提取PDF文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆