使用pdfbox从pdf中提取文本时出错 [英] Error when extracting text from pdf using pdfbox
问题描述
如果您运行文本提取代码并启用日志记录,您将看到许多警告:
2019 年 2 月 12 日下午 5:45:58 org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode警告:在字体 GNPVNR+PingFangSC-Semibold 中没有 CID+5482 (5482) 的 Unicode 映射2019 年 2 月 12 日下午 5:45:58 org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode警告:在字体 GNPVNR+PingFangSC-Semibold 中没有 CID+1842 (1842) 的 Unicode 映射2019 年 2 月 12 日下午 5:45:58 org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode警告:在字体 GNPVNR+PingFangSC-Semibold 中没有 CID+7566 (7566) 的 Unicode 映射2019 年 2 月 12 日下午 5:45:58 org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode警告:在字体 GNPVNR+PingFangSC-Semibold 中没有 CID+1915 (1915) 的 Unicode 映射...
确实,在检查 PDF 时,人们会看到嵌入了许多 PingFangSC 样式的子集,但每次都是
- 使用 ToUnicode 映射,根本没有任何条目,
- 使用 Identity-H 编码,以及
- 使用 Adobe-Identity-0 ROS,
即没有任何信息哪个字形代表哪个 Unicode 代码点.因此,文本提取结果非常缺乏也就不足为奇了.
因此,如果您确实需要提取文本,请要求 PDF 的来源提供包含所需信息的副本.如果这不可能,请尝试 OCR.
<小时>顺便说一句,一个好的第一次检查通常是尝试从 Adobe Reader 中复制和粘贴文本.在手头的情况下,这也会导致大部分字符丢失.这通常意味着缺少根据 PDF 规范提取文本所需的信息.
您还可以在评论中提供的@Tilman 链接中找到更多背景:https://pdfbox.apache.org/2.0/faq.html#text-extraction
Sample pdf is a chinese resume, 3 pages, using standard code below
PDDocument document = PDDocument.load(new File(path));
PDFTextStripper stripper = new PDFTextStripper();
text = stripper.getText(document);
Extraction result is like below image, only some words
If you run the text extraction code and enable logging, you'll see numerous warnings:
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+5482 (5482) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+1842 (1842) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+7566 (7566) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+1915 (1915) in font GNPVNR+PingFangSC-Semibold
...
Indeed, when inspecting the PDF one sees that there are numerous subsets of PingFangSC styles embedded but each time
- with a ToUnicode map without any entries at all,
- with an Identity-H encoding, and
- with an Adobe-Identity-0 ROS,
i.e. without any information which glyph represents which Unicode code point. Thus, it should not surprise at all that text extraction results are very lacking.
So if you really need to extract the text, ask the source of the PDF to provide a copy which includes the required information. If that is not possible, try OCR.
By the way, a good first check usually is to try and copy&paste the text from Adobe Reader. In the case at hand that also results in mostly missing characters. That usually means that the information required for text extraction according to the PDF specification is missing.
You'll also find some more backgrounds at the link @Tilman provided in a comment: https://pdfbox.apache.org/2.0/faq.html#text-extraction
这篇关于使用pdfbox从pdf中提取文本时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!