PDF 提取中的空白消失,以及奇怪的单词解释 [英] Whitespace gone from PDF extraction, and strange word interpretation

查看:15
本文介绍了PDF 提取中的空白消失,以及奇怪的单词解释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用下面的代码段,我尝试从 这个 PDF 文件.

Using the snippet below, I've attempted to extract the text data from this PDF file.

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    content = ""
    for i in range(0, pdf.getNumPages()):
        content += pdf.getPage(i).extractText() + "
"  # Extract text from page and add to content
    # Collapse whitespace
    content = " ".join(content.replace(u"xa0", " ").strip().split())
    return content

我获得的输出,然而,大多数单词之间没有空格.这使得对文本进行自然语言处理变得困难(我的最终目标,在这里).

The output I obtain, however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text (my ultimate goal, here).

此外,手指"一词中的fi"始终被解释为其他含义.这是相当有问题的,因为这篇论文是关于自发的手指运动......

Also, the 'fi' in the word 'finger' is consistently interpreted as something else. This is rather problematic since this paper is about spontaneous finger movements...

有人知道为什么会发生这种情况吗?我什至不知道从哪里开始!

Does anybody know why this might be happening? I don't even know where to start!

推荐答案

您的 PDF 文件没有可打印的空格字符,它只是将单词放在需要放置的位置.您必须做额外的工作来找出空格,也许通过假设多字符运行是单词,并在它们之间放置空格.

Your PDF file doesn't have printable space characters, it simply positions the words where they need to go. You'll have to do extra work to figure out the spaces, perhaps by assuming multi-character runs are words, and put spaces between them.

如果您可以在 PDF 阅读器中选择文本,并且空格显示正确,那么至少您知道有足够的信息来重建文本.

If you can select text in the PDF reader, and have spaces appear properly, then at least you know there is enough information to reconstruct the text.

"fi" 是印刷连字,显示为单个字符.您可能会发现fl"、ffi"和ffl"也会发生这种情况.您可以使用字符串替换将fi"替换为 fi 连字.

"fi" is a typographic ligature, shown as a single character. You may find this is also happening with "fl", "ffi", and "ffl". You can use string replacement to substitute "fi" for the fi ligature.

这篇关于PDF 提取中的空白消失,以及奇怪的单词解释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆