PDF提取中的空白和奇怪的单词解释 [英] Whitespace gone from PDF extraction, and strange word interpretation

查看:97
本文介绍了PDF提取中的空白和奇怪的单词解释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用下面的代码段,我尝试从

Using the snippet below, I've attempted to extract the text data from this PDF file.

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    content = ""
    for i in range(0, pdf.getNumPages()):
        content += pdf.getPage(i).extractText() + "\n"  # Extract text from page and add to content
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

我获得的输出,但是大多数单词之间都没有空格.这使得很难对文本执行自然语言处理(我的最终目标在这里).

The output I obtain, however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text (my ultimate goal, here).

此外,"finger"一词中的"fi"始终被解释为其他含义.这是相当有问题的,因为本文是关于自发手指运动的.

Also, the 'fi' in the word 'finger' is consistently interpreted as something else. This is rather problematic since this paper is about spontaneous finger movements...

有人知道为什么会这样吗?我什至不知道从哪里开始!

Does anybody know why this might be happening? I don't even know where to start!

推荐答案

您的PDF文件没有可打印的空格字符,它只是将单词放置在需要的地方.您可能需要做一些额外的工作来找出空格,可能是假设多字符运行是单词,然后在它们之间放置空格.

Your PDF file doesn't have printable space characters, it simply positions the words where they need to go. You'll have to do extra work to figure out the spaces, perhaps by assuming multi-character runs are words, and put spaces between them.

如果您可以在PDF阅读器中选择文本,并正确显示空格,那么至少您知道有足够的信息来重构文本.

If you can select text in the PDF reader, and have spaces appear properly, then at least you know there is enough information to reconstruct the text.

"fi"是印刷的连字,显示为单个字符.您可能会发现"fl","ffi"和"ffl"也会发生这种情况.您可以使用字符串替换将"fi"替换为连字.

"fi" is a typographic ligature, shown as a single character. You may find this is also happening with "fl", "ffi", and "ffl". You can use string replacement to substitute "fi" for the fi ligature.

这篇关于PDF提取中的空白和奇怪的单词解释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆