使用PdfMiner和PyPDF2合并列提取文本 [英] Extract Text Using PdfMiner and PyPDF2 Merges columns

查看:156
本文介绍了使用PdfMiner和PyPDF2合并列提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用pdfMiner解析pdf文件文本,但是提取的文本被合并了.我正在使用以下链接中的pdf文件.

I am trying to parse the pdf file text using pdfMiner, but the extracted text gets merged. I am using the pdf file from the following link.

PDF文件

我对任何类型的输出(文件/字符串)都很好.这是为我返回提取的文本作为字符串的代码,但由于某些原因,列被合并.

I am good with any type of output (file/string). Here is the code which returns the extracted text as string for me but for some reason, columns are merged.

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
import StringIO

def convert_pdf(filename):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec)

    fp = file(filename, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    str = retstr.getvalue()
    retstr.close()
    return str

我也尝试过PyPdf2,但是遇到了同样的问题.这是PyPDF2的示例代码

I have also tried PyPdf2, but faced the same issue. Here is the sample code for PyPDF2

from PyPDF2.pdf import PdfFileReader
import StringIO
import time

def getDataUsingPyPdf2(filename):
    pdf = PdfFileReader(open(filename, "rb"))
    content = ""

    for i in range(0, pdf.getNumPages()):
        print str(i)
        extractedText = pdf.getPage(i).extractText()
        content +=  extractedText + "\n"

    content = " ".join(content.replace("\xa0", " ").strip().split())
    return content.encode("ascii", "ignore")

我也尝试过 pdf2txt.py ,但无法获取格式化的输出.

I have also tried pdf2txt.py but unable to get the formatted output.

推荐答案

我最近也遇到了类似的问题,尽管我的pdf的结构稍微简单一些.

I recently struggled with a similar problem, although my pdf had slightly simpler structure.

PDFMiner使用称为设备"的类来解析pdf fil中的页面.设备的基本类是PDFPageAggregator类,它仅解析文件中的文本框.转换器类TextConverter,XMLConverter和HTMLConverter还将结果输出到文件中(或在您的示例中以字符串流的形式),并对内容进行更精细的解析.

PDFMiner uses classes called "devices" to parse the pages in a pdf fil. The basic device class is the PDFPageAggregator class, which simply parses the text boxes in the file. The converter classes , e.g. TextConverter, XMLConverter, and HTMLConverter also output the result in a file (or in a string stream as in your example) and do some more elaborate parsing for the contents.

TextConverter(和PDFPageAggregator)的问题在于它们没有足够深地递归到文档的结构以正确地提取不同的列.另外两个转换器需要一些有关文档结构的信息才能显示,因此它们收集了更详细的数据.在您的示例pdf中,两个简单的设备都仅解析(粗略地)包含列的整个文本框,这使得不可能(或至少非常困难)正确分隔不同的行.我发现解决此问题的方法很好,或者是

The problem with TextConverter (and PDFPageAggregator) is that they don't recurse deep enough to the structure of the document to properly extract the different columns. The two other converters require some information about the structure of the document for display purposes, so they gather more detailed data. In your example pdf both of the simplistic devices only parse (roughly) the entire text box containing the columns, which makes it impossible (or at least very difficult) to correctly separate the different rows. The solution to this that I found works pretty well, is to either

  • 创建一个继承自PDFPageAggregator的新类,或者
  • 使用XMLConverter并使用例如 Beautifulsoup

在两种情况下,您都必须使用其边界框y坐标将不同的文本段组合到行中.

In both cases you would have to combine the different text segments to rows using their bounding box y-coordinates.

对于新的设备类(我想更能说明问题),您将必须覆盖在渲染过程中为每个页面调用的方法receive_layout.然后,此方法以递归方式解析每个页面中的元素.例如,这样的事情可能会让您入门:

In the case of a new device class ('tis more eloquent, I think) you would have to override the method receive_layout that get's called for each page during the rendering process. This method then recursively parses the elements in each page. For example, something like this might get you started:

from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTPage, LTChar, LTAnno, LAParams, LTTextBox, LTTextLine

class PDFPageDetailedAggregator(PDFPageAggregator):
    def __init__(self, rsrcmgr, pageno=1, laparams=None):
        PDFPageAggregator.__init__(self, rsrcmgr, pageno=pageno, laparams=laparams)
        self.rows = []
        self.page_number = 0
    def receive_layout(self, ltpage):        
        def render(item, page_number):
            if isinstance(item, LTPage) or isinstance(item, LTTextBox):
                for child in item:
                    render(child, page_number)
            elif isinstance(item, LTTextLine):
                child_str = ''
                for child in item:
                    if isinstance(child, (LTChar, LTAnno)):
                        child_str += child.get_text()
                child_str = ' '.join(child_str.split()).strip()
                if child_str:
                    row = (page_number, item.bbox[0], item.bbox[1], item.bbox[2], item.bbox[3], child_str) # bbox == (x1, y1, x2, y2)
                    self.rows.append(row)
                for child in item:
                    render(child, page_number)
            return
        render(ltpage, self.page_number)
        self.page_number += 1
        self.rows = sorted(self.rows, key = lambda x: (x[0], -x[2]))
        self.result = ltpage

在上面的代码中,每个找到的LTTextLine元素存储在元组的有序列表中,该元组包含页码,边界框的坐标以及该特定元素中包含的文本.然后,您将执行以下操作:

In the code above, each found LTTextLine element is stored in an ordered list of tuples containing the page number, coordinates of the bounding box, and the text contained in that particular element. You would then do something similar to this:

from pprint import pprint
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams

fp = open('pdf_doc.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
doc.initialize('password') # leave empty for no password

rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageDetailedAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)

for page in PDFPage.create_pages(doc):
    interpreter.process_page(page)
    # receive the LTPage object for this page
    device.get_result()

pprint(device.rows)

变量device.rows包含有序列表,所有文本行均使用其页码和y坐标进行排列.您可以使用相同的y坐标遍历文本行和分组行以形成行,存储列数据等.

The variable device.rows contains the ordered list with all the text lines arranged using their page number and y-coordinates. You can loop over the text lines and group lines with the same y-coordinates to form the rows, store the column data etc.

我尝试使用上述代码解析pdf,并且大多数列均已正确解析.但是,某些列之间的距离非常近,以至于默认的PDFMiner启发式方法无法将它们分成各自的元素.您可以通过调整单词margin参数(命令行工具pdf2text.py中的-W标志)来解决此问题.在任何情况下,您都可能需要通读(文档很少) PDFMiner API ,并浏览PDFMiner的源代码,您可以从github获得该代码. (A,我无法粘贴该链接,因为我没有足够的代表点:'< ;,但希望您可以在Google上找到正确的库)

I tried to parse your pdf using the above code and the columns are mostly parsed correctly. However, some of the columns are so close together that the default PDFMiner heuristics fail to separate them into their own elements. You can probably get around this by tweaking the word margin parameter (the -W flag in the command line tool pdf2text.py). In any case, you might want to read through the (poorly documented) PDFMiner API as well as browse through the source code of PDFMiner, which you can obtain from github. (Alas, I cannot paste the link because I do not have sufficient rep points :'<, but you can hopefully google the correct repo)

这篇关于使用PdfMiner和PyPDF2合并列提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆