无论如何多线程pdf挖掘? [英] Anyway to multithread pdf mining?

查看:110
本文介绍了无论如何多线程pdf挖掘?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个代码正在寻找整个pdf文件中的特定字符串序列.问题在于此过程非常缓慢. (有时我得到的PDF超过50000页)

I have a code which is looking for a particular string sequence throughout a bunch of pdfs. The problems is that this process is extremely slow. (Sometimes I get pdf's with over 50000 pages)

是否可以进行多线程处理?不幸的是,即使我进行了搜索,也无法在线程代码上一目了然

Is there a way to do multi threading? Unfortunately even though I searched, I couldn't make heads or tails about the threading codes

import os
import shutil as sh
f = 'C:/Users/akhan37/Desktop/learning profiles/unzipped/unzipped_files'

import slate3k as slate


idee = "123456789"
os.chdir(f)
for file in os.listdir('.'):
    print(file) 
    with open(file,'rb') as g:
        extracted_text = slate.PDF(g)

            #extracted_text = slate.PDF() 

        # print(Text)
        if idee in extracted_text:
            print(file)
        else:
            pass

运行时间很长.我不认为这是代码错误,而是我必须阅读700多个pdf的事实

The run time is very long. I don't think it's the codes fault but rather the fact that I have to go through over 700 pdfs

推荐答案

我建议使用pdfminer,您可以将文档对象转换为页面对象列表,并可以在不同的内核上进行多次处理.

I would suggest using pdfminer, you can convert to the document object into a list of page object, which you can multi-processing on different cores.

    fp = open(pdf_path, "rb")
    parser = PDFParser(fp)
    document = PDFDocument(parser, password)
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    laparams = LAParams() # set
    resource_manager = PDFResourceManager()
    device = PDFPageAggregator(resource_manager, laparams=laparams)
    interpreter = PDFPageInterpreter(resource_manager, device)

    all_attributes = []

    list_of_page_obj = list(PDFPage.create_pages(document))

这篇关于无论如何多线程pdf挖掘?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆