无需创建文件即可将pdf转换为文本 [英] Convert pdf to text without creating a file

查看:39
本文介绍了无需创建文件即可将pdf转换为文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从网站下载 pdf 文件并处理文本.但是,我不想创建 pdf 文件然后将其转换为文本.我使用 python 请求.有没有什么办法可以直接在下面的代码后面获取文本?

I want to download pdf files from a website and work with the text. But, I don't want to create a pdf file and then convert it to text. I use python request. Is there any way to get the text directly after the following code?

res = requests.get(url, timeout=None)

推荐答案

AFAIK,您至少必须创建一个临时文件,以便您可以执行您的流程.

AFAIK, you will have to at least create a temp file so that you can perform your process.

您可以使用以下代码获取/读取 PDF 文件并将其转换为 TEXT 文件.这使用了 PDFMINER 和 Python 3.7.

You can use the following code which takes / reads a PDF file and converts it to a TEXT file. This makes use of PDFMINER and Python 3.7.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io

def convert(case,fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)
    manager = PDFResourceManager()
    codec = 'utf-8'
    caching = True
    output = io.StringIO()
    converter = TextConverter(manager, output, codec=codec, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    convertedPDF = output.getvalue()
    print(convertedPDF)

    infile.close()
    converter.close()
    output.close()
    return convertedPDF

调用上述程序的主函数:

Main function to call the above program:

import os
import converter
import sys, getopt

class ConvertMultiple:
    def convert_multiple(pdf_dir, txt_dir):
        if pdf_dir == "": pdf_dir = os.getcwd() + "\\"  # if no pdfDir passed in
        for pdf in os.listdir(pdf_dir):  # iterate through pdfs in pdf directory
            print("File name is %s", os.path.basename(pdf))
            file_extension = pdf.split(".")[-1]
            print("file extension is %s", file_extension)
            if file_extension == "pdf":
                pdf_file_name = pdf_dir + pdf
                path = 'E:/pdf/' + os.path.basename(pdf)
                print(path)
                text = converter.convert('text', path)  # get string of text content of pdf
                text_file_name = txt_dir + pdf + ".txt"
                text_file = open(text_file_name, "w")  # make text file
                text_file.write(text)  # write text to text file


pdf_dir = "E:/pdf"
txt_dir = "E:/text"
ConvertMultiple.convert_multiple(pdf_dir, txt_dir)

当然你可以再调整一些,可能还有更多改进的空间,但这东西肯定有效.

Of course you can tune it some more and may be some more room for improvement, but this thing certainly works.

只需确保提供临时 pdf,而不是提供 pdf 文件夹直接存档.

Just make sure instead of providing pdf folder provide a temp pdf file directly.

希望这对你有帮助..快乐编码!

Hope this helps you..Happy Coding!

这篇关于无需创建文件即可将pdf转换为文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆