无需创建文件即可将pdf转换为文本 [英] Convert pdf to text without creating a file
问题描述
我想从网站下载 pdf 文件并处理文本.但是,我不想创建 pdf 文件然后将其转换为文本.我使用 python 请求.有没有什么办法可以直接在下面的代码后面获取文本?
I want to download pdf files from a website and work with the text. But, I don't want to create a pdf file and then convert it to text. I use python request. Is there any way to get the text directly after the following code?
res = requests.get(url, timeout=None)
推荐答案
AFAIK,您至少必须创建一个临时文件,以便您可以执行您的流程.
AFAIK, you will have to at least create a temp file so that you can perform your process.
您可以使用以下代码获取/读取 PDF 文件并将其转换为 TEXT 文件.这使用了 PDFMINER 和 Python 3.7.
You can use the following code which takes / reads a PDF file and converts it to a TEXT file. This makes use of PDFMINER and Python 3.7.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io
def convert(case,fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
manager = PDFResourceManager()
codec = 'utf-8'
caching = True
output = io.StringIO()
converter = TextConverter(manager, output, codec=codec, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = open(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums, caching=caching, check_extractable=True):
interpreter.process_page(page)
convertedPDF = output.getvalue()
print(convertedPDF)
infile.close()
converter.close()
output.close()
return convertedPDF
调用上述程序的主函数:
Main function to call the above program:
import os
import converter
import sys, getopt
class ConvertMultiple:
def convert_multiple(pdf_dir, txt_dir):
if pdf_dir == "": pdf_dir = os.getcwd() + "\\" # if no pdfDir passed in
for pdf in os.listdir(pdf_dir): # iterate through pdfs in pdf directory
print("File name is %s", os.path.basename(pdf))
file_extension = pdf.split(".")[-1]
print("file extension is %s", file_extension)
if file_extension == "pdf":
pdf_file_name = pdf_dir + pdf
path = 'E:/pdf/' + os.path.basename(pdf)
print(path)
text = converter.convert('text', path) # get string of text content of pdf
text_file_name = txt_dir + pdf + ".txt"
text_file = open(text_file_name, "w") # make text file
text_file.write(text) # write text to text file
pdf_dir = "E:/pdf"
txt_dir = "E:/text"
ConvertMultiple.convert_multiple(pdf_dir, txt_dir)
当然你可以再调整一些,可能还有更多改进的空间,但这东西肯定有效.
Of course you can tune it some more and may be some more room for improvement, but this thing certainly works.
只需确保提供临时 pdf,而不是提供 pdf 文件夹直接存档.
Just make sure instead of providing pdf folder provide a temp pdf file directly.
希望这对你有帮助..快乐编码!
Hope this helps you..Happy Coding!
这篇关于无需创建文件即可将pdf转换为文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!