如何从内联 raw_bytes(而不是从文件中)读取 PDF 文件? [英] How can i read a PDF file from inline raw_bytes (not from file)?
问题描述
我正在尝试从澳大利亚证券交易所网站创建一个 pdf puller,这将允许我搜索公司发布的所有公告"并在这些公告的 pdf 文件中搜索关键词.
到目前为止,我使用 requests 和 PyPDF2 来获取 PDF 文件,将其写入我的驱动器,然后读取它.但是,我希望能够跳过将 PDF 文件写入驱动器并读取它的步骤,直接从获取 PDF 文件到将其转换为字符串.到目前为止,我所拥有的是:
导入请求,PyPDF2url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'响应 = requests.get(url)my_raw_data = response.content使用 open("my_pdf.pdf", 'wb') 作为 my_data:my_data.write(my_raw_data)open_pdf_file = open("my_pdf.pdf", 'rb')read_pdf = PyPDF2.PdfFileReader(open_pdf_file)num_pages = read_pdf.getNumPages()ann_text = []对于范围内的 page_num(num_pages):如果 read_pdf.isEncrypted:read_pdf.decrypt("")打印(read_pdf.getPage(page_num).extractText())page_text = read_pdf.getPage(page_num).extractText().split()ann_text.append(page_text)别的:打印(read_pdf.getPage(page_num).extractText())打印(ann_text)
这会从提供的 url 打印 PDF 文件中的字符串列表.
只是想知道我是否可以将 my_raw_data 变量转换为可读字符串?
非常感谢!
你可以使用 io
导入请求,PyPDF2,iourl = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'响应 = requests.get(url)使用 io.BytesIO(response.content) 作为 open_pdf_file:read_pdf = PyPDF2.PdfFileReader(open_pdf_file)num_pages = read_pdf.getNumPages()打印(num_pages)
<块引用>
2
附注.要打开文件,请始终使用上下文管理器(with
-statement)
I am trying to create a pdf puller from the Australian Stock Exchange website which will allow me to search through all the 'Announcements' made by companies and search for key words in the pdfs of those announcements.
So far I am using requests and PyPDF2 to get the PDF file, write it to my drive and then read it. However, I want to be able to skip the step of writing the PDF file to my drive and reading it, and going straight from getting the PDF file to converting it to a string. What I have so far is:
import requests, PyPDF2
url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)
my_raw_data = response.content
with open("my_pdf.pdf", 'wb') as my_data:
my_data.write(my_raw_data)
open_pdf_file = open("my_pdf.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
num_pages = read_pdf.getNumPages()
ann_text = []
for page_num in range(num_pages):
if read_pdf.isEncrypted:
read_pdf.decrypt("")
print(read_pdf.getPage(page_num).extractText())
page_text = read_pdf.getPage(page_num).extractText().split()
ann_text.append(page_text)
else:
print(read_pdf.getPage(page_num).extractText())
print(ann_text)
This prints a list of strings in the PDF file from the url provided.
Just wondering if i can convert the my_raw_data variable to a readable string?
Thanks so much in advance!
you can use io
import requests, PyPDF2, io
url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)
with io.BytesIO(response.content) as open_pdf_file:
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
num_pages = read_pdf.getNumPages()
print(num_pages)
2
PS. To open files, always use a context manager (with
-statement)
这篇关于如何从内联 raw_bytes(而不是从文件中)读取 PDF 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!