如何从内联raw_bytes(而非文件)读取PDF文件? [英] How can i read a PDF file from inline raw_bytes (not from file)?

查看:103
本文介绍了如何从内联raw_bytes(而非文件)读取PDF文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从澳大利亚证券交易所网站上创建一个pdf格式的出纸器,这将使我可以搜索公司发布的所有公告",并在这些公告的pdf中搜索关键字.

I am trying to create a pdf puller from the Australian Stock Exchange website which will allow me to search through all the 'Announcements' made by companies and search for key words in the pdfs of those announcements.

到目前为止,我正在使用请求和PyPDF2来获取PDF文件,将其写入驱动器,然后阅读.但是,我希望能够跳过将PDF文件写入驱动器并读取它的步骤,并直接从获取PDF文件到将其转换为字符串.到目前为止,我有:

So far I am using requests and PyPDF2 to get the PDF file, write it to my drive and then read it. However, I want to be able to skip the step of writing the PDF file to my drive and reading it, and going straight from getting the PDF file to converting it to a string. What I have so far is:

import requests, PyPDF2

url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)
my_raw_data = response.content

with open("my_pdf.pdf", 'wb') as my_data:
    my_data.write(my_raw_data)


open_pdf_file = open("my_pdf.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
num_pages = read_pdf.getNumPages()

ann_text = []
for page_num in range(num_pages):
    if read_pdf.isEncrypted:
        read_pdf.decrypt("")
        print(read_pdf.getPage(page_num).extractText())
        page_text = read_pdf.getPage(page_num).extractText().split()
        ann_text.append(page_text)

    else:
        print(read_pdf.getPage(page_num).extractText())
print(ann_text)

这将从提供​​的URL中打印PDF文件中的字符串列表.

This prints a list of strings in the PDF file from the url provided.

只是想知道我是否可以将my_raw_data变量转换为可读的字符串?

Just wondering if i can convert the my_raw_data variable to a readable string?

非常感谢!

推荐答案

,您可以使用

2

PS.要打开文件,请始终使用上下文管理器(with -statement)

PS. To open files, always use a context manager (with-statement)

这篇关于如何从内联raw_bytes(而非文件)读取PDF文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆