如何从内联 raw_bytes(而不是从文件中)读取 PDF 文件? [英] How can i read a PDF file from inline raw_bytes (not from file)?

查看:17
本文介绍了如何从内联 raw_bytes(而不是从文件中)读取 PDF 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从澳大利亚证券交易所网站创建一个 pdf puller,这将允许我搜索公司发布的所有公告"并在这些公告的 pdf 文件中搜索关键词.

到目前为止,我使用 requests 和 PyPDF2 来获取 PDF 文件,将其写入我的驱动器,然后读取它.但是,我希望能够跳过将 PDF 文件写入驱动器并读取它的步骤,直接从获取 PDF 文件到将其转换为字符串.到目前为止,我所拥有的是:

导入请求,PyPDF2url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'响应 = requests.get(url)my_raw_data = response.content使用 open("my_pdf.pdf", 'wb') 作为 my_data:my_data.write(my_raw_data)open_pdf_file = open("my_pdf.pdf", 'rb')read_pdf = PyPDF2.PdfFileReader(open_pdf_file)num_pages = read_pdf.getNumPages()ann_text = []对于范围内的 page_num(num_pages):如果 read_pdf.isEncrypted:read_pdf.decrypt("")打印(read_pdf.getPage(page_num).extractText())page_text = read_pdf.getPage(page_num).extractText().split()ann_text.append(page_text)别的:打印(read_pdf.getPage(page_num).extractText())打印(ann_text)

这会从提供的 url 打印 PDF 文件中的字符串列表.

只是想知道我是否可以将 my_raw_data 变量转换为可读字符串?

非常感谢!

解决方案

你可以使用 io

导入请求,PyPDF2,iourl = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'响应 = requests.get(url)使用 io.BytesIO(response.content) 作为 open_pdf_file:read_pdf = PyPDF2.PdfFileReader(open_pdf_file)num_pages = read_pdf.getNumPages()打印(num_pages)

<块引用>

2

附注.要打开文件,请始终使用上下文管理器(with-statement)

I am trying to create a pdf puller from the Australian Stock Exchange website which will allow me to search through all the 'Announcements' made by companies and search for key words in the pdfs of those announcements.

So far I am using requests and PyPDF2 to get the PDF file, write it to my drive and then read it. However, I want to be able to skip the step of writing the PDF file to my drive and reading it, and going straight from getting the PDF file to converting it to a string. What I have so far is:

import requests, PyPDF2

url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)
my_raw_data = response.content

with open("my_pdf.pdf", 'wb') as my_data:
    my_data.write(my_raw_data)


open_pdf_file = open("my_pdf.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
num_pages = read_pdf.getNumPages()

ann_text = []
for page_num in range(num_pages):
    if read_pdf.isEncrypted:
        read_pdf.decrypt("")
        print(read_pdf.getPage(page_num).extractText())
        page_text = read_pdf.getPage(page_num).extractText().split()
        ann_text.append(page_text)

    else:
        print(read_pdf.getPage(page_num).extractText())
print(ann_text)

This prints a list of strings in the PDF file from the url provided.

Just wondering if i can convert the my_raw_data variable to a readable string?

Thanks so much in advance!

解决方案

you can use io

import requests, PyPDF2, io

url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)

with io.BytesIO(response.content) as open_pdf_file:
    read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
    num_pages = read_pdf.getNumPages()
    print(num_pages)

2

PS. To open files, always use a context manager (with-statement)

这篇关于如何从内联 raw_bytes(而不是从文件中)读取 PDF 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆