如何使用 Python 抓取 PDF;仅特定内容 [英] How to scrape PDFs using Python; specific content only

查看:83
本文介绍了如何使用 Python 抓取 PDF;仅特定内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网站上可用的 PDF 中获取数据

I am trying to get data from PDFs available on the site

https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en

例如,如果我查看 2019 年 11 月的报告

For example, If I look at November 2019 report

https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/dz011445t/mg74r196p/latest.pdf

我需要第 12 页上的玉米数据,我必须为期末库存、出口等创建单独的文件.我是 Python 新手,我不确定如何单独抓取内容.如果我能弄清楚一个月,那么我就可以创建一个循环.但是,我对如何处理一个文件感到困惑.

I need the data on Page 12 for corns, I have to create separate files for ending stocks, exports etc. I am new to Python and I am not sure how to scrape the content separately. If I can figure it out for one month then I can create a loop. But, I am confused on how to proceed for one file.

有人能帮我一下吗,TIA.

Can someone help me out here, TIA.

推荐答案

这里有一个使用 PyPDF2、requests 和 BeautifulSoup 的小例子...请检查注释评论,这是第一块...如果你需要更多是必要的更改 url 变量中的值

Here a little example using PyPDF2 ,requests and BeautifulSoup ...pls check the notes comment , this is for first block ...if you need more is necesary change the value in url variable

# You need install :
# pip install PyPDF2 - > Read and parse your content pdf
# pip install requests - > request for get the pdf
# pip install BeautifulSoup - > for parse the html and find all url hrf with ".pdf" final
from PyPDF2 import PdfFileReader
import requests
import io
from bs4 import BeautifulSoup

url=requests.get('https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en#release-items')
soup = BeautifulSoup(url.content,"lxml")

for a in soup.find_all('a', href=True):
    mystr= a['href']
    if(mystr[-4:]=='.pdf'):
        print ("url with pdf final:", a['href'])
        urlpdf = a['href']
        response = requests.get(urlpdf)
        with io.BytesIO(response.content) as f:
            pdf = PdfFileReader(f)
            information = pdf.getDocumentInfo()
            number_of_pages = pdf.getNumPages()
            txt = f"""
            Author: {information.author}
            Creator: {information.creator}
            Producer: {information.producer}
            Subject: {information.subject}
            Title: {information.title}
            Number of pages: {number_of_pages}
            """
            # Here the metadata of your pdf
            print(txt)
            # numpage for the number page
            numpage=20
            page = pdf.getPage(numpage)
            page_content = page.extractText()
            # print the content in the page 20            
            print(page_content)

这篇关于如何使用 Python 抓取 PDF;仅特定内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆