将pdf转换为python中的text/html，以便我可以解析它 [英] Converting a pdf to text/html in python so I can parse it

查看：292 发布时间：2020/5/25 1:00:34 python parsing pdf text

本文介绍了将pdf转换为python中的text/html，以便我可以解析它的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下示例代码，可以根据给定的立法建议从欧洲议会网站下载pdf文件:

I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:

我最终只是获得链接并将其提供给adobes在线转换工具(请参见下面的代码):

I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):

import mechanize
import urllib2
import re
from BeautifulSoup import *

adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"

url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"

def get_pdf(soup2):
    link = soup2.findAll("a", "com_acronym")
    new_link = []
    amendments = []
    for i in link:
        if "REPORT" in i["href"]:
            new_link.append(i["href"])
    if new_link == None:
        print "No A number"
    else:
        for i in new_link:
            page = br.open(str(i)).read()
            bs = BeautifulSoup(page)
            text = bs.findAll("a")
            for i in text:
                if re.search("PDF", str(i)) != None:
                    pdf_link = "http://www.europarl.europa.eu/" + i["href"]
            pdf = urllib2.urlopen(pdf_link)
            name_pdf = "%s_%s.pdf" % (y,p)
            localfile = open(name_pdf, "w")
            localfile.write(pdf.read())
            localfile.close()

            br.open(adobe)
            br.select_form(name = "convertFrm")
            br.form["srcPdfUrl"] = str(pdf_link)
            br["convertTo"] = ["html"]
            br["visuallyImpaired"] = ["notcompatible"]
            br.form["platform"] =["Macintosh"]
            pdf_html = br.submit()

            soup = BeautifulSoup(pdf_html)


page = range(1,2) #can be set to 400 to get every document for a given year
year = range(1999,2000) #can be set to 2011 to get documents from all years

for y in year:
    for p in page:
        br = mechanize.Browser()
        br.open(url)
        br.select_form(name = "byReferenceForm")
        br.form["year"] = str(y)
        br.form["sequence"] = str(p)
        response = br.submit()
        soup1 = BeautifulSoup(response)
        test = soup1.find(text="No search result")
        if test != None:
            print "%s %s No page skipping..." % (y,p)
        else:
            print "%s %s  Writing dossier..." % (y,p)
            for i in br.links(url_regex="file.jsp"):
                link = i
            response2 = br.follow_link(link).read()
            soup2 = BeautifulSoup(response2)
            get_pdf(soup2)

在get_pdf()函数中，我想将pdf文件转换为python中的文本，以便我可以解析文本以获取有关立法程序的信息.谁能向我说明如何做到这一点?

In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?

托马斯

将pdf转换为python中的text/html，以便我可以解析它 [英] Converting a pdf to text/html in python so I can parse it

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将pdf转换为python中的text/html，以便我可以解析它 [英] Converting a pdf to text/html in python so I can parse it

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭