将pdf转换为python中的text/html,以便我可以解析它 [英] Converting a pdf to text/html in python so I can parse it

查看:292
本文介绍了将pdf转换为python中的text/html,以便我可以解析它的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下示例代码,可以根据给定的立法建议从欧洲议会网站下载pdf文件:

I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:

我最终只是获得链接并将其提供给adobes在线转换工具(请参见下面的代码):

I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):

import mechanize
import urllib2
import re
from BeautifulSoup import *

adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"

url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"

def get_pdf(soup2):
    link = soup2.findAll("a", "com_acronym")
    new_link = []
    amendments = []
    for i in link:
        if "REPORT" in i["href"]:
            new_link.append(i["href"])
    if new_link == None:
        print "No A number"
    else:
        for i in new_link:
            page = br.open(str(i)).read()
            bs = BeautifulSoup(page)
            text = bs.findAll("a")
            for i in text:
                if re.search("PDF", str(i)) != None:
                    pdf_link = "http://www.europarl.europa.eu/" + i["href"]
            pdf = urllib2.urlopen(pdf_link)
            name_pdf = "%s_%s.pdf" % (y,p)
            localfile = open(name_pdf, "w")
            localfile.write(pdf.read())
            localfile.close()

            br.open(adobe)
            br.select_form(name = "convertFrm")
            br.form["srcPdfUrl"] = str(pdf_link)
            br["convertTo"] = ["html"]
            br["visuallyImpaired"] = ["notcompatible"]
            br.form["platform"] =["Macintosh"]
            pdf_html = br.submit()

            soup = BeautifulSoup(pdf_html)


page = range(1,2) #can be set to 400 to get every document for a given year
year = range(1999,2000) #can be set to 2011 to get documents from all years

for y in year:
    for p in page:
        br = mechanize.Browser()
        br.open(url)
        br.select_form(name = "byReferenceForm")
        br.form["year"] = str(y)
        br.form["sequence"] = str(p)
        response = br.submit()
        soup1 = BeautifulSoup(response)
        test = soup1.find(text="No search result")
        if test != None:
            print "%s %s No page skipping..." % (y,p)
        else:
            print "%s %s  Writing dossier..." % (y,p)
            for i in br.links(url_regex="file.jsp"):
                link = i
            response2 = br.follow_link(link).read()
            soup2 = BeautifulSoup(response2)
            get_pdf(soup2)

在get_pdf()函数中,我想将pdf文件转换为python中的文本,以便我可以解析文本以获取有关立法程序的信息.谁能向我说明如何做到这一点?

In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?

托马斯

推荐答案

这并不完全是魔术.我建议

It's not exactly magic. I suggest

  • 将PDF文件下载到临时目录,
  • 调出外部程序以将文本提取到(临时)文本文件中,
  • 读取文本文件.

对于文本提取命令行实用程序,您有多种可能性,并且链接中可能未提及其他内容(也许基于Java).先尝试一下,看看它们是否符合您的需求.也就是说,分别尝试每个步骤(查找链接,下载文件,提取文本),然后将它们拼凑在一起.要调出,请使用subprocess.Popensubprocess.call().

For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().

这篇关于将pdf转换为python中的text/html,以便我可以解析它的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆