直接在 Python 中处理来自网络的 pdf? [英] Working with a pdf from the web directly in Python?

查看:35
本文介绍了直接在 Python 中处理来自网络的 pdf?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Python 直接从网络读取 .pdf 文件,而不是将它们全部保存到我的计算机上.我所需要的只是 .pdf 中的文本,我将阅读大量(约 60k)个文本,因此我宁愿实际上不必将它们全部保存.

我知道如何使用 urllib 从互联网保存 .pdf 并使用 PyPDF2 打开它.(示例)

我想跳过保存到文件的步骤.

import urllib, PyPDF2urllib.urlopen('https://bitcoin.org/bitcoin.pdf')wFile = urllib.urlopen('https://bitcoin.org/bitcoin.pdf')lFile = PyPDF2.pdf.PdfFileReader(wFile.read())

我收到一个很容易理解的错误:

回溯(最近一次调用最后一次):文件<pyshell#6>",第 1 行,在 <module> 中fil = PyPDF2.pdf.PdfFileReader(wFile.read())文件C:\Python27\lib\PyPDF2\pdf.py",第 797 行,在 __init__ 中自读(流)文件C:\Python27\lib\PyPDF2\pdf.py",第 1245 行,已读流.seek(-1, 2)AttributeError: 'str' 对象没有属性 'seek'

显然 PyPDF2 不喜欢我给它 urllib.urlopen().read() (它似乎返回一个字符串).我知道这个字符串不是 .pdf 的文本",而是文件的字符串表示.我该如何解决?

NorthCat 的解决方案解决了我的错误,但是当我尝试实际提取文本时,我得到了:

<预><代码>>>>打印 lFile.getPage(0).extractText()ˇ˘˘˙˘˘˝˘˛˘ˇ˘ˇ˚ˇˇˇ˘˘˘˘˚ˇˆ˘˘ˇ~ˇ˝˚˘˛˘ˇ ˘˘˘ˇ˛˘˘˚ˇ˛˘˚ˇˇˇˇ˝˘˚ˇ˘˘˚"˘˘ˇ˘˚ˇ˘˘˚ˇ˘˘˘˙˘˘˘#˘˘˘˘˛˘˚˛˙˘˘˚˘˛˙#˘ˇ˘ˇ˘˘˘˛˛˘˘!˘˘˛˘˝˘˘˘˚ ˛˘˘ˇ˘ˇ˛$%&˘ˇ'ˆ˛$%&˘ˇˇ˘˚ˆ˚˘˘˘˘ ˘ˆ(ˇˇ˘˘˘˘ˇ˘˚˘˘#˘˘˘ˇ˛!ˇ)˘˘˚˘˘˛ ˚˚˘ˇ˘˝˘˚'˘˘ˇˇ ˘˘ˇ˘˛˙˛˛˘˘˚ˇ˘˘ˆ˘˘ˆ˙$˘˘˘*˘˘˘ˇˆ˘˘ˇˆ˛ˇ˘˝˚˚˘˘ˇ˘˘˘"˘˘ˇˇ˘˛˛˛˘˛˘˘˘˘˘˘˘˘˘˘˘˛˘˘˚˚˘$ˇ˘ˇˆ˙˘˝˘ˇ˘˘˘ˇˇˆˇ˘ ˘˛ˇ˝˘˚˚#˘˛˘˚˘˘˘ˇ˘˚˛˛˘˛ˇˇˇ ˚˘˘˚˘˘ˇ˛˘˙˘˝˘ˇ˘˘˛˙˘˝˘ˇ˘˘˝˘"˘˛˘˝˘ˇ˛˘˝˘ˇ ˘˘˘˘˚)˘˘ˆ˛˘˘˘˛˘˛˘ˆˇ˚˘˘˘˘˚˘˘˘˘˛˛˚˘˚˝˚ˇ˘#˘˘˚ˆ˘˘˘˝˘˚˘˘˝˘˚˘ˆˇ˘ˆ˘˘˘ˆ˘˝˘˘˚"˘˘˚˘˚˘ˇ˘˘˘˚ˆ˛˚˛ˆ˚˘˘˘˘˘˘˚˛˚˚ˆ#˘ˇˇˇˇ˘˝˘˘ˇ˚˘ˇˇ˘˛˛˚ ˚˘˘˘ˇ˚˘˘ˇ˘˘˚ˆ˘*˘˘˘˘˚˘˙˘˚˘˘˘˙˙˘˘˚˚˘˘˝˘˘˘˛˛˘˚˘˛#˘˘˘˘˚˘˘˘˘$%&放;˘ˆ˘˛˘˚˘,

解决方案

试试这个:

import urllib, PyPDF2导入 cStringIOwFile = urllib.urlopen('https://bitcoin.org/bitcoin.pdf')lFile = PyPDF2.pdf.PdfFileReader( cStringIO.StringIO(wFile.read()) )

因为 PyPDF2 不起作用,所以有几种解决方案,但是,需要将文件保存到磁盘.

解决方案 1您可以使用 ps2ascii(如果您使用的是 linux 或 mac )或 xpdf (Windows).使用 xpdf 的示例:

导入操作系统os.system('C:\\xpdfbin-win-3.03\\bin32\\pdftotext.exe C:\\xpdfbin-win-3.03\\bin32\\bitcoin.pdf bitcoin1.txt')

导入子流程subprocess.call(['C:\\xpdfbin-win-3.03\\bin32\\pdftotext.exe', 'C:\\xpdfbin-win-3.03\\bin32\\bitcoin.pdf', 'bitcoin2.txt'])

解决方案 2您可以使用在线 pdf 到 txt 转换器之一.使用 pdf.my-addr.com

import MultipartPostHandler导入 urllib2def pdf2text(absolute_path):url = 'http://pdf.my-addr.com/pdf-to-text-converter-tool.php'params = { 'file' : open( absolute_path, 'rb' ),'编码':'UTF-8',}opener = urllib2.build_opener( MultipartPostHandler.MultipartPostHandler )返回 opener.open( url, params ).read()打印 pdf2text('bitcoin.pdf')

MultipartPostHandler 的代码,您可以在此处找到.我尝试使用 cStringIO 代替 open(),但是没有用.或许对你有帮助.

I'm trying to use Python to read .pdf files from the web directly rather than save them all to my computer. All I need is the text from the .pdf and I'm going to be reading a lot (~60k) of them, so I'd prefer to not actually have to save them all.

I know how to save a .pdf from the internet using urllib and open it with PyPDF2. (example)

I want to skip the saving-to-file step.

import urllib, PyPDF2
urllib.urlopen('https://bitcoin.org/bitcoin.pdf')
wFile = urllib.urlopen('https://bitcoin.org/bitcoin.pdf')
lFile = PyPDF2.pdf.PdfFileReader(wFile.read())

I get an error that is fairly easy to understand:

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    fil = PyPDF2.pdf.PdfFileReader(wFile.read())
  File "C:\Python27\lib\PyPDF2\pdf.py", line 797, in __init__
    self.read(stream)
  File "C:\Python27\lib\PyPDF2\pdf.py", line 1245, in read
    stream.seek(-1, 2)
AttributeError: 'str' object has no attribute 'seek'

Obviously PyPDF2 doesn't like that I'm giving it the urllib.urlopen().read() (which appears to return a string). I know that this string is not the "text" of the .pdf but a string representation of the file. How can I resolve this?

EDIT: NorthCat's solution resolved my error, but when I try to actually extract the text, I get this:

>>> print lFile.getPage(0).extractText()
ˇˆ˘˘˙˘˘˝˘˛˘ˇ˘ˇ˚ˇˇˇ˘ˆ˘˘˘˚ˇˆ˘ˆ˘ˇ˜ˇ˝˚˘˛˘ˇ ˘˘˘ˇ˛˘˚˚ˆˇˇ!
˝˘˚ˇ˘˘˚"˘˘ˇ˘˚ˇ˘˘˚ˇ˘˘˘˙˘˘˘#˘˘˘ˆ˘˛˘˚˛˙ ˘˘˚˚˘˛˙#˘ˇ˘ˇˆ˘˘˛˛˘˘!˘˘˛˘˝˘˘˘˚ ˛˘˘ˇ˘ˇ˛$%&˘ˇ'ˆ˛
$%&˘ˇˇ˘˚ˆ˚˘˘˘˘ ˘ˆ(ˇˇ˘˘˘˘ˇ˘˚˘˘#˘˘˘ˇ˛!ˇ)˘˘˚˘˘˛ ˚˚˘ˇ˘˝˘˚'˘˘ˇˇ ˘˘ˇ˘˛˙˛˛˘˘˚ˇ˘˘ˆ˘˘ˆ˙
$˘˘˘*˘˘˘ˇˆ˘˘ˇˆ˛ˇ˘˝˚˚˘˘ˇ˘ˆ˘"˘ˆ˘ˇˇ˘˛ ˛˛˘˛˘˘˘˘˘˘˛˘˘˚˚˘$ˇ˘ˇˆ˙˘˝˘ˇ˘˘˘ˇˇˆˇ˘ ˘˛ˇ˝˘˚˚#˘˛˘˚˘˘ 
˘ˇ˘˚˛˛˘ˆ˛ˇˇˇ ˚˘˘˚˘˘ˇ˛˘˙˘˝˘ˇ˘ˆ˘˛˙˘˝˘ˇ˘˘˝˘"˘˛˘˝˘ˇ ˘˘˘˚˛˘˚)˘˘ˆ˛˘˘ 
˘˛˘˛˘ˆˇ˚˘˘˘˘˚˘˘˘˘˛˛˚˘˚˝˚ˇ˘#˘˘˚ˆ˘˘˘˝˘˚˘ˆˆˇ˘ˆ 
˘˘˘ˆ˘˝˘˘˚"˘˘˚˘˚˘ˇ˘ˆ˘ˆ˘˚ˆ˛˚˛ˆ˚˘˘˘˘˘˘˚˛˚˚ˆ#˘ˇˇˆˇ˘˝˘˘ˇ˚˘ˇˇ˘˛˛˚ ˚˘˘˘ˇ˚˘˘ˇ˘˘˚ˆ˘*˘ 
˘˘ˇ˘˚ˇ˘˙˘˚ˇ˘˘˘˙˙˘˘˚˚˘˘˝˘˘˘˛˛˘ˇˇ˚˘˛#˘ˆ˘˘ˇ˘˚˘ˇˇ˘˘ˇˆˇ˘$%&˘ˆ˘˛˘˚˘,

解决方案

Try this:

import urllib, PyPDF2
import cStringIO

wFile = urllib.urlopen('https://bitcoin.org/bitcoin.pdf')
lFile = PyPDF2.pdf.PdfFileReader( cStringIO.StringIO(wFile.read()) )

Because PyPDF2 does not work, there are a couple of solutions, however, require saving the file to disk.

Solution 1 You can use ps2ascii (if you are using linux or mac ) or xpdf (Windows). Example of using xpdf:

import os
os.system('C:\\xpdfbin-win-3.03\\bin32\\pdftotext.exe C:\\xpdfbin-win-3.03\\bin32\\bitcoin.pdf bitcoin1.txt')

or

import subprocess
subprocess.call(['C:\\xpdfbin-win-3.03\\bin32\\pdftotext.exe',  'C:\\xpdfbin-win-3.03\\bin32\\bitcoin.pdf', 'bitcoin2.txt'])

Solution 2 You can use one of online pdf to txt converter. Example of using pdf.my-addr.com

import MultipartPostHandler
import urllib2


def pdf2text( absolute_path ):
    url = 'http://pdf.my-addr.com/pdf-to-text-converter-tool.php'

    params = {  'file' : open( absolute_path, 'rb' ),
                'encoding': 'UTF-8',
    }
    opener = urllib2.build_opener( MultipartPostHandler.MultipartPostHandler )
    return opener.open( url, params ).read()

print pdf2text('bitcoin.pdf')

Code of MultipartPostHandler you can find here. I tried to use the cStringIO instead open(), but it did not work. Maybe it will be helpful for you.

这篇关于直接在 Python 中处理来自网络的 pdf?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆