pyPDF2 TypeError尝试提取文本时 [英] pyPDF2 TypeError when trying to extract text
问题描述
我已经成功安装了pyPDF,但是extractText方法不能很好地工作,所以我决定尝试pyPDF2,问题是,提取文本时出现异常:
I have successfully installed pyPDF, but the extractText method does not work well, so i decided to try pyPDF2, the problem is, when extracting text there is an exception:
Traceback (most recent call last):
File "C:\Users\Asus\Desktop\pfdtest.py", line 44, in <module>
test2()
File "C:\Users\Asus\Desktop\pfdtest.py", line 41, in test2
print(mypdf.getPage(0).extractText())
File "C:\Python32\lib\site-packages\PyPDF2\pdf.py", line 1701, in extractText
content = ContentStream(content, self.pdf)
File "C:\Python32\lib\site-packages\PyPDF2\pdf.py", line 1783, in __init__
stream = StringIO(stream.getData())
TypeError: initial_value must be str or None, not bytes
这是我的示例代码:
filename = "myfile.pdf"
f = open(filename,'rb')
mypdf = PdfFileReader(f)
print(f,mypdf,mypdf.getNumPages())
print(mypdf.getPage(0).extractText())
它可以正确确定pdf中的页面数量,但是在读取流时存在问题.
It correctly determines the amount of pages in the pdf, but it has a problem with reading the stream.
推荐答案
这是与PyPDF2和Python 3的兼容性有关的问题.
It was a problem related to the compatibility within PyPDF2 and Python 3.
对于我来说,我已通过将pdf.py
和utils.py
替换为您会发现的此处,它们基本上控制着您是否正在运行Python 3,并且在可能的情况下,它们以字节而不是字符串的形式接收数据.
In my case, I have solved it by replacing pdf.py
and utils.py
with the ones you will find here, where they basically control if you are running Python 3 and, in case you are, receive data as bytes instead of strings.
这篇关于pyPDF2 TypeError尝试提取文本时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!