pypdf python工具 [英] pypdf python tool
问题描述
使用pypdf python模块如何阅读以下pdf文件 http://www.envis -icpe.com/pointcounterpointbook/Hindi_Book.pdf
Using pypdf python module how to read the following pdf file http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf
# -*- coding: utf-8 -*-
from pyPdf import PdfFileWriter, PdfFileReader
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
以上仅打印二进制文件
以及如何从下面的代码打印内容
And how to print the contents from the below code
from pyPdf import PdfFileWriter, PdfFileReader
import sys
import pyPdf
from pyPdf import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input1 = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))
# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)
推荐答案
请注意,您所引用的pdf文档的大多数文本"根本不是真正的文本:主要是图像.尝试时,实际文本似乎可以正确提取(尽管我必须承认,除了首页上的一些摘要和页码之外,我看不懂它;-)).
Note that most of the "text" of the pdf document you refer to isn't real text at all: it's mostly images. The actual text seems to get extracted correctly when I try it (although I must admit that apart from some snippets on the front page and the page numbers, I can't read it ;-)).
关于第二个问题:我不确定你在问什么.
As for the second question: I'm not sure what you're asking there.
这篇关于pypdf python工具的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!