搜索pdf文件以获取特定信息 [英] searching pdf files for certain info

查看:71
本文介绍了搜索pdf文件以获取特定信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

不是真正的Python问题......但是这里有:有没有办法阅读PDF文件的内容并用Python解码它?我想阅读

PDF',解码它们,然后在数据中搜索某些字符串。


谢谢,rbt

Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I''d like to read
PDF''s, decode them, and then search the data for certain strings.

Thanks, rbt

推荐答案

rbt写道:
rbt wrote:
不是真正的Python问题......但是这里有:有没有办法阅读PDF文件的内容并用Python解码?我想阅读
PDF',解码它们,然后搜索数据中的某些字符串。
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I''d like to read
PDF''s, decode them, and then search the data for certain strings.




有一个商业工具pdflib availablebla,这可能会有所帮助。它有一个免费的

评估版和python绑定。


如果它只是关于文本,也许pdf2text有帮助。

-

问候,


Diez B. Roggisch



There is a commercial tool pdflib availabla, that might help. It has a free
evaluation version, and python bindings.

If it''s only about text, maybe pdf2text helps.
--
Regards,

Diez B. Roggisch


Aloha,

rbt写道:
Aloha,

rbt wrote:
不是真正的Python问题......但是这里有:有没有办法阅读PDF文件的内容并解码它与Python?我想阅读
PDF',解码它们,然后在数据中搜索某些字符串。
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I''d like to read
PDF''s, decode them, and then search the data for certain strings.




首先,
http:// groups。 google.de/groups?selm=...&output=gplain

仍然适用于此。


如果你可以处理一个pdf-lib的非常基本的实现你可能会感兴趣
http://sourceforge.net/projects/pdfplayground

在CVS(或当前快照)中,您可以找到

ppg / doc / text_extract.txt用于文本提取的示例。



First of all,
http://groups.google.de/groups?selm=...&output=gplain
still applies here.

If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground

In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.

import pdffile
导入页面
import zlib
pf = pdffile.pdffile(''.. / pdf-testset1 / a.pdf'')
pp = pag es.pages(pf)
c = zlib.decompress(pf [pp.pagelist [0] [''/ Contents'']]。stream)
op = pdftool.parse_content(c)
sop = [x [1] for op in op if x [0] in ["''",Tj]]
for a sop:
import pdffile
import pages
import zlib
pf = pdffile.pdffile(''../pdf-testset1/a.pdf'')
pp = pages.pages(pf)
c = zlib.decompress(pf[pp.pagelist[0][''/Contents'']].stream)
op = pdftool.parse_content(c)
sop = [x[1] for x in op if x[0] in ["''", "Tj"]]
for a in sop:



打印一份[0]


祝你节日快乐

LOBI


print a[0]

Wishing a happy day
LOBI


Andreas Lobinger写道:
Andreas Lobinger wrote:
Aloha,

rbt写道:
Aloha,

rbt wrote:
不是真正的Python问题。 ..但是这里有:有没有办法阅读PDF文件的内容并用Python解码?我想阅读
PDF',解码它们,然后在数据中搜索某些字符串。
Not really a Python question... but here goes: Is there a way to read
the content of a PDF file and decode it with Python? I''d like to read
PDF''s, decode them, and then search the data for certain strings.



首先,
http://groups.google.de/ groups?selm = ...& output = gplain

仍然适用于此。

如果你可以处理一个非常基本的pdf-lib实现你可能对
感兴趣吗 http://sourceforge.net/projects / pdfplayground

在CVS(或当前快照)中,您可以在
ppg / Doc / text_extract.txt中找到文本提取的示例。


First of all,
http://groups.google.de/groups?selm=...&output=gplain

still applies here.

If you can deal with a very basic implementation of a pdf-lib you
might be interested in
http://sourceforge.net/projects/pdfplayground

In the CVS (or the current snapshot) you can find in
ppg/Doc/text_extract.txt an example for text extraction.

>>> import pdffile
>>>导入页面
>>> import zlib
>>> pf = pdffile.pdffile(''.. / pdf-testset1 / a.pdf'')
>>> pp = pages.pages(pf)
>>> c = zlib.decompress(pf [pp.pagelist [0] [''/ Contents'']]。stream)
>>> op = pdftool.parse_content(c)
>>> sop = [x [1] for op in op if x [0] in ["''',Tj]]
>>> for a sop:
>>> import pdffile
>>> import pages
>>> import zlib
>>> pf = pdffile.pdffile(''../pdf-testset1/a.pdf'')
>>> pp = pages.pages(pf)
>>> c = zlib.decompress(pf[pp.pagelist[0][''/Contents'']].stream)
>>> op = pdftool.parse_content(c)
>>> sop = [x[1] for x in op if x[0] in ["''", "Tj"]]
>>> for a in sop:


打印[0]

祝你节日快乐
LOBI


print a[0]

Wishing a happy day
LOBI



谢谢大家......如果我将它转换为PS,将其打印成文件或

,该怎么办?这会更容易使用吗?



Thanks guys... what if I convert it to PS via printing it to a file or
something? Would that make it easier to work with?


这篇关于搜索pdf文件以获取特定信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆