在code提取PDF文本 [英] Extract text from PDF in code

查看:182
本文介绍了在code提取PDF文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在做一个应用程序,供我上学,人们可以检查他们是否有一个时间表变化。所有的日程更改这里列出: http://www.augustinianum.eu/roosterwijzigingen/14062012.pdf 。我想搜索的页面的关键字(用户群,这是进入一个EditText)。我已经找到了如何让应用程序检查,如果EditText上一定的字符串匹配,所以现在我只需要在该网页上下载的所有文本为字符串。但问题是,它不是一个简单的网页,而是一个PDFpage。我听说你需要一个特殊的PDF库或东西来提取的PDF文本,然后把文本转换成字符串,然后搜索字符串使用关键字包含()。 不过我有一些问题有关:

I'm making an app for my school which people can check with if they've got a schedule change. All schedule changes are listed here: http://www.augustinianum.eu/roosterwijzigingen/14062012.pdf. I want to search that page for a keyword (the user's group, which is entered in an EditText). I've found out how to make the app check if the edittext matches a certain string, so now I only need to download all of the text on that page to a string. But the problem is that it's not a simple webpage, but a PDFpage. I've heard that you need a special pdf library or something to extract the text from the PDF and then put that text into a string and then search the string for keywords using contains(). However I've got some questions about that:

  • 这PDF是用的PDF的创造者,它不是一个扫描页面左右。实际上,你可以如选择文本或搜索它使用CTRL + F关键字。所以我想如果确实需要提取PDF和东西还是有可能更简单的方法吧。

  • This PDF is made with a PDF-creator, it's not a scanned page or so. You can actually for example select the text or search it for keywords using CTRL+F. So I wonder if it is actually required to extract the PDF and stuff or is there maybe an easier way.

我想要的应用程序,以检查每一个变化,让我们说小时。因此,它也有下载PDF每隔一小时(约8页)提取文本,会消耗非常多的果汁?

I want the app to check for changes every, let's say hour. So it also has to download the PDF and extract the text every hour (about 8 pages), would that consume very much juice?

我听说有很多很多库,做我想做的。所以,我应该使用哪个? (如果可能的话,我想其中一个是免费的:))

I've heard that there are many many libraries which do what I want. So which should I use? (If possible, I'd like one which is free :))

任何人都可以向我解释如何使用它在我的code? (我没有真正经历过,所以PLZ保持一点轻松:))

Could anyone explain to me how to use it in my code? (I'm not really experienced, so plz keep it a little easy :))

感谢大家这么多!

推荐答案

不幸的是,我没有工作与Java,你必须自己去实现它在Java code。现在,我要告诉你,我是多么终于做到了:

Unfortunately, I did not working with java and you have to implement it in java code by yourself. Now I'll tell you, how finally I did it:

1)我通过你的链接了该文件。 PHP是做什么用 @fopen(的http:// ...)

1) I took the file by your link. PHP is doing it by @fopen("http://...").

2)我打开了它作为一个(这很重要),并提取两部分组成:

2) I opened it as a binary (it is important) and extracted two parts:

2.1)数据3 0 OBJ一部分,从而重新presents创建和修改日期。我通过正则表达式做到了。这很简单,我提到它上面。

2.1) Data 3 0 obj part, which represents creation and modification dates. I did it by regex. It was simple and I mention it above.

2.1) 5 0 OBJ数据流,从而重新presents瘪的数据。重要! Microsoft Excel中插入两个字节 0D 0A 作为换行符。别忘了,当你过滤由正则表达式的内容。这在开始和在最后字节还没有被包括在提取的字符串中。

2.1) Data stream from 5 0 obj, which represents the deflated data. IMPORTANT! Microsoft Excel inserts two bytes 0D 0A as a line break. Do not forget it, when you filtering the content by regexp. This bytes in the start and in the end have not to be included in extracted string.

3)我用的功能夸大一个codeD的东西的 $ uncom pressed = @ gzuncom preSS($ COM pressed),并把它放在外部文件中。你可以看到效果

3) I inflate a coded stuff by function $uncompressed = @gzuncompress($compressed) and put it in external file. You can see results there

4)最有趣的部分。以文本格式文件中的原始数据。它看起来像 [(V)-4(RI)16(J)] TJ 和手段 VRIJ 。您可以在PDF中的<一个阅读文本href="http://www.google.com/search?client=safari&rls=en&q=PDF+Reference+v1.7&ie=UTF-8&oe=UTF-8#hl=ru&safe=off&client=safari&rls=en&sclient=psy-ab&q=PDF+Reference+v1.7+file%3Apdf&oq=PDF+Reference+v1.7+file%3apdf&gs_l=serp.3...7017.13704.0.14310.11.10.1.0.0.0.132.1000.6j4.10.0...0.0...1c.7e0s-Abrluo&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=c5c0acba9dbab366&biw=774&bih=694"相对=nofollow> PDF参考V1.7 ,第5部分。

4) Funniest part. The raw data inside the file in textual format. It looks like [(V)-4(RI)16(J)] TJ, and means VRIJ. You can read about texts in PDF in the PDF Reference v1.7, part 5.

5)我相信,正规的前pressions可以帮你提取和/或转换数据。

5) I believe, the regular expressions can help you extract or/and transform the data.

重要提示:我说:从5 0 OBJ数据流,但对象数是主题的变化。你必须控制从dictionary-> pages->页面级>内容链上的参考对象。在面包屑的描述,你可以我上面提到的手册中找到。

IMPORTANT: I said "data stream from 5 0 obj", but number of the object "is subject of change". You must control the reference to the object from dictionary->pages->page->content chain. Description of the "bread crumbs" you can find in the manual I mentioned above.

不幸的是,Excel中没有嵌入任何表结构中的PDF格式,但你可以找到的文本部分,跨$ P $私人它的坐标。反正它是一个烂摊子。

Unfortunately, Excel do not embed any table structure in the PDF, but you can find the coordinates of the text portions and interprete it. Anyway it is a mess.

你认为,亲爱的梅林,这是很难?不,亲爱的,其实不然。这并不难,因为没有单code符号。单向code在PDF才是真正的吸!

Do you think, dear Merlin, it is hard? No, dear, it is not. It is not hard, because there is no unicode symbols. The unicode in the PDF is THE REAL SUCK!

祝你好运!

这篇关于在code提取PDF文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆