我需要以编程方式搜索PDF中的一些关键字 [英] I need to search some keywords in PDF programmatically

查看:90
本文介绍了我需要以编程方式搜索PDF中的一些关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个任务,我需要编写一个程序来抓取一些PDF,如果我得到以前定义的关键字,爬虫应该突出显示该文本或弹出然后继续搜索。





提前谢谢。

I have a task in which I need to write a program to crawl some PDF and if I get keywords which are previously defined the crawler should highlight that text or give popup and then continue the search.


Thanks in advance.

推荐答案

这是一组可以使用的PDF库引用: http://csharp-source.net/open-source/pdf-libraries



特别是,您可以试试这个: https://pdfapi.codeplex.com



-SA
This is a set of referenced to PDF libraries you can use: http://csharp-source.net/open-source/pdf-libraries.

In particular, you can try this one: https://pdfapi.codeplex.com.

—SA


正如我所提到的,从中提取文本的唯一可靠方法PDF正在进行OCR。您可以使用一些免费的/ OS库(例如 Tesseract [ ^ ]),我建议购买具有适当.net支持的API,如下所示:

http://www.abbyy.com/ocr-sdk-windows/ [ ^ ]

https://www.leadtools.com/sdk/ocr/default.htm ?SrcOrigin = Google-CPC-OCR%20API& MatchType = e& AdPos = 1t2& gclid = CLjXx4Gx6K8CFdA2pAodAXth1Q [ ^ ]

http://www.aspose.com/.net/ocr-component.aspx [ ^ ]



另一种方法是使用iFilter [ ^ ],实际上用于全文索引,并且有iFilter for PDF:http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542 [ ^ ]。但我怀疑你能否真正找到文本的原始位置。
As I mentioned, the only reliable way to extract text from a PDF is doing OCR. There are some free/os libraries you could use (like Tesseract[^]), I recommend buying an API with proper .net support, like these:
http://www.abbyy.com/ocr-sdk-windows/[^]
https://www.leadtools.com/sdk/ocr/default.htm?SrcOrigin=Google-CPC-OCR%20API&MatchType=e&AdPos=1t2&gclid=CLjXx4Gx6K8CFdA2pAodAXth1Q[^]
http://www.aspose.com/.net/ocr-component.aspx[^]

An other approach is using iFilter[^], which is actually made for full-text indexing, and there is iFilter for PDF: http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542[^]. But I have doubts you will be able to actually find the original position of the text with it.


这篇关于我需要以编程方式搜索PDF中的一些关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆