如何从PDF,Word和Excel文档中提取文本? [英] How to extract text from Pdf, Word and Excel documents?

查看:217
本文介绍了如何从PDF,Word和Excel文档中提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个.NET库,以便使用,我可以从PDF,Excel和Word文件中提取文本数据。

I'd need a .NET library so that using which I can extract text data from PDF, Excel and Word files.

在理想情况下,一个免费的工具!

Ideally, a free tool!

你会推荐什么?

千恩万谢,

推荐答案

正如有人谁花了很多天寻找免费的解决方案(几乎)这个确切的问题,我可以告诉你还算诚实,你不会找到一个免费图书馆这将能够从的所有的这些格式中提取文本很好。我所知道的,唯一的库做所有这些格式(及以上)一项伟大的工作是一个商业库,并且它不是真正原产于.NET,这是一个C ++ / COM库,用C ++ / CLI。 NET包装。

As someone who has spent many days looking for free solutions for (nearly) this exact problem, I can tell you fairly honestly that you will not find a free library that will be able to extract text from all of those formats well. The only library that I'm aware of that does a great job with all of those formats (and more) is a commercial library, and it's not actually native to .NET, it's a C++/COM library, with a C++/CLI .NET wrapper.

有哪些选择呢?


  • iTextSharp的 - 这一个是从PDF文件提取文本绝对精彩。而更高版本的这个图书馆是商业友好(LGPL),作者已经决定,而不是他们想要的软件费,所以他们已经而是发布了AGPL下,所以除非你想释放所有源$ C的$ C,你可能不希望使用这些版本之一。然而,在LGPL许可的最后一个版本(4.1.6)都可以在互联网上找到。 这太问题有一个链接到一个版本是在LGPL下。

  • iTextSharp -- This one is absolutely fantastic in extracting text from PDFs. While later versions of this library were commercial friendly (LGPL), the authors have decided instead that they want to charge for the software, so they've instead released it under the AGPL, so unless you want to release all of your source code, you probably don't want to use one of those versions. However, the last version (4.1.6) licensed under the LGPL can be found all over the internet. This SO question has a link to a version that is under the LGPL.

PDFBOX - 另一个PDF库。这其中,国际海事组织,是更好,因为它是根据Apache 2.0许可。有几个问题与它,因为它的有时的(也许很少)不会做的好工作作为iTextSharp的的。我认为这更多的事实,它比其他任何一个新库。 然而,我与此库的经验是从的的以前。该项目正在积极开发,只是在上个月,52的问题已经得到解决。我会继续在这一个我的眼睛。请注意,这是一个Java库。 (以下保持阅读为什么我已经包括这方面的信息。)

PdfBox -- Another PDF library. This one, IMO, is better because it's under the Apache 2.0 license. There are a few issues with it, as it sometimes (perhaps rarely) will not do as good of a job as iTextSharp. I attribute this more to the fact that it's a newer library than anything else. However, my experience with this library is from months ago. This project is actively developed, and just in the last month, 52 issues have been resolved. I would keep my eye on this one. Please note this is a java library. (Keep reading below for more information on why I've included this.)

POI NPOI - 这些是专为Microsoft Office文档编写的库,特别是pre-2007格式,OLE二进制文件格式。它支持较新的OpenXML格式的,虽然我不知道如何成熟,库的一部分。 POI是java版本(以下保持阅读为什么我已经包括这方面的信息。),其中NPOI是土生土长的.NET版本。然而,NPOI仅支持Excel文件,POI的地方可以做的文本提取在更多类型的。

POI or NPOI -- These are libraries specifically written for Microsoft office documents, particularly the pre-2007 formats, OLE binary file formats. It does support the newer OpenXML formats, though I'm not sure how mature that part of the library is. POI is the java version (Keep reading below for more information on why I've included this.), where NPOI is a native .NET version. However, NPOI only supports excel documents, where POI can do text extraction on many more types.

的Open XML SDK 2.0 - 用于读取库/修改Office 2007的+(未加密的OpenXML)文件创建了微软自己!这是用于与这些种类的文档工作一个惊人的库。然而,这是一个低级别的库,因此实际上并没有(据我所知),拥有的它所做的一切的文本提取类。有一个非常好的例子,(我不知道它涵盖某些情况下,如文本在表格等),文本提取从Word文档的在这个SO回答

Open XML SDK 2.0 -- A library for reading/modifying office 2007+ (unencrypted OpenXML) documents created my Microsoft themselves! This is an amazing library for working with these kinds of documents. However, it is a lower-level library and therefore doesn't actually (as far as I know of), have a it does everything text extraction class. There's a fairly good example, (I'm not sure it covers certain cases like text in tables, etc), of text extraction from a word document at this SO answer

提卡 - 再次,另一个Java库(我不是告诉你关于无故Java库。继续阅读!:)),这将是接近一库为文本提取,你可以得到。蒂卡可以提取元数据和的结构化文本内容的许多不同类型的文件,使用现有的解析库。实际上它使用POI和PDFBOX引擎盖办公和PDF文档下。

Tika -- Once again, another Java library (I'm not telling you about java libraries for no reason. Keep on reading! :)), and this will be as close to "one library" for text extraction as you can get. Tika can extract metadata and structured text content from many different kinds of files, using existing parsing libraries. It actually uses POI and PdfBox under the hood for office and PDF documents.

非商业


  • dtSearch - 这是一个图书馆,我很熟悉。它做了出色的工作,并且可以解析的文件格式可笑的金额。但是,它的成本钱,可能是矫枉过正为您所需要的。它实际上的究竟的我们所需要的,但我们正在努力摆脱它自己,因为我们只用它来解析(它实际上是一个全文搜索引擎),并有大量的解析库在那里,我们可以使用或修改,以满足我们的需求,但它老老实实地打击所有这些其他库出来的水。正如我前面提到的,这也不是本机.NET code。 A C ++ / CLI包装使用DLL和.NET运行库之间intertop。

  • dtSearch -- This is a library I'm very familiar with. It does a fantastic job, and can parse a ridiculous amount of file formats. However, it costs money and is probably overkill for what you need. It's actually exactly what we need, but we're trying to get rid of it ourselves, because we only use it for parsing (it's actually a full-text search engine), and there's plenty of parsing libraries out there that we can use or modify to suit our needs, but it honestly blows all these other libraries out of the water. As I mentioned before, it is also not native .NET code. A C++/CLI wrapper is used to intertop between the DLL and the .NET runtime.

的IFilter可以使用,并在不同的问题等几个SO答案被提及,但你会得到的文本是非结构化的。有时,它只是坏...不可读对于人类来说,至少。我相信的IFilter也去precated,并根据许可问题,您可能无法重新分配它们。

为什么我提到的所有这些Java库?好吧,有两个原因。首先,没有任何的免费的.NET等价物来接近这些Java库的质量。其次,你可以使用 IKVM使用.NET这些库(我个人使用这些库这样做我自己,所以我至少可以为担保) 。这是.NET的Java里面的实现。 <一href=\"http://www.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm\">Here是使用IKVM到提卡转换成可以在您的项目中使用.NET程序集一个很好的例子。或许大概IKVM最可怕的事情,是的它只是工作!

Why did I mention all of those Java libraries? Well, for two reasons. First, there are no free .NET equivalents that come close to the quality of these Java libraries. Secondly, you can use these libraries in .NET (I've personally done this myself with these libraries, so I can at least vouch for that) using IKVM. It's an implementation of Java inside of .NET. Here is a good example on using IKVM to convert Tika into a .NET assembly that can be used in your project. Perhaps the scariest thing about IKVM, is that it just works!

编辑:我忘了,该博客的作者实际上张贴code和转换上一个GitHub的项目 。所以,如果你想快速检查出来,你可以这样做在那里。然而,这提卡和一岁多的更旧版本。如果你预期的效果都没有,我建议你自己与最新版本的尝试吧。

I forgot that the author of that blog had actually posted the code and converted libraries on a github project. So, if you want to quickly check it out, you can do so there. However, it's a much older version of Tika and over a year old. If the results aren't as you expected, I would suggest trying it yourself with the latest version.

这篇关于如何从PDF,Word和Excel文档中提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆