PDFBox 0.7.3将pdf转换为文本 [英] PDFBox 0.7.3 convert pdf to text

查看:220
本文介绍了PDFBox 0.7.3将pdf转换为文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将pdf文件转换为文本文件,但某些pdf文件不适用于pdfbox dll,因为Acrobat的版本比Acrobat 5.x更新

I want to convert pdf file to text file but some of pdf files do not work with pdfbox dll as the version of acrobat in newer than Acrobat 5.x

请告诉我我的工作?

output.WriteLine("Begin Parsing.....");
output.WriteLine(DateTime.Now.ToString());

PDDocument doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();

output.Write(stripper.getText(doc));


推荐答案

您的首次尝试应该尝试使用当前版本PDFBox。您的0.7.3版本可以追溯到 2006年! PDFBox同时已经成为一个Apache项目,位于这里:http://pdfbox.apache.org/ 和当前版本(截至2013年5月)是1.8.1。我非常确定PDFBox nowerdays确实支持PDF对象流和交叉引用流,它们是PDF参考版本1.5中的新版本,Adobe Acrobat 6版本已经构建为

Your first attempt should be to try with a current version of PDFBox. Your version 0.7.3 dates back to 2006! PDFBox meanwhile has become an Apache project and is located here: http://pdfbox.apache.org/ and the current version (as of May 2013) is 1.8.1. And I'm very sure that PDFBox nowerdays does support PDF object streams and cross reference streams which were new in PDF Reference version 1.5, the version Adobe Acrobat 6 has been built for

如果这不起作用,您可能想尝试其他PDF库,例如如果AGPL(或购买许可证)没有问题,请 iText (或您的案例中的iTextSharp)版本5.4.x你有。

If that does not work, you might want to try other PDF libraries, e.g. iText (or iTextSharp in your case) version 5.4.x if the AGPL (or alternatively buying a license) is no problem for you.

有关使用iText(夏普)进行文本解析的信息,请参阅第15章 标记内容和解析PDF ://itextpdf.com/book/index.php\"rel =nofollow> iText in Action - 2nd Edition 。该章的样本可以在网上找到: Java .Net

Information on text parsing using iText(Sharp) can be found in chapter15 Marked content and parsing PDF of iText in Action — 2nd Edition. The samples from that chapter can be found online: Java and .Net.

对于第一次测试,样本 ExtractPageContentSorted2.cs / ExtractPageContentSorted2.java 将是一个良好的开端。中央代码:

For a first test the sample ExtractPageContentSorted2.cs / ExtractPageContentSorted2.java would be a good start. The central code:

PdfReader reader = new PdfReader(PDF_FILE);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
StringBuilder sb = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++) {
    sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i));
}

如果当前的PDFBox版本和当前的iText(夏普)版本都无法解析您的PDF,您可能希望发布样本进行检查;有办法从PDF中删除文本解析所需的所有信息......

If neither a current PDFBox version nor a current iText(Sharp) version can parse your PDF, you might want to post a sample for inspection; there are ways to drop all information required for text parsing from a PDF...

这篇关于PDFBox 0.7.3将pdf转换为文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆