是否有一个C ++库从PDF文件(如PDFBox for Java)中提取文本? [英] Is there a C++ library to extract text from a PDF file like PDFBox for Java?

查看:299
本文介绍了是否有一个C ++库从PDF文件(如PDFBox for Java)中提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

去年,我在Java使用PDFBox获取一些PDF文件中的原始文本的应用程序,我需要将该应用程序移植到C ++。



我想要知道什么是最好的C ++替代方法来完成我需要的。



我会举一个例子,以防它帮助:



大多数文件将如下所示: http://www.jumbala.net/backup/league.pdf



使用PDFBox,使用该文件,第2页上读取的每行和第3页的大部分将输出一行的所有数据,



因此,第2页中的第一个相关行将如下所示:

  FB 847  -  Tremblay,Gérard179,63 56 16167 90 268 s27 p3 669 s14 199 223 193 615 

或类似的东西,因为他们出现的顺序有轻微的变化,但我不在乎,只要相似的行输出相同,因为我只是解析他们并把我需要的值放在不同的变量中。



所以,知道这一切,是否有一个库可以在C ++程序中获得类似的结果? / p>

编辑:查看sacredFaith的链接 http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file < a>并尝试它,我得到一个奇怪的输出像我前面提到的示例文件:



http://www.jumbala.net/backup/league.pdf.txt



零件我实际需要的是在开始的奇怪的字符。使用Adobe Acrobat Reader X并使用另存为...文本(可访问),我得到以下结果:



http://www.jumbala.net/backup/league_good.pdf.txt



这是大约是我在Java中使用PDFBox获取的内容,以及我想要在C ++中作为输出获得什么。

解决方案

Xpdf 是一个C ++应用程序/库,其中包含从PDF文件中提取纯文本的工具。


Last year, I made an application in Java using PDFBox to get the raw text in some PDF files and I need to port that application to C++ now.

I wanted to know what was the best C++ alternative to accomplish what I need.

I'll give an example in case it helps:

Most files will look like this: http://www.jumbala.net/backup/league.pdf

With PDFBox, using that file, each line read on page 2 and most of page 3 would output all the data of a line, separated by a space instead of keeping it in a grid like it is now.

So the first relevant line in page 2 would look like this:

FB 847 - Tremblay, Gérard 179,63 56 16167 90 268 s27 p3 669 s14 199 223 193 615

or something like that since there are minor changes in the order they appear, but I don't care about that as long as similar lines output the same since I just parse them and put the values I need in different variables.

So, knowing all of that, is there a library that I can use in a C++ program to get similar results?

Edit: After looking at sacredFaith's link at http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file and trying it, I'm getting a weird output like such for the example file I mentioned earlier:

http://www.jumbala.net/backup/league.pdf.txt

The parts I actually need are in the weird characters at the beginning. Using Adobe Acrobat Reader X and using Save As... Text (accessible), I get the following result:

http://www.jumbala.net/backup/league_good.pdf.txt

Which is approximately what I get in Java using PDFBox and what I want to get as output in C++.

解决方案

Xpdf is a C++ application/library which includes tools to extract plain text from a PDF file.

这篇关于是否有一个C ++库从PDF文件(如PDFBox for Java)中提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆