是否有一个C ++库从PDF文件（如PDFBox for Java）中提取文本？ [英] Is there a C++ library to extract text from a PDF file like PDFBox for Java?

查看：299 发布时间：2016/10/22 19:08:58 c++ pdf

本文介绍了是否有一个C ++库从PDF文件（如PDFBox for Java）中提取文本？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

去年，我在Java使用PDFBox获取一些PDF文件中的原始文本的应用程序，我需要将该应用程序移植到C ++。

我想要知道什么是最好的C ++替代方法来完成我需要的。

我会举一个例子，以防它帮助：

大多数文件将如下所示： http://www.jumbala.net/backup/league.pdf

使用PDFBox，使用该文件，第2页上读取的每行和第3页的大部分将输出一行的所有数据，

因此，第2页中的第一个相关行将如下所示：

  FB 847  -  Tremblay，Gérard179,63 56 16167 90 268 s27 p3 669 s14 199 223 193 615

或类似的东西，因为他们出现的顺序有轻微的变化，但我不在乎，只要相似的行输出相同，因为我只是解析他们并把我需要的值放在不同的变量中。

所以，知道这一切，是否有一个库可以在C ++程序中获得类似的结果？ / p>

编辑：查看sacredFaith的链接 http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file < a>并尝试它，我得到一个奇怪的输出像我前面提到的示例文件：

http://www.jumbala.net/backup/league.pdf.txt

零件我实际需要的是在开始的奇怪的字符。使用Adobe Acrobat Reader X并使用另存为...文本（可访问），我得到以下结果：

http://www.jumbala.net/backup/league_good.pdf.txt

这是大约是我在Java中使用PDFBox获取的内容，以及我想要在C ++中作为输出获得什么。

解决方案

Xpdf 是一个C ++应用程序/库，其中包含从PDF文件中提取纯文本的工具。

Last year, I made an application in Java using PDFBox to get the raw text in some PDF files and I need to port that application to C++ now.

I wanted to know what was the best C++ alternative to accomplish what I need.

I'll give an example in case it helps:

Most files will look like this: http://www.jumbala.net/backup/league.pdf

With PDFBox, using that file, each line read on page 2 and most of page 3 would output all the data of a line, separated by a space instead of keeping it in a grid like it is now.

So the first relevant line in page 2 would look like this:

FB 847 - Tremblay, Gérard 179,63 56 16167 90 268 s27 p3 669 s14 199 223 193 615

or something like that since there are minor changes in the order they appear, but I don't care about that as long as similar lines output the same since I just parse them and put the values I need in different variables.

So, knowing all of that, is there a library that I can use in a C++ program to get similar results?

Edit: After looking at sacredFaith's link at http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file and trying it, I'm getting a weird output like such for the example file I mentioned earlier:

http://www.jumbala.net/backup/league.pdf.txt

The parts I actually need are in the weird characters at the beginning. Using Adobe Acrobat Reader X and using Save As... Text (accessible), I get the following result:

http://www.jumbala.net/backup/league_good.pdf.txt

Which is approximately what I get in Java using PDFBox and what I want to get as output in C++.

解决方案

Xpdf is a C++ application/library which includes tools to extract plain text from a PDF file.

这篇关于是否有一个C ++库从PDF文件（如PDFBox for Java）中提取文本？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

是否有一个C ++库从PDF文件（如PDFBox for Java）中提取文本？ [英] Is there a C++ library to extract text from a PDF file like PDFBox for Java?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

是否有一个C ++库从PDF文件（如PDFBox for Java）中提取文本？ [英] Is there a C++ library to extract text from a PDF file like PDFBox for Java?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭