使用iTextSharp读取PDF文档会创建具有重复首页的字符串 [英] Reading PDF document with iTextSharp creates string with repeating first page
问题描述
我目前使用iTextSharp读取一些PDF文件,并使用收到的字符串对其进行解析.我在处理某些PDF文件时遇到了奇怪的行为.当获取例如4页PDF的字符串时,该字符串将按以下顺序用页面填充:
I currently use iTextSharp to read in some PDF files and parse them by using the string I receive. I have encountered a strange behavior with some PDF files. When getting the string back of a for example 4 page PDF, the string is filled with the pages in the following order:
1 2 1 3 1 4
1 2 1 3 1 4
我读取文件的代码如下:
My code for reading the files is as follows:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
if (!string.IsNullOrWhiteSpace(text))
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
}
Debug.WriteLine(sb.ToString());
}
这里是发生此行为的文件的链接:
Here is a link to a file with which this behaviour occurs:
希望你们能帮助我!
推荐答案
感谢克里斯·哈斯(Chris Haas),我发现自己出了问题.在网上找到的有关如何使用iTextSharp.Pdf的示例对于我的实现不正确或不正确.
Thanks to Chris Haas I found out was going wrong. The samples found online on how to use iTextSharp.Pdf are incorrect or incorrect for my implementation.
您尝试读取的每个页面都必须实例化SimpleTextExtractionStrategy
.如果不这样做,则会在结果字符串中乘以每个上一页.
The SimpleTextExtractionStrategy
needs to be instantiated for every page you try to read. Not doing this will multiply each previous page in the resulting string.
另外,可以将StringBuilder附加到的行更改为:
Also the line where the StringBuilder is being appended can be changed from:
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
到
sb.Append(text);
因此以下代码给出了正确的结果:
Thus the following code gives the correct result:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, new SimpleTextExtractionStrategy());
if (!string.IsNullOrWhiteSpace(text))
sb.Append(text);
}
Debug.WriteLine(sb.ToString());
}
这篇关于使用iTextSharp读取PDF文档会创建具有重复首页的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!