当我使用iText从PDF文件中提取文本时,我正在从以前的页面中获取值 [英] When I extract text from a PDF file using iText I am getting values from previous pages

查看:205
本文介绍了当我使用iText从PDF文件中提取文本时,我正在从以前的页面中获取值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从多页PDF文件中的每一页的特定位置提取文本块.

I am trying to extract a block of text from a specific location from each page in a multiple page PDF file.

我知道了文本的位置,并且能够在第一页上正确地提取文本. 但是,在第一页之后的页面上,提取的文本似乎正在堆积.

I have the location of the text, and I am able to extract it correctly on the first page. However on the pages after the first page, the text extracted seems to be accumulating.

例如,如果页面1上的文本值为"A",则页面2为"B".而第3页为"C",那么我通过FOR循环在每次迭代的输出字符串中都会收到以下值:

For example if the text value on page 1 is "A", page 2 is "B" and Page 3 is "C" then I am receiving the following values in my output string for each iteration through my FOR loop:

循环1:输出= A

Loop1 : output = A

回路2:输出= B A

Loop2 : output = B A

循环3:输出= C B A

Loop3 : output = C B A

我正在用C#编写的项目中使用iTextSharp.

I am using iTextSharp in my project, written in C#.

任何帮助将不胜感激.

var reader = new PdfReader(foregroundFile);

RectangleJ customerIdRectangle = new RectangleJ(0, 495, 108, 27);
RenderFilter[] filters = new RenderFilter[1];
LocationTextExtractionStrategy regionFilter = new LocationTextExtractionStrategy();
filters[0] = new RegionTextRenderFilter(customerIdRectangle);
FilteredTextRenderListener strategy = new FilteredTextRenderListener(regionFilter, filters);

for (int i = 1; i <= reader.NumberOfPages; i++)
{
    string output = "";
    output = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
    Console.WriteLine(output);
}

推荐答案

请像这样修改您的代码:

Please adapt your code like this:

var reader = new PdfReader(foregroundFile);

RectangleJ customerIdRectangle = new RectangleJ(0, 495, 108, 27);

for (int i = 1; i <= reader.NumberOfPages; i++)
{
    RenderFilter[] filters = new RenderFilter[1];
    LocationTextExtractionStrategy regionFilter = new LocationTextExtractionStrategy();
    filters[0] = new RegionTextRenderFilter(customerIdRectangle);
    FilteredTextRenderListener strategy = new FilteredTextRenderListener(regionFilter, filters);
    string output = "";
    output = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
    Console.WriteLine(output);
}

这篇关于当我使用iText从PDF文件中提取文本时,我正在从以前的页面中获取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆