iTextSharp从特定位置读取 [英] iTextSharp read from specific position

查看:206
本文介绍了iTextSharp从特定位置读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从PDF文件中读取数据时,我遇到使用iTextSharp的问题。我想要实现的是只读取PDF页面的特定部分(我想只检索位于恒定位置的地址信息)。我在阅读以下所有页面时看到了iTextSharp的用法:

I have a problem using iTextSharp when reading data from PDF File. What I want to achieve is to read only specific part of PDF page (I want to only retrieve Address Information, which is located at constant position). I have seen usage of iTextSharp when reading all pages such as following:

        StringBuilder text = new StringBuilder();

        if (File.Exists(fileName))
        {
            PdfReader pdfReader = new PdfReader(fileName);

            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                text.Append(currentText);
            }
            pdfReader.Close();
        }
        return text.ToString();

但我怎样才能将其限制在特定位置?我愿意使用任何东西,甚至是OCR技术,因为将来某些文件可能会成为图像(但此时不是必需的)。这个项目仅供我使用,所以没有商业用途。

But how can I only restrict it to a specific location? I am open to use anything, even OCR technique as it might happen in the future that some files will be images(but not neccessary at this time). This project is only for me, so no commercial use.

谢谢!

推荐答案

您使用的是 SimpleTextExtractionStrategy 而不是 LocationTextExtractionStrategy 。请阅读官方文档和随附的示例( Java / C#)。如果 rect 是一个基于地址坐标的矩形,则需要:

You are using a SimpleTextExtractionStrategy instead of a LocationTextExtractionStrategy. Please read the official documentation and the accompanying examples (Java / C#). If rect is a rectangle based on the coordinates of your address, you need:

RenderFilter[] filter = {new RegionTextRenderFilter(rect)};
ITextExtractionStrategy strategy;
StringBuilder sb = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++) {
    strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
    sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));
}

现在,您将获得与<$ c相交的所有文本片段$ c> rect (因此部分文字可能在 rect 之外,iText不会将文本片段分割成碎片。)

Now you'll get all the text snippets that intersect with the rect (so part of the text may be outside rect, iText doesn't cut text snippets in pieces).

请注意,您可以使用以下方式获取页面的MediaBox:

Note that you can get the MediaBox of a page using:

Rectangle mediabox = reader.GetPageSize(pagenum);

左下角的坐标为x = mediabox.Left 和y = mediabox.Bottom ;右上角的坐标是x = mediabox.Right 和y = mediabox.Top

The coordinate of the lower-left corner is x = mediabox.Left and y = mediabox.Bottom; the coordinate of the upper-right corner is x = mediabox.Right and y = mediabox.Top.

x的值从左到右增加; y的值从下到上增加。 PDF中的测量系统的单位称为用户单位。默认情况下,一个用户单元与一个点重合(这可能会更改,但您找不到许多具有不同UserUnit值的PDF)。在正常情况下,72个用户单位= 1英寸。

The values of x increase from left to right; the values of y increase from bottom to top. The unit of the measurement system in PDF is called "user unit". By default one user unit coincides with one point (this can change, but you won't find many PDFs with a different UserUnit value). In normal circumstances, 72 user units = 1 inch.

这篇关于iTextSharp从特定位置读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆