iTextSharp从PDF的单层读取文本 [英] iTextSharp Read Text From Single Layer of PDF

查看:85
本文介绍了iTextSharp从PDF的单层读取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我正在使用自定义的LocationTextExtractionStrategy从返回TextRenderInfo []的PDF中提取文本.我希望能够确定TextRenderInfo对象(或PDFString,TextRenderInfo的子级)是否出现在特定层中.我不确定这是否可能.要获取PDF中的图层,我正在使用:

Currently I am using a custom LocationTextExtractionStrategy to extract text from a PDF that returns a TextRenderInfo[]. I would like to be able to determine if a TextRenderInfo object (or PDFString, child of TextRenderInfo) appears in a specific layer. I am not sure if this is possible. To get the layers in a PDF, I am using:

Dictionary<string,PdfLayer> layers;
using (var pdfReader = new PdfReader(src))
{
    var newSrc = Path.Combine(["new file location"]);
    using (var stream = new FileStream(newSrc, FileMode.Create))
    {       
        PdfStamper stamper = new PdfStamper(pdfReader, stream);
        layers = stamper.GetPdfLayers();
        stamper.Close();
    }
    pdfReader.Close();
    src = newSrc;
}

要提取文本,我正在使用:

To extract the text, I am using:

var textExtractor = new TextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(pdfReader, pdfPageNum,textExtractor);
List<TextRenderInfo> results = textExtractor.Results;

有什么方法可以检查单个TextRenderInfo结果是否存在于第一个代码段中获得的层中.任何帮助将不胜感激.

Is there any way that I can check if the individual TextRenderInfo results exist within the layers obtained in the first code snippet. Any help would be much appreciated.

推荐答案

可以从单个图层中获取内容,但是您必须跳过几个步骤才能解决.具体来说,您将必须重新创建PdfTextExtractorPdfReaderContentParser提供的某些逻辑.

It is possible to get the contents from a single layer, but you'll have to jump through a few hoops to work it out. Specifically, you will have to recreate some of the logic that is provided by the PdfTextExtractor and PdfReaderContentParser.

public static String GetText(PdfReader reader, int pageNumber, int streamNumber) {
    var strategy = new LocationTextExtractionStrategy();
    var processor = new PdfContentStreamProcessor(strategy);

    var resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);

    // assuming you still need to extract the page bytes
    byte[] contents = GetContentBytesForPageStream(reader, pageNumber, streamNumber);

    processor.ProcessContent(contents, resourcesDic);
    return strategy.GetResultantText();
}

public static byte[] GetContentBytesForPageStream(PdfReader reader, int pageNumber, int streamNumber) {
    PdfDictionary pageDictionary = reader.GetPageN(pageNum);
    PdfObject contentObject = pageDictionary.Get(PdfName.CONTENTS);
    if (contentObject == null)
        return new byte[0];

    byte[] contentBytes = GetContentBytesFromContentObject(contentObject, streamNumber);
    return contentBytes;
}

public static byte[] GetContentBytesFromContentObject(PdfObject contentObject, int streamNumber) {
    // copy-paste logic from
    // ContentByteUtils.GetContentBytesFromContentObject(contentObject);
    // but in case PdfObject.ARRAY: only select the streamNumber you require
}

如果您特别希望仅使用PdfTextExtractorPdfReaderContentParser,并要求返回的TextRenderInfo作为其所在的图层,那么我不确定是否可以轻松实现.有很多问题:

If you're specifically looking to just use PdfTextExtractor or PdfReaderContentParser, and ask the returned TextRenderInfo for the layer it's on, then I'm not sure it will be easily possible. There are a number of problems with that:

  • TextRenderInfo不存储该信息,因此您必须将其子类化(可能)
  • 您必须重写创建TextRenderInfo对象的逻辑.通过使用PdfTextExtractorPdfReaderContentParser
  • 为所有文本运算符(TjTJ'")注册自定义IContentOperator对象是可能的
  • 最困难的部分是您已经丢失了ContentByteUtils.GetContentBytesFromContentObject中的图层信息-因此,您需要以某种方式保留该信息,这会造成一系列问题.
  • TextRenderInfo doesn't store that information, so you'd have to subclass it (which is possible)
  • you'd have to rewrite the logic that creates the TextRenderInfo objects. This is possible by registering custom IContentOperator objects for all text operators (Tj, TJ, ' and ") with the PdfTextExtractor or PdfReaderContentParser
  • the hardest part is that you have already lost layer information in ContentByteUtils.GetContentBytesFromContentObject - so you'd need to retain that somehow, which creates its own set of problems.

这篇关于iTextSharp从PDF的单层读取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆