itextsharp - 读取包含1列(第1页)和第2列(第2页)的PDF的问题 [英] itextsharp - Problems reading PDFs with 1 column (page1) and 2 columns (page2)

查看:133
本文介绍了itextsharp - 读取包含1列(第1页)和第2列(第2页)的PDF的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

打开PDF文件时,下面的代码丢失,首页只有一列,其他页面只有1列。

My code below is lost when opening PDF file which has only one column on the front page and more than 1 column on other pages.

有人能告诉我我做错了什么吗?
我的代码下面:

Someone can tell me what I'm doing wrong? Below my code:

PdfReader pdfreader = new PdfReader(pathNmArq);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

    for (int page=1; page <= lastPage; page++) 
    {
         extractText = PdfTextExtractor.GetTextFromPage(pdfreader, page, strategy);
         extractText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));
        / / ... 
    }


推荐答案

您使用 SimpleTextExtractionStrategy 。此策略假定PDF中的文本绘图说明按阅读顺序排序。在您的情况下似乎并非如此。

You use the SimpleTextExtractionStrategy. This strategy assumes that the text drawing instructions in the PDF are sorted by the reading order. In your case that does not seem to be the case.

如果您不能指望包含阅读顺序的绘图操作的PDF,但仅使用来自的阅读文字提取策略分发时,您必须知道构成单个列的区域。如果页面包含多个列,则必须使用 RegionTextRenderFilter 限制为列,然后使用 LocationTextExtractionStrategy

If you cannot count on the PDF containing drawing operations in reading order but are only using iText text extraction strategies from the distribution, you have to know areas which constitute a single column. If a page contains multiple columns, you have to use RegionTextRenderFilter to restrict to a column and then use the LocationTextExtractionStrategy.

PS:您的意图究竟是什么

extractText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));

行?

这篇关于itextsharp - 读取包含1列(第1页)和第2列(第2页)的PDF的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆