如何从PDF使用iTextSharp的检测换行符 [英] How to detect newline from PDF using iTextSharp

查看:1538
本文介绍了如何从PDF使用iTextSharp的检测换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用 getBaseline均[vector.I2] 计算标和上标。通过这样做,我不能够从PDF中提取换行符。能否请你建议我如何从PDF使用iTextSharp的获得新行?

I have used getbaseline[vector.I2] for calculating subscript and superscript. By doing this I'm not able to extract newline from PDF. Can you please suggest to me how to get newline from PDF using iTextSharp?

推荐答案

您提供的代码是不是完全独立自明。因此,我作出一些假设,最重要的是你的代码是 RenderListener 实施 RenderText(TextRenderInfo)方法的一些摘录,可能是一些扩展的 SimpleTextExtractionStrategy 添加了成员​​变量 lastBaseLine firstcharacter_baseline lastFontSize lastFont

The code you supplied isn't completely self-explanatory. Thus I make some assumptions, foremost that your code is some excerpt of the RenderText(TextRenderInfo) method of a RenderListener implementation, probably some extension of the SimpleTextExtractionStrategy with added member variables lastBaseLine, firstcharacter_baseline, lastFontSize, and lastFont.

这意味着,你只似乎是有意在其中的内容流中出现的阅读顺序的文本文件;否则,你会根据你对 LocationTextExtractionStrategy 或类似的基础算法。

This implies that you only seem to be interested in documents in which text occurs in the content stream in reading order; otherwise you would have based your code on the LocationTextExtractionStrategy or a similar base algorithm.

此外,我不明白一些你的如果声明这是不是总是假的或始终为真,或者代码的机构来里面是空的。也不是清楚 text_second 是好的,为什么你算算差值= curBaseline [Vector.I2] - curBaseline [Vector.I2] 在一个地方。

Furthermore I don't understand some of your if statements which are either always false or always true, or the code body for which is empty. Nor is clear what text_second is good for, or why you calculate difference = curBaseline[Vector.I2] - curBaseline[Vector.I2] in one place.

这一切都这么说,你最初如果声明似乎要测试是否不是新文本的垂直基线位置是从之前的文本不同。因此,这是在这里你还可以发现一个新行的开始。

All this being said, your initial if statement seems to test whether or not the vertical base line position of the new text is different from that of the text before. Thus, this is where you could also spot the start of a new line.

我建议你开始不仅存储最后的底线,但也是最后的后裔一行,其中根据该文档是<青霉>表示底部最程度,即当前字体的字符串可以有的,而且它与当前的上升线(由文档的比较线的行表示当前字体的字符串可以有最顶层的程度的)。

I would propose that you start not only storing the last base line but also the last descent line, which according to the docs is the line that represents the bottom most extent that a string of the current font could have, and compare it with the current ascent line (by the docs the line that represents the topmost extent that a string of the current font could have).

如果当前文本的上升线是最后文本的下降线以下,这应该意味着我们有一个新的生产线,这是太远了标。在代码中,因此:

If the ascent line of the current text is below the descent line of last text, that should mean that we have a new line, it's too far down for a subscript. In code, therefore:

[...]
else if (curBaseline[Vector.I2] < lastBaseLine[Vector.I2])
{
    if (curAscentLine[Vector.I2] < lastDescentLine[Vector.I2])
    {
        firstcharacter_baseline = character_baseline;
        this.result.Append("<br/>");
    }
    else
    {
        difference = firstcharacter_baseline - curBaseline[Vector.I2];
        text_second.SetTextRise(difference);

        if (difference == 0)
        {
        }
        else
        {
            SupSubFlag = 2;
        }
    }
}
[...]

正如你期望的阅读顺序出现的内容流中的文本,你也可以尝试通过比较来识别一个新行 Vector.I1 坐标结束最后一个文本的基线和新的文本的基线的起始。如果新的比旧的更小的相关款项,这看起来像一个回车以新行暗示。

As you expect the text in the content stream to occur in reading order, you can also try to recognize a new line by comparing the Vector.I1 coordinates of the end of the base line of the last text and the start of the base line of the new text. If the new one is a relevant amount less than the old one, this looks like a carriage return hinting at a new line.

中的代码,当然会碰到在一些情况下,麻烦:

The code, of course, will run into trouble in a number of situations:


  • 每当你的期望的内容流中的文本中读取顺序时,是不符合,你会得到垃圾遍布。

  • Whenever your expectation that the text in the content stream occurs in reading order, is not fulfilled, you'll get garbage all over.

当你有multicolumnar文本,测试上面不会赶上一列的底部之间的换行符下一个的顶部。为了也搭上这一点,你可能要检查(analogouly拟议检查的跳跃线下)的新文本是否是这样,最后上面的文字,比较新血统行的最后一个上升线。

When you have multicolumnar text, the test above won't catch the line break between the bottom of one column and the top of the next. To also catch this, you might want to check (analogouly to the proposed check for a jump a line down) whether the new text is way above the last text, comparing the last ascent line with the new descent line.

如果你的PDF文件非常密集的文字,线条可能与标和周围的线条标重叠。在这种情况下,你将不得不微调比较。但在这里,你将最终遇到错误检测有时休息。

If you get PDFs with very densely packed text, lines might overlap with superscript and subscript of surrounding lines. In this case you will have to fine tune the comparisons. But here you will definitively run into falsely detected breaks sometimes.

如果你与旋转文本的PDF文件,你会得到garbabr全部结束。

If you get PDFs with rotated text, you'll get garbabr all over.

这篇关于如何从PDF使用iTextSharp的检测换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆