检索所有单词各自的坐标与iTextSharp的网页上 [英] Retrieve the respective coordinates of all words on the page with itextsharp

查看:242
本文介绍了检索所有单词各自的坐标与iTextSharp的网页上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目的是要获取网页上的所有字的各个坐标,我所做的就是

my aim is to retrieve the respective coordinates of all words on the page,what i have done is

PdfReader reader = new PdfReader("cde.pdf");
TextWithPositionExtractionStategy S = new TextWithPositionExtractionStategy();
PdfTextExtractor.GetTextFromPage(reader,1,S);

Vector curBaseline = renderInfo.GetDescentLine().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();

iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
string x1 = curBaseline[Vector.I1].ToString();
string x2 = curBaseline[Vector.I2].ToString();
string x3 = topRight[Vector.I1].ToString();
string x4 = topRight[Vector.I2].ToString();

不过,我得到了什么是一个字符串,其中包含一条线,而不是word.For例子中的所有单词的坐标PDF的内容是我是女生,我得到了什么是坐标我是女生,但不是的坐标我,AM一姑娘。如何我可以修改code,这样我可以得到这个词坐标。谢谢你。

But,what i got is the coordinates of a string,which contains all words of a line,not a word.For example the content of the pdf is "i am a girl",what i got is the coordinate of "i am a girl",but not the coordinates of "i" "am" "a" "girl".How can i modify the code so that i can get the word coordinate. Thanks.

推荐答案

(我大多与Java库iText的工作,不与NET库iTextSharp的;因此,请在这里忽略了一些Java的主义,一切都应该是容易翻译。)

(I'm mostly working with the Java library iText, not with the .Net library iTextSharp; thus, please ignore some Java-isms here, everything should be easy to translate.)

有关提取利用iText(夏普)页面的内容,您使用的类解析器包后,一些preprocessing到的 RenderListener 来喂它你选择。

For extracting contents of a page using iText(Sharp), you employ the classes in the parser package to feed it after some preprocessing to a RenderListener of your choice.

在中,你只在文字感兴趣的背景下,您最常使用 TextExtractionStrategy 这是从 RenderListener ,并增加了一个方法 getResultantText 来检索页面上的文本汇总

In a context in which you are only interested in the text, you most commonly use a TextExtractionStrategy which is derived from RenderListener and adds a single method getResultantText to retrieve the aggregated text from the page.

作为iText的文本解析的最初目的是为了实现这个用例,大多数现有的 RenderListener 样本 TextExtractionStrategy 实现,只有使文本可用。

As the initial intent of text parsing in iText was to implement this use case, most existing RenderListener samples are TextExtractionStrategy implementations and only make the text available.

因此​​,你必须实现自己的 RenderListener 您已经似乎已经christianed TextWithPositionExtractionStategy

Therefore, you will have to implement your own RenderListener which you already seem to have christianed TextWithPositionExtractionStategy.

就像有既是 SimpleTextExtractionStrategy 和(与有关网页内容运营商的结构的一些假设实现的)一个 LocationTextExtractionStrategy (不具有相同的假设,但稍微复杂一些),你可能要开始使用,使得一些假设的实现。

Just like there is both a SimpleTextExtractionStrategy (which is implemented with some assumptions about the structure of the page content operators) and a LocationTextExtractionStrategy (which does not have the same assumptions but is somewhat more complicated), you might want to start with an implementation that makes some assumptions.

因此​​,就像在的情况下, SimpleTextExtractionStrategy ,你在你的第一次,简单地实现预期的文本渲染转发给你的听众到达逐行事件,以及在同一行上从左到右。这样一来,只要你找到一个水平间隙或punctation,你知道你的当前单词结束,您可以处理它。

Thus, just like in the case of the SimpleTextExtractionStrategy, you in your first, simple implementation expect the text rendering events forwarded to your listener to arrive line by line, and on the same line from left to right. This way, as soon as you find a horizontal gap or a punctation, you know your current word is finished and you can process it.

在此相反的文本提取策略,你并不需要一个的StringBuffer 成员收集你的结果,而是一些字与位置结构的列表。此外,你还需要一些成员变量来保存你已经收集到该页面的 TextRenderInfo 事件,但不能最终流程(您可以检索几个独立事件字)。

In contrast to the text extraction strategies you don't need a StringBuffer member to collect your result but instead a list of some "word with position" structure. Furthermore you need some member variable to hold the TextRenderInfo events you already collected for this page but could not finally process (you may retrieve a word in several separate events).

只要你(即你的 renderText 方法)称为一个新的 TextRenderInfo 的对象,你应该运行像这样的(伪code):

As soon as you (i.e. your renderText method) are called for a new TextRenderInfo object, you should operate like this (pseudo-code):

if (unprocessedTextRenderInfos not empty)
{
    if (isNewLine // Check this like the simple text extraction strategy checks for hardReturn
     || isGapFromPrevious) // Check this like the simple text extraction strategy checks whether to insert a space
    {
        process(unprocessedTextRenderInfos);
        unprocessedTextRenderInfos.clear();
    }
}

split new TextRenderInfo using its getCharacterRenderInfos() method;
while (characterRenderInfos contain word end)
{
    add characterRenderInfos up to excluding the white space/punctuation to unprocessedTextRenderInfos;
    process(unprocessedTextRenderInfos);
    unprocessedTextRenderInfos.clear();
    remove used render infos from characterRenderInfos;
}
add remaining characterRenderInfos to unprocessedTextRenderInfos;

过程(unprocessedTextRenderInfos)您提取您从unprocessedTextRenderInfos需要的信息;您连接个别文本内容保存到一个词,把你想要的坐标;如果您只是想开始坐标,你把那些从第一未处理者的TextRenderInfos。如果你需要更多的数据,也从另一个TextRenderInfos使用的数据。有了这些数据,你一补字随位置的结构,并把它添加到你的结果。

In process(unprocessedTextRenderInfos) you extract the information you need from the unprocessedTextRenderInfos; you concatenate the individual text contents to a word and take the coordinates you want; if you merely want starting coordinates, you take those from the first of those unprocessed TextRenderInfos. If you need more data, you also use the data from the other TextRenderInfos. With these data you fill a "word with position" structure and add it to your result list.

在页面处理完成后,你必须再次呼叫处理(unprocessedTextRenderInfos)和unprocessedTextRenderInfos.clear();或者你可以这样做,在 endTextBlock 方法。

When page processing is finished, you have to once more call process(unprocessedTextRenderInfos) and unprocessedTextRenderInfos.clear(); alternatively you may do that in the endTextBlock method.

已经做到了这一点,你可能会觉得准备好实施稍微复杂的变体,它不具备有关网页内容结构相同的假设。 ;)

Having done this, you might feel ready to implement the slightly more complex variant which does not have the same assumptions concerning the page content structure. ;)

这篇关于检索所有单词各自的坐标与iTextSharp的网页上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆