将TextChunk拆分为单词 [英] Split TextChunk into words

查看:172
本文介绍了将TextChunk拆分为单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现这个 example 将pdf文档拆分为TextChunks

I've found this example which splits a pdf document into TextChunks

是否有

a)将每个TextChunk进一步拆分为每个TextChunk中的单词/字符并仍能找到它的位置的方法?

a) a method to split each TextChunk further into words/characters from each TextChunk and still be able to find it's location?

或者

b)将PDF解析为单词/字符而不是块并查找位置的方法?

b) a method to get parse a PDF into words/characters instead of chunks and find the location?

推荐答案


是否有方法将每个TextChunk进一步拆分为每个TextChunk中的单词/字符,并且仍能找到它的位置?

Is there a method to split each TextChunk further into words/characters from each TextChunk and still be able to find it's location?

你不能进一步分割这些 TextChunk 对象,因为这个 TextChunk class只是一个运送极少量i的助手类信息,参见它的构造函数参数 String str,Vector startLocation,Vector endLocation,float charSpaceWidth,特别是没有关于单个字符宽度或相关文本大小和字体的信息来派生单个字符宽度来自。

You cannot split these TextChunk objects further because this TextChunk class is merely a helper class transporting a very small amount of information, cf. its constructor arguments String str, Vector startLocation, Vector endLocation, float charSpaceWidth, especially there is no information on the individual character widths or the associated text size and font to derive the individual character widths from.

但是你当然可以改变方法 RenderText (其中)传入更完整的 TextRenderInfo 实例减少为 TextChunk 实例):

But you can of course change the method RenderText (in which the incoming more complete TextRenderInfo instances are reduced to TextChunk instances):

public virtual void RenderText(TextRenderInfo renderInfo) {
  LineSegment segment = renderInfo.GetBaseline();
  TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
  locationalResult.Add(location);        
}

特别是你可以先拆分 TextRenderInfo 实例使用其 GetCharacterRenderInfos()方法转换为单个字符 TextRenderInfo 实例,循环遍历这些并创建个体 TextChunk 每个实例。

In particular you can first split the TextRenderInfo instance using its GetCharacterRenderInfos() method into single character TextRenderInfo instances, loop through these and create individual TextChunk instances for each of them.

您可能在存储库中看不到该方法因为iTextSharp已经切换到新的SourceForge版本控制基础架构。因此,您应该切换到当前的iTextSharp存储库

You probably don't see that method in the repository where you are looking as iTextSharp has already switched to the new SourceForge versioning infrastructure. Thus, you should switch to the current iTextSharp repository.


是否有方法将PDF解析为单词/字符而不是块并找到位置?

Is there a method to get parse a PDF into words/characters instead of chunks and find the location?

当然,你可以实现 IRenderListener 来创建一个完全符合你需要的提取策略。您可以在iText和iTextSharp的stackoverflow上找到关于该主题的一些讨论,例如: ITextSharp查找PDF中特定文本的坐标,< a href =https://stackoverflow.com/questions/13632541/get-the-exact-stringposition-in-pdf>以PDF格式获取确切的字符串位置,使用itextsharp 检索页面上所有单词的相应坐标。

Of course you can implement IRenderListener to create an extraction strategy which does exactly what you need. You can find some discussions of that topic on stackoverflow for iText and iTextSharp, e.g. ITextSharp Find coordinates of specific text in PDF, Get the exact Stringposition in PDF, Retrieve the respective coordinates of all words on the page with itextsharp and others.

这篇关于将TextChunk拆分为单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆