将TextChunk拆分为单词 [英] Split TextChunk into words
问题描述
我发现这个 example 将pdf文档拆分为TextChunks
I've found this example which splits a pdf document into TextChunks
是否有
a)将每个TextChunk进一步拆分为每个TextChunk中的单词/字符并仍能找到它的位置的方法?
a) a method to split each TextChunk further into words/characters from each TextChunk and still be able to find it's location?
或者
b)将PDF解析为单词/字符而不是块并查找位置的方法?
b) a method to get parse a PDF into words/characters instead of chunks and find the location?
推荐答案
是否有方法将每个TextChunk进一步拆分为每个TextChunk中的单词/字符,并且仍能找到它的位置?
Is there a method to split each TextChunk further into words/characters from each TextChunk and still be able to find it's location?
你不能进一步分割这些 TextChunk
对象,因为这个 TextChunk
class只是一个运送极少量i的助手类信息,参见它的构造函数参数 String str,Vector startLocation,Vector endLocation,float charSpaceWidth,
特别是没有关于单个字符宽度或相关文本大小和字体的信息来派生单个字符宽度来自。
You cannot split these TextChunk
objects further because this TextChunk
class is merely a helper class transporting a very small amount of information, cf. its constructor arguments String str, Vector startLocation, Vector endLocation, float charSpaceWidth,
especially there is no information on the individual character widths or the associated text size and font to derive the individual character widths from.
但是你当然可以改变方法 RenderText
(其中)传入更完整的 TextRenderInfo
实例减少为 TextChunk
实例):
But you can of course change the method RenderText
(in which the incoming more complete TextRenderInfo
instances are reduced to TextChunk
instances):
public virtual void RenderText(TextRenderInfo renderInfo) {
LineSegment segment = renderInfo.GetBaseline();
TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
locationalResult.Add(location);
}
特别是你可以先拆分 TextRenderInfo
实例使用其 GetCharacterRenderInfos()
方法转换为单个字符 TextRenderInfo
实例,循环遍历这些并创建个体 TextChunk
每个实例。
In particular you can first split the TextRenderInfo
instance using its GetCharacterRenderInfos()
method into single character TextRenderInfo
instances, loop through these and create individual TextChunk
instances for each of them.
您可能在存储库中看不到该方法因为iTextSharp已经切换到新的SourceForge版本控制基础架构。因此,您应该切换到当前的iTextSharp存储库。
You probably don't see that method in the repository where you are looking as iTextSharp has already switched to the new SourceForge versioning infrastructure. Thus, you should switch to the current iTextSharp repository.
是否有方法将PDF解析为单词/字符而不是块并找到位置?
Is there a method to get parse a PDF into words/characters instead of chunks and find the location?
当然,你可以实现 IRenderListener
来创建一个完全符合你需要的提取策略。您可以在iText和iTextSharp的stackoverflow上找到关于该主题的一些讨论,例如: ITextSharp查找PDF中特定文本的坐标,< a href =https://stackoverflow.com/questions/13632541/get-the-exact-stringposition-in-pdf>以PDF格式获取确切的字符串位置,使用itextsharp 检索页面上所有单词的相应坐标。
Of course you can implement IRenderListener
to create an extraction strategy which does exactly what you need. You can find some discussions of that topic on stackoverflow for iText and iTextSharp, e.g. ITextSharp Find coordinates of specific text in PDF, Get the exact Stringposition in PDF, Retrieve the respective coordinates of all words on the page with itextsharp and others.
这篇关于将TextChunk拆分为单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!