使用itextsharp识别pdf文件的段落 [英] identify paragraphs of pdf fiiles using itextsharp

查看:876
本文介绍了使用itextsharp识别pdf文件的段落的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于某些语义分析工作,我需要使用iTextSharp识别pdf文件中的段落。我知道iTextSharp的坐标位于页面的左下角。我找到了三个定义段落边界的功能:

Because of some semantic analysis work, I need identify paragraphs from pdf files with iTextSharp. I know the coordinates of iTextSharp live in the left bottom corner of a page. I find three features to define the paragraph boundaries:


  1. 如果一行中第一个单词的水平轴小于一般水平轴的水平轴line;

  2. 如果两个连续行的前导大于一般行的前导;

  3. 如果一行以。结尾。并且结束词的横轴小于其他行的横轴

但是,我被困在第二个。我如何知道段落中两行之间的一般领先?我的意思是两条连续线之间有不同的间隙,因为像'f','g'这样的字母需要的空间比'a','n'等其他字母要多。

However, I am stuck on the second one. How can I know the general leading between two lines in a paragraph? I mean there are different gaps between two consecutive lines, because some letters like 'f','g' need more space than the others like 'a','n' and so on.

感谢您的帮助!

推荐答案

我假设您正在使用解析器功能解析PDF文件可在iTextSharp中找到。例如,参见从PDF中提取字体高度和旋转使用iText / iTextSharp 查看其他人如何在您之前完成此操作。可以在这里找到更精细的文章:使用开源PDF技术解决医疗保健中的非结构化数据问题

I'm assuming that you are parsing your PDF files using the parser functionality available in iTextSharp. See for instance Extract font height and rotation from PDF files with iText/iTextSharp to see how others have done this before you. A more elaborate article can be found here: Using Open Source PDF Technology to Solve the Unstructured Data Problem in Healthcare

您的问题是:如何计算领先?那就是:我如何知道两个连续行的基线之间的距离?

Your question is: how can I calculate the leading? That is: how do I know the distance between the base lines of two consecutive lines?

当您使用iTextSharp解析PDF时,您会看到每一行都是一系列 TextRenderInfo 对象。这些对象允许您获取文本的基线:

When you parse a PDF using iTextSharp, you see each line as a series of TextRenderInfo object. These objects allow you to get the base line of the text:

LineSegment baseline = renderInfo.GetBaseline();
Vector startpoint = baseline.GetStartPoint();

Vector 由不同元素组成:< a href =https://stackoverflow.com/questions/23909893/getting-coordinates-of-string-using-itextextractionstrategy-and-locationtextextr>在Itextsharp中使用ITextExtractionStrategy和LocationTextExtractionStrategy获取字符串坐标

您需要 startpoint [Vector.I2] 。另请参阅:如何使用iTextSharp从PDF检测换行

两个连续行的值之间的差异为您提供现代意义上的领先值。在旧的印刷时代,每个角色都是一个固定大小的块。打印机(人,而不是机器)在块行之间放置一条带状线,以在线之间创建一些额外的空间。在现代计算中,这个词得以保留,但其含义却发生了变化。没有块了,但你可以使用字体大小。字体大小是字体中字形的平均大小。一些字形在高度上占用更多空间,一些将占用更少,但考虑到前导(基线之间的距离)和字体大小(每个字形的平均高度),你可以得到一个公平的想法这些行。

The difference between that value for two consecutive lines give you the value of the leading in its modern meaning. In the old times of printing, every character was a block of a fixed size. Printers (the people, not the machines) put a strip of lead between the rows of blocks to create some extra space between the lines. In modern computing, the word was preserved, but its meaning changed. There are no "blocks" anymore, but you could work with the font size. The font size is an average size of the glyphs in a font. Some glyphs will take more space in the height, some will take less, but taking both the leading (distance between baselines) and the font size (average height of each glyph) into account, you can get a fair idea of the "space between the lines".

这篇关于使用itextsharp识别pdf文件的段落的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆