使用iTextSharp检查PDF文档中的文本是否使用粗体的方法是什么? [英] What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp

查看:507
本文介绍了使用iTextSharp检查PDF文档中的文本是否使用粗体的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个应用程序,从PDF文件中提取标题。应用程序应该使用的文档都具有或多或少连贯的结构和格式,事实上,告诉文本块是否大胆,是非常重要的。最近我遇到了一堆文件,其中一些块可视化地显示为粗体,但在字体的字符串表示中没有粗体部分。以下SO主题如何使用iTextSharp获取文本格式有助于我理解,还有一种使文字显得更加大胆的方法。但是在我的情况下调用GetTextRenderMode()也没有帮助,因为它返回0就好像它是正常的文本。那么是否还有其他的方法让文本显示为粗体,并且是否可以使用iTextSharp来检测它?解析方案

假定PDF文件中的字体知道它是否是粗体。让我们来看看里面,检查是否你的假设是正确的。

这是字体TT116t00的子集JOJJAH看起来像看PDF内部的文件你已经分享了:





我们看到字体是子类 / TrueType ,我们看到 / ItalicAngle 为0,并且...我们看到 / Flags 的第三位被设置。让我们来检查一下PDF参考文件,看看这个告诉我们什么:



我引用:

lockquote

字体包含Adobe标准拉丁字符集以外的字形。


字形看起来粗体,因为字形的绘制方式他们显得大胆。你看到的字体粗体,因为你是人类。但是,当一台机器查看字体时,并不知道字体是粗体。一台机器只是遵循存储在 / FontFile2 流中的指令。



总之:iTextSharp没有任何表示字体为粗体。


I have an application, that extracts headings out of pdf files. The documents, that the application is supposed to work with, all have more or less coherent structure and formatting, in fact, telling if a text chunk is bold or not, is very important. Recently I came across a bunch of files, where some chunks visually appear bold, but do not have "bold" piece in string representation of font. The following SO thread how can i get text formatting with iTextSharp helped me to understand, that there is one more way of making text appear bold. However in my case calling GetTextRenderMode() does not help either, as it returns 0 as if it were normal text. So are there any other ways of making text appear bold, and is it possible to detect it using iTextSharp ?

解决方案

You are making the assumption that the font inside your PDF file knows if it's bold or not. Let's take a look inside and check if your assumption is correct.

This is what the subset JOJJAH of the font TT116t00 looks like when you look at the internals of the PDF file you have shared:

We see that the font is of subtye /TrueType, we see that the /ItalicAngle is 0, and... we see that the 3rd bit of the /Flags is set. Let's check the PDF reference to find out what this tells us:

I quote:

The font contains glyphs outside the Adobe standard Latin character set.

The glyphs look bold, because the glyphs are drawn in a way that they appear bold. You see the font as bold because you are human. However, when a machine looks at the font, it doesn't have a clue that the font is bold. A machine just follows the instructions stored in the /FontFile2 stream.

In short: iTextSharp doesn't have any indications that the font is bold.

这篇关于使用iTextSharp检查PDF文档中的文本是否使用粗体的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆