从TextRenderInfo获取字体高度/重量如何? [英] Get font height/weight from TextRenderInfo how?

查看:2665
本文介绍了从TextRenderInfo获取字体高度/重量如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用iText(Sharp)解析一个现有的PDF时,我创建了一个实现IRenderListener的对象,我将其传递到PdfReaderContentParser.ProcessContent()中,当然,我们的对象的RenderText() 。

问题是,TextRenderInfo告诉我关于基本字体(在我的情况下,Helvetica),但我不能告诉字体的高度,也不知道它的重量正常与粗体)。这是一个已知的iText(夏普)缺陷或我缺少的东西吗?

解决方案


TextRenderInfo告诉我关于基本字体(在我的情况下,Helvetica),但我不能说出字体的高度,也不知道它的重量(常规与粗体)




高度



不幸的是,iTextSharp在 TextRenderInfo 中没有提供公共字体大小的方法或成员。 。有些人通过使用它的 GetAscentLine()和它的 GetDescentLine()之间的距离来解决这个问题。



如果您准备好使用 Reflection ,您可以通过公开和使用私有 TextRenderInfo 成员 GraphicsState gs ,例如就像在这个渲染监听器中一样:

$ pre $ $ $ $ c $ public $ Location
public List< SizeAndTextAndFont> myChunks =新列表< SizeAndTextAndFont>();

//自动调用PDF中的每个文本块
public override void RenderText(TextRenderInfo wholeRenderInfo)
{
base.RenderText(wholeRenderInfo);
GraphicsState gs =(GraphicsState)GsField.GetValue(wholeRenderInfo);
myChunks.Add(new SizeAndTextAndFont(gs.FontSize,wholeRenderInfo.GetText(),wholeRenderInfo.GetFont()。PostscriptFontName));


FieldInfo GsField = typeof(TextRenderInfo).GetField(gs,System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
}

//存储矩形,文本和字体的辅助类
public class SizeAndTextAndFont
{
public float Size;
public String Text;
public String Font;
public SizeAndTextAndFont(float size,String text,String font)
{
this.Size = size;
this.Text = text;
this.Font = font;




$ b你可以使用这样的渲染监听器来提取信息:

 使用(var pdfReader = new PdfReader(testFile))
{
//循环遍历(var page = startPage; page< endPage; page ++)
{
Console.WriteLine(\ n Page {0},page);

LocationTextSizeExtractionStrategy strategy = new LocationTextSizeExtractionStrategy();
PdfTextExtractor.GetTextFromPage(pdfReader,page,strategy);

foreach(strategy.myChunks中的SizeAndTextAndFont p)
{
Console.WriteLine(string.Format({{}中的< {0}> ,p.Text,p.Size,p.Font));





$产生这样的输出:

  Page 1 
<菲律宾证券交易所公司>在Helvetica-Bold中8
<每日报价报表>在Helvetica-Bold中8
< 2015年3月23日> in Helvetica-Bold at 8
< Name> in Helvetica at 7
< Symbol>在Helvetica 7
< Bid> in Helvetica at 7
[...]



考虑转换



输出中以字体大小显示的数字是相应文本绘制时PDF图形状态下字体大小属性的值。



由于PDF的灵活性,这可能不是最终在输出中看到的字体大小,不过,自定义转换可能会显着延长输出。一些PDF制作者甚至总是使用1的字体大小,并且转换来相应地拉伸输出。

为了在这样的文档中获得较好的字体大小,你可以改进 LocationTextSizeExtractionStrategy 方法 RenderText 像这样:

  public override void RenderText(TextRenderInfo wholeRenderInfo)
{
base.RenderText(wholeRenderInfo);
GraphicsState gs =(GraphicsState)GsField.GetValue(wholeRenderInfo);
Matrix textToUserSpaceTransformMatrix =(Matrix)TextToUserSpaceTransformMatrixField.GetValue(wholeRenderInfo);
float transformedFontSize = new Vector(0,gs.FontSize,0).Cross(textToUserSpaceTransformMatrix).Length;

myChunks.Add(new SizeAndTextAndFont(transformedFontSize,wholeRenderInfo.GetText(),wholeRenderInfo.GetFont()。PostscriptFontName));

$ / code>

这个附加反射 FieldInfo 成员。
$ b $ pre $ FieldInfo TextToUserSpaceTransformMatrixField = typeof(TextRenderInfo).GetField(textToUserSpaceTransformMatrix,System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);



权重



在上面的输出中,字体的名称可能包含比单纯的字体家族名称更多的字符,而且还可以包含一个权重指示符

 < 2015年3月23日>在Helvetica-Bold中8 

因此,在你的例子中,


TextRenderInfo告诉我基本字体(在我的例子中,Helvetica)

没有任何装饰的Helvetica意味着一个正常的重量。



Helvetica是每个PDF阅读器必须提供的标准14种字体之一:Times罗马,Helvetica,快递,符号,时间大胆,黑体,粗体,信使粗体,ZapfDingbats,时间斜体,Helvetica斜,信使斜,Times-BoldItalic,Helvetica-BoldOblique,信使 - BoldOblique。因此,这些名字是相当可靠的。



不幸的是字体名称一般可以任意选择;一个粗体的字体可能会有粗体或黑色或其他名称的粗体,或者根本没有。\\ b
$ b

也可以尝试使用字体的指定了 FontWeight 条目的FontDescriptor 字典。不幸的是,这个条目是可选的,你不能指望它在那里。



另外,PDF中的字体可以被人为地粗体显示。

所有这些数字都是使用相同的字体绘制的,只是增加了轮廓线宽度的上升。

恐怕没有可靠的方法来找到确切的字体重量,只是一些启发式可能或可能不会返回可接受的近似值。

When I parse an existing PDF using iText(Sharp), I create an object which implements IRenderListener which I pass into PdfReaderContentParser.ProcessContent() and sure enough, my object's RenderText() gets called repeatedly with all the text in the PDF.

The problem is, the TextRenderInfo tells me about the base font (in my case, Helvetica) but I can't tell the height of the font nor its weight (regular vs. bold). Is this a known deficiency of iText(Sharp) or am I missing something?

解决方案

the TextRenderInfo tells me about the base font (in my case, Helvetica) but I can't tell the height of the font nor its weight (regular vs. bold)

Height

Unfortunately iTextSharp does not provide a public font size method or member in the TextRenderInfo. Some people worked around this by using the distance between its GetAscentLine() and its GetDescentLine().

If you are ready to use Reflection, though, you can do better by exposing and using the private TextRenderInfo member GraphicsState gs, e.g. like in this render listener:

public class LocationTextSizeExtractionStrategy : LocationTextExtractionStrategy
{
    //Hold each coordinate
    public List<SizeAndTextAndFont> myChunks = new List<SizeAndTextAndFont>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo wholeRenderInfo)
    {
        base.RenderText(wholeRenderInfo);
        GraphicsState gs = (GraphicsState) GsField.GetValue(wholeRenderInfo);
        myChunks.Add(new SizeAndTextAndFont(gs.FontSize, wholeRenderInfo.GetText(), wholeRenderInfo.GetFont().PostscriptFontName));
    }

    FieldInfo GsField = typeof(TextRenderInfo).GetField("gs", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
}

//Helper class that stores our rectangle, text, and font
public class SizeAndTextAndFont
{
    public float Size;
    public String Text;
    public String Font;
    public SizeAndTextAndFont(float size, String text, String font)
    {
        this.Size = size;
        this.Text = text;
        this.Font = font;
    }
}

You can extract information with such a render listener like this:

using (var pdfReader = new PdfReader(testFile))
{
    // Loop through each page of the document
    for (var page = startPage; page < endPage; page++)
    {
        Console.WriteLine("\n    Page {0}", page);

        LocationTextSizeExtractionStrategy strategy = new LocationTextSizeExtractionStrategy();
        PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

        foreach (SizeAndTextAndFont p in strategy.myChunks)
        {
            Console.WriteLine(string.Format("<{0}> in {2} at {1}", p.Text, p.Size, p.Font));
        }
    }
}

This produces an output like this:

    Page 1
<        The Philippine Stock Exchange, Inc> in Helvetica-Bold at 8
<       Daily Quotations Report> in Helvetica-Bold at 8
<       March 23 , 2015> in Helvetica-Bold at 8
<Name> in Helvetica at 7
<Symbol> in Helvetica at 7
<Bid> in Helvetica at 7
[...]

Considering transformations

The numbers you see in the output as font sizes are the values of the font size property in the PDF graphics state at the time the respective text is drawn.

Due to the flexibility of PDF this may not be font size you eventually see in the output, though, a custom transformation may stretch the output considerably. Some PDF producers even always use a font size of 1 and transformations to stretch the output accordingly.

To get a good value for font sizes in such documents, you can improve the LocationTextSizeExtractionStrategy method RenderText like this:

public override void RenderText(TextRenderInfo wholeRenderInfo)
{
    base.RenderText(wholeRenderInfo);
    GraphicsState gs = (GraphicsState) GsField.GetValue(wholeRenderInfo);
    Matrix textToUserSpaceTransformMatrix = (Matrix) TextToUserSpaceTransformMatrixField.GetValue(wholeRenderInfo);
    float transformedFontSize = new Vector(0, gs.FontSize, 0).Cross(textToUserSpaceTransformMatrix).Length;

    myChunks.Add(new SizeAndTextAndFont(transformedFontSize, wholeRenderInfo.GetText(), wholeRenderInfo.GetFont().PostscriptFontName));
}

with this additional reflection FieldInfo member.

FieldInfo TextToUserSpaceTransformMatrixField = typeof(TextRenderInfo).GetField("textToUserSpaceTransformMatrix", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);

Weight

As you can see in the output above, the name of the font may contain more than the mere font family name but also a weight indicator

<       March 23 , 2015> in Helvetica-Bold at 8

In your example, therefore,

the TextRenderInfo tells me about the base font (in my case, Helvetica)

the Helvetica without any decorations would imply a regular weight.

Helvetica is one of the standard 14 fonts which every PDF viewer must provide out-of-the-box: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique. Thus, these names are pretty dependable.

Unfortunately font names in general may be chosen arbitrarily; a bold font may have "Bold" or "Black" or other indicators of boldness in its name or none at all.

One might also try to use the font's FontDescriptor dictionary for which an entry FontWeight is specified. Unfortunately this entry is optional, you cannot count on it being there at all.

Furthermore, a font in a PDF can be artificially bold'ed, cf. this answer:

All these numbers are drawn using the same font, merely adding a rising outline line width.

Thus, I'm afraid there is no dependable way to find the exact font weight, merely a number of heuristics which may or may not return acceptable approximations.

这篇关于从TextRenderInfo获取字体高度/重量如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆