如何使用PDFBOX确定文本的人工粗体样式、人工斜体样式和人工轮廓样式 [英] How to determine artificial bold style ,artificial italic style and artificial outline style of a text using PDFBOX

查看:25
本文介绍了如何使用PDFBOX确定文本的人工粗体样式、人工斜体样式和人工轮廓样式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PDFBox 来验证 pdf 文档.有一定的要求检查 PDF 中存在的以下类型的文本

  • 人工粗体文本
  • 人工斜体样式文本.
  • 人工轮廓样式文本

我确实在PDFBOX api列表中搜索过,但找不到这种api.

谁能帮助我并告诉我如何使用 PDFBOX 确定要在 PDF 中出现的不同类型的人工字体/文本样式.

解决方案

一般程序和一个PDFBox问题

理论上应该通过从 PDFTextStripper 派生一个类并覆盖它的方法来开始:

/*** 将 Java 字符串写入输出流.默认实现将忽略<code>textPositions</code>* 并且只调用 {@link #writeString(String)}.** @param text 要写入流的文本.* @param textPositions 属于文本的 TextPositions.* @throws IOException 如果写入文本时出现错误.*/protected void writeString(String text, List textPositions) 抛出 IOException{写字符串(文本);}

你的覆盖应该使用 List;textPositions 而不是 String text;每个 TextPosition 本质上代表一个单独的字母以及绘制该字母时激活的图形状态信息.

不幸的是,textPositions 列表包含当前版本 1.8.3 中的正确内容.例如.对于这是普通文本"这一行.从您的 PDF 中,writeString 方法被调用四次,分别针对字符串This"、is"、normal"和text"调用一次.不幸的是,textPositions 列表每次都包含最后一个字符串text"的字母的 TextPosition 实例.

这实际上已经被证明是 PDFBox 问题 PDFBOX-1804同时已解决为 1.8.4 和 2.0.0 版本的固定问题.

话虽如此,只要你有一个固定的PDFBox版本,你可以检查一些人工样式如下:

人工斜体文本

这个文本样式在页面内容中是这样创建的:

BT/F0 1 Tf24 0 5.10137 24 66 695.58770 吨[<03>]TJ...

相关部分发生在设置文本矩阵Tm.5.10137 是文本被剪切的一个因素.

当您按上述方式检查 TextPosition textPosition 时,您可以使用

查询此值

textPosition.getTextPos().getValue(1, 0)

如果此值相关地大于 0.0,则您有人工斜体.如果它相关地小于 0.0,则您会使用人工倒斜体.

人工粗体或大纲文本

这些人工样式使用不同的渲染模式使用双打印字母;例如大写T",如果是粗体:

0 0 0 1 k...BT/F0 1 Tf24 0 0 24 66.36 729.86<03>Tj4 米 0.72 瓦0 0 天1 吨0 0 0 1 K<03>TjET

(即先以常规模式绘制字母,填充字母区域,然后以轮廓模式绘制,沿字母边框绘制一条线,均为黑色,CMYK 0, 0, 0, 1;这样留下一个更厚的字母的印象.)

如果是大纲:

BT/F0 1 Tf24 0 0 24 66 661.75 Tm0 0 0 0 千<03>Tj/GS1 克4 米 0.288 瓦0 0 天1 吨0 0 0 1 K<03>TjET

(即先在常规模式下绘制字母白色,CMYK 0, 0, 0, 0,填充字母区域,然后在轮廓模式下绘制,沿着字母边框画一条线,黑色,CMYK 0,0, 0, 1;这留下了一个轮廓的黑底白字的印象.)

不幸的是,PDFBox PDFTextStripper 没有跟踪文本渲染模式.此外,它在大致相同的位置显式地删除重复的字符.因此,识别这些人为的风格不是任务.

如果确实需要这样做,则必须更改 TextPosition 以包含渲染模式,PDFStreamEngine 将其添加到生成的 TextPosition 实例,并且 PDFTextStripper 不会processTextPosition 中删除重复的字形.

更正

我写的

<块引用>

不幸的是,PDFBox PDFTextStripper 没有跟踪文本渲染模式.

这并不完全正确,您可以使用getGraphicsState().getTextState().getRenderingMode() 找到当前 渲染模式.这意味着在 processTextPosition 期间,您确实有可用的渲染模式,并且可以尝试将给定 TextPosition 的渲染模式(和颜色!)信息存储在某处,例如在一些 Map 中,供以后使用.

<块引用>

此外,它明确删除了大致相同位置的重复字符.

您可以通过调用 setSuppressDuplicateOverlappingText(false) 禁用此功能.

通过这两个更改,您也应该能够进行检查人工粗体和轮廓所需的测试.

如果您在 processTextPosition 中尽早存储和检查样式,则甚至可能不需要后一种更改.

如何获取渲染模式和颜色

更正所述,确实可以通过在processTextPosition覆盖中收集这些信息来检索渲染模式和颜色信息.>

对此,OP 评论说

<块引用>

抚摸和非抚摸的颜色总是黑色

这起初有点令人惊讶,但在查看了PDFTextStripper.properties(文本提取期间支持的运算符的初始化)后,原因就很清楚了:

#以下操作符与文本提取无关,# 所以我们可以默默地忽略它们....钾克

因此在此上下文中将忽略颜色设置操作符(尤其是本文档中用于 CMYK 颜色的操作符)!幸运的是,PageDrawer 的这些运算符的实现也可以在这种情况下使用.

因此,以下概念证明展示了如何检索所有必需的信息.

 公共类 TextWithStateStripperSimple 扩展了 PDFTextStripper{公共 TextWithStateStripperSimple() 抛出 IOException {极好的();setSuppressDuplicateOverlappingText(false);registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor());registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor());}@覆盖protected void processTextPosition(TextPosition text){renderMode.put(text, getGraphicsState().getTextState().getRenderingMode());strokingColor.put(text, getGraphicsState().getStrokingColor());nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor());super.processTextPosition(text);}MaprenderMode = new HashMap();地图strokingColor = new HashMap();地图nonStrokingColor = new HashMap();protected void writeString(String text, List textPositions) 抛出 IOException{writeString(text + '
');for (TextPosition textPosition: textPositions){StringBuilder textBuilder = new StringBuilder();textBuilder.append(textPosition.getCharacter()).append(" - 剪切").append(textPosition.getTextPos().getValue(1, 0)).append(" - ").append(textPosition.getX()).append(" ").append(textPosition.getY()).append(" - ").append(renderingMode.get(textPosition)).append(" - ").append(toString(strokingColor.get(textPosition))).append(" - ").append(toString(nonStrokingColor.get(textPosition))).append('
');writeString(textBuilder.toString());}}String toString(PDColorState colorState){如果(颜色状态 == 空)返回空";StringBuilder builder = new StringBuilder();for (float f: colorState.getColorSpaceValue()){builder.append(' ').append(f);}返回 builder.toString();}}

使用它你会得到句号."在普通文本中为:

<预><代码>.- 剪切 0.0 - 256.5701 88.6875 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0

你得到的人造粗体文本;

<预><代码>.- 剪切 0.0 - 378.86 122.140015 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0.- 剪切 0.0 - 378.86002 122.140015 - 1 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0

人工斜体:

<预><代码>.- 剪切 5.10137 - 327.121 156.4123 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0

并在人工轮廓中:

<预><代码>.- 剪切 0.0 - 357.25 190.25 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 0.0.- 剪切 0.0 - 357.25 190.25 - 1 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 0.0

所以,您已准备好识别这些人工样式所需的所有信息.现在您只需分析数据.

顺便说一句,看看人造粗体案例:坐标可能并不总是相同,而只是非常相似.因此,测试两个文本位置对象是否描述相同的位置需要一些宽容.

I am using PDFBox for validating a pdf document . There are certain requirement to check following types of text present in a PDF

  • Artificial Bold style text
  • Artificial italic style text.
  • Artificial outline style text

I did search in PDFBOX api list but was unable to find such kind of api.

Can anyone please help me out and tell how to determine different types of artificial font/text styles to be present in a PDF using PDFBOX.

解决方案

The general procedure and a PDFBox issue

In theory one should start this by deriving a class from PDFTextStripper and overriding its method:

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
    writeString(text);
}

Your override then should use List<TextPosition> textPositions instead of the String text; each TextPosition essentially represents a single a single letter and the information on the graphic state active when that letter was drawn.

Unfortunately the textPositions list does not contain the correct contents in the current version 1.8.3. E.g. for the line "This is normal text." from your PDF the method writeString is called four times, once each for the strings "This", " is", " normal", and " text." Unfortunately the textPositions list each time contains the TextPosition instances for the letters of the last string " text."

This actually proved to have already been recognized as PDFBox issue PDFBOX-1804 which meanwhile has been resolved as fixed for versions 1.8.4 and 2.0.0.

This been said, as soon as you have a PDFBox version which is fixed, you can check for some artificial styles as follows:

Artificial italic text

This text style is created like this in the page content:

BT
/F0 1 Tf
24 0 5.10137 24 66 695.5877 Tm
0 Tr
[<03>]TJ
...

The relevant part happens in setting the text matrix Tm. The 5.10137 is a factor by which the text is sheared.

When you check a TextPosition textPosition as indicated above, you can query this value using

textPosition.getTextPos().getValue(1, 0)

If this value relevantly is greater than 0.0, you have artificial italics. If it is relevantly less than 0.0, you have artificial backwards italics.

Artificial bold or outline text

These artificial styles use double printing letters using differing rendering modes; e.g. the capital 'T', in case of bold:

0 0 0 1 k
...
BT
/F0 1 Tf 
24 0 0 24 66.36 729.86 Tm 
<03>Tj 
4 M 0.72 w 
0 0 Td 
1 Tr 
0 0 0 1 K
<03>Tj
ET

(i.e. first drawing the letter in regular mode, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, both in black, CMYK 0, 0, 0, 1; this leaves the impression of a thicker letter.)

and in case of outline:

BT
/F0 1 Tf
24 0 0 24 66 661.75 Tm
0 0 0 0 k
<03>Tj
/GS1 gs
4 M 0.288 w 
0 0 Td
1 Tr
0 0 0 1 K
<03>Tj
ET

(i.e. first drawing the letter in regular mode white, CMYK 0, 0, 0, 0, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, in black, CMYK 0, 0, 0, 1; this leaves the impression of an outlined black on white letter.)

Unfortunately the PDFBox PDFTextStripper does not keep track of the text rendering mode. Furthermore it explicitly drops duplicate character occurrences in approximately the same position. Thus, it is not up to the task of recognizing these artificial styles.

If you really need to do so, you'd have to change TextPosition to also contain the rendering mode, PDFStreamEngine to add it to the generated TextPosition instances, and PDFTextStripper to not drop duplicate glyphs in processTextPosition.

Corrections

I wrote

Unfortunately the PDFBox PDFTextStripper does not keep track of the text rendering mode.

This is not entirely true, you can find the current rendering mode using getGraphicsState().getTextState().getRenderingMode(). This means that during processTextPosition you do have the rendering mode available and can try and store rendering mode (and color!) information for the given TextPosition somewhere, e.g. in some Map<TextPosition, ...>, for later use.

Furthermore it explicitly drops duplicate character occurrences in approximately the same position.

You can disable this by calling setSuppressDuplicateOverlappingText(false).

With these two changes you should be able to make the required tests for checking for artificial bold and outline, too.

The latter change might even not be necessary if you store and check for the styles early in processTextPosition.

How to retrieve rendering mode and color

As mentioned in Corrections it indeed is possible to retrieve rendering mode and color information by collecting that information in a processTextPosition override.

To this the OP commented that

Always the stroking and non-stroking color is coming as Black

This was a bit surprising at first but after looking at the PDFTextStripper.properties (from which the operators supported during text extraction are initialized), the reason became clear:

# The following operators are not relevant to text extraction,
# so we can silently ignore them.
...
K
k

Thus color setting operators (especially those for CMYK colors as in the present document) are ignored in this context! Fortunately the implementations of these operators for the PageDrawer can be used in this context, too.

So the following proof-of-concept shows how all required information can be retrieved.

public class TextWithStateStripperSimple extends PDFTextStripper
{
    public TextWithStateStripperSimple() throws IOException {
        super();
        setSuppressDuplicateOverlappingText(false);
        registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor());
        registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor());
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode());
        strokingColor.put(text, getGraphicsState().getStrokingColor());
        nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor());

        super.processTextPosition(text);
    }

    Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>();
    Map<TextPosition, PDColorState> strokingColor = new HashMap<TextPosition, PDColorState>();
    Map<TextPosition, PDColorState> nonStrokingColor = new HashMap<TextPosition, PDColorState>();

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        writeString(text + '
');

        for (TextPosition textPosition: textPositions)
        {
            StringBuilder textBuilder = new StringBuilder();
            textBuilder.append(textPosition.getCharacter())
                       .append(" - shear by ")
                       .append(textPosition.getTextPos().getValue(1, 0))
                       .append(" - ")
                       .append(textPosition.getX())
                       .append(" ")
                       .append(textPosition.getY())
                       .append(" - ")
                       .append(renderingMode.get(textPosition))
                       .append(" - ")
                       .append(toString(strokingColor.get(textPosition)))
                       .append(" - ")
                       .append(toString(nonStrokingColor.get(textPosition)))
                       .append('
');
            writeString(textBuilder.toString());
        }
    }

    String toString(PDColorState colorState)
    {
        if (colorState == null)
            return "null";
        StringBuilder builder = new StringBuilder();
        for (float f: colorState.getColorSpaceValue())
        {
            builder.append(' ')
                   .append(f);
        }

        return builder.toString();
    }
}

Using this you get the period '.' in normal text as:

. - shear by 0.0 - 256.5701 88.6875 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

In artificial bold text you get;

. - shear by 0.0 - 378.86 122.140015 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0
. - shear by 0.0 - 378.86002 122.140015 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

In artificial italics:

. - shear by 5.10137 - 327.121 156.4123 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

And in artificial outline:

. - shear by 0.0 - 357.25 190.25 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0
. - shear by 0.0 - 357.25 190.25 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0

So, there you are, all information required for recognition of those artificial styles. Now you merely have to analyze the data.

BTW, have a look at the artificial bold case: The coordinates might not always be identical but instead merely very similar. Thus, some leniency is required for the test whether two text position objects describe the same position.

这篇关于如何使用PDFBOX确定文本的人工粗体样式、人工斜体样式和人工轮廓样式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆