如何使用pdfbox获得字体颜色 [英] How to get font color using pdfbox
问题描述
我正在尝试使用pdfbox从pdf中提取包含所有信息的文本.我得到了我想要的所有信息,除了颜色.我尝试了多种获取字体颜色的方法(包括使用PDFBox获取文本颜色).但是没有用.现在,我从pdfBox的PageDrawer类复制了代码.但是,那么RGB值也不正确.
I am trying to extract text with all information from the pdf using pdfbox. I got all the information i want, except color. I tried different ways to get the fontcolor (including Getting Text Colour with PDFBox). But not working. And now I copied code from PageDrawer class of pdfBox. But then also the RGB value is not correct.
protected void processTextPosition(TextPosition text) {
Composite com;
Color col;
switch(this.getGraphicsState().getTextState().getRenderingMode()) {
case PDTextState.RENDERING_MODE_FILL_TEXT:
com = this.getGraphicsState().getNonStrokeJavaComposite();
int r = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRed();
int g = this.getGraphicsState().getNonStrokingColor().getJavaColor().getGreen();
int b = this.getGraphicsState().getNonStrokingColor().getJavaColor().getBlue();
int rgb = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB();
float []cosp = this.getGraphicsState().getNonStrokingColor().getColorSpaceValue();
PDColorSpace pd = this.getGraphicsState().getNonStrokingColor().getColorSpace();
break;
case PDTextState.RENDERING_MODE_STROKE_TEXT:
System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString());
System.out.println(this.getGraphicsState().getStrokingColor().getJavaColor().getRGB());
break;
case PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT:
//basic support for text rendering mode "invisible"
Color nsc = this.getGraphicsState().getStrokingColor().getJavaColor();
float[] components = {Color.black.getRed(),Color.black.getGreen(),Color.black.getBlue()};
Color c1 = new Color(nsc.getColorSpace(),components,0f);
System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString());
break;
default:
System.out.println(this.getGraphicsState().getNonStrokeJavaComposite().toString());
System.out.println(this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB());
}
我正在使用上面的代码.得到的值是r = 0,g = 0,b = 0,内部cosp对象值是[0.0],内部pd对象数组=空,并且colorSpace =空.并且RGB值始终为-16777216.请帮我.预先感谢.
I am using the above code. The values getting are r = 0, g = 0, b = 0, inside cosp object value is [0.0], inside pd object array = null and colorSpace = null. and RGB value is always -16777216. Please help me. Thanks in advance.
推荐答案
我尝试了您发布的链接中的代码,它对我有用.我得到的颜色是148.92、179.0010.1和214.965.我希望我可以给我我的PDF,如果我将其存储在SO的外部呢?我的PDF使用一种淡蓝色,看起来很匹配.这只是在Word 2010中创建并导出的文本的一页,没有什么太紧张了.
I tried the code in the link you posted and it worked for me. The colors I get back are 148.92, 179.01001 and 214.965. I wish I could give you my PDF to work with, maybe if I store it externally to SO? My PDF used a sort of palish blue color and that seems to match. It was just one page of text created in Word 2010 and exported, nothing too intense.
一些建议....
- 回想一下,返回的值是介于0和1之间的一个浮点数.如果一个值被意外地转换为int,则这些值当然最终将包含几乎所有的0.链接到代码的255的倍数以获得一个范围.0至255.
- 正如评论者所说,PDF文件最常见的颜色是黑色,即0 0 0
这就是我现在能想到的,否则,我将拥有pdfbox和fontbox的1.7.1版本,就像我说的那样,我非常关注您提供的链接.
That is all I can think of now, otherwise I have version of 1.7.1 of pdfbox and fontbox and like I said I pretty much followed the link you gave.
编辑
根据我的评论,这也许是对诸如 color.pdf
之类的pdf文件执行的一种微创方法?
Based upon my comments, here perhaps is a minorly invasive way of doing it for pdf files like color.pdf
?
在 PDFStreamEngine.java
中的 processOperator
方法中,可以在try块内执行
In PDFStreamEngine.java
in the processOperator
method one can do inside the try block
if (operation.equals("RG")) {
// stroking color space
System.out.println(operation);
System.out.println(arguments);
} else if (operation.equals("rg")) {
// non-stroking color space
System.out.println(operation);
System.out.println(arguments);
} else if (operation.equals("BT")) {
System.out.println(operation);
} else if (operation.equals("ET")) {
System.out.println(operation);
}
这将向您显示信息,然后由您根据需要处理每个部分的颜色信息.这是在 color.pdf
...
This will show you the information, then it is up to you to process the color information for each section according to your needs. Here is a snippet from the beginning of the output of the above code when run on color.pdf
...
BTG[COSInt(1),COSInt(0),CosInt(0)]RG[COSInt(1),COSInt(0),CosInt(0)]ET英国电信ET英国电信G[COSFloat {0.573},COSFloat {0.816},COSFloat {0.314}]RG[COSFloat {0.573},COSFloat {0.816},COSFloat {0.314}]ET......
在上面的输出中,您会看到一个空的BT ET部分,该部分标记为DEVICEGRAY.所有其他元素为您提供R,G和B分量的[0,1]值
You see in the above output an empty BT ET section, this being a section which is marked DEVICEGRAY. All the other give you [0,1] values for the R, G and B components
这篇关于如何使用pdfbox获得字体颜色的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!