使用 PDFBOX 根据 PDF 中的输出识别文本 [英] Identifying the text based on the output in PDF using PDFBOX

查看:104
本文介绍了使用 PDFBOX 根据 PDF 中的输出识别文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PDF BOX 获取 PDF 中文本的颜色信息.我可以使用以下代码获得输出.但我怀疑的是 StrokingColor 代表什么,Non stroking color 代表什么.基于此,我将如何决定哪个文本具有哪种颜色.有人建议我吗?我的当前输出是这样的:DeviceRGB设备CMYKjava.awt.Color[r=63,g=240,b=0]java.awt.Color[r=35,g=31,b=32]34.93499831.1131.875

PDDocument doc = null;尝试 {doc = PDDocument.load(strFilepath);PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));PDPage 页面 = (PDPage)doc.getDocumentCatalog().getAllPages().get(1);engine.processStream(page, page.findResources(), page.getContents().getStream());PDGraphicsState 图形状态 = engine.getGraphicsState();System.out.println(graphicState.getStrokingColor().getColorSpace().getName());System.out.println(graphicState.getNonStrokingColor().getColorSpace().getName());System.out.println(graphicState.getNonStrokingColor().getJavaColor());System.out.println(graphicState.getStrokingColor().getJavaColor());float colorSpaceValues[] = graphicsState.getStrokingColor().getColorSpaceValue();for (float c: colorSpaceValues) {System.out.println(c * 255);}}最后 {如果(文档!= null){doc.close();}}

解决方案

根据 OP 想要的评论中的说明

<块引用>

将一个pdf页面的字体颜色与另一个pdf页面的字体颜色进行比较[...]如果有黑色文本示例"和灰色文本示例1"......我需要知道样本-->黑色,样本1-->像这样的灰色..我想要全文和它的颜色

PDFBox 有一个文本提取引擎,PDFTextStripper.但是,将其用于手头的任务存在一些挑战,其中包括:

  • 最初它不是设计用于提取文本旁边的颜色信息;它使用的 TextPosition 对象甚至没有任何颜色属性.因此,我们将不得不对其进行一些扩展.

    • 我们将首先注册颜色操作的侦听器以跟踪颜色.

    • 我们还将在另一个结构中存储 TextPosition 对象的颜色信息(我更愿意相应地扩展文本位置,但由于几个无法访问的私有成员,这意味着相当麻烦).

    • 这已经在 ,早期版本不提供 TextPosition 集合到 writeString .

      对于 PDFBox 2.x

      PDFBox 2.x 中有多次重构和其他更改,这些更改也与上述代码有关.

      移植到 PDFBox 2.x 后可能如下所示:

      public class ColorTextStripper extends PDFTextStripper {公共 ColorTextStripper() 抛出 IOException {极好的();setSuppressDuplicateOverlappingText(false);addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColorSpace());addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace());addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColor());addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColor());addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColorN());addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorN());addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceGrayColor());addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceGrayColor());addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceRGBColor());addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceRGBColor());addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceCMYKColor());addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceCMYKColor());}@覆盖protected void processTextPosition(TextPosition text) {renderMode.put(text, getGraphicsState().getTextState().getRenderingMode());strokingColor.put(text, getGraphicsState().getStrokingColor().getComponents());nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor().getComponents());super.processTextPosition(text);}地图renderMode = new HashMap();MapstrokingColor = new HashMap();MapnonStrokingColor = new HashMap();最终静态列表FILLING_MODES = Arrays.asList(RenderingMode.FILL, RenderingMode.FILL_STROKE, RenderingMode.FILL_CLIP, RenderingMode.FILL_STROKE_CLIP);最终静态列表STROKING_MODES = Arrays.asList(RenderingMode.STROKE, RenderingMode.FILL_STROKE, RenderingMode.STROKE_CLIP, RenderingMode.FILL_STROKE_CLIP);最终静态列表CLIPPING_MODES = Arrays.asList(RenderingMode.FILL_CLIP, RenderingMode.STROKE_CLIP, RenderingMode.FILL_STROKE_CLIP, RenderingMode.NEITHER_CLIP);@覆盖protected void writeString(String text, List textPositions) 抛出 IOException {for (TextPosition textPosition: textPositions) {RenderingMode charRenderingMode = renderingMode.get(textPosition);float[] charStrokingColor = strokingColor.get(textPosition);float[] charNonStrokingColor = nonStrokingColor.get(textPosition);StringBuilder textBuilder = new StringBuilder();textBuilder.append(textPosition.getUnicode()).append("{");如果(FILLING_MODES.contains(charRenderingMode)){textBuilder.append("FILL:").append(toString(charNonStrokingColor)).append(';');}如果(STROKING_MODES.contains(charRenderingMode)){textBuilder.append("STROKE:").append(toString(charStrokingColor)).append(';');}如果(CLIPPING_MODES.contains(charRenderingMode)){textBuilder.append("CLIP;");}textBuilder.append("}");writeString(textBuilder.toString());}}字符串 toString(float[] 值){如果(值 == 空)返回空";StringBuilder builder = new StringBuilder();开关(值.长度){情况1:builder.append("灰色");休息;案例3:builder.append("RGB");休息;案例4:builder.append("CMYK");休息;默认:builder.append("未知");}for (float f: values) {builder.append(' ').append(f);}返回 builder.toString();}}

      (ColorTextStripper)

      Iam using the PDF BOX for getting color information of the text in PDF. I could able to get the output by using the following code. But my doubt is what StrokingColor represents, what Non stroking color represents. Based on this how will i decide which text is having which color. Anyone suggest me? My cuurent output is like this:DeviceRGB DeviceCMYK java.awt.Color[r=63,g=240,b=0] java.awt.Color[r=35,g=31,b=32] 34.934998 31.11 31.875

      PDDocument doc = null;
              try {
                  doc = PDDocument.load(strFilepath);
                  PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
                  PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(1);
                  engine.processStream(page, page.findResources(), page.getContents().getStream());
                  PDGraphicsState graphicState = engine.getGraphicsState();
                  System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
                  System.out.println(graphicState.getNonStrokingColor().getColorSpace().getName());
                  System.out.println(graphicState.getNonStrokingColor().getJavaColor()); 
                  System.out.println(graphicState.getStrokingColor().getJavaColor());
                  float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
                  for (float c : colorSpaceValues) {
                      System.out.println(c * 255);
                  }
              }
              finally {
                  if (doc != null) {
                      doc.close();
                  }
              }
      

      解决方案

      According to the clarifications in comments the OP wants to

      compare the font colors of one pdf page to another pdf page [...] if there is a text "Sample" in black color and some other text "sample1" in grey color....i need to know that sample--> black color, sample1-->grey color like this..i want the full text and its color

      PDFBox has a text extraction engine, the PDFTextStripper. There are some challenges in using it for the task at hand, though, among them:

      • Originally it is not designed for extracting color information alongside the text; the TextPosition objects it uses don't even have any attribute for color. Thus, we will have to extend it somewhat.

        • We will first register listeners for color operations to keep track of colors at all.

        • We will furthermore store the color information for a TextPosition object in another structure (I would prefer to extend text position accordingly but due to several inaccessible private members that would have meant quite some hassle).

        • This has already been shown in details in this answer; for the backgrounds, look there.

      • PDF allows many ways of drawing text. The letters may be filled with one color and its border may be stroked with another. Their border may even serve as clipping path for following drawing operations. We will only consider filling and stroking colors.

      • Text drawn may later on be covered by other drawings, either completely hiding it or changing its apparent color. We will ignore this for now.

      For PDFBox 1.8.x

      As indicated, we extend the PDFTextStripper like this:

      import java.io.IOException;
      import java.util.Arrays;
      import java.util.HashMap;
      import java.util.List;
      import java.util.Map;
      
      import org.apache.pdfbox.util.PDFTextStripper;
      import org.apache.pdfbox.util.TextPosition;
      
      public class ColorTextStripper extends PDFTextStripper
      {
          public ColorTextStripper() throws IOException
          {
              super();
              setSuppressDuplicateOverlappingText(false);
      
              registerOperatorProcessor("CS", new org.apache.pdfbox.util.operator.SetStrokingColorSpace());
              registerOperatorProcessor("cs", new org.apache.pdfbox.util.operator.SetNonStrokingColorSpace());
              registerOperatorProcessor("SC", new org.apache.pdfbox.util.operator.SetStrokingColor());
              registerOperatorProcessor("sc", new org.apache.pdfbox.util.operator.SetNonStrokingColor());
              registerOperatorProcessor("SCN", new org.apache.pdfbox.util.operator.SetStrokingColor());
              registerOperatorProcessor("scn", new org.apache.pdfbox.util.operator.SetNonStrokingColor());
              registerOperatorProcessor("G", new org.apache.pdfbox.util.operator.SetStrokingGrayColor());
              registerOperatorProcessor("g", new org.apache.pdfbox.util.operator.SetNonStrokingGrayColor());
              registerOperatorProcessor("RG", new org.apache.pdfbox.util.operator.SetStrokingRGBColor());
              registerOperatorProcessor("rg", new org.apache.pdfbox.util.operator.SetNonStrokingRGBColor());
              registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor());
              registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor());
          }
      
          @Override
          protected void processTextPosition(TextPosition text)
          {
              renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode());
              strokingColor.put(text, getGraphicsState().getStrokingColor().getColorSpaceValue());
              nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor().getColorSpaceValue());
      
              super.processTextPosition(text);
          }
      
          Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>();
          Map<TextPosition, float[]> strokingColor = new HashMap<TextPosition, float[]>();
          Map<TextPosition, float[]> nonStrokingColor = new HashMap<TextPosition, float[]>();
      
          final static List<Integer> FILLING_MODES = Arrays.asList(0, 2, 4, 6);
          final static List<Integer> STROKING_MODES = Arrays.asList(1, 2, 5, 6);
          final static List<Integer> CLIPPING_MODES = Arrays.asList(4, 5, 6, 7);
      
          @Override
          protected void writeString(String text, List<TextPosition> textPositions) throws IOException
          {
              for (TextPosition textPosition: textPositions)
              {
                  Integer charRenderingMode = renderingMode.get(textPosition);
                  float[] charStrokingColor = strokingColor.get(textPosition);
                  float[] charNonStrokingColor = nonStrokingColor.get(textPosition);
      
                  StringBuilder textBuilder = new StringBuilder();
                  textBuilder.append(textPosition.getCharacter())
                             .append("{");
      
                  if (FILLING_MODES.contains(charRenderingMode))
                  {
                      textBuilder.append("FILL:")
                                 .append(toString(charNonStrokingColor))
                                 .append(';');
                  }
      
                  if (STROKING_MODES.contains(charRenderingMode))
                  {
                      textBuilder.append("STROKE:")
                                 .append(toString(charStrokingColor))
                                 .append(';');
                  }
      
                  if (CLIPPING_MODES.contains(charRenderingMode))
                  {
                      textBuilder.append("CLIP;");
                  }
      
                  textBuilder.append("}");
                  writeString(textBuilder.toString());
              }
          }
      
          String toString(float[] values)
          {
              if (values == null)
                  return "null";
              StringBuilder builder = new StringBuilder();
              switch(values.length)
              {
              case 1:
                  builder.append("GRAY"); break;
              case 3:
                  builder.append("RGB"); break;
              case 4:
                  builder.append("CMYK"); break;
              default:
                  builder.append("UNKNOWN");
              }
              for (float f: values)
              {
                  builder.append(' ')
                         .append(f);
              }
      
              return builder.toString();
          }
      }
      

      You can call it like this:

      PDFTextStripper stripper = new ColorTextStripper();
      
      PDDocument document = PDDocument.load(SOURCE_FILE);
      
      String text = stripper.getText(document);
      

      The resulting text contains something like this:

      P{FILL:RGB 0.803 0.076 0.086;}e{FILL:RGB 0.803 0.076 0.086;}l{FILL:RGB 0.803 0.076 0.086;}l{FILL:RGB 0.803 0.076 0.086;}e{FILL:RGB 0.803 0.076 0.086;}
      

      and

      G{FILL:RGB 0.102 0.101 0.095;}r{FILL:RGB 0.102 0.101 0.095;}a{FILL:RGB 0.102 0.101 0.095;}z{FILL:RGB 0.102 0.101 0.095;}i{FILL:RGB 0.102 0.101 0.095;}e{FILL:RGB 0.102 0.101 0.095;}
      

      for the Pelle and Grazie from this

      or

      K{FILL:RGB 0.0 0.322 0.573;}E{FILL:RGB 0.0 0.322 0.573;}Y{FILL:RGB 0.0 0.322 0.573;}
      

      and

      C{FILL:GRAY 0.0;}o{FILL:GRAY 0.0;}m{FILL:GRAY 0.0;}b{FILL:GRAY 0.0;}i{FILL:GRAY 0.0;}n{FILL:GRAY 0.0;}e{FILL:GRAY 0.0;}d{FILL:GRAY 0.0;}
      

      for KEY and Combined from this:

      Instead of serializing all the information into a String result, you can of course also create some class containing both the color and the character information in a structured way. Just like now the String result is created in writeString, you can change this method to add instances of such a class to some list in it.

      Requirements

      At least PDFBox version 1.8.4 is required to make this work. I tested it using 2.0.0-SNAPSHOT but 1.8.4 should suffice. 1.8.3, on the other hand, has a bug which sometimes forwards the wrong TextPosition objects to writeString, cf. PDFBOX-1804, and earlier versions don't provide a TextPosition collection to writeString at all.

      For PDFBox 2.x

      There were multiple refactorings and other changes in PDFBox 2.x which also concern the code above.

      Ported to PDFBox 2.x it may look like this:

      public class ColorTextStripper extends PDFTextStripper {
          public ColorTextStripper() throws IOException {
              super();
              setSuppressDuplicateOverlappingText(false);
      
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColorSpace());
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace());
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColor());
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColor());
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColorN());
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorN());
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceGrayColor());
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceGrayColor());
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceRGBColor());
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceRGBColor());
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceCMYKColor());
              addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceCMYKColor());
          }
      
          @Override
          protected void processTextPosition(TextPosition text) {
              renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode());
              strokingColor.put(text, getGraphicsState().getStrokingColor().getComponents());
              nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor().getComponents());
      
              super.processTextPosition(text);
          }
      
          Map<TextPosition, RenderingMode> renderingMode = new HashMap<TextPosition, RenderingMode>();
          Map<TextPosition, float[]> strokingColor = new HashMap<TextPosition, float[]>();
          Map<TextPosition, float[]> nonStrokingColor = new HashMap<TextPosition, float[]>();
      
          final static List<RenderingMode> FILLING_MODES = Arrays.asList(RenderingMode.FILL, RenderingMode.FILL_STROKE, RenderingMode.FILL_CLIP, RenderingMode.FILL_STROKE_CLIP);
          final static List<RenderingMode> STROKING_MODES = Arrays.asList(RenderingMode.STROKE, RenderingMode.FILL_STROKE, RenderingMode.STROKE_CLIP, RenderingMode.FILL_STROKE_CLIP);
          final static List<RenderingMode> CLIPPING_MODES = Arrays.asList(RenderingMode.FILL_CLIP, RenderingMode.STROKE_CLIP, RenderingMode.FILL_STROKE_CLIP, RenderingMode.NEITHER_CLIP);
      
          @Override
          protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
              for (TextPosition textPosition: textPositions) {
                  RenderingMode charRenderingMode = renderingMode.get(textPosition);
                  float[] charStrokingColor = strokingColor.get(textPosition);
                  float[] charNonStrokingColor = nonStrokingColor.get(textPosition);
      
                  StringBuilder textBuilder = new StringBuilder();
                  textBuilder.append(textPosition.getUnicode()).append("{");
      
                  if (FILLING_MODES.contains(charRenderingMode)) {
                      textBuilder.append("FILL:").append(toString(charNonStrokingColor)).append(';');
                  }
      
                  if (STROKING_MODES.contains(charRenderingMode)) {
                      textBuilder.append("STROKE:").append(toString(charStrokingColor)).append(';');
                  }
      
                  if (CLIPPING_MODES.contains(charRenderingMode)) {
                      textBuilder.append("CLIP;");
                  }
      
                  textBuilder.append("}");
                  writeString(textBuilder.toString());
              }
          }
      
          String toString(float[] values)
          {
              if (values == null)
                  return "null";
              StringBuilder builder = new StringBuilder();
              switch(values.length) {
              case 1:
                  builder.append("GRAY"); break;
              case 3:
                  builder.append("RGB"); break;
              case 4:
                  builder.append("CMYK"); break;
              default:
                  builder.append("UNKNOWN");
              }
              for (float f: values) {
                  builder.append(' ')
                         .append(f);
              }
      
              return builder.toString();
          }
      }
      

      (ColorTextStripper)

      这篇关于使用 PDFBOX 根据 PDF 中的输出识别文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆