使用带有 VB.NET 的 PDFBox 检测粗体、斜体和删除线文本 [英] Detect Bold, Italic and Strike Through text using PDFBox with VB.NET

查看:137
本文介绍了使用带有 VB.NET 的 PDFBox 检测粗体、斜体和删除线文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法在使用 PDFBox 提取 PDF 时保留文本格式?

我有一个程序可以解析 PDF 文档以获取信息.当 PDF 的新版本发布时,作者使用粗体或斜体文本表示新信息,并用删除线或下划线表示省略的文本.在 PDFbox 中使用基本的 Stripper 类返回所有文本,但格式被删除,所以我无法判断文本是新的还是省略的.我目前正在使用下面的项目示例代码:

 Dim doc As PDDocument = Nothing尝试doc = PDDocument.load(RFPFilePath)Dim stripper As New PDFTextStripper()stripper.setAddMoreFormatting(真)stripper.setSortByPosition(真)rtxt_DocumentViewer.Text = stripper.getText(doc)最后如果 doc IsNot Nothing 那么文档.close()万一结束尝试

如果我简单地将 PDF 文本复制并粘贴到保留格式的富文本框中,我的解析代码运行良好.我正在考虑通过打开 PDF 以编程方式执行此操作,选择全部,复制,关闭文档,然后将其粘贴到我的富文本框中,但这似乎很笨拙.

解决方案

正如 OP 在评论中提到的 Java 示例可以做的那样,而我还只将 PDFBox 与 Java 一起使用,这个答案提供了一个 Java 示例.此外,此示例仅使用 PDFBox 1.8.11 版开发和测试.

自定义文本剥离器

正如评论中已经提到的,

<块引用>

OP 示例文档中的粗体和斜体效果是通过使用不同的字体(包含字母的粗体或斜体版本)绘制文本而生成的.示例文档中的下划线和删除线效果是通过在文本行下方/通过文本行绘制一个矩形来生成的,该矩形具有文本行的宽度和非常小的高度.因此,要提取这些信息,必须扩展 PDFTextStripper 以某种方式对字体更改和文本附近的矩形做出反应.

这是一个扩展 PDFTextStripper 的示例类:

公共类 PDFStyledTextStripper 扩展 PDFTextStripper{公共 PDFStyledTextStripper() 抛出 IOException{极好的();registerOperatorProcessor(re", new AppendRectangleToPath());}@覆盖protected void writeString(String text, List textPositions) 抛出 IOException{for (TextPosition textPosition: textPositions){设置<字符串>样式 = 确定样式(文本位置);如果 (!style.equals(currentStyle)){output.write(style.toString());当前样式 = 样式;}output.write(textPosition.getCharacter());}}设置<字符串>确定样式(文本位置文本位置){设置<字符串>结果 = 新 HashSet<>();if (textPosition.getFont().getBaseFont().toLowerCase().contains(bold"))result.add(粗体");if (textPosition.getFont().getBaseFont().toLowerCase().contains(斜体"))result.add(斜体");if (rectangles.stream().anyMatch(r -> r.underlines(textPosition)))result.add(下划线");if (rectangles.stream().anyMatch(r -> r.strikesThrough(textPosition)))result.add(StrikeThrough");返回结果;}类 AppendRectangleToPath 扩展了 OperatorProcessor{public void process(PDFOperator operator, List arguments){COSNumber x = (COSNumber) arguments.get(0);COSNumber y = (COSNumber) arguments.get(1);COSNumber w = (COSNumber) arguments.get(2);COSNumber h = (COSNumber) arguments.get(3);double x1 = x.doubleValue();double y1 = y.doubleValue();//为变换创建一对坐标double x2 = w.doubleValue() + x1;double y2 = h.doubleValue() + y1;Point2D p0 = 变换点(x1,y1);Point2D p1 = 变换点(x2,y1);Point2D p2 = 变换点(x2,y2);Point2D p3 = 变换点(x1,y2);rectangles.add(new TransformedRectangle(p0, p1, p2, p3));}Point2D.Double 变换点(双 x,双 y){double[] 位置 = {x,y};getGraphicsState().getCurrentTransformationMatrix().createAffineTransform().transform(位置, 0, 位置, 0, 1);返回新的 Point2D.Double(position[0],position[1]);}}静态类 TransformedRectangle{公共变换矩形(Point2D p0,Point2D p1,Point2D p2,Point2D p3){this.p0 = p0;this.p1 = p1;this.p2 = p2;this.p3 = p3;}布尔strikesThrough(TextPosition textPosition){矩阵矩阵 = textPosition.getTextPos();//TODO:这是一个非常简单的实现,仅适用于没有页面旋转的水平文本//和水平矩形strikeThroughs,左下角为p0,右上角为p2//检查矩形是否水平匹配(至少)文本if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt()/10.0)返回假;//检查矩形是否垂直处于合适的高度以添加下划线double vertDiff = p0.getY() - matrix.getYPosition();if (vertDiff < 0 || vertDiff > textPosition.getFont().getFontDescriptor().getAscent() * textPosition.getFontSizeInPt()/1000.0)返回假;//检查矩形是否小到可以成为一条线return Math.abs(p2.getY() - p0.getY()) <2;}布尔下划线(TextPosition textPosition){矩阵矩阵 = textPosition.getTextPos();//TODO:这是一个非常简单的实现,仅适用于没有页面旋转的水平文本//和水平矩形下划线,左下角为 p0,右上角为 p2//检查矩形是否水平匹配(至少)文本if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt()/10.0)返回假;//检查矩形是否垂直处于合适的高度以添加下划线double vertDiff = p0.getY() - matrix.getYPosition();if (vertDiff > 0 || vertDiff < textPosition.getFont().getFontDescriptor().getDescent() * textPosition.getFontSizeInPt()/500.0)返回假;//检查矩形是否小到可以成为一条线return Math.abs(p2.getY() - p0.getY()) <2;}最终 Point2D p0, p1, p2, p3;}最终列表矩形 = 新的 ArrayList<>();设置<字符串>currentStyle = Collections.singleton(未定义");}

(


PS 同时,PDFStyledTextStripper 的代码略有改动,也适用于 github 问题中共享的示例文档,特别是其内部类的代码 <代码>TransformedRectangle,参见.这里.

Is there a way to preserve the text formatting when extracting a PDF with PDFBox?

I have a program that parses a PDF document for information. When a new version of the PDF is released the authors use bold or italic text to indicate new information and Strike through or underlined to indicated omitted text. Using the base Stripper class in PDFbox returns all the text but the formatting is removed so I have no way of telling if the text is new or omitted. I'm currently using the project example code below:

    Dim doc As PDDocument = Nothing

    Try
        doc = PDDocument.load(RFPFilePath)
        Dim stripper As New PDFTextStripper()

        stripper.setAddMoreFormatting(True)
        stripper.setSortByPosition(True)
        rtxt_DocumentViewer.Text = stripper.getText(doc)

    Finally
        If doc IsNot Nothing Then
            doc.close()
        End If
    End Try

I have my parsing code working well if I simply copy and paste the PDF text into a richtextbox which preservers the formatting. I was thinking of doing this programatically by opening the PDF, select all, Copy, close the document then paste it in my richtextbox but that seems clunky.

解决方案

As the OP mentioned in a comment that a Java example would do and I've yet only used PDFBox with Java, this answer features a Java example. Furthermore, this example has been developed and tested with PDFBox version 1.8.11 only.

A customized text stripper

As already mentioned in a comment,

The bold and italic effects in the OP's sample document are generated by using a different font (containing bold or italic versions of the letters) to draw the text. The underline and strike-through effects in the sample document are generated by drawing a rectangle under / through the text line which has the width of the text line and a very small height. To extract these information, therefore, one has to extend the PDFTextStripper to somehow react to font changes and rectangles nearby text.

This is an example class extending the PDFTextStripper just like that:

public class PDFStyledTextStripper extends PDFTextStripper
{
    public PDFStyledTextStripper() throws IOException
    {
        super();
        registerOperatorProcessor("re", new AppendRectangleToPath());
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        for (TextPosition textPosition : textPositions)
        {
            Set<String> style = determineStyle(textPosition);
            if (!style.equals(currentStyle))
            {
                output.write(style.toString());
                currentStyle = style;
            }
            output.write(textPosition.getCharacter());
        }
    }

    Set<String> determineStyle(TextPosition textPosition)
    {
        Set<String> result = new HashSet<>();

        if (textPosition.getFont().getBaseFont().toLowerCase().contains("bold"))
            result.add("Bold");

        if (textPosition.getFont().getBaseFont().toLowerCase().contains("italic"))
            result.add("Italic");

        if (rectangles.stream().anyMatch(r -> r.underlines(textPosition)))
            result.add("Underline");

        if (rectangles.stream().anyMatch(r -> r.strikesThrough(textPosition)))
            result.add("StrikeThrough");

        return result;
    }

    class AppendRectangleToPath extends OperatorProcessor
    {
        public void process(PDFOperator operator, List<COSBase> arguments)
        {
            COSNumber x = (COSNumber) arguments.get(0);
            COSNumber y = (COSNumber) arguments.get(1);
            COSNumber w = (COSNumber) arguments.get(2);
            COSNumber h = (COSNumber) arguments.get(3);

            double x1 = x.doubleValue();
            double y1 = y.doubleValue();

            // create a pair of coordinates for the transformation
            double x2 = w.doubleValue() + x1;
            double y2 = h.doubleValue() + y1;

            Point2D p0 = transformedPoint(x1, y1);
            Point2D p1 = transformedPoint(x2, y1);
            Point2D p2 = transformedPoint(x2, y2);
            Point2D p3 = transformedPoint(x1, y2);

            rectangles.add(new TransformedRectangle(p0, p1, p2, p3));
        }

        Point2D.Double transformedPoint(double x, double y)
        {
            double[] position = {x,y}; 
            getGraphicsState().getCurrentTransformationMatrix().createAffineTransform().transform(
                    position, 0, position, 0, 1);
            return new Point2D.Double(position[0],position[1]);
        }
    }

    static class TransformedRectangle
    {
        public TransformedRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3)
        {
            this.p0 = p0;
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
        }

        boolean strikesThrough(TextPosition textPosition)
        {
            Matrix matrix = textPosition.getTextPos();
            // TODO: This is a very simplistic implementation only working for horizontal text without page rotation
            // and horizontal rectangular strikeThroughs with p0 at the left bottom and p2 at the right top

            // Check if rectangle horizontally matches (at least) the text
            if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0)
                return false;
            // Check whether rectangle vertically is at the right height to underline
            double vertDiff = p0.getY() - matrix.getYPosition();
            if (vertDiff < 0 || vertDiff > textPosition.getFont().getFontDescriptor().getAscent() * textPosition.getFontSizeInPt() / 1000.0)
                return false;
            // Check whether rectangle is small enough to be a line
            return Math.abs(p2.getY() - p0.getY()) < 2;
        }

        boolean underlines(TextPosition textPosition)
        {
            Matrix matrix = textPosition.getTextPos();
            // TODO: This is a very simplistic implementation only working for horizontal text without page rotation
            // and horizontal rectangular underlines with p0 at the left bottom and p2 at the right top

            // Check if rectangle horizontally matches (at least) the text
            if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0)
                return false;
            // Check whether rectangle vertically is at the right height to underline
            double vertDiff = p0.getY() - matrix.getYPosition();
            if (vertDiff > 0 || vertDiff < textPosition.getFont().getFontDescriptor().getDescent() * textPosition.getFontSizeInPt() / 500.0)
                return false;
            // Check whether rectangle is small enough to be a line
            return Math.abs(p2.getY() - p0.getY()) < 2;
        }

        final Point2D p0, p1, p2, p3;
    }

    final List<TransformedRectangle> rectangles = new ArrayList<>();
    Set<String> currentStyle = Collections.singleton("Undefined");
}

(PDFStyledTextStripper.java)

In addition to what the PDFTextStripper does, this class also

  • collects rectangles from the content (defined using the re instruction) using an instance of the AppendRectangleToPath operator processor inner class,
  • checks text for the style variants from the sample document in determineStyle, and
  • whenever the style changes, adds the new style to the result in writeString.

Beware: This merely is a proof of concept! In particular

  • the implementations of the tests in TransformedRectangle.underlines(TextPosition) and TransformedRectangle#strikesThrough(TextPosition) are very simplistic and only work for horizontal text without page rotation and horizontal rectangular strikeThroughs and underlines with p0 at the left bottom and p2 at the right top;
  • all rectangles are collected, not checking whether they actually are filled with a visible color;
  • the tests for "bold" and "italic" merely inspect the name of the used font which may not suffice in general.

A test output

Using the PDFStyledTextStripper like this

String extractStyled(PDDocument document) throws IOException
{
    PDFTextStripper stripper = new PDFStyledTextStripper();
    stripper.setSortByPosition(true);
    return stripper.getText(document);
}

(from ExtractText.java, called from the test method testExtractStyledFromExampleDocument)

one gets the result

[]This is an example of plain text 
 
[Bold]This is an example of bold text 
[] 
[Underline]This is an example of underlined text[] 
 
[Italic]This is an example of italic text  
[] 
[StrikeThrough]This is an example of strike through text[]  
 
[Italic, Bold]This is an example of bold, italic text 

for the OP's sample document


PS The code of the PDFStyledTextStripper meanwhile has been slightly changed to also work for a sample document shared in a github issue, in particular the code of its inner class TransformedRectangle, cf. here.

这篇关于使用带有 VB.NET 的 PDFBox 检测粗体、斜体和删除线文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆