阅读遗留的Word表单复选框转换为PDF [英] Reading legacy Word forms checkboxes converted to PDF

查看:264
本文介绍了阅读遗留的Word表单复选框转换为PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的客户以PDF格式向我们发送订单,该表格由使用旧表格构建的Word文档生成。



目前,我们客户中心的人员正在将订单打入我们的订单中系统,但我们决定尝试自动执行此任务。



我能够通过每页简单的PdfReader读取PDF的内容:

  public static string GetPdfText(string path)
{
var text = string.Empty;
使用(var reader = new PdfReader(path))
{
for(var page = 1; page< = reader.NumberOfPages; page ++)
{
text + = PdfTextExtractor.GetTextFromPage(reader,page);
}
}
返回文字;
}

但不是复选框......



我可以在浏览PDF中的每个对象时检测复选框作为字典,但是我无法将它们与其他对象区分开来或读取值...

  public static IEnumerable< PdfDictionary> ReadCheckboxes(字符串路径)
{
using(var reader = new PdfReader(path))
{
var checkboxes = new List< PdfDictionary>();
for(var i = 0; i< reader.XrefSize; i ++)
{
var pdfObject = reader.GetPdfObject(i);
checkboxes.Add((PdfDictionary)pdfObject);
}
返回复选框;
}
}

我缺少什么?我也试过阅读AcroFields,但它们都是空的......



我上传了一个带有旧版复选框的示例PDF


Our customers sends us orders as PDF forms which is generated from a Word document built with legacy forms.

Currently people at our customer center is punching the orders into our system, but we have decided to try and automate this task.

I'm able to read the content of the PDF with a simple PdfReader per page:

    public static string GetPdfText(string path)
    { 
        var text = string.Empty;
        using (var reader = new PdfReader(path))
        {
            for (var page = 1; page <= reader.NumberOfPages; page++)
            {
                text += PdfTextExtractor.GetTextFromPage(reader, page);
            }
        }
        return text;
    }

But not the checkboxes...

I am able to detect the checkboxes as dictionaries while running through every object in the PDF, but I'm unable to distinguish them from other objects or read the value...

    public static IEnumerable<PdfDictionary> ReadCheckboxes(string path)
    {
        using (var reader = new PdfReader(path))
        {
            var checkboxes = new List<PdfDictionary>();
            for (var i = 0; i < reader.XrefSize; i++)
            {
                var pdfObject = reader.GetPdfObject(i);
                checkboxes.Add((PdfDictionary) pdfObject);
            }
            return checkboxes;
        }
    }

What am I missing? I've also tried reading the AcroFields, but they're empty...

I have uploaded a sample PDF with legacy checkboxes here.

Currently there is not option to integrate between our systems or do any changes to the underlying PDF or Word document.

解决方案

The OP indicated in comments that a solution which returns an output like "checkbox at position x0, y0, checked; checkbox at position x1, y1, not checked; ..." would suffice, i.e. his "forms" are static enough so that these positions allow identification of the meaning of the respective checkboxes. Thus, here an implementation of this variant.

I just saw that the question is tagged while I have implemented the search using Java. This should not be too big a problem, the code should be easy to port. If there are problems porting, I'll add a C# version here.

As the checkboxes are drawn using vector graphics, the text extraction already used by the OP does not find them. Fortunately, though, the iText parsing framework can also be used to look for vector graphics.

Thus, we first need an ExtRenderListener (IExtRenderListener in iTextSharp) which collects the boxes. It only has non-trivial implementations of the interface methods modifyPath and renderPath:

@Override
public void modifyPath(PathConstructionRenderInfo renderInfo)
{
    switch (renderInfo.getOperation())
    {
    case PathConstructionRenderInfo.RECT:
    {
        float x = renderInfo.getSegmentData().get(0);
        float y = renderInfo.getSegmentData().get(1);
        float w = renderInfo.getSegmentData().get(2);
        float h = renderInfo.getSegmentData().get(3);
        rectangle = new Rectangle(x, y, x+w, y+h);
    }
    case PathConstructionRenderInfo.MOVETO:
    {
        float x = renderInfo.getSegmentData().get(0);
        float y = renderInfo.getSegmentData().get(1);
        moveToVector = new Vector(x, y, 1);
        lineToVector = null;
        break;
    }
    case PathConstructionRenderInfo.LINETO:
    {
        if (moveToVector != null)
        {
            float x = renderInfo.getSegmentData().get(0);
            float y = renderInfo.getSegmentData().get(1);
            lineToVector = new Vector(x, y, 1);
        }
        break;
    }
    default:
        moveToVector = null;
        lineToVector = null;
    }
}

@Override
public Path renderPath(PathPaintingRenderInfo renderInfo)
{
    if (renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
    {
        if (rectangle != null)
        {
            Vector a = new Vector(rectangle.getLeft(), rectangle.getBottom(), 1).cross(renderInfo.getCtm());
            Vector b = new Vector(rectangle.getRight(), rectangle.getBottom(), 1).cross(renderInfo.getCtm());
            Vector c = new Vector(rectangle.getRight(), rectangle.getTop(), 1).cross(renderInfo.getCtm());
            Vector d = new Vector(rectangle.getLeft(), rectangle.getTop(), 1).cross(renderInfo.getCtm());

            Box box = new Box(new LineSegment(a, c), new LineSegment(b, d));
            boxes.add(box);

        }
        if (moveToVector != null && lineToVector != null)
        {
            if (!boxes.isEmpty())
            {
                Vector from = moveToVector.cross(renderInfo.getCtm());
                Vector to = lineToVector.cross(renderInfo.getCtm());

                boxes.get(boxes.size() - 1).selectDiagonal(new LineSegment(from, to));
            }
        }
    }

    moveToVector = null;
    lineToVector = null;
    rectangle = null;
    return null;
}

Vector moveToVector = null;
Vector lineToVector = null;
Rectangle rectangle = null;

public Iterable<Box> getBoxes()
{
    return boxes;
}

final List<Box> boxes = new ArrayList<Box>();

(from CheckBoxExtractionStrategy.java)

It uses a helper class Box which models the checkboxes using their respective diagonals:

public class Box
{
    public LineSegment getDiagonal()
    {
        return diagonalA;
    }

    public boolean isChecked()
    {
        return selectedA && selectedB;
    }

    Box(LineSegment diagonalA, LineSegment diagonalB)
    {
        this.diagonalA = diagonalA;
        this.diagonalB = diagonalB;
    }

    void selectDiagonal(LineSegment diagonal)
    {
        if (approximatelyEquals(diagonal, diagonalA))
            selectedA = true;
        else if (approximatelyEquals(diagonal, diagonalB))
            selectedB = true;
    }

    boolean approximatelyEquals(LineSegment a, LineSegment b)
    {
        float permissiveness = a.getLength() / 10.0f;
        if (approximatelyEquals(a.getStartPoint(), b.getStartPoint(), permissiveness) &&
                approximatelyEquals(a.getEndPoint(), b.getEndPoint(), permissiveness))
            return true;
        if (approximatelyEquals(a.getStartPoint(), b.getEndPoint(), permissiveness) &&
                approximatelyEquals(a.getEndPoint(), b.getStartPoint(), permissiveness))
            return true;
        return false;
    }

    boolean approximatelyEquals(Vector a, Vector b, float permissiveness)
    {
        return a.subtract(b).length() < permissiveness;
    }

    boolean selectedA = false;
    boolean selectedB = false;
    final LineSegment diagonalA, diagonalB;
}

(Inner class in CheckBoxExtractionStrategy.java)

Applying it like this to the sample document:

for (int page = 1; page <= pdfReader.getNumberOfPages(); page++)
{
    System.out.printf("\nPage %s\n====\n", page);

    CheckBoxExtractionStrategy strategy = new CheckBoxExtractionStrategy();
    PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader);
    parser.processContent(page, strategy);

    for (Box box : strategy.getBoxes())
    {
        Vector basePoint = box.getDiagonal().getStartPoint();
        System.out.printf("at %s, %s - %s\n", basePoint.get(Vector.I1), basePoint.get(Vector.I2),
                box.isChecked() ? "checked" : "unchecked");
    }
}

one gets the output

Page 1
====
at 73.104, 757.8 - checked
at 86.544, 757.8 - checked
at 99.984, 757.8 - unchecked

for the OP's document

这篇关于阅读遗留的Word表单复选框转换为PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆