如何使用itextsharp从表结构化PDF中读取数据? [英] How to read data from table-structured PDF using itextsharp?

查看:127
本文介绍了如何使用itextsharp从表结构化PDF中读取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在从pdf文件中读取一些数据时遇到问题。

我的文件是结构化的,它包含表格和纯文本。标准解析器从同一行的不同列读取数据。例如:

 
一些表头
数据Col1a数据Col2a数据Col3a
数据Col1b数据Col2b数据Col3b
数据Col2c

此代码

  PdfReader reader = new PdfReader(pdfName); 

List< String> text = new List< String>();
字符串页面;
List< String> pageStrings;
string [] separators = {\ n,\\\\ n};

for(int i = 1; i< = reader.NumberOfPages; i ++)
{
page = PdfTextExtractor.GetTextFromPage(reader,i);
pageStrings = new List< string>(page.Split(separators,StringSplitOptions.RemoveEmptyEntries));
text.AddRange(pageStrings);

}

reader.Close();

返回文字;

将连接成字符串:

 
一些表头
数据Col1a数据Col2a数据Col3a
数据Col1b数据Col2b数据Col3b
数据Col2c

我想获得反映来自块数据的串联字符串。我希望得到这样的字符串作为上面的例子:

 
一些表头
数据Col1a数据Col1b
数据Col2a数据Col2b数据Col2c
数据Col3a数据Col3b

有没有人知道如何调整itextsharp以获得pdf的这种行为解析器?
也许有人有适当的代码示例?

示例PDF文件是



评论中提到的OP:


另一个工具完全按照我的意愿解析我的PDF。 [...]



PS:此工具为pdfbox


使用此方法中的PDFBox(v1.8.10,当前发行版本):

 字符串提取(PDDocument文档)抛出IOException 
{
PDFTextStripper stripper = new PDFTextStripper();
返回stripper.getText(document);
}

返回上面显示的部分

  2015年8月5日的驱动程序预订
公司IS MEDICAL;和服务日期是2015年5月8日和2015年5月8日之间; AND状态已分配;和车辆医疗:
CATY
医疗
旅行#:314-A
评论:---- LIVERY ---
目的地:接送:
电话类型:Livery
< Doctor Office>
REGO PARK,(631)
000-0000
(718)896-5953
74- AVE 204E HEIGHTS,NY
11372(718)639-4154
11:00:00 PAT,MIKHAIL
旅行#:314-B
评论:---- LIVERY ---
目的地:接送:
致电类型:Livery
74- AVE 204E HEIGHTS,NY
11372(718)639-4154
< Doctor Office>
63-6 REGO PARK,NY
11374(631)000-0000
11:01:00 PAT,MIKHAIL

这不是一个整齐的列式提取,但某些信息块(如地址块)保持在一起。



使用iText(Sharp)获得相同的输出实际上非常简单:只需显式使用 SimpleTextExtractionStrategy 而不是 LocationTextExtractionStrategy 默认使用,即必须替换此行

  page = PdfTextExtractor.GetTextFromPage(reader,i) ; 

by

  page = PdfTextExtractor.GetTextFromPage(reader,i,new SimpleTextExtractionStrategy()); 

每个数据集只有一个空格字符(iText(夏普)提取目的地:接送:而不是目的地:接送:)结果相同。






关于PDFBox提取文本的结论:


所以我认为PDF实际上是表结构化的。


实际上这个提取顺序仅仅意味着绘制字符串段的操作PDF页面内容流按此顺序发生。由于这些操作的顺序是任意的,根据PDF规范,生成这些PDF的软件的任何更新都可能导致PDFBox PDFTextStripper 和iText <$ c的文件$ c> SimpleTextExtractionStrategy 只提取一个难以理解的字符汤。






PS:如果一套PDFBox PDFTextStripper 属性 SortByPosition true 像这样

  PDFTextStripper stripper = new PDFTextStripper(); 
stripper.setSortByPosition(true);
返回stripper.getText(document);

然后PDFBox提取文本就像iText(夏普)一样(默认) LocationTextExtractionStrategy 确实






OP表示对内容流中固有的块结构感兴趣。像通用PDF中最明显的结构是文本对象(可以绘制多个字符串)。



在手边的情况下使用SimpleTextExtractionStrategy 。它可以很容易地扩展为还包括与其输出中的文本对象的开始和结束相对应的标记。在Java中,这可以通过使用这样的匿名类来完成:

  return PdfTextExtractor.getTextFromPage(reader,pageNo,new SimpleTextExtractionStrategy( )
{
boolean empty = true;

@Override
public void beginTextBlock()
{
if(!empty)
appendTextChunk(< BLOCK>);
super.beginTextBlock();
}

@Override
public void endTextBlock()
{
if(!empty)
appendTextChunk(< / BLOCK> \ n);
super.endTextBlock();
}

@Override
public String getResultantText()
{
if(empty)
return super.getResultantText();
else
return< BLOCK> ;+ super.getResultantText();
}

@Override
public void renderText(TextRenderInfo renderInfo)
{
empty = false;
super.renderText(renderInfo);
}
});

TextExtraction.java 方法 extractSimple



(这个Java代码很容易转换成C#。用 boolean可能看起来很有趣;但是,这是必要的,因为基类假定一旦将一些块附加到提取的内容后就会设置某些附加属性。)



使用此扩展策略可获得上面显示的部分:

 < BLOCK> 8的驱动程序簿5/2015 
公司医疗;和服务日期是2015年5月8日和2015年5月8日之间; AND状态已分配;和车辆医疗:
CATY< / BLOCK>
< BLOCK>
医疗< / BLOCK>
< BLOCK>
旅行#:314-A< / BLOCK>
< BLOCK>
评论:---- LIVERY ---< / BLOCK>
< BLOCK>
目的地:接送:< / BLOCK>
< BLOCK>
通话类型:Livery
< Doctor Office>
REGO PARK,(631)
000-0000
(718)896-5953< / BLOCK>
< BLOCK>
74- AVE 204E HEIGHTS,NY
11372(718)639-4154< / BLOCK>
< BLOCK>
11:00:00< / BLOCK>
< BLOCK> PAT,MIKHAIL< / BLOCK>
< BLOCK>
旅行#:314-B< / BLOCK>
< BLOCK>
评论:---- LIVERY ---< / BLOCK>
< BLOCK>
目的地:接送:< / BLOCK>
< BLOCK>
电话类型:Livery
74- AVE 204E HEIGHTS,NY
11372(718)639-4154< / BLOCK>
< BLOCK>
< Doctor Office>
63-6 REGO PARK,NY
11374(631)000-0000< / BLOCK>
< BLOCK>
11:01:00< / BLOCK>
< BLOCK> PAT,MIKHAIL< / BLOCK>

由于这会将地址保存在同一个块中,这在提取过程中可能会有所帮助。


I am having a problem with reading some data from pdf file.
My file is structurized and it contains tables and plain text. Standard parser reads data from separate columns at the same line. For example:

Some Table Header  
Data Col1a     Data Col2a      Data Col3a
Data Col1b     Data Col2b      Data Col3b
               Data Col2c

with this code

        PdfReader reader = new PdfReader(pdfName);

        List<String> text = new List<String>();
        String page;
        List<String> pageStrings;
        string[] separators = { "\n", "\r\n" };

        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            page = PdfTextExtractor.GetTextFromPage(reader, i);
            pageStrings = new List<string>(page.Split(separators, StringSplitOptions.RemoveEmptyEntries));
            text.AddRange(pageStrings);

        }

        reader.Close();

        return text;

will be concatenated into strings:

Some Table Header
Data Col1a Data Col2a Data Col3a  
Data Col1b Data Col2b Data Col3b  
Data Col2c  

I'd like to get concatenated strings that will reflect data from blocks. I'd like to get such strings for upper example:

Some Table Header
Data Col1a Data Col1b   
Data Col2a Data Col2b Data Col2c  
Data Col3a Data Col3b

Does anyone have any idea how to tune itextsharp to get such behavior of pdf parser? Maybe someone has appropriate code sample?
The sample PDF file is here

解决方案

The OP's sample file contains multiple sections like this one:

And the OP mentioned in a comment:

another one tool parse my PDF exactly like I want. [...]

PS: this tool is pdfbox

Using PDFBox (v1.8.10, the current release version) in this method:

String extract(PDDocument document) throws IOException
{
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(document);
}

returns for the section shown above

Driver Book for 8/5/2015
Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS  MEDICAL: 
CATY
 MEDICAL
Trip #: 314-A
Comments: ----LIVERY---
Destination:Pick-up:
Call Type: Livery
<Doctor Office>
REGO PARK,  (631) 
000-0000
(718) 896-5953
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154
11:00:00 PAT, MIKHAIL
Trip #: 314-B
Comments:  ----LIVERY---
Destination:Pick-up:
Call Type: Livery
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154
<Doctor Office>
63-6 REGO PARK, NY 
11374 (631) 000-0000
11:01:00 PAT, MIKHAIL

This is not really a neat column-wise extraction but certain blocks of information (like address blocks) remain together.

Getting the same output with iText(Sharp) actually is very easy: One merely has to explicitly use the SimpleTextExtractionStrategy instead of the LocationTextExtractionStrategy which is used by default, i.e. one has to replace this line

page = PdfTextExtractor.GetTextFromPage(reader, i);

by

page = PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy());

With the exception of one space character per dataset (iText(Sharp) extracts Destination: Pick-up: instead of Destination:Pick-up:) the results are identical.


Concerning your conclusion from PDFBox extracting the text as it does:

So I think that PDF is really table structured.

Actually this order of extraction means merely that the operations for drawing the string segments in the PDF page content stream occur in this very order. As the order of those operations is arbitrary according to the PDF specification, any update of the software generating those PDFs may result in files from which the PDFBox PDFTextStripper and the iText SimpleTextExtractionStrategy extract merely an unintelligible soup of characters.


PS: If one sets the PDFBox PDFTextStripper property SortByPosition to true like this

    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setSortByPosition(true);
    return stripper.getText(document);

then PDFBox extracts the text just like iText(Sharp) with the (default) LocationTextExtractionStrategy does


The OP indicated interest in a block structure inherent in the content stream. The most obvious structure like that in a generic PDF would be the text objects (in which multiple strings may be drawn).

In the case at hand the SimpleTextExtractionStrategy is used. It can easily be extended to also include markers corresponding to the start and end of text objects in its output. In Java this can be done by using an anonymous class like this:

return PdfTextExtractor.getTextFromPage(reader, pageNo, new SimpleTextExtractionStrategy()
{
    boolean empty = true;

    @Override
    public void beginTextBlock()
    {
        if (!empty)
            appendTextChunk("<BLOCK>");
        super.beginTextBlock();
    }

    @Override
    public void endTextBlock()
    {
        if (!empty)
            appendTextChunk("</BLOCK>\n");
        super.endTextBlock();
    }

    @Override
    public String getResultantText()
    {
        if (empty)
            return super.getResultantText();
        else
            return "<BLOCK>" + super.getResultantText();
    }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        empty = false;
        super.renderText(renderInfo);
    }
});

(TextExtraction.java method extractSimple)

(This Java code should easily be translatable into C#. The playing around with an empty boolean may look funny; it is necessary, though, because the base class assumes certain additional properties to be set as soon as some chunk has been appended to the extracted content.)

Using this extended strategy one gets for the section shown above:

<BLOCK>Driver Book for 8/5/2015
Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS  MEDICAL: 
CATY</BLOCK>
<BLOCK>
 MEDICAL</BLOCK>
<BLOCK>
Trip #: 314-A</BLOCK>
<BLOCK>
Comments: ----LIVERY---</BLOCK>
<BLOCK>
Destination: Pick-up:</BLOCK>
<BLOCK>
Call Type: Livery
<Doctor Office>
REGO PARK,  (631) 
000-0000
(718) 896-5953</BLOCK>
<BLOCK>
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154</BLOCK>
<BLOCK>
11:00:00</BLOCK>
<BLOCK> PAT, MIKHAIL</BLOCK>
<BLOCK>
Trip #: 314-B</BLOCK>
<BLOCK>
Comments:  ----LIVERY---</BLOCK>
<BLOCK>
Destination: Pick-up:</BLOCK>
<BLOCK>
Call Type: Livery
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154</BLOCK>
<BLOCK>
<Doctor Office>
63-6 REGO PARK, NY 
11374 (631) 000-0000</BLOCK>
<BLOCK>
11:01:00</BLOCK>
<BLOCK> PAT, MIKHAIL</BLOCK>

As this keeps addresses in the same block, this might help during extraction.

这篇关于如何使用itextsharp从表结构化PDF中读取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆