使用java读取pdf文件中的表格或单元格值? [英] Reading a table or cell value in a pdf file using java?

查看:5562
本文介绍了使用java读取pdf文件中的表格或单元格值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经通过Java和PDF论坛从pdf文件中的表中提取文本值,但除了样本来自iText in Action,第2版)


请注意,iText会根据内容流中的基本文本块提取文本,而不是基于此类块中的每个字形。因此,如果只有最细小的部分在该区域内,则处理整个块。



这可能适合也可能不适合您。



如果遇到的问题是提取的内容比你想要的多,你应该事先将这些块拆分成它们的构成字形。 此stackoverflow答案说明了如何执行此操作。



PDFBox



要限制要从中提取文本的区域,可以使用 PDFTextStripperByArea ,例如:

  PDDocument document = PDDocument.load(args [0]); 
if(document.isEncrypted())
{
document.decrypt();
}
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
Rectangle rect = new Rectangle(10,280,275,60);
stripper.addRegion(class1,rect);
列出allPages = document.getDocumentCatalog()。getAllPages();
PDPage firstPage =(PDPage)allPages.get(0);
stripper.extractRegions(firstPage);
System.out.println(区域中的文字:+ rect);
System.out.println(stripper.getTextForRegion(class1));




ExtractTextByArea 来自PDFBox 1.8.8示例)



I have gone through Java and PDF forums to extract a text value from the table in a pdf file, but could't find any solution except JPedal (It's not opensource and licensed).

So, I would like to know any opensource API's like pdfbox, itext to achieve the same result as JPedal.

Ref. Example:

解决方案

In comments the OP clarified that he locates the text value from the table in a pdf file he wants to extract

By providing X and Y co-ordinates

Thus, while the question initially sounded like generic extraction of tabular data from PDFs (which can be difficult at least), it actually is essentially about extracting the text from a rectangular region on a page given by coordinates.

This is possible using either of the libraries you mentioned (and surely others, too).

iText

To restrict the region from which you want to extract text, you can use the RegionTextRenderFilter in a FilteredTextRenderListener, e.g.:

/**
 * Parses a specific area of a PDF to a plain text file.
 * @param pdf the original PDF
 * @param txt the resulting text
 * @throws IOException
 */
public void parsePdf(String pdf, String txt) throws IOException {
    PdfReader reader = new PdfReader(pdf);
    PrintWriter out = new PrintWriter(new FileOutputStream(txt));
    Rectangle rect = new Rectangle(70, 80, 490, 580);
    RenderFilter filter = new RegionTextRenderFilter(rect);
    TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
        out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
    }
    out.flush();
    out.close();
    reader.close();
}

(ExtractPageContentArea sample from iText in Action, 2nd edition)

Beware, though, iText extracts text based on the basic text chunks in the content stream, not based on each individual glyph in such a chunk. Thus, the whole chunk is processed if only the tiniest part of it is in the area.

This may or may not suit you.

If you run into the problem that more is extracted than you wanted, you should split the chunks into their constituting glyphs beforehand. This stackoverflow answer explains how to do that.

PDFBox

To restrict the region from which you want to extract text, you can use the PDFTextStripperByArea, e.g.:

PDDocument document = PDDocument.load( args[0] );
if( document.isEncrypted() )
{
    document.decrypt( "" );
}
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
Rectangle rect = new Rectangle( 10, 280, 275, 60 );
stripper.addRegion( "class1", rect );
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 0 );
stripper.extractRegions( firstPage );
System.out.println( "Text in the area:" + rect );
System.out.println( stripper.getTextForRegion( "class1" ) );

(ExtractTextByArea from the PDFBox 1.8.8 examples)

这篇关于使用java读取pdf文件中的表格或单元格值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆