使用PDFBox从PDF提取数据时如何用单词替换空格 [英] How to replace a space with a word while extract the data from PDF using PDFBox
问题描述
我想用一个单词替换任何空列;例如,提取Pdf数据时,单词 BLK .
下表是预期表和实际结果的示例.
原始表
+--------------------------------------+
|# |NAME |TEL |GENDER |
|---------------------------|----------|
|1 |JOHN |096587498 |M |
|2 |VILLA | |F |
+--------------------------------------+
预期结果
# NAME TEL GENDER
1 JOHN 096587498 M
2 VILLA BLK F
实际结果
# NAME TEL GENDER
1 JOHN 096587498 M
2 VILLA F
实际结果来自类 PDFTextStripper .
pdf的捕获
PDFTextStripper
看不到PDF中的图形线,只看到文本字符.因此,在您的#2行中,看到"2","Villa"和"F"之间有间隙.因此,仅凭本课程,您将无法获得想要的东西.
通常,使用PDFBox可以使用以下选项:
-
您可以先通过解析页面的矢量图形指令来尝试识别PDF中的表格单元格区域,然后逐个单元格地提取文本.
此答案为此提供了概念验证.当心:此答案侧重于该问题的OP提供的示例文档.特别是,它期望将线条绘制为细实的矩形.对于通用解决方案,需要扩展收集表格行的代码,以便也可以识别否则绘制的行.
显然,此方法要求将表的行和列按行(或按扩展名或者由背景色或类似颜色)划分;并非总是如此.
在您的示例文档中,代码可以直接使用:
[A1] # [A2] Name [A3] Tel [A4] Gender [B1] 1 [B2] John [B3] 096875959 [B4] M [C1] 2 [C2] Villa [C3] [C4] F
(此答案为感知布局的文本提取提供了概念验证.请注意,该代码基于PDFBox 1.8.x,可能需要进行一些修改.
这种方法需要了解表列的布局;这并不总是给出的.
在您的示例文档中,代码可以直接使用:
# Name Tel Gender 1 John 096875959 M 2 Villa F
( ExtractMarkedContent 测试
testExtractTestWPhromma
)使用这两种辅助方法
void showStructure(PDStructureNode node, Map<PDPage, Map<Integer, PDMarkedContent>> markedContents) { String structType = null; PDPage page = null; if (node instanceof PDStructureElement) { PDStructureElement element = (PDStructureElement) node; structType = element.getStructureType(); page = element.getPage(); } Map<Integer, PDMarkedContent> theseMarkedContents = markedContents.get(page); System.out.printf("<%s>\n", structType); for (Object object : node.getKids()) { if (object instanceof COSArray) { for (COSBase base : (COSArray) object) { if (base instanceof COSDictionary) { showStructure(PDStructureNode.create((COSDictionary) base), markedContents); } else if (base instanceof COSNumber) { showContent(((COSNumber)base).intValue(), theseMarkedContents); } else { System.out.printf("?%s\n", base); } } } else if (object instanceof PDStructureNode) { showStructure((PDStructureNode) object, markedContents); } else if (object instanceof Integer) { showContent((Integer)object, theseMarkedContents); } else { System.out.printf("?%s\n", object); } } System.out.printf("</%s>\n", structType); } void showContent(int mcid, Map<Integer, PDMarkedContent> theseMarkedContents) { PDMarkedContent markedContent = theseMarkedContents != null ? theseMarkedContents.get(mcid) : null; List<Object> contents = markedContent != null ? markedContent.getContents() : Collections.emptyList(); StringBuilder textContent = new StringBuilder(); for (Object object : contents) { if (object instanceof TextPosition) { textContent.append(((TextPosition)object).getUnicode()); } else { textContent.append("?" + object); } } System.out.printf("%s\n", textContent); }
(
<null> <Document> <Table> <THead> <TR> <TH> <P> # </P> </TH> <TH> <P> Name </P> </TH> <TH> <P> Tel </P> </TH> <TH> <P> Gender </P> </TH> </TR> </THead> <TBody> <TR> <TH> <P> 1 </P> </TH> <TD> <P> John </P> </TD> <TD> <P> 096875959 </P> </TD> <TD> <P> M </P> </TD> </TR> <TR> <TH> <P> 2 </P> </TH> <TD> <P> Villa </P> </TD> <TD> <P> </P> </TD> <TD> <P> F </P> </TD> </TR> </TBody> </Table> <P> </P> </Document> </null>
您识别出空单元格:
<TD> <P> </P> </TD>
此概念验证提取到标准输出.您显然可以选择在字符串构建器或流中收集数据,也可以将
<Table>
数据立即填充到自定义结构中,毕竟它们已经被分隔在单元格中了.当心:这只是一个概念证明.如果代码输出像
System.out.printf("?%s\n", ...);
这样的数据,则可能需要某些特定的处理.另外,其他边界条件可能也没有得到充分考虑. (实际上,我只是实现它来正确提取示例PDF的内容.)I want to replace any empty column with a word; for example, the word BLK while extract Pdf data.
the below tables are the example of the expected table and actual result.
Original Table
+--------------------------------------+ |# |NAME |TEL |GENDER | |---------------------------|----------| |1 |JOHN |096587498 |M | |2 |VILLA | |F | +--------------------------------------+
Expected Result
# NAME TEL GENDER 1 JOHN 096587498 M 2 VILLA BLK F
Actual Result
# NAME TEL GENDER 1 JOHN 096587498 M 2 VILLA F
The actual result is from the class PDFTextStripper.
capture of pdf
解决方案The
PDFTextStripper
does not see the graphical lines in the PDF, it merely sees text characters. Thus, in your line #2 it sees "2", "Villa", and "F" with gaps in-between. With this class alone, therefore, you won't get what you want.In general you have the following options using PDFBox:
You can first try and recognize the table cell regions in your PDF by parsing the vector graphics instructions of the page and then extract text cell by cell.
This answer provides a proof-of-concept for this. Beware: This answer focuses on the example document provided by the OP of that question. In particular it expects the lines to be drawn as thin filled rectangles; for a generic solution, the code collecting the table lines needs to be extended to also recognize lines drawn otherwise.
This approach obviously requires table rows and columns to be divided by lines (or by extension alternatively by background colors or something similar); this is not always the case.
In case of your example document the code works out of the box:
[A1] # [A2] Name [A3] Tel [A4] Gender [B1] 1 [B2] John [B3] 096875959 [B4] M [C1] 2 [C2] Villa [C3] [C4] F
(output of ExtractBoxedText test
testExtractBoxedTextsTestWPhromma
)You can extract the text attempting to reflect the layout of the PDF. If you know the general layout of the table in question (column n goes from here to there...), you can derive the table cell contents.
This answer provides a proof-of-concept for the layout-aware text extraction. Beware, the code is PDFBox 1.8.x based, some adaptions might be necessary.
This approach requires knowledge of the table column layout; this is not always given.
In case of your example document the code works out of the box:
# Name Tel Gender 1 John 096875959 M 2 Villa F
(output of ExtractTextWithLayout test
testExtractTestWPhromma
)For tagged PDFs you can try to extract the text including the tagging which reflects the table structure (if properly tagged).
As your example document is tagged, I'll show a quick & dirty proof-of-concept for this below.
This approach requires the PDF to be properly tagged; this is not always the case.
Extraction of content with tags
If your PDF is properly tagged, you can extract the content including the markup tags like this:
PDDocument document = PDDocument.load(SOURCE); Map<PDPage, Map<Integer, PDMarkedContent>> markedContents = new HashMap<>(); for (PDPage page : document.getPages()) { PDFMarkedContentExtractor extractor = new PDFMarkedContentExtractor(); extractor.processPage(page); Map<Integer, PDMarkedContent> theseMarkedContents = new HashMap<>(); markedContents.put(page, theseMarkedContents); for (PDMarkedContent markedContent : extractor.getMarkedContents()) { theseMarkedContents.put(markedContent.getMCID(), markedContent); } } PDStructureNode root = document.getDocumentCatalog().getStructureTreeRoot(); showStructure(root, markedContents);
(ExtractMarkedContent test
testExtractTestWPhromma
)using these two helper methods
void showStructure(PDStructureNode node, Map<PDPage, Map<Integer, PDMarkedContent>> markedContents) { String structType = null; PDPage page = null; if (node instanceof PDStructureElement) { PDStructureElement element = (PDStructureElement) node; structType = element.getStructureType(); page = element.getPage(); } Map<Integer, PDMarkedContent> theseMarkedContents = markedContents.get(page); System.out.printf("<%s>\n", structType); for (Object object : node.getKids()) { if (object instanceof COSArray) { for (COSBase base : (COSArray) object) { if (base instanceof COSDictionary) { showStructure(PDStructureNode.create((COSDictionary) base), markedContents); } else if (base instanceof COSNumber) { showContent(((COSNumber)base).intValue(), theseMarkedContents); } else { System.out.printf("?%s\n", base); } } } else if (object instanceof PDStructureNode) { showStructure((PDStructureNode) object, markedContents); } else if (object instanceof Integer) { showContent((Integer)object, theseMarkedContents); } else { System.out.printf("?%s\n", object); } } System.out.printf("</%s>\n", structType); } void showContent(int mcid, Map<Integer, PDMarkedContent> theseMarkedContents) { PDMarkedContent markedContent = theseMarkedContents != null ? theseMarkedContents.get(mcid) : null; List<Object> contents = markedContent != null ? markedContent.getContents() : Collections.emptyList(); StringBuilder textContent = new StringBuilder(); for (Object object : contents) { if (object instanceof TextPosition) { textContent.append(((TextPosition)object).getUnicode()); } else { textContent.append("?" + object); } } System.out.printf("%s\n", textContent); }
(ExtractMarkedContent helper methods)
The output for your example PDF
is
<null> <Document> <Table> <THead> <TR> <TH> <P> # </P> </TH> <TH> <P> Name </P> </TH> <TH> <P> Tel </P> </TH> <TH> <P> Gender </P> </TH> </TR> </THead> <TBody> <TR> <TH> <P> 1 </P> </TH> <TD> <P> John </P> </TD> <TD> <P> 096875959 </P> </TD> <TD> <P> M </P> </TD> </TR> <TR> <TH> <P> 2 </P> </TH> <TD> <P> Villa </P> </TD> <TD> <P> </P> </TD> <TD> <P> F </P> </TD> </TR> </TBody> </Table> <P> </P> </Document> </null>
You recognize the empty cell:
<TD> <P> </P> </TD>
This proof-of-concept extracts to the standard output. You obviously can alternatively collect the data in a string builder or stream, or you can fill the
<Table>
data immediately into custom structures, they after all already come separated in cells.Beware: This only is a proof-of-concept. Where the code outputs data like this
System.out.printf("?%s\n", ...);
, some specific handling may be required. Also other border conditions likely are not adequately considered. (Actually I only implemented it to properly extract the contents of your example PDF.)这篇关于使用PDFBox从PDF提取数据时如何用单词替换空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!