与文档内容对象和PDFBox中的PDF段落相关的文本 [英] Text associated to PDF paragraph in document content object wit PDFBox
问题描述
我正在尝试获取与段落关联的文本,以浏览PDF文件的内容树.我使用的是PDFBox,找不到段落与其包含的文本之间的链接(请参见下面的代码):
I'm trying to get the text associated to a paragraph navigating through the content tree of a PDF file. I am using PDFBox and cannot find the link between the paragraph and the text that it contains (see code below):
public class ReadPdf {
public static void main( String[] args ) throws IOException{
MyBufferedWriter out = new MyBufferedWriter(new FileWriter(new File(
"C:/Users/wip.txt")));
RandomAccessFile raf = new RandomAccessFile(new File(
"C:/Users/mypdf.pdf"), "r");
PDFParser parser = new PDFParser(raf);
parser.parse();
COSDocument cosDoc = parser.getDocument();
out.write(cosDoc.getXrefTable().toString());
out.write(cosDoc.getObjects().toString());
PDDocument document = parser.getPDDocument()
document.getClass();
COSParser cosParser = new COSParser(raf);
PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();
for (Object kid : treeRoot.getKids()){
for (Object kid2 :((PDStructureElement)kid).getKids()){
PDStructureElement kid2c = (PDStructureElement)kid2;
if (kid2c.getStandardStructureType() == "P"){
for (Object kid3 : kid2c.getKids()){
if (kid3 instanceof PDStructureElement){
PDStructureElement kid3c = (PDStructureElement)kid3;
}
else{
for (Entry<COSName, COSBase>entry : kid2c.getCOSObject().entrySet()){
// Print all the Keys in the paragraph COSDictionary
System.out.println(entry.getKey().toString());
System.out.println(entry.getValue().toString());}
}}}}}}}
当我打印内容时,我会得到以下按键:
When I print the contents I get the following Keys:
- /P:对父项的引用
- /A:段落格式
- /K:该段落在本节中的位置
- /C:段落名称(!=文本)
- /Pg:参考页面
示例输出:
COSName {K}
COSName{K}
COSInt {2}
COSInt{2}
COSName {Pg}
COSName{Pg}
COSObject {12,0}
COSObject{12, 0}
COSName {C}
COSName{C}
COSName {普通}
COSName{Normal}
COSName {A}
COSName{A}
COSObject {434,0}
COSObject{434, 0}
COSName {S}
COSName{S}
COSName {普通}
COSName{Normal}
COSName {P}
COSName{P}
COSObject {421,0}
COSObject{421, 0}
现在这些都没有指向段落中的实际文本. 我知道可以使用acrobat打开文档时解析该关系(请参见下图):
Now none of these points to the actual text inside the paragraph. I know that the relation can be obtained as it is parsed when I open the document with acrobat (see pic below):
推荐答案
我找到了一种通过解析页面中的内容流来实现此目的的方法. 浏览PDF规范第10.6.3章,在\ P \ MCID下的每个文本流的编号与可以在COSObject中找到的Tag的属性(PDFBox中的PDStructureElement)之间存在链接.
I found a way to do this through the parsing of the Content Stream from a page. Navigating through the PDF Specification Chapter 10.6.3 there is a link between the numbering of each Text Stream which comes under \P \MCID and an attribute of the Tag (PDStructureElement in PDFBox) which can be found in the COSObject.
1)要获取文本和MCID:
1) To get the text and the MCID:
PDPage pdPage;
Iterator<PDStream> inputStream = pdPage.getContentStreams();
while (inputStream.hasNext()) {
try {
PDFStreamParser parser2 = new PDFStreamParser((PDStream)inputStream.next());
parser2.parse();
List<Object> tokens = parser2.getTokens();
for (int j = 0; j < tokens.size(); j++){
tokenString = (tokenString + tokens.get(j).toString()}
// here comes the parsing of the string. Chapter 5 specifies what each of the operators Tj (actual text), Tm, BDC, BT, ET, EMC mean, MCID
-
然后获取与MCID匹配的标签及其属性:
Then to get the tags and their attribute that matches MCID:
PDStructureElement pDStructureElement;
pDStructureElement .getCOSObject().getInt(COSName.K)
PDStructureElement pDStructureElement;
pDStructureElement .getCOSObject().getInt(COSName.K)
应该这样做.在没有标签的文档中(document.getDocumentCatalog().getStructureTreeRoot()没有子代),无法执行此匹配,但仍可以使用第1步读取文本.
That should do it. In documents without Tags (document.getDocumentCatalog().getStructureTreeRoot() is empty of children) this match cannot be performed but the text can still be read using step 1.
这篇关于与文档内容对象和PDFBox中的PDF段落相关的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!