PDF查看文本是否带下划线或表格单元格 [英] PDF find out if text is underlined or a table cell

查看:129
本文介绍了PDF查看文本是否带下划线或表格单元格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在玩PdfBox和PDFTextStripperByArea方法。



如果文字是粗体 italic ,但我无法获得下划线信息。



据我所知,在PDF中,下划线是通过绘制线条完成的。所以从理论上讲,我应该能够获得有关文本周围某些行的某些信息。根据这些信息,我可以找出是否有下划线或表格。



这是我到目前为止的代码:

 列表< TextPosition> textPos = charactersByArticle.get(index); 

for(TextPosition t:textPos)
{
if(t.getFont()。getFontDescriptor()!= null)
{
if( t.getFont()。getFontDescriptor()。getFontWeight()> BOLD_WEIGHT ||
t.getFont()。getFontDescriptor()。isForceBold())
{
isBold = true;
}

if(t.getFont()。getFontDescriptor()。isItalic())
{
isItalic = true;
}
}
}

我试过玩游戏PDGraphicsState对象,在 PDFStreamEngine 类的 processEncodedText 方法中处理,但没有在那里找到行的信息。



有什么建议可以从中检索这些信息吗?

解决方案

这是我到目前为止所发现的:



PDFBox使用资源文件将PDF操作符/指令绑定到某些类,然后处理这些信息。



如果我们看看在 PDFTextStripper.properties 资源文件下:


pdfbox\src \\ \\ main\resources\org\apache\pdfbox\resources\


我们可以看到BT,例如BT operator绑定到
org.apache.pdfbox.util.operator.BeginText 类,依此类推。



PDFTextStripper


pdfbox\src \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\使用此类处理PDF。



但是所有图形对象都被忽略,因此没有下划线或表格结构的信息!



现在,如果我们看一下 PageDrawer.properties 资源文件,我们可以看到这个文件几乎绑定了所有可用的运算符。 PageDrawer 类在


pdfbox \ src \\\\\\\\\\\\\\\ \\ apache\pdfbox \pdfviewer \


诀窍现在是找出代表下划线的那些图形运算符和表格以及与 PDFTextStripper 结合使用。



现在这意味着要阅读PDF文件规范,这是目前的工作方式。



如果有人知道哪些运营商负责绘制下划线和表格行的行为,请告诉我。


I have been playing around with PdfBox and PDFTextStripperByArea method.

I was able to extract information if the text is bold or italic, but I'm unable to get the underline information.

As far as I understand it in PDF, underline is done by drawing lines. So in theory I should be able to get some sort of information about lines somewhere around the text. Giving this information I could then find out if either text is underlined or in a table.

Here is my code so far:

List<TextPosition> textPos = charactersByArticle.get(index);

for (TextPosition t : textPos)
{               
    if (t.getFont().getFontDescriptor() != null)
    {                           
        if (t.getFont().getFontDescriptor().getFontWeight() > BOLD_WEIGHT ||
            t.getFont().getFontDescriptor().isForceBold())
        {
            isBold = true;
        }

        if (t.getFont().getFontDescriptor().isItalic())
        {
            isItalic = true;
        }
    }
}

I have tried to play around the PDGraphicsState object which is processed in the processEncodedText method in PDFStreamEngine class but no information of lines found there.

Any suggestions where this information could be retrieved from ?

解决方案

Here is what I have found out so far:

PDFBox uses a resource file to bound PDF operators/instructions to certain classes which then process the information.

If we take a look at the PDFTextStripper.properties resource file under:

pdfbox\src\main\resources\org\apache\pdfbox\resources\

we can see that for instance the BT operator is bound to the org.apache.pdfbox.util.operator.BeginText class and so on.

The PDFTextStripper under

pdfbox\src\main\java\org\apache\pdfbox\util\

takes this into account and utilizes the processing of the PDF with this classes.

BUT all graphical objects are ignored, therefore no information of underline or table structure!

Now if we take a look at the PageDrawer.properties resource file we can see that this one bounds to almost all operators available. Which is utilized by PageDrawer class under

pdfbox\src\main\java\org\apache\pdfbox\pdfviewer\

The "trick" is now to find out which graphical operators are those who represent underline and tables and to use them in combination with PDFTextStripper.

Now this would mean reading the PDF file specification, which is currently way to much work.

If someone knows which operators are responsible for which actions to draw underlines and table lines please let me know.

这篇关于PDF查看文本是否带下划线或表格单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆