从pdf提取文本时不包括超级脚本 [英] Excluding super script when extracting text from pdf

查看：98 发布时间：2020/5/25 1:42:28 parsing extract pdfbox superscript sentence

本文介绍了从pdf提取文本时不包括超级脚本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经使用pdfbox逐行从pdf中提取了文本，然后用我的算法按句子对文本进行处理.

我通过使用句点(.)后接第一个字母为大写字母的单词来识别句子.这里的问题是，当句子以带有上标的单词结尾时，提取器将其视为普通字符并将其放置在period(.)旁边

例如:表达式"2幂22"在句子的最后一个单词(即带有句点)出现时，已被提取为2.22，这使得很难识别句子的结尾.

请提出一种解决方案，以摆脱超级脚本或使用其他逻辑来识别句子的结尾.

谢谢.

解决方案

我正在回答我自己的问题，有些人可能会在这里找到答案.

我已经按照@mkl的建议解决了这个问题.在PDFStreamEngine.java中观察getYScale()的结果后，我得出的结论是上标的大小小于8.9663.因此，在创建TextPosition之前，我已经在PDFStreamEngine.java中保留了一个条件，该条件将由PDFTextStripper.java处理.代码如下:

if(textXctm.getYScale()>=8.9663) {
    processTextPosition(
        new TextPosition(
            pageRotation,
            pageWidth,
            pageHeight,
            textMatrixStart,
            endXPosition,
            endYPosition,
            totalVerticalDisplacementDisp,
            widthText,
            spaceWidthDisp,
            c,
            codePoints,
            font,
            fontSizeText,
            (int)(fontSizeText * textMatrix.getXScale())
    ));
}

让我知道我的方法在消除上标方面是否有任何缺陷. 谢谢.

I have extracted text from pdf line by line using pdfbox, to process it with my algorithm by sentences.

I am recognizing the sentences by using period(.) followed by a word whose first letter is capital. Here the issue is, when a sentence ends with a word which has superscript, extractor treats it as a normal character and places it next to period(.)

For example: expression "2 power 22" when appeared as a last word in a sentence i.e. with a period, it has been extracted as 2.22 which makes it difficult to identify the end of sentence.

Please suggest a solution to get rid of super script or a different logic to identify the end of sentence.

Thanks.

解决方案

I am answering my own questions, as some may get directed here.

I had solved this according to @mkl suggestion. After observing the result of getYScale() in PDFStreamEngine.java, I have come to a conclusion that the size of superscript was less than 8.9663. so I had kept a condition in the PDFStreamEngine.java before creating a TextPosition, which will be processed by PDFTextStripper.java. The code is below:

if(textXctm.getYScale()>=8.9663) {
    processTextPosition(
        new TextPosition(
            pageRotation,
            pageWidth,
            pageHeight,
            textMatrixStart,
            endXPosition,
            endYPosition,
            totalVerticalDisplacementDisp,
            widthText,
            spaceWidthDisp,
            c,
            codePoints,
            font,
            fontSizeText,
            (int)(fontSizeText * textMatrix.getXScale())
    ));
}

Let me know if my approach has any flaws in eliminating only the superscripts. Thanks.

这篇关于从pdf提取文本时不包括超级脚本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从pdf提取文本时不包括超级脚本 [英] Excluding super script when extracting text from pdf

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从pdf提取文本时不包括超级脚本 [英] Excluding super script when extracting text from pdf

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭