iTextSharp在pdf文件中的单词中插入空格 [英] iTextSharp inserting spaces within words from a pdf file

查看:329
本文介绍了iTextSharp在pdf文件中的单词中插入空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用iTextSharp,我试图从以下pdf文件中提取文本:



https://www.treasury.gov/ofac/downloads/sdnlist.pdf



这是代码:

  var currentText = PdfTextExtractor.GetTextFromPage(pdfReader,2,new SimpleTextExtractionStrategy()); 
if(currentText.Length> 0)
{
var capture = new Capture();
capture.Text = currentText;

//如果找到任何数据,请将结果写入数据库
_dataService.AddCapture(capture);
}

使用SimpleTextExtractionStrategy,结果被写入数据库,包含大量不需要的空格在言语中。第2页的前几行写为:



外国资产管理办公室特别指定的国民和被封锁的人2017年2月3日 - 2 - A.A. RASPLET IN; a .k。
a。 AL MAZ -AN TEY MSDB; a .k.a。 AL MAZ -ANTEY PV O'AI R DEFENSE'
CO NCERN LEAD SYSTE M S DESIGN BUREAU OAO'OPEN JO INT -STOCK
COMPANY'IMENI ACADEMIC IAN A.A。 RASPLETIN; a.k .a。 GO LOVNOYE
SISTEMN OYE KONS TRUKT ORSKOY E BYURO OPEN J OIN T-S TOCK C OMP任意
ALMAZ -AN TEY PVO C ONCERN I MEN I Acadeian A .A。 RASPLE TIN;
a.k.一个。 JO INT STOCK C OMPANY A LMA Z-AN TEY AI R DEFENSE CON CERN
MA在SYSTE M DESIGN BUREAU被AKE AA命名





例如,参见第4和第4章中的JO INT一词。第6行,以及第2行到最后一行的CON CERN。这些类型的空间出现在整个结果中。不幸的是,这将使查询文本变得不可能。



有没有人知道为什么会这样做以及如何解决这个问题?

解决方案

为什么会这样做



原因实际上是文本提取策略的一个特性,在您的情况下没有按预期工作。



一些背景:你认为PDF文件中的单词之间的空格不一定是由于指令而产生的绘制空格字符,它也可以是指令将文本插入位置向右移动的结果。因此,文本提取策略通常在找到像这样的足够大的右移时添加空格字符。对于这方面的更多内容(特别是足够大的部分),例如此答案



如果是您的文件,文本正文字体的字体宽度信息太小(如果按原样使用,字符会粘在一起,中间没有任何空格);因此,在每对连续字符之间存在小的右移,其中一些移位宽度足以通过上述机制错误地识别为字分离。



如何解决此问题



由于PDF中的单词分隔是通过绘制空格字符的指令创建的,因此您不需要上述功能。因此,解决该问题的最简单方法是使用没有该功能的文本提取策略。



您可以通过复制<$的源代码来创建此类策略c $ c> SimpleTextExtractionStrategy (例如来自这里)并注释掉方法 RenderText 中的一些行,如下所示:

  public virtual void RenderText(TextRenderInfo renderInfo)
{
[...]

if(hardReturn)
{
//System.out.Println(\"<< Hard Return>>));
AppendTextChunk('\ n');
}
else if(!firstRender)
{
// if(result [result.Length - 1]!=''&& renderInfo.GetText()。长度>&& renderInfo.GetText()[0]!='')
// {//如果前一个字符串的尾随字符不是空格,我们只插入一个空格,并且当前字符串的前导字符不是空格
// float spacing = lastEnd.Subtract(start).Length;
// if(spacing> renderInfo.GetSingleSpaceWidth()/ 2f)
// {
// AppendTextChunk('');
// //System.out.Println(\"在''+ renderInfo.GetText()+'之前插入隐含空格);
//}
//}
}
else
{
//System.out.Println(\"Displaying first string of content'+ text +':: x1 =+ x1);
}

[...]
}

使用这种简化的提取策略,可以正确提取文本。


Using iTextSharp, I am trying to extract the text from the following pdf file:

https://www.treasury.gov/ofac/downloads/sdnlist.pdf

This is the code:

var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, 2, new SimpleTextExtractionStrategy());
                   if (currentText.Length > 0)
                            {
                                var capture = new Capture();
                                capture.Text = currentText;

                                // write the results to the DB, if any data was found
                                _dataService.AddCapture(capture);
                            }

Using the SimpleTextExtractionStrategy, the results are written to the database with myriads of unwanted spaces within words. The first several lines of of page 2 write as:

OFFICE OF FOREIGN ASSETS CONTROL SPECIALLY DESIGNATED NATIONALS & BLOCKED PERSONS February 3, 2017 - 2 - A.A. RASPLET IN; a .k. a. AL MAZ -AN TEY MSDB; a .k.a . AL MAZ -ANTEY PV O 'AI R DEFENSE' CO NCERN LEAD SYSTE M S DESIGN BUREAU OAO ' OPEN JO INT -STOCK COMPANY' IMENI ACADEMIC IAN A.A . RASPLETIN; a.k .a. GO LOVNOYE SISTEMN OYE KONS TRUKT ORSKOY E BYURO OPEN J OIN T-S TOCK C OMP ANY OF ALMAZ -AN TEY PVO C ONCERN I MEN I ACADEMICIAN A .A. RASPLE TIN; a.k. a. JO INT STOCK C OMPANY A LMA Z-AN TEY AI R DEFENSE CON CERN MA IN SYSTE M DESIGN BUREAU NAMED BY ACADE MICIAN A.A.

See for example the word "JO INT" in the 4th & 6th lines, and the word "CON CERN" in the 2nd to last line. These types of spaces occur throughout the entire results. This will make querying the text impossible, unfortunately.

Does anyone have any idea why this does this and how to resolve this?

解决方案

why this does this

The cause actually is a feature of the text extraction strategy which in your case does not work as desired.

A bit of background: What you perceive as a space between words in a PDF file does not necessarily come into being due to an instruction drawing a space character, it can also be the result of an instruction shifting the text insertion position a little to the right. Thus, text extraction strategies usually add a space character when finding a sufficiently large right-shift like that. For some more on this (in particular the "sufficiently large" part) confer e.g. this answer.

In case of your document, though, the text body font has too small font width information (if used as is, the characters appear glued together with no space in-between whatsoever); thus, there are small right shifts between each couple of consecutive characters, some of these shifts wide enough to be falsely identified as word separation by the mechanism explained above.

how to resolve this

As word separations in your PDF are created by instructions drawing a space character, you do not need the feature explained above. Thus, the easiest way to resolve the issue is to use a text extraction strategy without that feature.

You can create such a strategy by copying the source code of the SimpleTextExtractionStrategy (e.g. from here) and comment out some lines from the method RenderText as below:

public virtual void RenderText(TextRenderInfo renderInfo)
{
    [...]

    if (hardReturn)
    {
        //System.out.Println("<< Hard Return >>");
        AppendTextChunk('\n');
    }
    else if (!firstRender)
    {
//        if (result[result.Length - 1] != ' ' && renderInfo.GetText().Length > 0 && renderInfo.GetText()[0] != ' ')
//        { // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
//            float spacing = lastEnd.Subtract(start).Length;
//            if (spacing > renderInfo.GetSingleSpaceWidth() / 2f)
//            {
//                AppendTextChunk(' ');
//                //System.out.Println("Inserting implied space before '" + renderInfo.GetText() + "'");
//            }
//        }
    }
    else
    {
        //System.out.Println("Displaying first string of content '" + text + "' :: x1 = " + x1);
    }

    [...]
}

Using this simplified extraction strategy, your text is properly extracted.

这篇关于iTextSharp在pdf文件中的单词中插入空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆