我们如何能够利用iTextSharp的用空格提取PDF文本? [英] how can we extract text from pdf using itextsharp with spaces?

查看:322
本文介绍了我们如何能够利用iTextSharp的用空格提取PDF文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用下面的方法来提取由行PDF文本行。但问题是,它不是读字和数字之间的空间。这可能是该解决方案?

I am using below method to extract pdf text line by line. But problem that, it is not reading spaces between words and figures. what could be the solution for this ??

我只是想创建的字符串列表,列表中的每个对象都有字符串从PDF文本行,因为它是PDF格式,包括空格。

I just want to create a list of string, each string in list object has a text line from pdf as it is in pdf including spaces.

public void readtextlinebyline(string filename)   {


        List<string> strlist = new List<string>();
        PdfReader reader = new PdfReader(filename);
        string text = string.Empty;
        for (int page = 1; page <= 1; page++)
        {

            text += PdfTextExtractor.GetTextFromPage(reader, page ,new LocationTextExtractionStrategy())+" ";

        }
        reader.Close();
        string[] words = text.Split('\n');
        foreach (string word in words)
        {
            strlist.Add(word);
        }

        foreach (string st in strlist)
        {
            Response.Write(st +"<br/>");
        }

   }

我已经改变策略,以SimpleTextExtractionStrategy也尝试过这种方法,但它也没有为我工作。

I have tried this method by changing strategy to SimpleTextExtractionStrategy as well but it is also not working for me.

推荐答案

为什么语言有时不正确的iText由(夏普)或其他PDF文本提取的认可之间的空间,已经在<一个解释的背景href=\"http://stackoverflow.com/questions/13644419/itext-java-pdf-to-text-creation/13645183#13645183\">this回答iText的java的PDF文本创作的:这些空间使用空格字符,而是使用创建一个小的差距的操作不一定创建。这些操作也可用于其它目的(其不破裂的话),虽然,这样的文本提取器必须使用试探法确定这样的间隙是否是一个字符或不...

The backgrounds on why space between words sometimes is not properly recognized by iText(Sharp) or other PDF text extractors, have been explained in this answer to "itext java pdf to text creation": These 'spaces' are not necessarily created using a space character but instead using an operation creating a small gap. These operations are also used for other purposes (which do not break words), though, and so a text extractor must use heuristics to decide whether such a gap is a word break or not...

这尤其意味着你永远不会得到一个100%安全字断线检测。

This especially implies that you never get a 100% secure word break detection.

你能做什么,不过,是提高使用的启发。

What you can do, though, is to improve the heuristics used.

的iText和iTextSharp的标准文本提取的策略,例如承担一行字,如果突破

iText and iTextSharp standard text extraction strategies, e.g. assume a word break in a line if

A)有一个空格字符或

a) there is a space character or

B)还有一定的差距至少有一样宽半空格字符。

b) there is a gap at least as wide as half a space character.

一个项目是肯定打不过B项可能会密集地设定文本的情况下,往往会失败。这个问题的<一的OP href=\"http://stackoverflow.com/questions/13644419/itext-java-pdf-to-text-creation/13645183#13645183\">answer上述参考使用了第四个空格字符,而不是宽度相当不错的成绩。

Item a is a sure hit but item b may often fail in case of densely set text. The OP of the question to the answer referenced above got quite good results using a fourth of the width of a space character instead.

您可以复制并更改您所选择的文本提取的战略调整这些标准。

You can tweak these criteria by copying and changing the text extraction strategy of your choice.

SimpleTextExtractionStrategy 你觉得这个标准嵌入在 renderText 方法:

In the SimpleTextExtractionStrategy you find this criterion embedded in the renderTextmethod:

if (spacing > renderInfo.GetSingleSpaceWidth()/2f){
    AppendTextChunk(' ');
}

在的情况下, LocationTextExtractionStrategy 这个标准同时已投入了自己的方法:

In case of the LocationTextExtractionStrategy this criterion meanwhile has been put into a method of its own:

/**
 * Determines if a space character should be inserted between a previous chunk and the current chunk.
 * This method is exposed as a callback so subclasses can fine tune the algorithm for determining whether a space should be inserted or not.
 * By default, this method will insert a space if the there is a gap of more than half the font space character width between the end of the
 * previous chunk and the beginning of the current chunk.  It will also indicate that a space is needed if the starting point of the new chunk 
 * appears *before* the end of the previous chunk (i.e. overlapping text).
 * @param chunk the new chunk being evaluated
 * @param previousChunk the chunk that appeared immediately before the current chunk
 * @return true if the two chunks represent different words (i.e. should have a space between them).  False otherwise.
 */
protected bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk) {
    float dist = chunk.DistanceFromEndOf(previousChunk);
    if(dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f)
        return true;
    return false;
}

有关把这个变成它自己的方法的目的是仅需要战略的简单子类,并重写此方法来调整启发式标准。这相当于iText的Java类的案件,但港期间工作正常可惜没有iTextSharp的虚拟已添加的声明(截至5.4.4版本)。因此,目前拷贝整个战略仍然是必要的iTextSharp的。

The intention for putting this into a method of its own was to merely require simple subclassing of the strategy and overriding this method to adjust the heuristics criteria. This works fine in case of the equivalent iText Java class but during the port to iTextSharp unfortunately no virtual has been added to the declaration (as of version 5.4.4). Thus, currently copying the whole strategy is still necessary for iTextSharp.

@Bruno你可能想告诉iText的 - > iTextSharp的移植团队这个

@Bruno You might want to tell the iText -> iTextSharp porting team about this.

虽然可以精细这些code调位置文本提取你应该知道,你不会在这​​里找到一个100%的标准。有些原因是:

While you can fine tune text extraction at these code locations you should be aware that you will not find a 100% criterion here. Some reasons are:


  • 在密集地设定文本文字之间的间隙可以比字距或字里面的一些光学效应等差距较小。因此,有没有一个放之四海而皆准的所有因素在这里。

  • 在不使用在所有的空格字符(如你可以随时使用的空白,这是可能的),有空格字符宽度的PDF文件可能是一些随机的值或无法确定在所有!

  • 有有趣的PDF滥用空格字符宽度(其可以单独随时被拉伸的操作,以跟踪),同时使用空白字断做一些表格的格式。在这样的PDF空格字符的当前宽度值不能认真地被用来确定断字。

  • 有时候,你发现的I体中L E字印刷为重点间隔了一条线。这些可能会被大多数启发式被解析为一个字母的单词的集合。

可以比iText的启发式得到更好的和从它通过考虑实际的视觉自由空间中的所有字符之间(使用PDF显示或字体信息分析机制),使用其他常数而得,但对于可感知的改进,你必须投入太多的时间。

You can get better than the iText heuristics and those derived from it using other constants by taking into account the actual visual free space between all characters (using PDF rendering or font information analysis mechanisms), but for a perceivable improvement you have to invest much time.

这篇关于我们如何能够利用iTextSharp的用空格提取PDF文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆