如何从文本中删除OCR文物? [英] How to remove OCR artifacts from text?

查看:213
本文介绍了如何从文本中删除OCR文物?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

OCR生成的文本有时会与文物,像这样的:

OCR generated texts sometimes come with artifacts, such as this one:

DiesegrundsätzlicheV erborgenheit Gottes,死SICH淖尔DEMñachfolgeröffnet,IST mitdem Messiasgeheimnis gemeint

Diese grundsätzliche V e r b o r g e n h e i t Gottes, die sich n u r dem N a c h f o l g e r ö f f n e t , ist m i t d e m Messiasgeheimnis gemeint

虽然这是不寻常,也就是说字母之间的间隔用作强调(可能是由于早期的打印preSS限制),它是不利的检索任务。

While it is not unusual, that the spacing between letters is used as emphasis (probably due to early printing press limitations), it is unfavorable for retrieval tasks.

一个人怎么可以把上面的文字注入了更多,比如,规范形式,如:

How can one turn the above text into a more, say, canonical form, like:

DiesegrundsätzlicheVerborgenheit Gottes,死SICH淖尔DEM NACHFOLGERöffnet,北京时间MIT数字高程模型Messiasgeheimnis gemeint

Diese grundsätzliche Verborgenheit Gottes, die sich nur dem Nachfolger öffnet, ist mit dem Messiasgeheimnis gemeint

这可以有效地对大量的文字做了什么?

Can this be done efficiently for large amounts of text?

一个想法是串联整个字符串(跳过猜测,在词的边界),然后在其上运行的文本分割算法,或许真的与此类似: http://norvig.com/ngrams/

One idea would be to concatenate the whole string (to skip the guessing, where word boundaries are) and then run a text segmentation algorithm on it, maybe something similar to this: http://norvig.com/ngrams/

推荐答案

如果你有一本字典为目标语言,以及所有间隔出的字组成的只是一个单一的字的,那么它很容易:只是通过文字进行扫描,寻找间隔出单个字母最大长度的运行,如果它存在,他们跟单对应的字典单词替换(否则让它们保持不变)

If you have a dictionary for the target language, and all spaced-out words consist of just a single word, then it's easy: Just scan through the text, looking for maximal-length runs of spaced-out single letters, and replace them with the single corresponding dictionary word if it exists (and otherwise leave them unchanged).

唯一真正的困难是与像 mitdem 字符串对应于两个或两个以上单独的词。一个简单的方法是将贪婪蚕食掉出现在词典prefixes,但是这可能会导致不理想的结果,尤其涉及一个后缀,不对应于任何词典字符串即使不同的选择的断点会起作用(如 BEIM一个RZT 不会,如果你贪婪地抓工作而不是 BEIM 从前面)。幸运的是,将工作做得更好一个简单的线性时间DP方法 - 甚至可以结合权重的话,它可以帮助获得在事件最有可能的分解,有不止一个。给定一个串S [1 ... N](删除空格),我们将计算F(i)中,的最佳分解得分长度-i的$ S的对$ PFIX,对于所有1所述; = I&所述= N:

The only real difficulty is with strings like m i t d e m that correspond to two or more separate words. A simple way would be to greedily "nibble off" prefixes that appear in the dictionary, but this might lead to suboptimal results, and in particular to a suffix that doesn't correspond to any dictionary string even though a different choice of breakpoints would have worked (e.g. b e i m A r z t won't work if you greedily grab bei instead of beim from the front). Fortunately there's a simple linear-time DP approach that will do a better job -- and can even incorporate weights on words, which can help to get the most likely decomposition in the event that there is more than one. Given a string S[1 .. n] (with spaces removed), we will compute f(i), the score of the best decomposition of the length-i prefix of S, for all 1 <= i <= n:

f(0) = 0
f(i) = max over all 0 <= j < i of f(j) + dictScore(S[j+1 .. i])

F(N),那么将是整个串的最佳可能分解的得分。如果设置dictScore(T)为1存在于字典和0的话不,你会得到一个分解成尽可能多的单词可能的话;如果设置dictScore(T)来,比如,-1存在于词典和-2的话没有的话,你会得到一个分解成的几句话越好。您也可以选择奖励分数越高,更多的可能字样。

f(n) will then be the score of the best possible decomposition of the entire string. If you set dictScore(T) to 1 for words that exist in the dictionary and 0 for words that don't, you will get a decomposition into as many words as possible; if you set dictScore(T) to, e.g., -1 for words that exist in the dictionary and -2 for words that don't, you'll get a decomposition into as few words as possible. You can also choose to award higher scores for more "likely" words.

计算这些分数后,你可以步行回通过DP矩阵重建对应于要得高分分解。

After computing these scores, you can walk back through the DP matrix to reconstruct a decomposition that corresponds to the maximal score.

这篇关于如何从文本中删除OCR文物?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆