用ASCII近似替换unicode标点符号 [英] Replacing unicode punctuation with ASCII approximations

查看:137
本文介绍了用ASCII近似替换unicode标点符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读Java程序中的一些文本文件,并希望用ASCII近似替换一些Unicode字符。这些文件最终将被分解为送到OpenNLP的句子。 OpenNLP无法识别Unicode字符并在许多符号上给出不正确的结果(它将girl标记为girl和s,但如果它是Unicode引号则将其视为单个标记)..

I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sentences that are fed to OpenNLP. OpenNLP does not recognize Unicode characters and gives improper results on a number of symbols (it tokenizes "girl's" as "girl" and "'s" but if it is a Unicode quote it is treated as a single token)..

例如,源句可能包含Unicode方向引用 U2018 (')我想将其转换为 U0027 (')。最终我将剥离剩余的Unicode。

For example, the source sentence may contain the Unicode directional quotation U2018 (‘) and I would like to convert that to U0027 ('). Eventually I will be stripping the remaining Unicode.

我知道我丢失了信息,我知道我可以编写正则表达式来转换每个符号,但是我我在问是否有可以重复使用的代码来转换这些符号。

I understand that I am losing information, and I know that I could write regular expressions to convert each of these symbols, but I am asking if there is code I can reuse to convert some of these symbols.

这就是我能做的,但我确信我会犯错误/错过的东西/等等:

This is what I could, but I'm sure I will make mistakes/miss things/etc.:

    // double quotation (")
    replacements.add(new Replacement(Pattern.compile("[\u201c\u201d\u201e\u201f\u275d\u275e]"), "\""));

    // single quotation (')
    replacements.add(new Replacement(Pattern.compile("[\u2018\u2019\u201a\u201b\u275b\u275c]"), "'"));

替换是我后来运行并应用替换的自定义类。

replacements is a custom class that I later run over and apply the replacements.

    for (Replacement replacement : replacements) {
         text = replacement.pattern.matcher(text).replaceAll(r.replacement);
    }

如你所见,我必须找到:

As you can see, I had to find:


  • LEFT SINGLE QUOTATION MARK

  • 正确的单一报价单

  • 单低报9报价标记(这是什么/我应该更换它?)

  • 单个高翻-9报价标记(这是什么/我应该替换它?)

  • LEFT SINGLE QUOTATION MARK
  • RIGHT SINGLE QUOTATION MARK
  • SINGLE LOW-9 QUOTATION MARK (what is this/should I replace this?)
  • SINGLE HIGH-REVERSED-9 QUOTATION MARK (what is this/should I replace this?)

推荐答案

为每个unicode字符分配一个类别。引号有两个单独的类别

Each unicode character is assigned a category. There exists two separate categories for quotes:

使用这些列表,您应该能够适当地处理所有报价,如果你想手动编写正则表达式。

With these lists, you should be able to handle all quotes appropriately, if you would like to code the regex manually.

Java Character.getType 为您提供了字符类别,例如 FINAL_QUOTE_PUNCTUATION

Java Character.getType gives you the category of character, for example FINAL_QUOTE_PUNCTUATION.

现在,您可以获取每个(标点符号)字符的类别,并将其替换为ASCII中的相应补充。

Now you can get the category of each (punctuation-)character and replace it with an appropriate supplement in ASCII.

您可以相应地使用其他标点符号类别。在'标点符号,其他'中,有一些字符,例如PRIME ',您可能还想用撇号代替。

You can use the other punctuation categories accordingly. In 'Punctuation, Other' there are some characters, for example PRIME , which you may also want to substitute with an apostrophe.

这篇关于用ASCII近似替换unicode标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆