用ASCII近似替换unicode标点符号 [英] Replacing unicode punctuation with ASCII approximations

查看：137 发布时间：2018/11/28 20:08:44 java unicode ascii

本文介绍了用ASCII近似替换unicode标点符号的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在阅读Java程序中的一些文本文件，并希望用ASCII近似替换一些Unicode字符。这些文件最终将被分解为送到OpenNLP的句子。 OpenNLP无法识别Unicode字符并在许多符号上给出不正确的结果（它将girl标记为girl和s，但如果它是Unicode引号则将其视为单个标记）..

I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sentences that are fed to OpenNLP. OpenNLP does not recognize Unicode characters and gives improper results on a number of symbols (it tokenizes "girl's" as "girl" and "'s" but if it is a Unicode quote it is treated as a single token)..

例如，源句可能包含Unicode方向引用 U2018 （'）我想将其转换为 U0027 （'）。最终我将剥离剩余的Unicode。

For example, the source sentence may contain the Unicode directional quotation U2018 (‘) and I would like to convert that to U0027 ('). Eventually I will be stripping the remaining Unicode.

我知道我丢失了信息，我知道我可以编写正则表达式来转换每个符号，但是我我在问是否有可以重复使用的代码来转换这些符号。

I understand that I am losing information, and I know that I could write regular expressions to convert each of these symbols, but I am asking if there is code I can reuse to convert some of these symbols.

这就是我能做的，但我确信我会犯错误/错过的东西/等等：

This is what I could, but I'm sure I will make mistakes/miss things/etc.:

    // double quotation (")
    replacements.add(new Replacement(Pattern.compile("[\u201c\u201d\u201e\u201f\u275d\u275e]"), "\""));

    // single quotation (')
    replacements.add(new Replacement(Pattern.compile("[\u2018\u2019\u201a\u201b\u275b\u275c]"), "'"));

替换是我后来运行并应用替换的自定义类。

replacements is a custom class that I later run over and apply the replacements.

    for (Replacement replacement : replacements) {
         text = replacement.pattern.matcher(text).replaceAll(r.replacement);
    }

如你所见，我必须找到：

As you can see, I had to find:

LEFT SINGLE QUOTATION MARK

正确的单一报价单

单低报9报价标记（这是什么/我应该更换它？）

单个高翻-9报价标记（这是什么/我应该替换它？）

LEFT SINGLE QUOTATION MARK
RIGHT SINGLE QUOTATION MARK
SINGLE LOW-9 QUOTATION MARK (what is this/should I replace this?)
SINGLE HIGH-REVERSED-9 QUOTATION MARK (what is this/should I replace this?)

用ASCII近似替换unicode标点符号 [英] Replacing unicode punctuation with ASCII approximations

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

用ASCII近似替换unicode标点符号 [英] Replacing unicode punctuation with ASCII approximations

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭