如何从java中的字符串中删除无效的unicode字符 [英] How to remove non-valid unicode characters from strings in java

查看:1084
本文介绍了如何从java中的字符串中删除无效的unicode字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 CoreNLP神经网络依赖性解析器进行解析一些社交媒体内容。不幸的是,根据 fileformat.info ,该文件包含的字符不是有效的unicode字符或unicode替换字符。这些是例如 U + D83D U + FFFD 。如果这些字符在文件中,coreNLP会回复错误消息,如下所示:

I am using the CoreNLP Neural Network Dependency Parser to parse some social media content. Unfortunately, the file contains characters which are, according to fileformat.info, not valid unicode characters or unicode replacement characters. These are for example U+D83D or U+FFFD. If those characters are in the file, coreNLP responds with errors messages like this one:

Nov 15, 2015 5:15:38 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+D83D, decimal: 55357)

基于这个答案,我试过 document.replaceAll(\ \p {C},); 只删除这些字符。 document 这里只是一个字符串文档。但这没有帮助。

Based on this answer, I tried document.replaceAll("\\p{C}", ""); to just remove those characters. document here is just the document as a string. But that didn't help.

如何将这些字符从字符串中删除后再传递给coreNLP?

How can I remove those characters out of the string before passing it to coreNLP?

更新(11月16日):

为了完整起见,我应该提一下,我只是为了避免这个问题而提出这个问题。通过预处理文件的大量错误消息。 CoreNLP只是忽略它无法处理的字符,所以这不是问题。

For the sake of completeness I should mention that I asked this question only in order to avoid the huge amount of error messages by preprocessing the file. CoreNLP just ignores characters it can't handle, so that is not the problem.

推荐答案

在某种程度上,两个答案都是由 Mukesh Kumar GsusRecovery 正在帮助,但不完全正确。

In a way, both answers provided by Mukesh Kumar and GsusRecovery are helping, but not fully correct.

document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");

似乎替换所有无效字符。但CoreNLP似乎不再支持。我通过在整个语料库上运行解析器来手动计算出来,这导致了这个:

seems to replace all invalid characters. But CoreNLP seems to not support even more. I manually figured them out by running the parser on my whole corpus, which led to this:

document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");

所以现在我正在运行两个 replaceAll()在将文档交给解析器之前的命令。完整的代码段是

So right now I am running two replaceAll() commands before handing the document to the parser. The complete code snippet is

// remove invalid unicode characters
String tmpDoc1 = document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
// remove other unicode characters coreNLP can't handle
String tmpDoc2 = tmpDoc1.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(tmpDoc2));
for (List<HasWord> sentence : tokenizer) {
    List<TaggedWord> tagged = tagger.tagSentence(sentence);
    GrammaticalStructure gs = parser.predict(tagged);
    System.err.println(gs);
}

这不一定是不受支持的字符的完整列表,这就是为什么我在问题 stanfordnlp / CoreNLPrel =nofollow noreferrer> GitHub 。

This is not necessarily a complete list of unsupported characters, though, which is why I opened an issue on GitHub.

请注意,CoreNLP会自动删除这些不受支持的字符。我想要预处理语料库的唯一原因是避免所有这些错误消息。

Please note that CoreNLP automatically removes those unsupported characters. The only reason I want to preprocess my corpus is to avoid all those error messages.

更新11月27日

Christopher Manning 刚回答了 GitHub问题我打开了。有几种方法可以使用类 edu.stanford.nlp.process.TokenizerFactory; 来处理这些字符。以此代码示例来标记文档:

Christopher Manning just answered the GitHub Issue I opened. There are several ways to handle those characters using the class edu.stanford.nlp.process.TokenizerFactory;. Take this code example to tokenize a document:

DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(document));
TokenizerFactory<? extends HasWord> factory=null;
factory=PTBTokenizer.factory();
factory.setOptions("untokenizable=noneDelete");
tokenizer.setTokenizerFactory(factory);

for (List<HasWord> sentence : tokenizer) {
    // do something with the sentence
}

您可以使用其他选项替换第4行中的 noneDelete 。我引用Manning:

You can replace noneDeletein line 4 with other options. I am citing Manning:


(...)完整的六个选项组合,是否记录无警告,第一个或者全部,以及是否删除它们或将它们包含在输出中的单个字符标记:noneDelete,firstDelete,allDelete,noneKeep,firstKeep,allKeep。

"(...) the complete set of six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep."

这意味着,要保留字符而不获取所有这些错误消息,最好的方法是使用 noneKeep 选项。这种方式比任何删除这些字符的方式都更优雅。

That means, to keep the characters without getting all those error messages, the best way is to use the option noneKeep. This way is way more elegant than any attempt to remove those characters.

这篇关于如何从java中的字符串中删除无效的unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆