从字符串中删除特定unicode范围的字符 [英] removing characters of a specific unicode range from a string

查看:322
本文介绍了从字符串中删除特定unicode范围的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序正在从twitter流api实时解析推文。在存储它们之前,我将它们编码为utf8。某些字符最终出现在字符串中?,??或???而不是他们各自的unicode代码,并导致问题。经过进一步调查,我发现有问题的字符来自表情符号块,U + 1F600 - U + 1F64F,以及其他符号和象形文字阻止,U + 1F300 - U + 1F5FF。我尝试删除,但是不成功,因为匹配器最终替换了字符串中的几乎每个字符,而不仅仅是我想要的unicode范围。

I have a program that is parsting tweets in real time from the twitter stream api. Before storing them, I am encoding them as utf8. Certain characters end up appearing in the string as ?, ??, or ??? instead of their respective unicode codes and cause problems. Upon further investigation, I found that the problematic characters are from the "emoticon" block, U+1F600 - U+1F64F, and the "Miscellaneous Symbols And Pictographs" block, U+1F300 - U+1F5FF. I tried removing, but was unsuccessful as the matcher ended up replacing almost every character in the string, not just my desired unicode range.

String utf8tweet = "";
        try {
            byte[] utf8Bytes = status.getText().getBytes("UTF-8");

            utf8tweet = new String(utf8Bytes, "UTF-8");

        } 
        catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
Pattern unicodeOutliers = Pattern.compile("[\\u1f300-\\u1f64f]", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");

我该怎么做才能删除这些字符?

What can I do to remove these characters?

推荐答案

在正则表达式模式中添加否定运算符 ^ 。对于过滤可打印字符,您可以使用以下表达式 [^ \\x00-\\x7F] ,您应该得到所需的结果。

In the regex pattern add the negation operator ^. For filtering printable characters you could use the following expression [^\\x00-\\x7F] and you should get the desired result.

import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class UTF8 {
    public static void main(String[] args) {
        String utf8tweet = "";
        try {
            byte[] utf8Bytes = "#Hello twitter  How are you?".getBytes("UTF-8");

            utf8tweet = new String(utf8Bytes, "UTF-8");

        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
        Pattern unicodeOutliers = Pattern.compile("[^\\x00-\\x7F]",
                Pattern.UNICODE_CASE | Pattern.CANON_EQ
                        | Pattern.CASE_INSENSITIVE);
        Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);

        System.out.println("Before: " + utf8tweet);
        utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
        System.out.println("After: " + utf8tweet);
    }
}

结果如下:

Before: #Hello twitter  How are you?
After: #Hello twitter   How are you?






编辑

为了进一步解释,你还可以用以下方式用 \u 表格继续表达范围 [^ \\\\-\\ u007F] ,它将匹配所有不是前128个UNICODE字符的字符(与之前相同)。如果要扩展范围以支持额外字符,可以使用UNICODE字符列表此处

To explain further, you could also keep expressing the range with the \u form in the following way [^\\u0000-\\u007F], which will match all the characters which are not the first 128 UNICODE characters (the same as before). If you want to extend the range to support extra characters, you can do so using the UNICODE character list here.

例如,如果您想要包含带重音的元音(用西班牙语),您应该将范围扩展到 \ u00FF ,所以你有 [^ \\\\-\\\\ 0000FF] [^ \\x00-\\ xFF ]

For example if you want to include vowels with accent (used in Spanish) you should extend the range to \u00FF, so you have [^\\u0000-\\u00FF] or [^\\x00-\\xFF]:

Before: #Hello twitter  How are you? á é í ó ú
After: #Hello twitter   How are you? á é í ó ú

这篇关于从字符串中删除特定unicode范围的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆