Java-使用单个表情符号读取UTF-8文件 [英] Java - read UTF-8 file with a single emoji symbol

查看:147
本文介绍了Java-使用单个表情符号读取UTF-8文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有单个unicode符号的文件.
该文件以UTF-8编码.
它包含一个表示为4个字节的符号.
https://www.fileformat.info/info/unicode/char/1f60a/index.htm

I have a file with a single unicode symbol.
The file is encoded in UTF-8.
It contains a single symbol represented as 4 bytes.
https://www.fileformat.info/info/unicode/char/1f60a/index.htm

F0 9F 98 8A

当我读取文件时,我得到两个符号/字符.

When I read the file I get two symbols/chars.

下面的程序打印

?
2
?
?
55357
56842
======================================
��
16
&
======================================
?
2
?
======================================

这是正常现象还是错误?还是我在滥用某些东西?
如何在代码中获取单个表情符号?

Is this normal... or a bug? Or am I misusing something?
How do I get that single emoji symbol in my code?

而且...如何将其转义为XML?

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class Test008 {

    public static void main(String[] args) throws Exception{
        BufferedReader in = new BufferedReader(
                   new InputStreamReader(
                              new FileInputStream("D:\\DATA\\test1.txt"), "UTF8"));
        
        String s = "";
        while ((s = in.readLine()) != null) {
            System.out.println(s);
            System.out.println(s.length());
            System.out.println(s.charAt(0));
            System.out.println(s.charAt(1));
            
            System.out.println((int)(s.charAt(0)));
            System.out.println((int)(s.charAt(1)));
            
            String z = org.apache.commons.lang.StringEscapeUtils.escapeXml(s);
            String z3 = org.apache.commons.lang3.StringEscapeUtils.escapeXml(s);
            
            System.out.println("======================================");
            System.out.println(z);
            System.out.println(z.length());
            System.out.println(z.charAt(0));
            
            System.out.println("======================================");
            System.out.println(z3);
            System.out.println(z3.length());
            System.out.println(z3.charAt(0));
            
            System.out.println("======================================");

        }

        in.close();
    }

}

推荐答案

是的,Unicode符号为2个UTF-16字符(1个字符为2个字节).

Yes normal, the Unicode symbol is 2 UTF-16 chars (1 char is 2 bytes).

int codePoint = s.codePointAt(0); // Your code point.
System.out.printf("U+%04X, chars: $d%n", codePoint, Character.charCount(cp));

U+F09F988A, chars: 2


评论后


After comments

Java,使用流:

public static String escapeToAsciiHTML(String s) {
    StringBuilder sb = new StringBuilder();
    s.codePoints().forEach(cp -> {
        if (cp < 128) {
            sb.append((char) cp);
        } else{
            sb.append("&#").append(cp).append(";");
        }
    });
    return sb.toString();
}

这篇关于Java-使用单个表情符号读取UTF-8文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆