使用声明的encoding = utf-8 - Java从xml中删除非UTF-8字符 [英] remove non-UTF-8 characters from xml with declared encoding=utf-8 - Java

查看:97
本文介绍了使用声明的encoding = utf-8 - Java从xml中删除非UTF-8字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须在Java中处理这种情况:



我从一个声明为encoding = utf-8的客户端以XML形式获得请求。不幸的是,它可能不包含utf-8个字符,并且需要从我身边的xml中删除这些字符(旧版)。



让我们考虑一个例子,这个无效XML包含£(磅)。



1)我得到xml作为java字符串与£在其中(我现在没有访问接口,但我可能得到xml作为java字符串)。我可以用replaceAll(£,)去掉这个字符吗?任何潜在的问题?



2)我将xml作为字节数组 - 在这种情况下如何安全地处理这个操作?

解决方案


1)我得到xml作为java字符串与其中(我现在没有访问接口,但我可能得到xml作为一个java串)。我可以用replaceAll(£,)去掉这个字符吗?


我假设你的意思是想要摆脱非 ASCII 字符,因为你在谈论一个遗产方面。您可以使用以下正则表达式来摆脱可打印ASCII范围之外的任何内容:

  string = string.replaceAll([^ \\\x20-\\\x7e],); 




2)我将xml作为字节数组 - 如何处理在这种情况下,这个操作是安全的吗?


你需要包装 byte [] ByteArrayInputStream ,以便您可以使用 InputStreamReader ,其中您指定编码,然后使用 BufferedReader 逐行阅读。



例如

  BufferedReader reader = null; 
try {
reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes),UTF-8));
for(String line;(line = reader.readLine())!= null;){
line = line.replaceAll([^ \\\x20-\\x7e] ,);
// ...
}
// ...


I have to handle this scenario in Java:

I'm getting a request in XML form from a client with declared encoding=utf-8. Unfortunately it may contain not utf-8 characters and there is a requirement to remove these characters from the xml on my side (legacy).

Let's consider an example where this invalid XML contains £ (pound).

1) I get xml as java String with £ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(£, "") to get rid of this character? Any potential issues?

2) I get xml as an array of bytes - how to handle this operation safely in that case?

解决方案

1) I get xml as java String with £ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(£, "") to get rid of this character?

I am assuming that you rather mean that you want to get rid of non-ASCII characters, because you're talking about a "legacy" side. You can get rid of anything outside the printable ASCII range using the following regex:

string = string.replaceAll("[^\\x20-\\x7e]", "");

2) I get xml as an array of bytes - how to handle this operation safely in that case?

You need to wrap the byte[] in an ByteArrayInputStream, so that you can read them in an UTF-8 encoded character stream using InputStreamReader wherein you specify the encoding and then use a BufferedReader to read it line by line.

E.g.

BufferedReader reader = null;
try {
    reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
    for (String line; (line = reader.readLine()) != null;) {
        line = line.replaceAll("[^\\x20-\\x7e]", "");
        // ...
    }
    // ...

这篇关于使用声明的encoding = utf-8 - Java从xml中删除非UTF-8字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆