将Unicode转换为ASCII而不更改字符串长度(在Java中) [英] Convert Unicode to ASCII without changing the string length (in Java)

查看:97
本文介绍了将Unicode转换为ASCII而不更改字符串长度(在Java中)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将字符串从Unicode转换为ASCII而不改变其长度的最佳方法是什么(在我的情况下这非常重要)?此外,没有任何转换问题的字符必须与原始字符串中的位置相同。
所以Ä必须转换为A而不是含有更多字符的神秘内容。

What is the best way to convert a string from Unicode to ASCII without changing it's length (that is very important in my case)? Also the characters without any conversion problems must be at the same positions as in the original string. So an "Ä" must be converted to "A" and not something cryptic that has more characters.

编辑:

@ novalis - 这些符号(例如亚洲语言)应该转换为一些占位符。我对这些词或它们的含义不太感兴趣。


@novalis - Such symbols (for example of asian languages) should just be converted to some placeholders. I am not too interested in those words or what they mean.

@MtnViewMark - 在任何情况下我都必须保留所有字符的数量和ASCII可用字符的位置。

@MtnViewMark - I must preserve the number of all characters and the position of ASCII available characters under any circumstance.

这里有更多信息:我有一些只能处理ASCII字符串的文本挖掘工具。大多数应该处理的文本是英文的,但有些文本包含非ASCII字符。我对这些词不感兴趣,但我必须确保我感兴趣的词(那些只包含ASCII字符的词)在字符串转换后处于相同的位置。

Here some more info: I have some text mining tools that can only process ASCII strings. Most of the text that should be processed is in English, but some do contain non ASCII characters. I am not interested in those words, but I must be sure that the words I am interested in (those that only contain ASCII characters) are at the same positions after the string conversion.

推荐答案

这个答案,以下代码应该有效:

As stated in this answer, the following code should work:

    String s = "口水雞 hello Ä";

    String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
    String regex = "[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+";

    String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

    System.out.println(s2);
    System.out.println(s.length() == s2.length());

输出

??? hello A
true

所以你首先删除diactrical标记,转换为ascii。非ascii字符将成为问号。

So you first remove diactrical marks, the convert to ascii. Non-ascii characters will become question marks.

这篇关于将Unicode转换为ASCII而不更改字符串长度(在Java中)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆