如何删除重音(A-Umlaut到A) [英] How to remove accents (A-Umlaut to A)

查看:82
本文介绍了如何删除重音(A-Umlaut到A)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有替换特殊字符的方法? (A-Umlaut)

A ,? (O-Umlaut)和O,等等?

当然,我可以单独寻找每个角色并用它的

ascii对应替换它,但也有法语中的这些特殊字符

和瑞典语以及我还想要捕捉的许多其他语言。

有通用的方法吗?

Is there a method to replace special characters like ? (A-Umlaut) with
A, ? (O-Umlaut) with O, and so on?
Sure, I could look for each character separately and replace it with its
ascii-counterpart, but there are also such special characters in French
and Swedish and many other languages which I also want to catch. Is
there a generic way to do it?

推荐答案

2007年8月7日星期二14:05:46 + 0200,cody< de ******** @ gmx.dewrote:
On Tue, 07 Aug 2007 14:05:46 +0200, cody <de********@gmx.dewrote:

是否有替换特殊字符的方法? (A-Umlaut)

A ,? (O-Umlaut)和O,等等?

当然,我可以单独寻找每个角色并用它的

ascii对应替换它,但也有法语中的这些特殊字符

和瑞典语以及我还想要捕捉的许多其他语言。

有通用的方法吗?
Is there a method to replace special characters like ? (A-Umlaut) with
A, ? (O-Umlaut) with O, and so on?
Sure, I could look for each character separately and replace it with its
ascii-counterpart, but there are also such special characters in French
and Swedish and many other languages which I also want to catch. Is
there a generic way to do it?



你好Cody,


有没有通用的方法来做到这一点。有一个hack在大多数情况下都有效,包括切换编码字符串并以不同的编码读取它,但这绝不能确保为你工作。最好的办法是创建一个查找表并手动翻译每个字符。如果您期待各种各样的角色,最好支持Unicode或UTF-8。


-

快乐编码!

Morten Wennevik [C#MVP]

Hi Cody,

There is no generic way to do this. There is a hack that works in most cases involving switching Encoding the string and reading it in a different encoding, but this is by no means ensured to work for you. Your best bet is to create a lookup table and manually translate each character. If you anticipate a wide variety of characters, maybe Unicode or UTF-8 support is best.

--
Happy coding!
Morten Wennevik [C# MVP]


Morten Wennevik [C#MVP]< Mo ************ @ hotmail.comwrote:
Morten Wennevik [C# MVP] <Mo************@hotmail.comwrote:

2007年8月7日星期二14:05:46 + 0200,cody< de ******** @ gmx.dewrote:
On Tue, 07 Aug 2007 14:05:46 +0200, cody <de********@gmx.dewrote:

是否有替换特殊字符的方法? (A-Umlaut)

A ,? (O-Umlaut)和O,等等?

当然,我可以单独寻找每个角色并用它的

ascii对应替换它,但也有法语中的这些特殊字符

和瑞典语以及我还想要捕捉的许多其他语言。

有通用的方法吗?
Is there a method to replace special characters like ? (A-Umlaut) with
A, ? (O-Umlaut) with O, and so on?
Sure, I could look for each character separately and replace it with its
ascii-counterpart, but there are also such special characters in French
and Swedish and many other languages which I also want to catch. Is
there a generic way to do it?



没有通用的方法可以做到这一点。有一个hack适用于

大多数涉及切换的情况编码字符串并以

a不同的编码读取它,但这绝不是确保适用于

你。最好的办法是创建一个查找表并手动翻译每个字符的
。如果您预计会有各种各样的字符,那么最好使用Unicode或UTF-8支持

There is no generic way to do this. There is a hack that works in
most cases involving switching Encoding the string and reading it in
a different encoding, but this is by no means ensured to work for
you. Your best bet is to create a lookup table and manually translate
each character. If you anticipate a wide variety of characters, maybe
Unicode or UTF-8 support is best.



实际上,从.NET 2.0开始,*是*使用

System.Text.NormalizationForm执行此操作的方法。


看看
http://groups.google.com/group/micro...neral/tree/bro

wse_frm / thread / 78a09bd184351bc5 / 99f090af662c126c?rnum = 11

(来自Chris Mullins的最后一个回复)。


这里是贴出的代码,它做了一些不需要的上壳br />
在这种情况下 - 但除此之外应该没问题。


原始代码:


编码ascii = Encoding.GetEncoding(

" us-ascii",

new EncoderReplacementFallback(string.Empty),

new DecoderReplacementFallback(string.Empty) );

byte [] encodedBytes = new byte [ascii.GetByteCount(normalized)];

int numberOfEncodedBytes = ascii.GetBytes(normalized,0,

normalized.Length,

encodedBytes,0);


string s ="á???ò?:usdBDlGXHHA" ;;

string normalized = s.Normalize(NormalizationForm.FormKD);

编码ascii = Encoding.GetEncoding(

" us-ascii",

new EncoderReplacementFallback(string.Empty),

new DecoderReplacementFallback(string.Empty));

byte [] encodedBytes = new byte [ascii.GetByteCount(normalized)];

int numberOfEncodedBytes = ascii.GetBytes(normalized,0,

normalized.Length,

encodedBytes,0);

string newString = ascii.GetString(encodedBytes).ToUpper();

MessageBox.Show(newString);


原始代码结束。

这里有一个稍微简单的(IMO)版本:


静态字符串RemoveAccents(字符串输入)

{

string normalized = input.Normalize(NormalizationForm.FormKD);

编码删除= Encoding.GetEncoding

(Encoding.ASCII.CodePage,

new EncoderReplacementFallback(""),

new DecoderReplacementFallback(""));


byte [] bytes = removal.GetBytes(normalized);

返回Encoding.ASCII。 GetString(bytes);

}


或另一种选择:


静态字符串RemoveAccents(字符串输入)< br $>
{

string normalized = input.Normalize(NormalizationForm.FormKD);

StringBuilder builder = new StringBuilder();

foreach(char c in normalized)

{

if(char.GetUnicodeCategory(c)!=

UnicodeCategory.NonSpacingMark)< br $>
{

builder.Append(c);

}

}

返回builder.ToString();

}

-

Jon Skeet - < sk *** @ pobox.com>
http://www.pobox.com/~skeet 博客: http://www.msmvps.com/jon.skeet

如果回复小组,请不要给我发邮件

Actually, as of .NET 2.0 there *is* a way of doing this using
System.Text.NormalizationForm.

Look at
http://groups.google.com/group/micro...neral/tree/bro
wse_frm/thread/78a09bd184351bc5/99f090af662c126c?rnum=11
(the last response, from Chris Mullins).

Here''s the code posted, which does some upper-casing which isn''t needed
in this case - but it should be okay aside from that.

Original code:

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));
byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);

string s = "á???ò?:usdBDlGXHHA";
string normalized = s.Normalize(NormalizationForm.FormKD);
Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));
byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);
string newString = ascii.GetString(encodedBytes).ToUpper();
MessageBox.Show(newString);

End of original code.
Here''s a slightly simpler (IMO) version:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
Encoding removal = Encoding.GetEncoding
(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(""),
new DecoderReplacementFallback(""));

byte[] bytes = removal.GetBytes(normalized);
return Encoding.ASCII.GetString(bytes);
}

Or an alternative:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (char.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}
--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too


2007年8月7日星期二19:29:00 +0200,Jon Skeet [C#MVP] < sk *** @ pobox.comwrote:
On Tue, 07 Aug 2007 19:29:00 +0200, Jon Skeet [C# MVP] <sk***@pobox.comwrote:

Morten Wennevik [C#MVP]< Mo ************ @ hotmail .comwrote:
Morten Wennevik [C# MVP] <Mo************@hotmail.comwrote:

> 2007年8月7日星期二14:05:46 + 0200,cody< de ******** @ gmx。 dewrote:
>On Tue, 07 Aug 2007 14:05:46 +0200, cody <de********@gmx.dewrote:

是否有替换特殊字符的方法? (A-Umlaut)

A ,? (O-Umlaut)和O,等等?

当然,我可以单独寻找每个角色并用它的

ascii对应替换它,但也有法语中的这些特殊字符

和瑞典语以及我还想要捕捉的许多其他语言。

有通用的方法吗?
Is there a method to replace special characters like ? (A-Umlaut) with
A, ? (O-Umlaut) with O, and so on?
Sure, I could look for each character separately and replace it with its
ascii-counterpart, but there are also such special characters in French
and Swedish and many other languages which I also want to catch. Is
there a generic way to do it?


没有通用的方法来做到这一点。有一个hack适用于大多数涉及切换的情况编码字符串并以不同的编码读取它,但这绝不是确保为你工作。最好的办法是创建一个查找表并手动翻译每个角色。如果您预期会有各种各样的字符,那么最好使用Unicode或UTF-8支持。


There is no generic way to do this. There is a hack that works in
most cases involving switching Encoding the string and reading it in
a different encoding, but this is by no means ensured to work for
you. Your best bet is to create a lookup table and manually translate
each character. If you anticipate a wide variety of characters, maybe
Unicode or UTF-8 support is best.



实际上,从.NET 2.0开始,*是*使用

System.Text.NormalizationForm执行此操作的方法。


看看
http://groups.google.com/group/micro...neral/tree/bro

wse_frm / thread / 78a09bd184351bc5 / 99f090af662c126c?rnum = 11

(来自Chris Mullins的最后一个回复)。


这里是贴出的代码,它做了一些不需要的上壳br />
在这种情况下 - 但除此之外应该没问题。


原始代码:


编码ascii = Encoding.GetEncoding(

" us-ascii",

new EncoderReplacementFallback(string.Empty),

new DecoderReplacementFallback(string.Empty) );


byte [] encodedBytes = new byte [ascii.GetByteCount(normalized)];

int numberOfEncodedBytes = ascii。 GetBytes(标准化,0,

normalized.Length,

encodedBytes,0);


string s ="á? ??ò?:usdBDlGXHHA" ;;

string normalized = s.Normalize(NormalizationForm.FormKD);


编码ascii = Encoding.GetEncoding(

" us-ascii",

new EncoderReplacementFallback(string.Empty),

new DecoderReplacementFallback(string.Empty));


byte [] encodedBytes = new byte [ascii.GetByteCount(normalized)];

int numberOfEncodedBytes = ascii.GetBytes(normalized,0,

normalized.Length,

encodedBytes,0);


string newString = ascii.GetString(encodedBytes).ToUpper();

MessageBox.Show(newString);


原始代码结束。


这里有一个稍微简单的(IMO)版本:


静态字符串RemoveAccents(字符串输入)

{

string normalized = input.Normalize( NormalizationForm.FormKD);

编码删除= Encoding.GetEncoding

(Encoding.ASCII.CodePage,

new EncoderReplacementFallback("") ,

new DecoderReplacementFallback(""));

byte [] bytes = removal.GetBytes(normalized);

返回编码。 ASCII.GetString(bytes);

}


或另一种选择:


静态字符串RemoveAccents(字符串输入)

{

string normalized = input.Normalize(NormalizationForm.FormKD);

StringBuilder builder = new StringBuilder();

foreach(标准化中的字符)

{

if(char.GetUnicodeCategory(c)!=

UnicodeCategory.NonSpacingMark)

{

builder.Append(c);

}

}

return builder.ToString();

}


Actually, as of .NET 2.0 there *is* a way of doing this using
System.Text.NormalizationForm.

Look at
http://groups.google.com/group/micro...neral/tree/bro
wse_frm/thread/78a09bd184351bc5/99f090af662c126c?rnum=11
(the last response, from Chris Mullins).

Here''s the code posted, which does some upper-casing which isn''t needed
in this case - but it should be okay aside from that.

Original code:

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));
byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);

string s = "á???ò?:usdBDlGXHHA";
string normalized = s.Normalize(NormalizationForm.FormKD);
Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));
byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0,
normalized.Length,
encodedBytes, 0);
string newString = ascii.GetString(encodedBytes).ToUpper();
MessageBox.Show(newString);

End of original code.
Here''s a slightly simpler (IMO) version:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
Encoding removal = Encoding.GetEncoding
(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(""),
new DecoderReplacementFallback(""));
byte[] bytes = removal.GetBytes(normalized);
return Encoding.ASCII.GetString(bytes);
}

Or an alternative:

static string RemoveAccents (string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (char.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}



有趣。


好吧,它会删除被定义为unicode重音的内容,这是OP所要求的,但是它没有将其他字符规范化为ascii,就像挪威语???,在这种情况下只有?虽然被定义为具有重音?和?可以翻译成a和o。第一种方法会吃?只返回一个而第二个会返回?? a


-

快乐编码!

Morten Wennevik [C#MVP ]

Interesting.

Well, it would remove what is defined as unicode accents, which is what the OP asked, but it does not normalize other characters into ascii, like the Norwegian ???, in which case only ? is defined as having an accent, though ? and ? could be translated to a and o. The first method would eat ?? and return only a and the second would return ??a

--
Happy coding!
Morten Wennevik [C# MVP]


这篇关于如何删除重音(A-Umlaut到A)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆