将组合变音符转换为简单的utf [英] Converting combining diacritics to simple utf
问题描述
由于某些编码问题,将字符串插入数据库时出现问题。
I have a problem when inserting a string to database due to some encoding issues.
String source是一个外部的rss feed。
在Web浏览器中,它看起来很好。即使在调试器中,文本似乎也可以。
如果我将强力复制到记事本,结果也可以。
String source is a external rss feed. In web browser it looks ok. Even in debugger the text appears to be ok. If I copy the strong to notedpad, the result is also ok.
但是在记事本++中可以看到该字符串正在使用组合字符。
如果更改为ansii,则两者合并出现。
例如
But in notepad++ was possible to see that string is using combining characters. If changing to ansii, both combined appears. e.g.
á显示为'
(在记事本++中就像有两个字符,甚至可以选择...的一半char)
(In notepad++ is is like having two chars, on over the other. I even can select ... half of the char)
我对这个问题进行了漫游,尝试了非常不同的方法。
我真的想找到一个巧妙的方式来转换字符串与组合diacritics到简单的utf8数据库兼容的。
I googled a lot and tried very different approach to this problem. I really want to find a clever way of convert string with combining diacritics to simple utf8 database compatible ones.
任何帮助?
非常感谢!
Any help? Thank you so much!
推荐答案
这应该适合你
output.Normalize(NormalizationForm.FormC)
这个小测试给出了3,2,3。中间的字符串将A和它的变音符正确组合成一个单一的UTF-8字符
This little test gave 3, 2, 3. The middle string is correctly combining A and it's diacritic into a single UTF-8 character
Console.WriteLine(Encoding.UTF8.GetByteCount(("A\u0302")));
Console.WriteLine(Encoding.UTF8.GetByteCount(("A\u0302").Normalize(NormalizationForm.FormC)));
Console.WriteLine(Encoding.UTF8.GetByteCount(("T\u0302").Normalize(NormalizationForm.FormC)));
这篇关于将组合变音符转换为简单的utf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!