阅读与UNI code字符的文件 [英] Read a file with unicode characters

查看:217
本文介绍了阅读与UNI code字符的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个asp.net C#页面,我试图读取具有以下系统字符的文件,并将其转换为。 (从斜撇号,以撇号)。

I have an asp.net c# page and am trying to read a file that has the following charater ’ and convert it to '. (From slanted apostrophe to apostrophe).

FileInfo fileinfo = new FileInfo(FileLocation);
string content = File.ReadAllText(fileinfo.FullName);

//strip out bad characters
content = content.Replace("’", "'");

这不工作,它改变了倾斜撇号到?标记。

This doesn't work and it changes the slanted apostrophes into ? marks.

推荐答案

我怀疑,这个问题是不是与更换,而是与文件本身的读取。当我今天尝试了Nieve酒店的方式(使用Word和复制粘贴)我结束了相同的结果如你,但检查内容显示了.NET Framework相信角色是统一code字 65533 ,即跆拳道?字符的字符串替换之前的。您可以通过检查相关的角色在Visual Studio调试器,它应该显示的字符code检查这个自己:

I suspect that the problem is not with the replacement, but rather with the reading of the file itself. When I tried this the nieve way (using Word and copy-paste) I ended up with the same results as you, however examining content showed that the .Net framework believe that the character was Unicode character 65533, i.e. the "WTF?" character before the string replacement. You can check this yourself by examining the relevant character in the Visual Studio debugger, where it should show the character code:

content[0]; // 65533 '�'

之所以更换不工作很简单 - 内容不包含你给它的字符串:

The reason why the replace isn't working is simple - content doesn't contain the string you gave it:

content.IndexOf("’"); // -1

至于为什么该文件读取工作不正常 - 读取文件时,你可能使用了错误的编码。 (如果没有指定编码,然后.net框架将尝试确定你正确的编码,但没有100%可靠的方式来做到这一点,所以往往能得到它错了)。你需要确切的编码依赖于文件本身,但在我的情况下所使用的编码是扩展ASCII ,因此对阅读我只需要指定正确的编码文件:

As for why the file reading isn't working properly - you are probably using the wrong encoding when reading the file. (If no encoding is specified then the .Net framework will try to determine the correct encoding for you, however there is no 100% reliable way to do this and so often it can get it wrong). The exact encoding you need depends on the file itself, however in my case the encoding being used was Extended ASCII, and so to read the file I just needed to specify the correct encoding:

string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding("iso-8859-1"));

(见<一href=\"http://stackoverflow.com/questions/666385/how-can-i-convert-extended-ascii-to-a-system-string\">this问题)。

您还需要确保您指定的替换字符串正确的字符 - 在$ C $用奇字当c,你可能会发现它更可靠由其性格code指定的字符,而不是作为一个字符串(这可能会导致问题,如果源文件更改的编码),例如,以下为我工作:

You also need to make sure that you specify the correct character in your replacement string - when using "odd" characters in code you may find it more reliable to specify the character by its character code, rather than as a string literal (which may cause problems if the encoding of the source file changes), for example the following worked for me:

content = content.Replace("\u0092", "'");

这篇关于阅读与UNI code字符的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆