是否可以可靠地将用户文件自动解码为Unicode?[C#] [英] Is it possible to reliably auto-decode user files to Unicode? [C#]

查看:47
本文介绍了是否可以可靠地将用户文件自动解码为Unicode?[C#]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Web应用程序,允许用户上传其内容进行处理.处理引擎需要UTF8(并且我正在从多个用户的文件中编写XML),因此我需要确保可以正确解码上传的文件.

I have a web application that allows users to upload their content for processing. The processing engine expects UTF8 (and I'm composing XML from multiple users' files), so I need to ensure that I can properly decode the uploaded files.

由于如果我的任何用户都知道甚至对他们的文件进行了编码,我都会感到惊讶,所以我很少希望他们能够正确地指定编码.(解码器)使用.因此,我的应用程序剩下的任务是在解码之前进行检测.

Since I'd be surprised if any of my users knew their files even were encoded, I have very little hope they'd be able to correctly specify the encoding (decoder) to use. And so, my application is left with task of detecting before decoding.

这似乎是一个普遍的问题,我很惊讶没有找到解决方案的框架功能或通用秘诀.是不是我没有使用有意义的搜索词进行搜索?

This seems like such a universal problem, I'm surprised not to find either a framework capability or general recipe for the solution. Can it be I'm not searching with meaningful search terms?

我已经实现了可识别BOM的检测( http://en.wikipedia.org/wiki/Byte_order_mark ),但是我不确定要多久上传一次不带BOM的文件以指示编码,并且这对于大多数非UTF文件没有用.

I've implemented BOM-aware detection (http://en.wikipedia.org/wiki/Byte_order_mark) but I'm not sure how often files will be uploaded w/o a BOM to indicate encoding, and this isn't useful for most non-UTF files.

我的问题归结为:

  1. 能够识别BOM的文件足以检测绝大多数文件吗?
  2. 如果BOM表检测失败,是否可以尝试使用不同的解码器并确定它们是否有效" ?(我的尝试表明答案是否".)
  3. 在什么情况下C#编码器/解码器框架会导致有效"文件失败?
  4. 在任何地方都存在一个存储库,该存储库中包含大量具有各种编码的文件以用于测试?
  5. 虽然我特别想问有关C#/.NET的问题,但我下次想知道有关Java,Python和其他语言的答案时.
  1. Is BOM-aware detection sufficient for the vast majority of files?
  2. In the case where BOM-detection fails, is it possible to try different decoders and determine if they are "valid"? (My attempts indicate the answer is "no.")
  3. Under what circumstances will a "valid" file fail with the C# encoder/decoder framework?
  4. Is there a repository anywhere that has a multitude of files with various encodings to use for testing?
  5. While I'm specifically asking about C#/.NET, I'd like to know the answer for Java, Python and other languages for the next time I have to do this.

到目前为止,我已经发现:

So far I've found:

  • 带有Ctrl-S字符的有效" UTF-16文件导致对UTF-8的编码引发异常(非法字符?) (这是XML编码异常.)
  • 使用UTF-8 成功解码有效的UTF-16文件,但给出的文本为空字符.嗯?
  • 目前,我只希望使用UTF-8,UTF-16以及可能的ISO-8859-1文件,但是我希望解决方案能够在可扩展的范围内.
  • 我现有的输入文件集还不够广泛,不足以揭示实时文件将出现的所有问题.
  • 尽管我要解码的文件是文本",但我认为它们通常是使用在文件中留下垃圾字符的方法创建的.因此,有效"文件可能不是纯"文件.太高兴了.
  • A "valid" UTF-16 file with Ctrl-S characters has caused encoding to UTF-8 to throw an exception (Illegal character?) (That was an XML encoding exception.)
  • Decoding a valid UTF-16 file with UTF-8 succeeds but gives text with null characters. Huh?
  • Currently, I only expect UTF-8, UTF-16 and probably ISO-8859-1 files, but I want the solution to be extensible if possible.
  • My existing set of input files isn't nearly broad enough to uncover all the problems that will occur with live files.
  • Although the files I'm trying to decode are "text" I think they are often created w/methods that leave garbage characters in the files. Hence "valid" files may not be "pure". Oh joy.

谢谢.

推荐答案

不会有绝对可靠的方法,但是通过一些启发式方法,您也许可以获得相当不错"的结果.

There won't be an absolutely reliable way, but you may be able to get "pretty good" result with some heuristics.

  • 如果数据以BOM表开头,请使用它.
  • 如果数据包含0字节,则可能是utf-16或ucs-32.您可以通过查看0字节的位置来区分它们,以及它们的大端和小端变体
  • 如果可以将数据解码为utf-8(无错误),则很可能是utf-8(或US-ASCII,但这是utf-8的子集)
  • 接下来,如果您想走向国际,请将浏览器的语言设置映射到该语言最有可能的编码.
  • 最后,假设使用ISO-8859-1

当然,相当好"是否足够好"取决于您的应用程序.如果需要确定,则可能需要将结果显示为预览,并让用户确认数据看起来正确.如果没有,请尝试下一种可能的编码,直到用户满意为止.

Whether "pretty good" is "good enough" depends on your application, of course. If you need to be sure, you might want to display the results as a preview, and let the user confirm that the data looks right. If it doesn't, try the next likely encoding, until the user is satisfied.

注意:如果数据包含垃圾字符,此算法将不起作用.例如,否则有效的utf-8中的单个垃圾字节将导致utf-8解码失败,从而使算法沿错误的路径前进.您可能需要采取其他措施来解决此问题.例如,如果您可以事先识别可能的垃圾,请在尝试确定编码之前将其剥离.(如果您剥离得过于激进,则没关系,一旦确定了编码,就可以对原始的未剥离数据进行解码,只需将解码器配置为替换无效字符,而不是引发异常.)或者计算解码错误并对其进行适当加权.但这可能很大程度上取决于您的垃圾的性质,即您可以做出什么样的假设.

Note: this algorithm will not work if the data contains garbage characters. For example, a single garbage byte in otherwise valid utf-8 will cause utf-8 decoding to fail, making the algorithm go down the wrong path. You may need to take additional measures to handle this. For example, if you can identify possible garbage beforehand, strip it before you try to determine the encoding. (It doesn't matter if you strip too aggressive, once you have determined the encoding, you can decode the original unstripped data, just configure the decoders to replace invalid characters instead of throwing an exception.) Or count decoding errors and weight them appropriately. But this probably depends much on the nature of your garbage, i.e. what assumptions you can make.

这篇关于是否可以可靠地将用户文件自动解码为Unicode?[C#]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆