C#中的编码格式 [英] encoding formats in c#

查看:286
本文介绍了C#中的编码格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,

我创建了一个应用程序"Web Crawler",可以在其中下载网站页面.

现在我需要的是,如果一个网站说Wiki拥有印地语或阿拉伯语的页面,则应该在我的文件中以相同的编码格式显示该页面.

我下载了页面并将其保存到文件中,但是它没有相同的编码方式.但是,如果我将页面粘贴粘贴到记事本中,它将具有相同的编码,并且将完全相同.

任何人都可以提供有关它的任何信息.

谢谢

Hi All,

I create an application "Web Crawler" where I can download sites pages.

Now what I need into it is if a site says wiki have page in Hindi or Arabic language it should be visible in my file with same encoding format.

I download the pages and save into files but it will not have same kind of encoding. But if I copy paste a page into notepad it will have same encoding and will be visible exactly same.

Anyone can provide any information on it.

Thanks

推荐答案

Content-Encoding应该告诉您使用了哪种文本编码.然后,您应该使用该编码从请求中提取文本内容.
The Content-Encoding should tell you what text encoding is used. You should then extract the text content from the request using that encoding.


可能是有些您不了解的内容. .NET和Web都使用Unicode,但是在Web中也有许多过时的编码.如今,唯一可以显示不同语言(取决于语言)的编码方法是Unicode.

没有单一的Unicode编码.如果字符(作为文化实体,从字体,样式,编码和其他详细信息中抽象出来)和称为代码点的整数值(代码点被理解为抽象的)一一对应,则Unicode只是一张表数学整数,从二进制表示形式,大小,小端或大端和其他技术细节中提取;所有这些都包含在编码中). Unicode还定义了几个UTF,它们代表代码点的物理编码. 不,Unicode不是16位编码!全套代码点需要超过16位. 所有UTF都支持超过16位的代码点.

现在,没有语言了.有 scripts 脚本,它们是代码点的子集.例如,梵文脚本支持印地语,该脚本还支持印度许多最常用的语言,包括梵语.

参见 http://unicode.org/ [ ^ ], http://unicode.org/faq/utf_bom.html [ ^ ].

—SA
Probably, there is something you don''t understand. Both .NET and Web use Unicode, but in Web there are also many obsolete encodings. The only valid method of encoding which could show text of the different languages (depends on what languages though) these days is Unicode.

There is no a single Unicode encoding. Unicode is just a table if one-to-one correspondence of characters (as cultural entities, abstracted from fonts, styles, encoding and other details) and integer values called code points (code points are understood as abstract mathematical integer numbers, abstracted from binary presentation, size, little- or big- endian and other technical detail; this is all covered in encodings). Unicode also defines several UTFs which represent physical encoding of code points. No, Unicode is not 16-bit encoding! Full set of code point needs way more than 16-bits. All UTFs support code points beyond 16 bits.

Now, there are no languages. There are scripts, sub-sets of the code points. For example Hindi is supported by Devanagari script which also supports many most used languages of India, including Sanscrit.

See http://unicode.org/[^], http://unicode.org/faq/utf_bom.html[^].

—SA


这篇关于C#中的编码格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆