内部字符串编码 [英] internal string encoding

查看:108
本文介绍了内部字符串编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试了解ASP Classic如何在内部处理字符串。我已经google和调试了,但是我仍然不知道字符串是如何在ASP脚本中编码的。



请参见下图。



输入数据已变换,所有字符串变量都有相同的编码,无论什么来源?



大多数ASP页面都保存在磁盘上,如utf-8。然而,他们#包含使用另一个编码保存的asp文件。前端页面的顶部我将响应编码设置为unicode。

  response.codepage = 65001 // unicode 
reponse.charset ='utf-8'

http://www.designerline.se/db/aspclassicencoding.png

解决方案

首先要考虑的是UTF-8和Windows-1252(和ISO-8859-1等)都是基于US-ASCII的。所有这些代码页中的前128个字符是相同的。使用完全相同的字节值,并且都占用一个字节。



在许多情况下,绝大多数内容都在US-ASCII范围内,所以很难说是有区别的。通常,整个文件只是使用US-ASCII字符,因此尽管选择了编码(保存在文件开头的BOM),文件是相同的。



基本脚本处理



首先处理器将ASP文件及其所有包含的内容和包含的内容组合在一起。这完成非常简单地顺序地替换包含标记与被引用的包含文件的内容。这完全是在字节级别,不尝试转换不同编码的文件。



接下来解析文件的组合版本。标记化,编译甚至进入紧密的interperter友好文件。在这一点上,文件中的内容块(脚本代码块之外的东西)变成一个特殊形式的 Response.Write 。它的特殊之处在于在脚本执行时将会达到这些特殊写入处理器简单地将文件中找到的字节直接复制到输出流中,同样没有尝试转换任何编码。



脚本代码和字符编码



ASP处理器不适用于任何不是ASCII 。所有的代码,特别是你的代码中的字符串字面值应该只能是ASCII。



一旦脚本执行所有字符串变量就可以使用Unicode存储编码。



当代码使用正确的 Response.Write 方法内容响应时,这是 Response.CodePage 生效。它将编码脚本提供给响应代码页的unicode字符串,然后再将其添加到输出流。



Response.CharSet的作用是什么? strong>



它将CharSet属性添加到 Content-Type http头。就这样,它没有其他的影响。如果设置这个一个字符集,但发送不同的一个,因为您的Response.CodePage不匹配或因为文件的字节内容不在那个编码,那么你可以期待的问题。



输入编码



这里的事情真的很混乱。当表单数据发布到服务器时,表单url编码标准中没有规定声明使用的代码页。浏览器可以告诉你要使用什么编码,并且默认使用html页面的字符集包含表单,但是没有任何机制来将该选择传达给服务器。



Response.CodePage 值对 Request返回的字符串有影响。表格。由于这个原因,提前获得正确的代码页很重要,做一些表单处理,然后在发送响应之前稍后设置代码页可能会导致意想不到的结果。



经典的网页看起来不错,但数据库中的数据已损坏gotcha



这个行为导致的一个常见的问题是开发人员已经设置了CharSet =UTF-8,但是将代码页保留为Windows-1252。



最终发生的是用户输入发送到UTF-8编码的服务器,但脚本代码读取为1252.这个损坏的字符串存储在数据库中。随后的网页会查看这些数据,这是从DB中提取的损坏的字符串。该字符串然后由response.write使用1252编码发送,但目的地页面被告知其UTF-8。然而,当其他组件(比如说报表生成器)从数据库创建内容时,数据就会显示出来因为它是腐败的。



底线



你已经在做正确的事情,让CharSet和CodePage早日设置。如果其他文件可能不会保存为UTF-8,那么如果在其中存在非ascii内容,那么您将遇到问题,否则您将会很好。



纯粹的代码没有内容,因为该代码应该纯粹在ascii它的编码并不重要。


I'm trying to understand how ASP classic handles strings internally. I've googled and debugged, but I still don't know how a string is encoded within the ASP script.

See the illustration below.

Is input data transformed so that all string variables have the same encoding no matter what source?

Most ASP-pages are saved on disk as utf-8. They do however #include asp-files that are saved with another encoding. A the top of front-end-pages I set the Response encoding to unicode.

response.codepage = 65001   //unicode
reponse.charset = 'utf-8'

http://www.designerline.se/db/aspclassicencoding.png

解决方案

First of all its worth considering that the both UTF-8 and Windows-1252 (and ISO-8859-1 and others) are based on US-ASCII. The first 128 characters in all of these codepages are identical. Use exactly the same byte value and all occupy just one byte.

In many cases the vast majority of the content is within the US-ASCII range so its hard to tell there is any difference between. Frequently the whole file is just using US-ASCII characters and hence the files are identical despite choosen encoding (save perhaps the BOM at the start of the file).

Basic Script Processing

First the processor combines an ASP file with all its includes and the includes of those includes. This is done very simply sequentially replacing the include markers with the content of the include file being referenced. This is done purely at the byte level not attempt is made to convert files of different encodings.

Next the combined version of the file is parsed. tokenized, "compiled" even into a tight interperter friendly file. Its at this point that chunks of content in the file (the stuff outside of script code blocks) are turned into a special form of Response.Write. Its special in that at the point script execution would reach these special writes the processor simply copies verbatim the bytes as found in the file directly to the output stream, again no attempt is made to convert any encodings.

Script code and character encoding

The ASP processor just doesn't cope well with anything that isn't ASCII. All your code and especially your string literals in your code should only be in ASCII.

What can be a bit confusing once a script is executing all string variables are stored using Unicode encoding.

When code writes content the response using the proper Response.Write method this is where the Response.CodePage comes into effect. It will encode the unicode string the script provides to the response code page before adding it to the output stream.

What is the effect of Response.CharSet

It adds the CharSet attribute to the Content-Type http header. That is it, it has no other impact. If set this one character set but send different one because either your Response.CodePage doesn't match it or because the byte content of the files are not in that encoding then you can expect problems.

Input encoding

Things get really messy here. When form data is posted to the server there is no provision in the form url encoding standard to declare the code page used. Browser can be told what encoding to use and they will default to the charset of the html page contain the form, but there is no mechanism to communicate that choice to the server.

ASP takes the view that the codepage of posted form fields would be the same as the codepage of the response its about to send. Take a moment to absorb that.... This means that quite counter intuatively the Response.CodePage value has an impact on the strings returned by Request.Form. For this reason its important to get the correct codepage set early, doing some form processing and then setting the codepage later just before sending a response can lead to unexpected results.

The classic "the web page looks fine but the data in the database is corrupt" gotcha

One common gotcha this behaviour results in is where the developer has set CharSet="UTF-8" but left the codepage at something like "Windows-1252".

What ends up happening is the user enters text which is sent to the server in UTF-8 encoding but the script code reads it as 1252. This corrupt string gets stored in the database. A subsequent web page looks at this data, the corrupt string it pulled from the DB. This string is then sent by response.write using 1252 encoding but the destination page is told its UTF-8. This has the effect of reversing the corruption and everything looks fine to the user.

However when other components, say a report generator, creates content from the database then the data appears corrupt because it is.

The Bottom Line

You are already doing the correct thing, get that CharSet and CodePage set early and consistently. Where other files may not be saved as UTF-8 you will have problems if there is non-ascii content in them but otherwise you would be fine.

Many include asps are purely code with no content and since that code ought to be purely in ascii its encoding doesn't really matter.

这篇关于内部字符串编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆