HTML编码问题 - “”字符显示而不是“& nbsp;” [英] HTML encoding issues - "Â" character showing up instead of " "

查看:273
本文介绍了HTML编码问题 - “”字符显示而不是“& nbsp;”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个遗失的应用程序刚开始行为不正,无论什么原因我都不确定。它生成一堆HTML,被ActivePDF转换成PDF报告。



该过程的工作原理如下:


  1. 从DB中提取一个带有令牌的HTML模板以被替换(例如〜CompanyName〜,〜CustomerName〜等)。

  2. 使用真实数据替换令牌

  3. 使用一个简单的正则表达式函数整理HTML,该属性格式化HTML标签属性值(确保引号等),因为ActivePDF的渲染引擎讨厌任何属性价值)

  4. 将HTML发送到创建PDF的Web服务。

某处在这种混乱中,HTML模板中的不间断空格(& nbsp; s)正在编码为ISO-8859-1,以便它们不正确地显示为在浏览器(FireFox)中查看文档时,Â字符。 ActivePDF在这些非UTF8字符上显示。



我的问题:由于我不知道问题源于何处,没有时间调查,是否有一个简单的方法来重新编码或找到和替换坏角色?我已经尝试通过我投掷在一起的这个小功能发送它,但它将它全部变成gobbledegook 不会改变任何东西。

 私有共享函数ConvertToUTF8(ByVal html As String)As String 
Dim isoEncoding As Encoding = Encoding.GetEncoding(iso-8859-1)
Dim source As Byte )= isoEncoding.GetBytes(html)
返回Encoding.UTF8.GetString(Encoding.Convert(isoEncoding,Encoding.UTF8,source))
结束函数
/ pre>

任何想法?



编辑:



我现在得到这个,虽然这似乎不是一个很好的解决方案:

 私有共享函数ReplaceNonASCIIChars(ByVal html As String)As String 
返回Regex.Replace(html,[^ \\\-\\\],& nbsp;)
结束功能


解决方案


乱,不HTML模板中的不正确的空格(s)正在编码为ISO-8859-1,以便它们不正确地显示为Â字符


那就是编码为UTF-8,而不是ISO-8859-1。 ISO-8859-1中的不间断字符为0xA0字节;当编码为UTF-8时,它将为0xC2,0xA0,如果您(不正确地)将ISO-8859-1视为Â,则将其显示出来。这包括您可能不会注意到的尾随如果那个字节不在那里,那么其他的东西就会损坏你的文档,我们需要进一步了解一下。



什么是正则表达式,模板怎么样工作?如果您的& nbsp; 字符串正确(正确)转换为U + 00A0非打破空格字符,则似乎有一个适当的HTML解析器。如果是这样,您可以在DOM中本地处理您的模板,并要求使用ASCII编码进行序列化,以将非ASCII字符作为字符引用。这也阻止你不得不对HTML本身进行正则表达式后处理,这是一个非常狡猾的业务。



无论如何,现在你可以添加一个以下到您的文档的< head> ,看看是否使浏览器看起来正确:



    $ HTML4的b $ b
  • < meta http-equiv =Content-Typecontent =text / html; charset = utf-8/>

  • for HTML5:< meta charset =utf-8>



如果你这样做,那么任何剩下的问题都是ActivePDF的错误。


I've got a legacy app just starting to misbehave, for whatever reason I'm not sure. It generates a bunch of HTML that gets turned into PDF reports by ActivePDF.

The process works like this:

  1. Pull an HTML template from a DB with tokens in it to be replaced (e.g. "~CompanyName~", "~CustomerName~", etc.)
  2. Replace the tokens with real data
  3. Tidy the HTML with a simple regex function that property formats HTML tag attribute values (ensures quotation marks, etc, since ActivePDF's rendering engine hates anything but single quotes around attribute values)
  4. Send off the HTML to a web service that creates the PDF.

Somewhere in that mess, the non-breaking spaces from the HTML template (the &nbsp;s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character when viewing the document in a browser (FireFox). ActivePDF pukes on these non-UTF8 characters.

My question: since I don't know where the problem stems from and don't have time to investigate it, is there an easy way to re-encode or find-and-replace the bad characters? I've tried sending it through this little function I threw together, but it turns it all into gobbledegook doesn't change anything.

Private Shared Function ConvertToUTF8(ByVal html As String) As String
    Dim isoEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
    Dim source As Byte() = isoEncoding.GetBytes(html)
    Return Encoding.UTF8.GetString(Encoding.Convert(isoEncoding, Encoding.UTF8, source))
End Function

Any ideas?

EDIT:

I'm getting by with this for now, though it hardly seems like a good solution:

Private Shared Function ReplaceNonASCIIChars(ByVal html As String) As String
    Return Regex.Replace(html, "[^\u0000-\u007F]", "&nbsp;")
End Function

解决方案

Somewhere in that mess, the non-breaking spaces from the HTML template (the  s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character

That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as " ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.

What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your &nbsp; strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business.

Well anyway, for now you can add one of the following to your document's <head> and see if that makes it look right in the browser:

  • for HTML4: <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  • for HTML5: <meta charset="utf-8">

If you've done that, then any remaining problem is ActivePDF's fault.

这篇关于HTML编码问题 - “”字符显示而不是“&amp; nbsp;”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆