为什么此A0字符出现在我的HTML :: Element输出中? [英] Why is this A0 character appearing in my HTML::Element output?
问题描述
我正在解析带有两个Perl模块的HTML文档: HTML :: TreeBuilder 和 HTML :: .出于某种原因,只要标签的内容只是& nbsp;
,它就会被HTML :: Element作为我从未见过的奇怪字符返回:>
替代文字http://www.freeimagehosting.net/uploads/2acca201ab.jpg
我无法复制字符,因此无法对其进行Google搜索,无法在字符映射图中找到它,而且奇怪的是,当我使用正则表达式进行搜索时, \ w
会找到它.当我将返回的文档转换为ANSI或UTF-8时,它完全消失了.在HTML :: Element文档中也找不到任何信息.
我如何检测和替换为更有用的字符(如 null
)?将来如何处理这种奇怪的字符?
字符为"\ xa0"
(即160),这是& nbsp;的标准Unicode转换.
.(也就是说,这是Unicode的不间断空格.)如果愿意,您应该可以使用 s/\ xa0//g
删除它们.
I'm parsing an HTML document with a couple Perl modules: HTML::TreeBuilder and HTML::Element. For some reason whenever the content of a tag is just
, which is to be expected, it gets returned by HTML::Element as a strange character I've never seen before:
alt text http://www.freeimagehosting.net/uploads/2acca201ab.jpg
I can't copy the character so can't Google it, couldn't find it in character map, and strangely when I search with a regular expression, \w
finds it. When I convert the returned document to ANSI or UTF-8 it disappears altogether. I couldn't find any info on it in the HTML::Element documentation either.
How can I detect and replace this character with something more useful like null
and how should I deal with strange characters like this in the future?
The character is "\xa0"
(i.e. 160), which is the standard Unicode translation for
. (That is, it's Unicode's non-breaking space.) You should be able to remove them with s/\xa0/ /g
if you like.
这篇关于为什么此A0字符出现在我的HTML :: Element输出中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!