为什么此A0字符出现在我的HTML :: Element输出中? [英] Why is this A0 character appearing in my HTML::Element output?

查看:56
本文介绍了为什么此A0字符出现在我的HTML :: Element输出中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析带有两个Perl模块的HTML文档: HTML :: TreeBuilder HTML :: .出于某种原因,只要标签的内容只是& nbsp; ,它就会被HTML :: Element作为我从未见过的奇怪字符返回:

替代文字http://www.freeimagehosting.net/uploads/2acca201ab.jpg

我无法复制字符,因此无法对其进行Google搜索,无法在字符映射图中找到它,而且奇怪的是,当我使用正则表达式进行搜索时, \ w 会找到它.当我将返回的文档转换为ANSI或UTF-8时,它完全消失了.在HTML :: Element文档中也找不到任何信息.

我如何检测和替换为更有用的字符(如 null )?将来如何处理这种奇怪的字符?

解决方案

字符为"\ xa0" (即160),这是& nbsp;的标准Unicode转换..(也就是说,这是Unicode的不间断空格.)如果愿意,您应该可以使用 s/\ xa0//g 删除它们.

I'm parsing an HTML document with a couple Perl modules: HTML::TreeBuilder and HTML::Element. For some reason whenever the content of a tag is just  , which is to be expected, it gets returned by HTML::Element as a strange character I've never seen before:

alt text http://www.freeimagehosting.net/uploads/2acca201ab.jpg

I can't copy the character so can't Google it, couldn't find it in character map, and strangely when I search with a regular expression, \w finds it. When I convert the returned document to ANSI or UTF-8 it disappears altogether. I couldn't find any info on it in the HTML::Element documentation either.

How can I detect and replace this character with something more useful like null and how should I deal with strange characters like this in the future?

解决方案

The character is "\xa0" (i.e. 160), which is the standard Unicode translation for  . (That is, it's Unicode's non-breaking space.) You should be able to remove them with s/\xa0/ /g if you like.

这篇关于为什么此A0字符出现在我的HTML :: Element输出中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆