解码&符号散列字符串(&#124&#120&#97)等 [英] Decoding ampersand hash strings (&#124&#120&#97)etc

查看:274
本文介绍了解码&符号散列字符串(&#124&#120&#97)等的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

其他答案中的解决方案在我尝试使用时不起作用,当我尝试这些方法时,输出的字符串相同。



我试图使用Python进行网页抓取2.7。我已经下载了网页,它有一些字符形式为&#120 ,其中120似乎代表ascii代码。我尝试使用 HTMLParser() decode()方法,但似乎没有任何效果。
请注意,我在网页上的格式只有那些字符。
示例:

 &#66&#108&#97&#115&#116&#101& #114&#106&#97&#120&#120&#32 

请引导我使用Python解码这些字符串。我已阅读其他答案,但解决方案似乎不适合我。根据你正在做的事情,你可能希望将这些数据转换为有效的HTML //en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_reference_overviewrel =nofollow>字符引用,以便您可以使用正确的HTML解析器在上下文中解析它。



但是,很容易提取数字字符串并将它们自己转换为等效的ASCII字符。例如,

  s ='&#66&#108&#97&#115&#116&#101&# ('(&);#')中的[chr(int(u)),如果u ])

输出

  Blasterjaxx 

因为 s 以分割字符串'&#'开始,所以u 跳过了我们得到的初始空字符串。或者,我们可以通过切片来跳过它:

 ''。join([chr(int(u))for u in s .split('&#')[1:]])


The solutions in other answers do not work when I try them, the same string outputs when I try those methods.

I am trying to do web scraping using Python 2.7. I have the webpage downloaded and it has some characters which are in the form &#120 where 120 seems to represent the ascii code. I tried using HTMLParser() and decode() methods but nothing seems to work. Please note that what I have from the webpage in the format are only those characters. Example:

&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32

Please guide me to decode these strings using Python. I have read the other answers but the solutions don't seem to work for me.

解决方案

Depending on what you're doing, you may wish to convert that data to valid HTML character references so you can parse it in context with a proper HTML parser.

However, it's easy enough to extract the number strings and convert them to the equivalent ASCII characters yourself. Eg,

s ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
print ''.join([chr(int(u)) for u in s.split('&#') if u])

output

Blasterjaxx 

The if u skips over the initial empty string that we get because s begins with the splitting string '&#'. Alternatively, we could skip it by slicing:

''.join([chr(int(u)) for u in s.split('&#')[1:]])

这篇关于解码&符号散列字符串(&#124&#120&#97)等的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆