在python中解码html编码的字符串 [英] Decoding html encoded strings in python

查看:126
本文介绍了在python中解码html编码的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下字符串......

 骗局,骗局或真实交易,他’他会走到肮脏的故事的底部,并希望最终在这个过程中有一个街机游戏。 

我需要把它变成这个字符串...


骗局,骗局或真正的交易,
他会去肮脏的故事的
底部,
希望最终结束与街机游戏
在过程中。

这是非常标准的HTML编码,我不能在我找出了如何在python中进行转换。



我发现这个:
GitHub



它非常接近工作,但它不输出一个撇号,而不是一些unicode字符。



以下是GitHub脚本输出的示例...


诈骗,骗局或真正的交易,他
会去b $ b的底部肮脏的故事,并希望结束
与街机游戏进行中。

HTML实体解码,它涵盖了很多过去的Stack Overflow问题,例如:



以下是使用美丽的汤 HTML代码片段解析库来解码你的例子:

$ p $ #!/ usr / bin / env python
# - * - coding: utf-8 - * - 来自BeautifulSoup的
导入BeautifulSoup

string =骗局,骗局或真正的交易,他’ s会去他的方式bot肮脏的故事的汤姆,并希望最终在这个过程中的街机游戏。
s = BeautifulSoup(string,convertEntities = BeautifulSoup.HTML_ENTITIES).contents [0]
print s

以下是输出:


诈骗,恶作剧或真实交易,他的
会按照他的方式工作
的底部是肮脏的故事,并希望在这个过程中结束
的街机游戏。



I have the following string...

"Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."

I need to turn it into this string...

Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.

This is pretty standard HTML encoding and I can't for the life of me figure out how to convert it in python.

I found this: GitHub

And it's very close to working, however it does not output an apostrophe but instead some off unicode character.

Here is an example of the output from the GitHub script...

Scam, hoax, or the real deal, heâs gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.

解决方案

What's you're trying to do is called "HTML entity decoding" and it's covered in a number of past Stack Overflow questions, for example:

Here's a code snippet using the Beautiful Soup HTML parsing library to decode your example:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup

string = "Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."
s = BeautifulSoup(string,convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0]
print s

Here's the output:

Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.

这篇关于在python中解码html编码的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆