lxml中的HTML元素被错误地编码为&＃x41D;&＃x430;&＃x439; [英] HTML elements in lxml get incorrectly encoded like &#x41D;&#x430;&#x439;

查看：161 发布时间：2017/8/17 1:28:12 python encoding lxml

本文介绍了lxml中的HTML元素被错误地编码为&＃x41D;&＃x430;&＃x439;的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要从网页打印RSS链接，但是这个链接被解码不正确。这是我的代码：

I need to print RSS link from a web page, but this link is decoded incorrectly. Here is my code:

import urllib2
from lxml import html, etree
import chardet

data = urllib2.urlopen('http://facts-and-joy.ru/')
S=data.read()
encoding = chardet.detect(S)['encoding']
#S=S.decode(encoding)
#encoding='utf-8'

print encoding
parser = html.HTMLParser(encoding=encoding)
content = html.document_fromstring(S,parser)
loLinks = content.xpath('//link[@type="application/rss+xml"]')

for oLink in loLinks:
    print oLink.xpath('@title')[0]
    print etree.tostring(oLink,encoding='utf-8')

这是我的输出：

utf-8
Позитивное мышление RSS Feed
<link rel="alternate" type="application/rss+xml" title="&#x41F;&#x43E;&#x437;&#x438;&#x442;&#x438;&#x432;&#x43D;&#x43E;&#x435; &#x43C;&#x44B;&#x448;&#x43B;&#x435;&#x43D;&#x438;&#x435; RSS Feed" href="http://facts-and-joy.ru/feed/" />&#13;

标题内容本身正确显示，但内部的tostring（）已被奇怪的&＃＃ ...符号。如何正确打印整个链接元素？

Title contents got correctly displayed by itself, but inside tostring() it got replaced by strange &#... symbols. How can I print whole link element correctly?

提前感谢您的帮助！

推荐答案

以下是您的程序的简化版本：

Here is a simplified version of your program that works:

from lxml import html

url = 'http://facts-and-joy.ru/'
content = html.parse(url)
rsslinks = content.xpath('//link[@type="application/rss+xml"]')

for link in rsslinks:
    print link.get('title')
    print html.tostring(link, encoding="utf-8")

输出：

Позитивное мышление RSS Feed
<link rel="alternate" type="application/rss+xml" title="Позитивное мышление RSS Feed" href="http://facts-and-joy.ru/feed/">&#13;

关键的是

The crucial line is

print html.tostring(link, encoding="utf-8")

这是您原来程序中唯一必须更改的内容。

That is the only thing you must change in your original program.

使用 html.tostring （）而不是 etree.tostring（）生成实际字符而不是数字字符引用。您也可以使用 etree.tostring（link，method =html，encoding =utf-8）。

Using html.tostring() instead of etree.tostring() produces actual characters instead of numeric character references. You could also use etree.tostring(link, method="html", encoding="utf-8").

不清楚为什么html和xml输出方法之间存在差异。此邮件发送到lxml邮件列表中没有得到任何回复： https ：//mailman-mail5.webfaction.com/pipermail/lxml/2011-September/006131.html 。

It is not clear why this difference exists between the "html" and "xml" output methods. This post to the lxml mailing list didn't get any replies: https://mailman-mail5.webfaction.com/pipermail/lxml/2011-September/006131.html.

这篇关于lxml中的HTML元素被错误地编码为&＃x41D;&＃x430;&＃x439;的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

lxml中的HTML元素被错误地编码为&＃x41D;&＃x430;&＃x439; [英] HTML elements in lxml get incorrectly encoded like &#x41D;&#x430;&#x439;

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

lxml中的HTML元素被错误地编码为&amp;＃x41D;&amp;＃x430;&amp;＃x439; [英] HTML elements in lxml get incorrectly encoded like &amp;#x41D;&amp;#x430;&amp;#x439;

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

lxml中的HTML元素被错误地编码为&＃x41D;&＃x430;&＃x439; [英] HTML elements in lxml get incorrectly encoded like Най

登录关闭