使用lxml在python中编码-复杂的解决方案 [英] Encoding in python with lxml - complex solution

查看:69
本文介绍了使用lxml在python中编码-复杂的解决方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要下载并使用lxml解析网页,并构建UTF-8 xml输出.我认为伪代码中的模式更具说明性:

I need to download and parse webpage with lxml and build UTF-8 xml output. I think schema in pseudocode is more illustrative:

from lxml import etree

webfile = urllib2.urlopen(url)
root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True))

txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8))


output = etree.Element("out")
output.text = txt

outputfile.write(etree.tostring(output, encoding=utf8))

因此,webfile可以采用任何编码(lxml应该可以处理).输出文件必须在utf-8中.我不确定在哪里使用编码/编码.这种架构可以吗? (我找不到有关lxml和编码的很好的教程,但是我可以找到很多与此有关的问题...)我需要可靠的解决方案.

So webfile can be in any encoding (lxml should handle this). Outputfile have to be in utf-8. I'm not sure where to use encoding/coding. Is this schema ok? (I cant find good tutorial about lxml and encoding, but I can find many problems with this...) I need robust solution.

因此,我将utf-8发送到lxml

So for sending utf-8 to lxml I use

        converted = UnicodeDammit(webfile, isHTML=True)
        if not converted.unicode:
            print "ERR. UnicodeDammit failed to detect encoding, tried [%s]", \
                ', '.join(converted.triedEncodings)
            continue
        webfile = converted.unicode.encode('utf-8')

推荐答案

lxml对于输入编码可能有点不了解.最好发送UTF8并取出UTF8.

lxml can be a little wonky about input encodings. It is best to send UTF8 in and get UTF8 out.

您可能要使用 chardet 模块或

You might want to use the chardet module or UnicodeDammit to decode the actual data.

您想隐约地做某事:

import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)

我不确定为什么要在lxml和etree之间移动,除非您正在与已经使用etree的另一个库进行交互?

I'm not sure why you are moving between lxml and etree, unless you are interacting with another library that already uses etree?

这篇关于使用lxml在python中编码-复杂的解决方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆