HTML编码和LXML解析 [英] HTML encoding and lxml parsing

查看:183
本文介绍了HTML编码和LXML解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想最终解决,从尝试与LXML刮HTML弹出一些编码问题。下面是我遇到的三个示例HTML文档:

1

 <!DOCTYPE HTML>
< HTML LANG =EN'>
< HEAD>
   <标题>统一code个字符:은 - '< /标题>
   <间的charset =utf-8'>
< /头>
<身体GT;< /身体GT;
< / HTML>

2

 <!DOCTYPE HTML>
< HTML的xmlns =htt​​p://www.w3.org/1999/xhtmlXML:LANG =KO-KRLANG =KO-KR>
< HEAD>
    <标题>统一code个字符:은 - '< /标题>
    < META HTTP-EQUIV =内容类型内容=text / html的;字符集= UTF-8/>
< /头>
<身体GT;< /身体GT;
< / HTML>

3。

 <?XML版本=1.0编码=UTF-8&GT?;
<!DOCTYPE HTML
PUBLIC - // W3C // DTD XHTML 1.0 Strict标准// EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">
< HTML的xmlns =htt​​p://www.w3.org/1999/xhtmlXML:LANG =ENLANG =ENGT&;
< HEAD>
    < META HTTP-EQUIV =Content-Type的CONTENT =text / html的;字符集= UTF-8/>
    <标题>统一code个字符:은 - '< /标题>
< /头>
<身体GT;< /身体GT;
< / HTML>

我的基本脚本:

 从lxml.html进口fromstring
...DOC = fromstring(raw_html)
标题= doc.xpath('//标题/文本()')[0]
打印标题

的结果是:

 的Uni code个字符:把aa
UNI code个字符:은 -
UNI code个字符:은 -

所以,很显然与样品1的问题,缺少的&LT; META HTTP-EQUIV =Content-Type的CONTENT =text / html的;字符集= UTF-8/&GT; 标记。从溶液<一个href=\"http://stackoverflow.com/questions/2686709/encoding-in-python-with-lxml-complex-solution\">here将正确识别样品1为UTF-8,所以它在功能上等同于我原来的code。

该LXML文档出现抵触:

的例子似乎表明我们应该使用统一codeDammit为en code标记为UNI code。

 从BeautifulSoup进口的Uni codeDammit高清德code_html(html_string):
    转换后的Uni = codeDammit(html_string,isHTML = TRUE)
    如果不是converted.uni code:
        统一筹集codeDE codeError(
            无法检测编码,试图[%S],
            ','。加入(converted.triedEncodings))
    #打印converted.originalEncoding
    返回converted.uni code根= lxml.html.fromstring(德code_html(tag_soup))

不过这里它说:


  当您尝试

[Y] OU会得到错误[解析]在UNI code字符串,它指定标题的元标记一个字符集的HTML数据。通常应避免将其传递到解析器之前,XML / HTML数据转换为单向code。它既是慢且容易出错。


如果我试图遵循LXML文档的第一个建议,我的code现在是:

 从lxml.html进口fromstring
从BS4进口的Uni codeDammit
...
该死的Uni = codeDammit(raw_html)
DOC = fromstring(da​​mmit.uni code_markup)
标题= doc.xpath('//标题/文本()')[0]
打印标题

我现在得到以下结果:

 的Uni code个字符:은 - 
UNI code个字符:은 -
ValueError错误:UNI美元,编码声明C $ C字符串不被支持。

样品1现在可以正常工作,但样品3导致一个错误,由于&LT;?XML版本=1.0编码=UTF-8&GT; 标签。

有没有处理所有这些情况下,正确的方法是什么?难道还有比下一个更好的解决方案?

 该死的Uni = codeDammit(raw_html)
尝试:
    DOC = fromstring(da​​mmit.uni code_markup)
除了ValueError错误:
    DOC = fromstring(raw_html)


解决方案

LXML 拥有的若干 问题有关处理的Uni code。这可能是最好用字节(现在),而指定的字符编码​​明确:

 #!的/ usr /斌/包膜蟒蛇
进口水珠
从LXML导入HTML
从BS4进口的Uni codeDammit在glob.glob文件名(* HTML):
    开放(文件名,RB)的文件中:
        内容= file.read()
        DOC =统一codeDammit(内容,is_html = TRUE)    解析器= html.HTMLParser(编码= doc.original_encoding)
    根= html.document_fromstring(内容,解析器=解析器)
    标题= root.find('.//标题')。TEXT_CONTENT()
    打印(标题)

输出

 的Uni code个字符:은 - 
UNI code个字符:은 -
UNI code个字符:은 -

I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered:

1.

<!DOCTYPE html>
<html lang='en'>
<head>
   <title>Unicode Chars: 은 —’</title>
   <meta charset='utf-8'>
</head>
<body></body>
</html>

2.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
    <title>Unicode Chars: 은 —’</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>

3.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>

My basic script:

from lxml.html import fromstring
...

doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title

The results are:

Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’

So, obviously an issue with sample 1 and the missing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag. The solution from here will correctly recognize sample 1 as utf-8 and so it is functionally equivalent to my original code.

The lxml docs appear conflicted:

From here the example seems to suggest we should use UnicodeDammit to encode the markup as unicode.

from BeautifulSoup import UnicodeDammit

def decode_html(html_string):
    converted = UnicodeDammit(html_string, isHTML=True)
    if not converted.unicode:
        raise UnicodeDecodeError(
            "Failed to detect encoding, tried [%s]",
            ', '.join(converted.triedEncodings))
    # print converted.originalEncoding
    return converted.unicode

root = lxml.html.fromstring(decode_html(tag_soup))

However here it says:

[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

If I try to follow the the first suggestion in the lxml docs, my code is now:

from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title

I now get the following results:

Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.

Sample 1 now works correctly but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag.

Is there a correct way to handle all of these cases? Is there a better solution than the following?

dammit = UnicodeDammit(raw_html)
try:
    doc = fromstring(dammit.unicode_markup)
except ValueError:
    doc = fromstring(raw_html)

解决方案

lxml has several issues related to handling Unicode. It might be best to use bytes (for now) while specifying the character encoding explicitly:

#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit

for filename in glob.glob('*.html'):
    with open(filename, 'rb') as file:
        content = file.read()
        doc = UnicodeDammit(content, is_html=True)

    parser = html.HTMLParser(encoding=doc.original_encoding)
    root = html.document_fromstring(content, parser=parser)
    title = root.find('.//title').text_content()
    print(title)

Output

Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’

这篇关于HTML编码和LXML解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆