HTML编码和LXML解析 [英] HTML encoding and lxml parsing

查看：183 发布时间：2016/8/5 18:54:28 python unicode web-scraping beautifulsoup lxml

本文介绍了HTML编码和LXML解析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想最终解决，从尝试与LXML刮HTML弹出一些编码问题。下面是我遇到的三个示例HTML文档：

 ＆LT;！DOCTYPE HTML＆GT;
＆LT; HTML LANG =EN'＆GT;
＆LT; HEAD＆GT;
   ＆LT;标题＆GT;统一code个字符：은 - '＆LT; /标题＆GT;
   ＆LT;间的charset =utf-8'＆GT;
＆LT; /头＆GT;
＆LT;身体GT;＆LT; /身体GT;
＆LT; / HTML＆GT;

 ＆LT;！DOCTYPE HTML＆GT;
＆LT; HTML的xmlns =http://www.w3.org/1999/xhtmlXML：LANG =KO-KRLANG =KO-KR＆GT;
＆LT; HEAD＆GT;
    ＆LT;标题＆GT;统一code个字符：은 - '＆LT; /标题＆GT;
    ＆LT; META HTTP-EQUIV =内容类型内容=text / html的;字符集= UTF-8/＆GT;
＆LT; /头＆GT;
＆LT;身体GT;＆LT; /身体GT;
＆LT; / HTML＆GT;

3。

 ＆LT;？XML版本=1.0编码=UTF-8＆GT？;
＆LT;！DOCTYPE HTML
PUBLIC -  // W3C // DTD XHTML 1.0 Strict标准// EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">
＆LT; HTML的xmlns =http://www.w3.org/1999/xhtmlXML：LANG =ENLANG =ENGT＆;
＆LT; HEAD＆GT;
    ＆LT; META HTTP-EQUIV =Content-Type的CONTENT =text / html的;字符集= UTF-8/＆GT;
    ＆LT;标题＆GT;统一code个字符：은 - '＆LT; /标题＆GT;
＆LT; /头＆GT;
＆LT;身体GT;＆LT; /身体GT;
＆LT; / HTML＆GT;

我的基本脚本：

 从lxml.html进口fromstring
...DOC = fromstring（raw_html）
标题= doc.xpath（'//标题/文本（）'）[0]
打印标题

的结果是：

 的Uni code个字符：把aa
UNI code个字符：은 - 
UNI code个字符：은 -

所以，很显然与样品1的问题，缺少的＆LT; META HTTP-EQUIV =Content-Type的CONTENT =text / html的;字符集= UTF-8/＆GT; 标记。从溶液<一个href=\"http://stackoverflow.com/questions/2686709/encoding-in-python-with-lxml-complex-solution\">here将正确识别样品1为UTF-8，所以它在功能上等同于我原来的code。

该LXML文档出现抵触：

的例子似乎表明我们应该使用统一codeDammit为en code标记为UNI code。

 从BeautifulSoup进口的Uni codeDammit高清德code_html（html_string）：
    转换后的Uni = codeDammit（html_string，isHTML = TRUE）
    如果不是converted.uni code：
        统一筹集codeDE codeError（
            无法检测编码，试图[％S]，
            '，'。加入（converted.triedEncodings））
    ＃打印converted.originalEncoding
    返回converted.uni code根= lxml.html.fromstring（德code_html（tag_soup））

不过这里它说：

当您尝试
[Y] OU会得到错误[解析]在UNI code字符串，它指定标题的元标记一个字符集的HTML数据。通常应避免将其传递到解析器之前，XML / HTML数据转换为单向code。它既是慢且容易出错。

如果我试图遵循LXML文档的第一个建议，我的code现在是：
 从lxml.html进口fromstring
从BS4进口的Uni codeDammit
...
该死的Uni = codeDammit（raw_html）
DOC = fromstring（dammit.uni code_markup）
标题= doc.xpath（'//标题/文本（）'）[0]
打印标题
 
我现在得到以下结果：
 的Uni code个字符：은 - 
UNI code个字符：은 - 
ValueError错误：UNI美元，编码声明C $ C字符串不被支持。
 
样品1现在可以正常工作，但样品3导致一个错误，由于＆LT;？XML版本=1.0编码=UTF-8＆GT; 标签。
有没有处理所有这些情况下，正确的方法是什么？难道还有比下一个更好的解决方案？
 该死的Uni = codeDammit（raw_html）
尝试：
    DOC = fromstring（dammit.uni code_markup）
除了ValueError错误：
    DOC = fromstring（raw_html）
 
解决方案

LXML 拥有的若干问题有关处理的Uni code。这可能是最好用字节（现在），而指定的字符编码明确：
 ＃！的/ usr /斌/包膜蟒蛇
进口水珠
从LXML导入HTML
从BS4进口的Uni codeDammit在glob.glob文件名（* HTML）：
    开放（文件名，RB）的文件中：
        内容= file.read（）
        DOC =统一codeDammit（内容，is_html = TRUE）    解析器= html.HTMLParser（编码= doc.original_encoding）
    根= html.document_fromstring（内容，解析器=解析器）
    标题= root.find（'.//标题'）。TEXT_CONTENT（）
    打印（标题）
 
输出
 的Uni code个字符：은 - 
UNI code个字符：은 - 
UNI code个字符：은 - 
 
I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered:

1.
<!DOCTYPE html>
<html lang='en'>
<head>
   <title>Unicode Chars: 은 —’</title>
   <meta charset='utf-8'>
</head>
<body></body>
</html>
2.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
    <title>Unicode Chars: 은 —’</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>
3.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>
My basic script:
from lxml.html import fromstring
...

doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title
The results are:
Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’
So, obviously an issue with sample 1 and the missing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag. The solution from here will correctly recognize sample 1 as utf-8 and so it is functionally equivalent to my original code.

The lxml docs appear conflicted:

From here the example seems to suggest we should use UnicodeDammit to encode the markup as unicode.
from BeautifulSoup import UnicodeDammit

def decode_html(html_string):
    converted = UnicodeDammit(html_string, isHTML=True)
    if not converted.unicode:
        raise UnicodeDecodeError(
            "Failed to detect encoding, tried [%s]",
            ', '.join(converted.triedEncodings))
    # print converted.originalEncoding
    return converted.unicode

root = lxml.html.fromstring(decode_html(tag_soup))
However here it says:

[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

If I try to follow the the first suggestion in the lxml docs, my code is now:
from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title
I now get the following results:
Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.
Sample 1 now works correctly but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag.

Is there a correct way to handle all of these cases? Is there a better solution than the following?
dammit = UnicodeDammit(raw_html)
try:
    doc = fromstring(dammit.unicode_markup)
except ValueError:
    doc = fromstring(raw_html)
解决方案
lxml has several issues related to handling Unicode. It might be best to use bytes (for now) while specifying the character encoding explicitly:
#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit

for filename in glob.glob('*.html'):
    with open(filename, 'rb') as file:
        content = file.read()
        doc = UnicodeDammit(content, is_html=True)

    parser = html.HTMLParser(encoding=doc.original_encoding)
    root = html.document_fromstring(content, parser=parser)
    title = root.find('.//title').text_content()
    print(title)
Output
Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’
这篇关于HTML编码和LXML解析的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

HTML编码和LXML解析 [英] HTML encoding and lxml parsing

问题描述

输出

Output

相关文章

Python最新文章

热门教程

热门工具

登录关闭

HTML编码和LXML解析 [英] HTML encoding and lxml parsing

问题描述

输出

Output

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭