python lxml在某些情况下无法解析日语 [英] python lxml can't parse japanese in some case

查看:56
本文介绍了python lxml在某些情况下无法解析日语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用lxml 4.5.0从网站上抓取数据.

I am using lxml 4.5.0 to scraping data from website.

在以下示例中效果很好

chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
            "(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"

with requests.Session() as s:
    s.headers.update({'User-Agent': chrome_ua})
    resp = s.get('https://www.yahoo.co.jp')
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO(resp.text), parser)
    result = tree.xpath('//*[@id="tabTopics1"]/a')[0]

result.text

由于 result.text 给了我正确的文本'ニュース'

as the result.text give me the right text 'ニュース'

但是当我尝试另一面时,它无法正确地折服日本人.

but when I try another side, it failed to prase the japanese properly.

chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
            "(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"

with requests.Session() as s:
    s.headers.update({'User-Agent': chrome_ua})
    resp = s.get('https://travel.rakuten.co.jp/')
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO(resp.text), parser)
    result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]

result.text

result.text 给我'å\x9b½å\ x86 \x85æ\ x97 \x85è¡\ x8c',应该是'国内旅行'

我尝试使用 parser = etree.HTMLParser(encoding ='utf-8'),但仍然无法正常工作.

I tried to use parser = etree.HTMLParser(encoding='utf-8'), but it still not work.

在这种情况下,如何使lxml正确解析日语?

How can I make lxml parse japanese properly in this case?

推荐答案

使用

print(resp.encoding)

您可以看到它使用 ISO-8859-1 resp.content 转换为 resp.text

you can see it used ISO-8859-1 to convert resp.content to resp.text

但是您可以直接获取 resp.content 并使用不同的编码对其进行解码

but you can get directly resp.content and decode it with different encoding

StringIO( resp.content.decode('utf-8') )


使用模块 chardet ,您可以尝试检测应该使用哪种编码


Using module chardet you can try to detect what encoding you should use

print( chardet.detect(resp.content) )

结果

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}


import requests
from lxml import etree
from io import StringIO
import chardet

chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
            "(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"

with requests.Session() as s:
    s.headers.update({'User-Agent': chrome_ua})
    resp = s.get('https://travel.rakuten.co.jp/')

    print(resp.encoding)
    print( chardet.detect(resp.content) )
    detected_encoding = chardet.detect(resp.content)['encoding']

    parser = etree.HTMLParser()
    #tree = etree.parse(StringIO(resp.content.decode('utf-8')), parser)
    tree = etree.parse(StringIO(resp.content.decode(detected_encoding)), parser)
    result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]

result.text


为答案中的@ usr2564301


as @usr2564301 found in answer

python request.get()返回解码文本不正确,而不是UTF-8?

可以通过

 resp.encoding = resp.apparent_encoding 

使用 chardet 识别编码.

这篇关于python lxml在某些情况下无法解析日语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆