尝试从一个网页Python和BeautifulSoup获得编码 [英] Trying to get encoding from a webpage Python and BeautifulSoup

查看:131
本文介绍了尝试从一个网页Python和BeautifulSoup获得编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试着去检索网页的字符集(这将改变所有的时间)。目前即时通讯使用beautifulSoup解析页面,然后提取头中的字符集。这是工作的罚款,直到我遇到了一个网站,有.....

 < META HTTP-EQUIV =Content-Type的CONTENT =text / html的;字符集= UTF-8>

到现在为止并与其他页面的工作

我的code起来是:

 高清get_encoding(汤):
        ENCOD = soup.meta.get(字符集),
        如果ENCOD ==无:
            ENCOD = soup.meta.get('内涵式')
            如果ENCOD ==无:
                ENCOD = soup.meta.get(内容)
    返回ENCOD

有没有人对如何添加到此code来从上面的例子中的字符集是一个好主意。将它标记化,并试图检索的charset这种方式是一个想法?你将如何去了解它,而无需改变整个功能?
眼下上述​​code将返回text / html的;字符集= UTF-8。这是造成一LookupError,因为这是一个未知的编码

感谢

最后code,我最终使用:

 高清get_encoding(汤):
        ENCOD = soup.meta.get(字符集),
        如果ENCOD ==无:
            ENCOD = soup.meta.get('内涵式')
            如果ENCOD ==无:
                内容= soup.meta.get(内容)
                匹配= re.search('字符集=(。*),内容)
                如果匹配:
                    ENCOD = match.group(1)
                其他:
                    dic_of_possible_encodings = chardet.detect(UNI code(汤))
                    ENCOD = dic_of_possible_encodings ['编码']
    返回ENCOD


解决方案

 进口重
高清get_encoding(汤):
    ENCOD = soup.meta.get(字符集),
    如果ENCOD ==无:
        ENCOD = soup.meta.get('内涵式')
        如果ENCOD ==无:
            内容= soup.meta.get(内容)
            匹配= re.search('字符集=(。*),内容)
            如果匹配:
                ENCOD = match.group(1)
            其他:
                提高ValueError异常(无法找到编码)
    返回ENCOD

Im trying to retrieve the charset from a webpage(this will change all the time). At the moment Im using beautifulSoup to parse the page and then extract the charset from the header. This was working fine until I ran into a site that had.....

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

My code up until now and which was working with other pages is:

    def get_encoding(soup):
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                encod = soup.meta.get('content')
    return encod

Would anyone have a good idea about how to add to this code to retrieve the charset from the above example. Would tokenizing it and trying to retrieve the charset that way be an idea? and how would you go about it without having to change the whole function? Right now the above code is returning "text/html; charset=utf-8" which is causing a LookupError because this is an unknown encoding.

Thanks

The final code that I ended up using:

    def get_encoding(soup):
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                content = soup.meta.get('content')
                match = re.search('charset=(.*)', content)
                if match:
                    encod = match.group(1)
                else:
                    dic_of_possible_encodings = chardet.detect(unicode(soup))
                    encod = dic_of_possible_encodings['encoding'] 
    return encod

解决方案

import re
def get_encoding(soup):
    encod = soup.meta.get('charset')
    if encod == None:
        encod = soup.meta.get('content-type')
        if encod == None:
            content = soup.meta.get('content')
            match = re.search('charset=(.*)', content)
            if match:
                encod = match.group(1)
            else:
                raise ValueError('unable to find encoding')
    return encod

这篇关于尝试从一个网页Python和BeautifulSoup获得编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆