使用BeautifulSoup从HTML页面获取内容类型 [英] Get content-type from HTML page with BeautifulSoup

查看:263
本文介绍了使用BeautifulSoup从HTML页面获取内容类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取我抓取的页面的字符编码,但在某些情况下会失败.这是我在做什么:

I am trying to get the character encoding for pages that I scrape, but in some cases it is failing. Here is what I am doing:

resp = urllib2.urlopen(request)
self.COOKIE_JAR.extract_cookies(resp, request)
content = resp.read()
encodeType= resp.headers.getparam('charset')
resp.close()

那是我的第一次尝试.但是,如果字符集返回为None类型,我可以这样做:

That is my first attempt. But if charset comes back as type None, I do this:

soup = BeautifulSoup(html)
if encodeType == None:
    try:
        encodeType = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})
    except AttributeError, e:
        print e
        try:
            encodeType = soup.findAll('meta', {'charset':lambda v:v.lower() != None})
        except AttributeError, e:
            print e
            if encodeType == '':
                encodeType = 'iso-8859-1'

我正在测试的页面的标题中包含以下内容: <meta charset="ISO-8859-1">

The page I am testing has this in the header: <meta charset="ISO-8859-1">

我希望第一个try语句返回一个空字符串,但是我在两个try语句上都收到此错误(这就是为什么第二个语句暂时嵌套的原因):

I would expect the first try statement to return an empty string, but I get this error on both try statements (which is why the 2nd statement is nested for now):

'NoneType'对象没有属性'lower'

'NoneType' object has no attribute 'lower'

第二个try语句出了什么问题?我猜第一个也是不正确的,因为它引发了错误,而不仅仅是返回空白.

What is wrong with the 2nd try statement? I am guessing the 1st one is incorrect as well, since it's throwing an error and not just coming back blank.

或更好,是否有一种更优雅的方法来仅删除页面中的任何特殊字符编码?我想要达到的最终结果是,我不在乎任何经过特殊编码的字符.我想删除编码字符并保留原始文本.我可以跳过上述所有命令,告诉BeautifulSoup只是剥离所有已编码的内容吗?

OR better yet is there a more elegant way to just remove any special character encoding from a page? My end result I am trying to accomplish is that I don't care about any of the specially encoded characters. I want to delete encoded characters and keep the raw text. Can I skip all of the above an tell BeautifulSoup to just strip anything that is encoded?

推荐答案

我决定随便BeautifulSoup吐出来的东西.然后,当我解析文档中的每个单词时,如果无法将其转换为字符串,则可以忽略它.

I decided to just go with whatever BeautifulSoup spits out. Then as I parse through each word in the document, if I can't convert it to a string, I just disregard it.

for word in doc.lower().split(): 
        try:
            word = str(word)
            word = self.handlePunctuation(word)
            if word == False:
                continue
        except UnicodeEncodeError, e:
            #word couldn't be converted to string; most likely encoding garbage we can toss anyways
            continue 

这篇关于使用BeautifulSoup从HTML页面获取内容类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆