尝试从一个网页Python和BeautifulSoup获得编码 [英] Trying to get encoding from a webpage Python and BeautifulSoup
问题描述
我试着去检索网页的字符集(这将改变所有的时间)。目前即时通讯使用beautifulSoup解析页面,然后提取头中的字符集。这是工作的罚款,直到我遇到了一个网站,有.....
< META HTTP-EQUIV =Content-Type的CONTENT =text / html的;字符集= UTF-8>
到现在为止并与其他页面的工作
我的code起来是:
高清get_encoding(汤):
ENCOD = soup.meta.get(字符集),
如果ENCOD ==无:
ENCOD = soup.meta.get('内涵式')
如果ENCOD ==无:
ENCOD = soup.meta.get(内容)
返回ENCOD
有没有人对如何添加到此code来从上面的例子中的字符集是一个好主意。将它标记化,并试图检索的charset这种方式是一个想法?你将如何去了解它,而无需改变整个功能?
眼下上述code将返回text / html的;字符集= UTF-8。这是造成一LookupError,因为这是一个未知的编码
感谢
最后code,我最终使用:
高清get_encoding(汤):
ENCOD = soup.meta.get(字符集),
如果ENCOD ==无:
ENCOD = soup.meta.get('内涵式')
如果ENCOD ==无:
内容= soup.meta.get(内容)
匹配= re.search('字符集=(。*),内容)
如果匹配:
ENCOD = match.group(1)
其他:
dic_of_possible_encodings = chardet.detect(UNI code(汤))
ENCOD = dic_of_possible_encodings ['编码']
返回ENCOD
进口重
高清get_encoding(汤):
ENCOD = soup.meta.get(字符集),
如果ENCOD ==无:
ENCOD = soup.meta.get('内涵式')
如果ENCOD ==无:
内容= soup.meta.get(内容)
匹配= re.search('字符集=(。*),内容)
如果匹配:
ENCOD = match.group(1)
其他:
提高ValueError异常(无法找到编码)
返回ENCOD
Im trying to retrieve the charset from a webpage(this will change all the time). At the moment Im using beautifulSoup to parse the page and then extract the charset from the header. This was working fine until I ran into a site that had.....
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
My code up until now and which was working with other pages is:
def get_encoding(soup):
encod = soup.meta.get('charset')
if encod == None:
encod = soup.meta.get('content-type')
if encod == None:
encod = soup.meta.get('content')
return encod
Would anyone have a good idea about how to add to this code to retrieve the charset from the above example. Would tokenizing it and trying to retrieve the charset that way be an idea? and how would you go about it without having to change the whole function? Right now the above code is returning "text/html; charset=utf-8" which is causing a LookupError because this is an unknown encoding.
Thanks
The final code that I ended up using:
def get_encoding(soup):
encod = soup.meta.get('charset')
if encod == None:
encod = soup.meta.get('content-type')
if encod == None:
content = soup.meta.get('content')
match = re.search('charset=(.*)', content)
if match:
encod = match.group(1)
else:
dic_of_possible_encodings = chardet.detect(unicode(soup))
encod = dic_of_possible_encodings['encoding']
return encod
import re
def get_encoding(soup):
encod = soup.meta.get('charset')
if encod == None:
encod = soup.meta.get('content-type')
if encod == None:
content = soup.meta.get('content')
match = re.search('charset=(.*)', content)
if match:
encod = match.group(1)
else:
raise ValueError('unable to find encoding')
return encod
这篇关于尝试从一个网页Python和BeautifulSoup获得编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!