在 BeautifulSoup 中处理印度语言 [英] Handling Indian Languages in BeautifulSoup
问题描述
我正在尝试抓取 NDTV 网站的新闻标题.这个是我用作 HTML 源的页面.我正在使用 BeautifulSoup (bs4) 来处理 HTML 代码,并且一切正常,除了当我在链接到的页面中遇到印地语标题时我的代码会中断.
I'm trying to scrape the NDTV website for news titles. This is the page I'm using as a HTML source. I'm using BeautifulSoup (bs4) to handle the HTML code, and I've got everything working, except my code breaks when I encounter the hindi titles in the page I linked to.
到目前为止我的代码是:
My code so far is :
import urllib2
from bs4 import BeautifulSoup
htmlUrl = "http://archives.ndtv.com/articles/2012-01.html"
FileName = "NDTV_2012_01.txt"
fptr = open(FileName, "w")
fptr.seek(0)
page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")
li = soup.findAll( 'li')
for link_tag in li:
hypref = link_tag.find('a').contents[0]
strhyp = str(hypref)
fptr.write(strhyp)
fptr.write("
")
我得到的错误是:
Traceback (most recent call last):
File "./ScrapeTemplate.py", line 30, in <module>
strhyp = str(hypref)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
即使我没有包含 from_encoding
参数,我也遇到了同样的错误.我最初将它用作 fromEncoding
,但 python 警告我它已被弃用.
I got the same error even when I didn't include the from_encoding
parameter. I initially used it as fromEncoding
, but python warned me that it was deprecated usage.
我该如何解决这个问题?从我读过的内容来看,我需要避免使用印地语标题或将其显式编码为非 ascii 文本,但我不知道该怎么做.任何帮助将不胜感激!
How do I fix this? From what I've read I need to either avoid the hindi titles or explicitly encode it into non-ascii text, but I don't know how to do that. Any help would be greatly appreciated!
推荐答案
您看到的是 NavigableString 实例(派生自 Python unicode 类型):
What you see is a NavigableString instance (which is derived from the Python unicode type):
(Pdb) hypref.encode('utf-8')
'NDTV'
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)
您需要使用
hypref.encode('utf-8')
这篇关于在 BeautifulSoup 中处理印度语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!