在 BeautifulSoup 中处理印度语言 [英] Handling Indian Languages in BeautifulSoup

查看:26
本文介绍了在 BeautifulSoup 中处理印度语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取 NDTV 网站的新闻标题.这个是我用作 HTML 源的页面.我正在使用 BeautifulSoup (bs4) 来处理 HTML 代码,并且一切正常,除了当我在链接到的页面中遇到印地语标题时我的代码会中断.

I'm trying to scrape the NDTV website for news titles. This is the page I'm using as a HTML source. I'm using BeautifulSoup (bs4) to handle the HTML code, and I've got everything working, except my code breaks when I encounter the hindi titles in the page I linked to.

到目前为止我的代码是:

My code so far is :

import urllib2
from bs4 import BeautifulSoup

htmlUrl = "http://archives.ndtv.com/articles/2012-01.html"
FileName = "NDTV_2012_01.txt"

fptr = open(FileName, "w")
fptr.seek(0)

page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")

li = soup.findAll( 'li')
for link_tag in li:
   hypref = link_tag.find('a').contents[0]
   strhyp = str(hypref)
   fptr.write(strhyp)
   fptr.write("
")

我得到的错误是:

Traceback (most recent call last):
  File "./ScrapeTemplate.py", line 30, in <module>
  strhyp = str(hypref)
  UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

即使我没有包含 from_encoding 参数,我也遇到了同样的错误.我最初将它用作 fromEncoding,但 python 警告我它已被弃用.

I got the same error even when I didn't include the from_encoding parameter. I initially used it as fromEncoding, but python warned me that it was deprecated usage.

我该如何解决这个问题?从我读过的内容来看,我需要避免使用印地语标题或将其显式编码为非 ascii 文本,但我不知道该怎么做.任何帮助将不胜感激!

How do I fix this? From what I've read I need to either avoid the hindi titles or explicitly encode it into non-ascii text, but I don't know how to do that. Any help would be greatly appreciated!

推荐答案

您看到的是 NavigableString 实例(派生自 Python unicode 类型):

What you see is a NavigableString instance (which is derived from the Python unicode type):

(Pdb) hypref.encode('utf-8')
'NDTV'
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)

您需要使用

hypref.encode('utf-8')

这篇关于在 BeautifulSoup 中处理印度语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆