在处理印度BeautifulSoup语言 [英] Handling Indian Languages in BeautifulSoup

查看:164
本文介绍了在处理印度BeautifulSoup语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想刮 NDTV 网站的新闻标题。 是我使用的是作为一个HTML源代码的网页。我使用BeautifulSoup(BS4)来处理HTML code,我也得到了一切工作,除了我的code休息,当我遇到我链接到页面的印地文标题。

I'm trying to scrape the NDTV website for news titles. This is the page I'm using as a HTML source. I'm using BeautifulSoup (bs4) to handle the HTML code, and I've got everything working, except my code breaks when I encounter the hindi titles in the page I linked to.

我的code到目前为止是:

My code so far is :

import urllib2
from bs4 import BeautifulSoup

htmlUrl = "http://archives.ndtv.com/articles/2012-01.html"
FileName = "NDTV_2012_01.txt"

fptr = open(FileName, "w")
fptr.seek(0)

page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")

li = soup.findAll( 'li')
for link_tag in li:
   hypref = link_tag.find('a').contents[0]
   strhyp = str(hypref)
   fptr.write(strhyp)
   fptr.write("\n")

我得到的错误是:

The error I get is :

Traceback (most recent call last):
  File "./ScrapeTemplate.py", line 30, in <module>
  strhyp = str(hypref)
  UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

我得到了同样的错误,即使我不包括 from_encoding 参数。我最初用它作为 fromEncoding ,但蟒蛇警告我说,这是德precated使用。

I got the same error even when I didn't include the from_encoding parameter. I initially used it as fromEncoding, but python warned me that it was deprecated usage.

我该如何解决这个问题?从我读过什么,我需要或者避免印地文标题或明确EN code成非ASCII文本,但我不知道该怎么做。任何帮助将大大AP preciated!

How do I fix this? From what I've read I need to either avoid the hindi titles or explicitly encode it into non-ascii text, but I don't know how to do that. Any help would be greatly appreciated!

推荐答案

你看到的是一个NavigableString实例(这是从Python的UNI code型派生):

What you see is a NavigableString instance (which is derived from the Python unicode type):

(Pdb) hypref.encode('utf-8')
'NDTV'
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)

您需要使用转换为UTF-8

You need to convert to utf-8 using

hypref.encode('utf-8')

这篇关于在处理印度BeautifulSoup语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆