从使用urllib2或BeautifulSoup获取的(可能是编码的)字符串中返回小写ASCII字符串 [英] Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

查看:91
本文介绍了从使用urllib2或BeautifulSoup获取的(可能是编码的)字符串中返回小写ASCII字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用urllib2从网页中获取数据.所有页面的内容均为英语,因此处理非英语文本没有问题.但是页面是经过编码的,有时它们包含HTML实体,例如£或版权符号等.

我想检查页面的某些部分是否包含某些关键字-但是,我想进行不区分大小写的检查(出于明显的原因).

将返回的页面内容转换为所有小写字母的最佳方法是什么?

def get_page_content_as_lower_case(url):
    request = urllib2.Request(url)
    page = urllib2.urlopen(request)
    temp = page.read()

    return str(temp).lower() # this dosen't work because page contains utf-8 data

[[更新]]

我不必使用urllib2来获取数据,实际上,我可以改用BeautifulSoup,因为我需要从页面中的特定元素中检索数据-BS是更好的选择.我已更改标题以反映这一点.

但是,问题仍然存在,即获取的数据是以utf-8中的某些非asci编码(假定为)的.我确实检查了其中一页,编码为iso-8859-1.

由于我只关心英语,所以我想知道如何获取从页面中检索到的数据的小写ASCII字符串版本-这样我就可以对关键字是否进行区分大小写的测试在文本中找到.

我假设我将自己仅限于英语(来自英语网站)的事实会减少编码的选择?我对编码了解不多,但是我假设有效的选择是:

  • ASCII
  • iso-8859-1
  • utf-8

这是一个有效的假设吗(如果是的话),也许有一种方法可以编写一个健壮"的函数,该函数接受包含英文文本的编码字符串,并返回一个小写的ASCII字符串版本?

解决方案

BeautifulSoup在内部将数据存储为Unicode,因此您无需手动执行字符编码操作.

要在文本中查找关键字(不区分大小写)(属性值或标记名称中为 not ):

#!/usr/bin/env python
import urllib2
from contextlib import closing 

import regex # pip install regex
from BeautifulSoup import BeautifulSoup

with closing(urllib2.urlopen(URL)) as page:
     soup = BeautifulSoup(page)
     print soup(text=regex.compile(ur'(?fi)\L<keywords>',
                                   keywords=['your', 'keywords', 'go', 'here']))

示例(@tchrist的Unicode单词)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex
from BeautifulSoup import BeautifulSoup, Comment

html = u'''<div attr="PoSt in attribute should not be found">
<!-- it must not find post inside a comment either -->
<ol> <li> tag names must not match
<li> Post will be found
<li> the same with post
<li> and post
<li> and poſt
<li> this is ignored
</ol>
</div>'''

soup = BeautifulSoup(html)

# remove comments
comments = soup.findAll(text=lambda t: isinstance(t, Comment))
for comment in comments: comment.extract()

# find text with keywords (case-insensitive)
print ''.join(soup(text=regex.compile(ur'(?fi)\L<opts>', opts=['post', 'li'])))
# compare it with '.lower()'
print '.lower():'
print ''.join(soup(text=lambda t: any(k in t.lower() for k in ['post', 'li'])))
# or exact match
print 'exact match:'
print ''.join(soup(text=' the same with post\n'))

输出

 Post will be found
 the same with post
 and post
 and poſt

.lower():
 Post will be found
 the same with post

exact match:
 the same with post

I am fetching data from a web page using urllib2. The content of all the pages is in the English language so there is no issue of dealing with non-English text. The pages are encoded however, and they sometimes contain HTML entities such as £ or the copyright symbol etc.

I want to check if portions of a page contains certain keywords - however, I want to do a case insensitive check (for obvious reasons).

What is the best way to convert the returned page content into all lower case letters?

def get_page_content_as_lower_case(url):
    request = urllib2.Request(url)
    page = urllib2.urlopen(request)
    temp = page.read()

    return str(temp).lower() # this dosen't work because page contains utf-8 data

[[Update]]

I don't have to use urllib2 to get the data, in fact I may use BeautifulSoup instead, since I need to retrieve data from a specific element(s) in the page - for which BS is a much better choice. I have changed the title to reflect this.

HOWEVER, the problem still remains that the fetched data is in some non-asci coding (supposed to be) in utf-8. I did check one of the pages and the encoding was iso-8859-1.

Since I am only concerned with the English language, I want to know how I can obtain a lower case ASCII string version of the data retrieved from the page - so that I can carry out a case sensitive test as to whether a keyword is found in the text.

I am assuming that the fact that I have restricted myself to only English (from English speaking websites) reduces the choices of encoding?. I don't know much about encoding, but I assuming that the valid choices are:

  • ASCII
  • iso-8859-1
  • utf-8

Is that a valid assumption, and if yes, perhaps there is a way to write a 'robust' function that accepts an encoded string containing English text and returns a lower case ASCII string version of it?

解决方案

BeautifulSoup stores data as Unicode internally so you don't need to perform character encoding manipulations manually.

To find keywords (case-insensitive) in a text (not in attribute values, or tag names):

#!/usr/bin/env python
import urllib2
from contextlib import closing 

import regex # pip install regex
from BeautifulSoup import BeautifulSoup

with closing(urllib2.urlopen(URL)) as page:
     soup = BeautifulSoup(page)
     print soup(text=regex.compile(ur'(?fi)\L<keywords>',
                                   keywords=['your', 'keywords', 'go', 'here']))

Example (Unicode words by @tchrist)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex
from BeautifulSoup import BeautifulSoup, Comment

html = u'''<div attr="PoSt in attribute should not be found">
<!-- it must not find post inside a comment either -->
<ol> <li> tag names must not match
<li> Post will be found
<li> the same with post
<li> and post
<li> and poſt
<li> this is ignored
</ol>
</div>'''

soup = BeautifulSoup(html)

# remove comments
comments = soup.findAll(text=lambda t: isinstance(t, Comment))
for comment in comments: comment.extract()

# find text with keywords (case-insensitive)
print ''.join(soup(text=regex.compile(ur'(?fi)\L<opts>', opts=['post', 'li'])))
# compare it with '.lower()'
print '.lower():'
print ''.join(soup(text=lambda t: any(k in t.lower() for k in ['post', 'li'])))
# or exact match
print 'exact match:'
print ''.join(soup(text=' the same with post\n'))

Output

 Post will be found
 the same with post
 and post
 and poſt

.lower():
 Post will be found
 the same with post

exact match:
 the same with post

这篇关于从使用urllib2或BeautifulSoup获取的(可能是编码的)字符串中返回小写ASCII字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆