使用SoupStrainer选择性解析 [英] Using SoupStrainer to parse selectively

查看:778
本文介绍了使用SoupStrainer选择性解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试着从一个购物网站分析的视频游戏作品名单。然而,随着项目列表的所有存储在标签内。

Im trying to parse a list of video game titles from a shopping site. however as the item list is all stored inside a tag .

<一个href=\"http://www.crummy.com/software/BeautifulSoup/documentation.html#Improving%20Performance%20by%20Parsing%20Only%20Part%20of%20the%20Document\"相对=nofollow>文档的这一部分理应解释如何来解析文档的一部分,但我不能工作了。我的code:

This section of the documentation supposedly explains how to parse only part of the document but i cant work it out. my code:

from BeautifulSoup import BeautifulSoup
import urllib
import re

url = "Some Shopping Site"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for a in soup.findAll('a',{'title':re.compile('.+') }):
    print a.string

在present被打印具有非空标题参考任何标签内的字符串。但它也priting在侧栏是特价的项目。如果我只能走产品列表格,我会杀了2一箭双雕。

at present is prints the string inside any tag that has a not empty title reference. but it is also priting the items in the side bar that are the "specials". if i can only take the product list div, i will kill 2 birds with one stone.

非常感谢。

推荐答案

哦,孩子傻傻的我是,我正在寻找有属性附加伤害ID =产品的标签,但它应该是所属类别

Oh boy am i silly, i was searching for tags with atribute id = products, but it should have been product_list

继承人的finaly code。如果有人来搜索。

heres the finaly code if anyone comes searching.

from BeautifulSoup import BeautifulSoup, SoupStrainer
import urllib
import re


start = time.clock()
url = "http://someplace.com"
html = urllib.urlopen(url).read()
product = SoupStrainer('div',{'id': 'products_list'})
soup = BeautifulSoup(html,parseOnlyThese=product)
for a in soup.findAll('a',{'title':re.compile('.+') }):
      print a.string

这篇关于使用SoupStrainer选择性解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆