使用SoupStrainer选择性解析 [英] Using SoupStrainer to parse selectively
问题描述
我试着从一个购物网站分析的视频游戏作品名单。然而,随着项目列表的所有存储在标签内。
Im trying to parse a list of video game titles from a shopping site. however as the item list is all stored inside a tag .
<一个href=\"http://www.crummy.com/software/BeautifulSoup/documentation.html#Improving%20Performance%20by%20Parsing%20Only%20Part%20of%20the%20Document\"相对=nofollow>文档的这一部分理应解释如何来解析文档的一部分,但我不能工作了。我的code:
This section of the documentation supposedly explains how to parse only part of the document but i cant work it out. my code:
from BeautifulSoup import BeautifulSoup
import urllib
import re
url = "Some Shopping Site"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for a in soup.findAll('a',{'title':re.compile('.+') }):
print a.string
在present被打印具有非空标题参考任何标签内的字符串。但它也priting在侧栏是特价的项目。如果我只能走产品列表格,我会杀了2一箭双雕。
at present is prints the string inside any tag that has a not empty title reference. but it is also priting the items in the side bar that are the "specials". if i can only take the product list div, i will kill 2 birds with one stone.
非常感谢。
推荐答案
哦,孩子傻傻的我是,我正在寻找有属性附加伤害ID =产品的标签,但它应该是所属类别
Oh boy am i silly, i was searching for tags with atribute id = products, but it should have been product_list
继承人的finaly code。如果有人来搜索。
heres the finaly code if anyone comes searching.
from BeautifulSoup import BeautifulSoup, SoupStrainer
import urllib
import re
start = time.clock()
url = "http://someplace.com"
html = urllib.urlopen(url).read()
product = SoupStrainer('div',{'id': 'products_list'})
soup = BeautifulSoup(html,parseOnlyThese=product)
for a in soup.findAll('a',{'title':re.compile('.+') }):
print a.string
这篇关于使用SoupStrainer选择性解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!