使用Python中美丽的汤就具体内容HTML处理 [英] Python using Beautiful Soup for HTML processing on specific content
问题描述
所以,当我决定来分析从一个网站的内容。例如, http://allrecipes.com/Recipe/Slow-电磁炉,猪大排-II / Detail.aspx
欲成分解析为一个文本文件。各成分位于:
< DIV CLASS =配料
风格=的margin-top:10px的;>
块引用>和在此,各成分保存
之间
<李班=plaincharacterwrap>
块引用>有人是不够好,提供使用正则表达式code,但是当你从网站modyfying到现场就变得混乱。所以我想用美味的汤,因为它有很多的内置功能。但我可以混淆如何真正做到这一点。
code:
进口重
进口的urllib2,SYS
从BeautifulSoup进口BeautifulSoup,NavigableString
HTML = urllib2.urlopen(http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx)
汤= BeautifulSoup(HTML)尝试: ingrdiv = soup.find('格',ATTRS = {'类':'成分'})除了IO错误:
打印IO错误这是怎么样的,你心动吗?我想找到实际的div类,然后解析出所有的位于黎类中的成分。
任何帮助将是AP preciated!谢谢!
解决方案进口的urllib2
进口BeautifulSoup高清的main():
URL =http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx
数据= urllib2.urlopen(URL).read()
BS = BeautifulSoup.BeautifulSoup(数据) ingreds = bs.find('格',{'类':'成分'})
ingreds = [s.getText()条()对于s的ingreds.findAll(礼) FNAME ='PorkChopsRecipe.txt
开放(FNAME,'W')的OUTF:
outf.write('\\ n'.join(ingreds))如果__name __ ==__ main__:
主要()结果
1/4杯橄榄油
1杯鸡汤
2瓣大蒜,切碎
1汤匙辣椒粉
1汤匙大蒜粉
1汤匙调味家禽
1茶匙干牛至
1茶匙干罗勒
4厚切去骨猪排
盐和胡椒粉
后续反应@eyquem:
不时进口时钟
进口的urllib
进口重
进口BeautifulSoup
进口lxml.html开始时钟=()
URL ='http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx
数据=了urllib.urlopen(URL).read()
打印加载了(钟() - 启动),S#对正则表达式
开始时钟=()
X = data.find('成份及LT; / H3和GT;')
patingr = re.compile('<李班=plaincharacterwrap方式>吗?\\ r \\ n +(+)LT; /李> \\ r \\ n)
RES1 ='\\ n'.join(patingr.findall(数据中,x))
打印正则表达式解析了(钟() - 启动),S#对BeautifulSoup
开始时钟=()
BS = BeautifulSoup.BeautifulSoup(数据)
ingreds = bs.find('格',{'类':'成分'})
RES2 ='\\ n'.join(s.getText()条()对于s的ingreds.findAll(礼))
打印BeautifulSoup解析了(钟() - 启动),S - 同=(RES2 == RES1)#由LXML
开始时钟=()
LX = lxml.html.fromstring(数据)
ingreds = lx.xpath('// DIV [@类=成分] //丽/文())
RES3 ='\\ n'.join(s.strip()对于s的ingreds)
打印LXML解析了(钟() - 启动),S - 同=(RES3 == RES1)给
加载了1.09091222621小号
正则表达式解析了0.000432703726233小号
BeautifulSoup解析了0.28126133314秒 - 同样= TRUE
LXML解析了0.0100940499505秒 - 同样= TRUE正则表达式是更快(除非它是错的);但如果你考虑在页面加载和解析在一起,BeautifulSoup仍然只有20%的运行时间。如果你非常在意速度,我建议,而不是限于lxml
So when I decided to parse content from a website. For example, http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx
I want to parse the ingredients into a text file. The ingredients are located in:
< div class="ingredients" style="margin-top: 10px;">
and within this, each ingredient is stored between
< li class="plaincharacterwrap">
Someone was nice enough to provide code using regex, but it gets confusing when you are modyfying from site to site. So I wanted to use Beautiful Soup since it has a lot of built in features. Except I can confused on how to actually do it.
Code:
import re import urllib2,sys from BeautifulSoup import BeautifulSoup, NavigableString html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx") soup = BeautifulSoup(html) try: ingrdiv = soup.find('div', attrs={'class': 'ingredients'}) except IOError: print 'IO error'
Is this kind of how you get started? I want to find the actual div class and then parse out all those ingredients located within the li class.
Any help would be appreciated! Thanks!
解决方案import urllib2 import BeautifulSoup def main(): url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx" data = urllib2.urlopen(url).read() bs = BeautifulSoup.BeautifulSoup(data) ingreds = bs.find('div', {'class': 'ingredients'}) ingreds = [s.getText().strip() for s in ingreds.findAll('li')] fname = 'PorkChopsRecipe.txt' with open(fname, 'w') as outf: outf.write('\n'.join(ingreds)) if __name__=="__main__": main()
results in
1/4 cup olive oil 1 cup chicken broth 2 cloves garlic, minced 1 tablespoon paprika 1 tablespoon garlic powder 1 tablespoon poultry seasoning 1 teaspoon dried oregano 1 teaspoon dried basil 4 thick cut boneless pork chops salt and pepper to taste
.
Follow-up response to @eyquem:
from time import clock import urllib import re import BeautifulSoup import lxml.html start = clock() url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx' data = urllib.urlopen(url).read() print "Loading took", (clock()-start), "s" # by regex start = clock() x = data.find('Ingredients</h3>') patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n') res1 = '\n'.join(patingr.findall(data,x)) print "Regex parse took", (clock()-start), "s" # by BeautifulSoup start = clock() bs = BeautifulSoup.BeautifulSoup(data) ingreds = bs.find('div', {'class': 'ingredients'}) res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li')) print "BeautifulSoup parse took", (clock()-start), "s - same =", (res2==res1) # by lxml start = clock() lx = lxml.html.fromstring(data) ingreds = lx.xpath('//div[@class="ingredients"]//li/text()') res3 = '\n'.join(s.strip() for s in ingreds) print "lxml parse took", (clock()-start), "s - same =", (res3==res1)
gives
Loading took 1.09091222621 s Regex parse took 0.000432703726233 s BeautifulSoup parse took 0.28126133314 s - same = True lxml parse took 0.0100940499505 s - same = True
Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.
这篇关于使用Python中美丽的汤就具体内容HTML处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!