使用Python中美丽的汤就具体内容HTML处理 [英] Python using Beautiful Soup for HTML processing on specific content

查看:221
本文介绍了使用Python中美丽的汤就具体内容HTML处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,当我决定来分析从一个网站的内容。例如, http://allrecipes.com/Recipe/Slow-电磁炉,猪大排-II / Detail.aspx

欲成分解析为一个文本文件。各成分位于:


  

< DIV CLASS =配料
  风格=的margin-top:10px的;>


和在此,各成分保存

之间

  

<李班=plaincharacterwrap>


有人是不够好,提供使用正则表达式code,但是当你从网站modyfying到现场就变得混乱。所以我想用美味的汤,因为它有很多的内置功能。但我可以混淆如何真正做到这一点。

code:

 进口重
进口的urllib2,SYS
从BeautifulSoup进口BeautifulSoup,NavigableString
HTML = urllib2.urlopen(http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx)
汤= BeautifulSoup(HTML)尝试:        ingrdiv = soup.find('格',ATTRS = {'类':'成分'})除了IO错误:
        打印IO错误

这是怎么样的,你心动吗?我想找到实际的div类,然后解析出所有的位于黎类中的成分。

任何帮助将是AP preciated!谢谢!


解决方案

 进口的urllib2
进口BeautifulSoup高清的main():
    URL =htt​​p://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx
    数据= urllib2.urlopen(URL).read()
    BS = BeautifulSoup.BeautifulSoup(数据)    ingreds = bs.find('格',{'类':'成分'})
    ingreds = [s.getText()条()对于s的ingreds.findAll(礼)    FNAME ='PorkChopsRecipe.txt
    开放(FNAME,'W')的OUTF:
        outf.write('\\ n'.join(ingreds))如果__name __ ==__ main__:
    主要()

结果

  1/4杯橄榄油
1杯鸡汤
2瓣大蒜,切碎
1汤匙辣椒粉
1汤匙大蒜粉
1汤匙调味家禽
1茶匙干牛至
1茶匙干罗勒
4厚切去骨猪排
盐和胡椒粉


后续反应@eyquem:

 不时进口时钟
进口的urllib
进口重
进口BeautifulSoup
进口lxml.html开始时钟=()
URL ='http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx
数据=了urllib.urlopen(URL).read()
打印加载了(钟() - 启动),S#对正则表达式
开始时钟=()
X = data.find('成份及LT; / H3和GT;')
patingr = re.compile('<李班=plaincharacterwrap方式>吗?\\ r \\ n +(+)LT; /李> \\ r \\ n)
RES1 ='\\ n'.join(patingr.findall(数据中,x))
打印正则表达式解析了(钟() - 启动),S#对BeautifulSoup
开始时钟=()
BS = BeautifulSoup.BeautifulSoup(数据)
ingreds = bs.find('格',{'类':'成分'})
RES2 ='\\ n'.join(s.getText()条()对于s的ingreds.findAll(礼))
打印BeautifulSoup解析了(钟() - 启动),S - 同=(RES2 == RES1)#由LXML
开始时钟=()
LX = lxml.html.fromstring(数据)
ingreds = lx.xpath('// DIV [@类=成分] //丽/文())
RES3 ='\\ n'.join(s.strip()对于s的ingreds)
打印LXML解析了(钟() - 启动),S - 同=(RES3 == RES1)

 加载了1.09091222621小号
正则表达式解析了0.000432703726233小号
BeautifulSoup解析了0.28126133314秒 - 同样= TRUE
LXML解析了0.0100940499505秒 - 同样= TRUE

正则表达式是更快(除非它是错的);但如果你考虑在页面加载和解析在一起,BeautifulSoup仍然只有20%的运行时间。如果你非常在意速度,我建议,而不是限于lxml

So when I decided to parse content from a website. For example, http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

I want to parse the ingredients into a text file. The ingredients are located in:

< div class="ingredients" style="margin-top: 10px;">

and within this, each ingredient is stored between

< li class="plaincharacterwrap">

Someone was nice enough to provide code using regex, but it gets confusing when you are modyfying from site to site. So I wanted to use Beautiful Soup since it has a lot of built in features. Except I can confused on how to actually do it.

Code:

import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
soup = BeautifulSoup(html)

try:

        ingrdiv = soup.find('div', attrs={'class': 'ingredients'})

except IOError: 
        print 'IO error'

Is this kind of how you get started? I want to find the actual div class and then parse out all those ingredients located within the li class.

Any help would be appreciated! Thanks!

解决方案

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]

    fname = 'PorkChopsRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

results in

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

.


Follow-up response to @eyquem:

from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html

start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"

# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"

# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)

# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)

gives

Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s  - same = True
lxml parse took 0.0100940499505 s  - same = True

Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.

这篇关于使用Python中美丽的汤就具体内容HTML处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆