牵制在HTML确切内容位置网页抓取的urllib2美味的汤 [英] Pin down exact content location in html for web scraping urllib2 Beautiful Soup

查看:215
本文介绍了牵制在HTML确切内容位置网页抓取的urllib2美味的汤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的网络刮,很少接触到HTML文件系统,并想知道是否有搜索的网页的HTML版本所需的内容更好更有效的方式。
目前,我想刮这里产品评论:<一href=\"http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem\" rel=\"nofollow\">http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem

I'm new to web scraping, have little exposure to html file systems and wanted to know if there is a better more efficient way to search for a required content on the html version of a web page. Currently, I want to scrape reviews for a product here: http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem

对于这一点,我有以下的code:

For this, I have the following code:

url = http://www.walmart.com/ip/29701960? wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=6227215 6621&veh=sem
review_url = url       
#print review_url
    #-------------------------------------------------------------------------
    # Scrape the ratings
    #-------------------------------------------------------------------------
    page_no = 1
    sum_total_reviews = 0
    more = True

    while (more):
        #print "XXXX"
        # Open the URL to get the review data
        request = urllib2.Request(review_url)
        try:
            #print "XXXX"
            page = urllib2.urlopen(request)
        except urllib2.URLError, e:
            #print "XXXXX"
            if hasattr(e, 'reason'):
                print 'Failed to reach url'
                print 'Reason: ', e.reason
                sys.exit()
            elif hasattr(e, 'code'):
                if e.code == 404:
                    print 'Error: ', e.code
                    sys.exit()

        content = page.read()
        #print content
        soup = BeautifulSoup(content)
        results = soup.find_all('span', {'class': re.compile(r's_star_\d_0')})

有了这个,我没能读什么。我猜我必须给它一个准确的目的地。有什么建议?

With this, I'm not able to read anything. I'm guessing I have to give it an accurate destination. Any suggestions ?

推荐答案

据我所知,这个问题最初是约 BeautifulSoup ,但既然你没有成功使用它在这种特殊情况,我建议在考虑看看

I understand that the question was initially about BeautifulSoup, but since you haven't had any success using it in this particular situation, I suggest taking a look at selenium.

硒使用真正的浏览器 - 你不必应付解析Ajax调用的结果。例如,这里是你如何能得到评审职称和评级从第一个评论网页列表:

Selenium uses a real browser - you don't have to deal with parsing the results of ajax calls. For example, here's how you can get the list of review titles and ratings from the first reviews page:

from selenium.webdriver.firefox import webdriver


driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')

for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
    title = review.find_element_by_class_name('BVRRReviewTitle').text
    rating = review.find_element_by_xpath('.//div[@class="BVRRRatingNormalImage"]//img').get_attribute('title')
    print title, rating

driver.close()

打印:

Renee Culver loves Clorox Wipes 5 out of 5
Men at work 5 out of 5
clorox wipes 5 out of 5
...

此外,考虑到可以使用无头PhantomJS浏览器(<一个href=\"http://www.realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/#.Uy9vAK1dXQg\"相对=nofollow>例如)。

另一种选择是利用沃尔玛API 的。

希望有所帮助。

这篇关于牵制在HTML确切内容位置网页抓取的urllib2美味的汤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆