在“下一步"之后,如何将scrapy用于Amazon.com链接.按钮? [英] How to use scrapy for Amazon.com links after "Next" Button?

查看:73
本文介绍了在“下一步"之后,如何将scrapy用于Amazon.com链接.按钮?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Python和Scrapy比较陌生.我正在尝试取消购买此商品的客户也购买了"中的链接. 例如: http://www.amazon.com/Confessions -Economic-Hit-John-Perkins-ebook/dp/B001AFF266/. 购买此商品的顾客也购买了"共有17页.如果我要求scrapy抓取该URL,则它只会抓取第一页(6个项目).我如何要求scrapy按下下一步按钮"以剪贴17页中的所有项目?样例代码(仅是crawler.py中重要的部分)将不胜感激.谢谢您的时间!

I am relatively new to Python and Scrapy. I'm trying to scrap the links in "Customers who bought this item also bought". For example: http://www.amazon.com/Confessions-Economic-Hit-John-Perkins-ebook/dp/B001AFF266/. There are 17 pages for "Customers who bought this item also bought". If I ask scrapy to scrap that url, it only scraps the first page (6 items). How do I ask scrapy to press the "Next Button" to scrap all the items in the 17 pages? A sample code (just the part that matters in the crawler.py) will be greatly appreciated. Thank you for your time!

好的.这是我的代码.正如我所说的,我是Python的新手,因此代码可能看起来很愚蠢,但可以删除第一页(6个项目).我主要使用Fortran或Matlab.如果有时间,我很想系统地学习Python.

Ok. Here is my code. As I said I am new to Python so the code might look quite stupid but it works to scrap the first page (6 items). I work mostly with Fortran or Matlab. I would love to learn Python systematically If I have time though.

# Code of my crawler.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from beta.items import BetaItem

class AlphaSpider(CrawlSpider):

    name = 'alpha'
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/s/ref=lp_4366_nr_p_n_publication_date_0?rh=n%3A283155%2Cn%3A%211000%2Cn%3A4366%2Cp_n_publication_date%3A1250226011&bbn=4366&ie=UTF8&qid=1384729756&rnid=1250225011']
    rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//h3/a',)), callback='parse_item'), )

    def parse_item(self, response):
        sel = Selector(response)

        stuff = BetaItem()
    isbn10R = sel.xpath('//li[b[contains(text(),"ISBN-10:")]]/text()').extract()
    isbn10 = []
    if len(isbn10R) > 0:
       isbn10 = [(isbn10R[0].split(' '))[1]]
    stuff['isbn10'] = isbn10

        starsR = sel.xpath('//div[contains(@id,"averageCustomerReviews")]/span/@title').extract()
    stars = []
    if len(starsR) > 0:
       stars = [(starsR[0].split(' '))[0]]
    stuff['stars'] = stars

    reviewsR = sel.xpath('//div[contains(@id,"averageCustomerReviews")]/a[contains(@href,"showViewpoints=1")]/text()').extract()
    reviews = []
    if len(reviewsR) > 0:
       reviews = [(reviewsR[0].split(' '))[0]]
    stuff['reviews'] = reviews

    copsR = sel.xpath('//a[@class="sim-img-title"]/@href').extract()
    ncops = len(copsR)
    cops = [None] * ncops
    if ncops > 0:
       for idx, cop in enumerate(copsR):
           cops[idx]=((cop.split('dp/'))[1].split('/ref'))[0]
    stuff['cops'] = cops       

    return stuff

推荐答案

因此,我了解您可以抓取这些同时购买此商品的客户"的产品详细信息.如您所见,它们位于div类别中的ul中,该类别为"shoveler-content":

So I understand you were able to scrape these "Customers Who Bought This Item Also Bought" product details. As you probably saw, these are within a ul in a div with class "shoveler-content":

<div id="purchaseButtonWrapper" class="shoveler-button-wrapper">
    <a class="back-button" onclick="return false;" style="" href="#Back">
    <div class="shoveler-content">
        <ul tabindex="-1">
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">
                <div id="purchase_B003LSTK8G" class="new-faceout p13nimp" data-ref="pd_sim_kstore_1" data-asin="B003LSTK8G">
                ...
                </div>
            </li>
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
            <li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
        </ul>
    </div>
    <a class="next-button" onclick="return false;" style="" href="#Next">
        <span class="auiTestSprite s_shvlNext">...</span>
    </a>
    </div>
</div>

当您检查所选浏览器的网络活动(通过Firebug或Chrome Inspect工具)时,单击下一步"按钮以获取下一个推荐产品时,您会看到针对此类URL的AJAX查询:

When you inspect your browser of choice's network activity (via Firebug or Chrome Inspect tool), when you click on the "next" button for next suggested products, you'll see an AJAX query to this sort of URL:

http://www.amazon.com
    /gp/product/features/similarities/shoveler/cell-render.html/ref=pd_sim_kstore?
    id=B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
    B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG
    &pos=7&refTag=pd_sim_kstore&wdg=ebooks_display_on_website
    &shovelerName=purchase

(我正在使用此产品页面: http: //www.amazon.com/Boomerang-Travels-New-Third-World-ebook/dp/B005CRQ2OE )

(I'm using this product page: http://www.amazon.com/Boomerang-Travels-New-Third-World-ebook/dp/B005CRQ2OE)

id查询参数中的是ASIN列表,这是下一个建议的产品.显示12个ASIN(6个)?可能会为用户可能会进行的下一次下一次"点击提供页内缓存.

What's in the id query argument is a list of ASINs, which are the next suggested products. 12 ASINs for 6 displayed? probably some in-page caching for the next "next" click a user will probably make.

您将从此AJAX查询中得到什么?仍在浏览器的检查工具中,您将看到响应的类型为application/json,并且响应数据为12的JSON数组元素,每个元素都是一些HTML代码段,类似于:

What do you get back from this AJAX query? Still within your browser's inspect tool, you'll see the response is of type application/json, and the response data is a JSON array of 12 elements, each elements being some HTML snippet, similar to:

<div class="new-faceout p13nimp" id="purchase_B00261OOWQ" data-asin="B00261OOWQ" data-ref="pd_sim_kstore_7">
    <a href="/Home-Game-Accidental-Guide-Fatherhood-ebook/dp/B00261OOWQ/ref=pd_sim_kstore_7" class="sim-img-title" >
        <div class="product-image">
            <img src="http://ecx.images-amazon.com/images/I/51ZBpvGgsUL._SL500_PIsitb-sticker-arrow-big,TopRight,35,-73_OU01_SS100_.jpg" width="100" alt="" height="100" border="0" /> 
        </div> Home Game: An Accidental Guide to Fatherhood
    </a> 
    <div class="byline">
        <span class="carat">&#8250</span> 
        <a href="http://www.amazon.com/Michael-Lewis/e/B000APZ33E/ref=pd_sim_kstore_bl_7">Michael Lewis</a> 
    </div> 

    <div class="rating-price"> 
        <span class="rating-stars">
            <span class="crAvgStars" style="white-space:no-wrap;">
                <span class="asinReviewsSummary" name="B00261OOWQ">
                    <a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_img_7">
                        <span class="auiTestSprite s_star_4_0 " title="4.1 out of 5 stars" >
                            <span>4.1 out of 5 stars</span>
                        </span>
                    </a>&nbsp;
                </span>
                (<a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_txt_7">99</a>)
            </span>
        </span> 
    </div> 
    <div class="binding-platform"> Kindle Edition </div> 
    <div class="pricetext"><span class="price" style="margin-right:5px">$11.36</span></div> 
</div>

因此,您基本上可以从<div class="shoveler-content"><ul>

So you basically get what was in the original page section for suggested products earlier, in each <li> from <div class="shoveler-content"><ul>

但是如何获取这些ASIN代码以附加到AJAX查询的id参数?

But how do you get those ASINs codes to append to the AJAX query's id parameter?

好吧,在产品页面中,您会注意到此部分

Well, in the product page, you'll notice this section

<div id="purchaseSimsData" 
    class="sims-data" style="display:none" 
    data-baseAsin="B005CRQ2OE" data-featureId="pd_sim" 
    data-pageId="B005CRQ2OEr_sim_2" data-reftag="pd_sim_kstore"
    data-wdg="ebooks_display_on_website" data-widgetName="purchase">
    B003LSTK8G,B000VKVZR6,B003E20ZRY,B000RH0C9A,B000RH0CA4,B000YMDQRS,
    B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
    B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG,
    B0018QQQKS,B002OTKEP6,B005PUWUKS,B007V65R54,B00B3VOTTI,B004EYT932,
    B002UBRFFU,B000WJSB50,B000RH0DYE,B004JXXKWY,B003E8AJXI,B008TRU7PE,
    B00555X8OA,B007OSIOWM,B00DLJIA54,B00139XTG4,B0058Z4NR8,B00ALBR6JG,
    B004H0M8QS,B003F3PL7Q,B008UX8YPC,B000U913GG,B003HOXLVQ,B000VWM0MI,
    B000SEIU28,B006VE7YS0,B008KPMBIG,B003CIQ57E,B0064EHZY0,B008UX3ITE,
    B001NLKY38,B003VIWK4C,B005GSYZRA,B007YGGOVM,B004H4X84K,B00B5ZQ72Y,
    B000R1BAH4,B008W02TIG,B000W8HC8I,B0036QVOKU,B000VRBBDC,B00APDGFOC,
    B00EOAS0EK,B000QCS888,B001QIGZEK,B0074B55IK,B000FC12C8,B00AP2XVJ0,
    B000FCK5YE,B006ID6UAW,B001FA0W5W,B005HFI0X2,B006ZOYM9K,B003SNJZ3Y,
    B00C1N5WOI,B008EKORIY,B00C4GRK4W,B004V3WRNU,B00BV6RTUG,B001AFF266,
    B00DUM1W3E,B00APDGGCS,B008WOUFIS,B008EKOO46,B008JHXO6S,B005AJM3U6,
    B00BKRW6GI,B00CDUVSQ0,B00A287PG2,B009H679WA,B000VDUWMC,B009NF6IRW
</div>

看起来像所有建议的产品ASIN.

which looks like all the suggested products ASINs.

因此,我建议您模拟连续的AJAX查询以一次获得建议的产品,12个ASIN,使用json包对响应进行解码,然后解析每个HTML代码段以提取所需的产品信息.

Therefore, I suggest you emulate successive AJAX queries to get suggested products, 12 ASINs at a time, decode the response using json package, and then parse each HTML snippet to extract product info you want.

这篇关于在“下一步"之后,如何将scrapy用于Amazon.com链接.按钮?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆