使用来自同一URL的多个POST数据进行抓取 [英] Scrape using multiple POST data from the same URL

查看:118
本文介绍了使用来自同一URL的多个POST数据进行抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经创建了一个蜘蛛,该蜘蛛会收集带有匹配电话号码的公司名称列表.然后将其保存到CSV文件.

I have already created one spider that collects a list of company names with matching phone numbers. This is then saved to a CSV file.

然后,我想使用CSV文件中的电话号码作为POST数据从另一个站点抓取数据.我希望它循环访问相同的起始URL,但只是抓取每个电话号码产生的数据,直到CSV文件中没有更多的号码为止.

I am then wanting to scrape data from another site using the phones numbers in the CSV file as POST data. I am wanting it to loop through the same start URL but just scraping the data that each phone number produces until there are no more numbers left in the CSV file.

这是我到目前为止所得到的:

This is what I have got so far:

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import sys
from scrapy.shell import inspect_response
from btw.items import BtwItem
import csv

class BtwSpider(BaseSpider):
    name = "btw"
    allowed_domains = ["siteToScrape.com"]
    start_urls = ["http://www.siteToScrape.com/broadband/broadband_checker"] 

    def parse(self, response):
        phoneNumbers = ['01253873647','01253776535','01142726749']

        return [FormRequest.from_response(response,formdata={'broadband_checker[phone]': phoneNumbers[1]},callback=self.after_post)]


    def after_post(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="results"]')
       items = []
       for site in sites:
           item = BtwItem()

           fttcText = site.select("div[@class='content']/div[@id='btfttc']/ul/li/text()").extract()

           # Now we will change the text to be a boolean value
           if fttcText[0].count('not') > 0:
               fttcEnabled=0
           else:
               fttcEnabled=1

           item['fttcAvailable'] = fttcEnabled
           items.append(item)
       return items

目前,我只是试图通过一个列表(phoneNumbers)进行循环,但到目前为止,我什至没有设法使它正常工作.一旦我知道该怎么做,我就能自己将其从CSV文件中提取出来.在当前状态下,它仅使用列表中索引为1的phoneNumber.

At the minute I have just been trying to get this looping through a list(phoneNumbers) but I have not even managed to get that to work so far. Once I know how to do that I will be able to get it to pull it from a CSV file by myself. In its current state it is just using the phoneNumber with a index of 1 in the list.

推荐答案

假设您有一个带有电话的phones.csv文件:

Assuming you have a phones.csv file with phones in it:

01253873647
01253776535
01142726749

这是你的蜘蛛:

import csv
from scrapy.item import Item, Field

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector


class BtwItem(Item):
    fttcAvailable = Field()
    phoneNumber = Field()


class BtwSpider(BaseSpider):
    name = "btw"
    allowed_domains = ["samknows.com"]

    def start_requests(self):
        yield Request("http://www.samknows.com/broadband/broadband_checker", self.parse_main_page)

    def parse_main_page(self, response):
        with open('phones.csv', 'r') as f:
            reader = csv.reader(f)
            for row in reader:
                phone_number = row[0]
                yield FormRequest.from_response(response,
                                                formdata={'broadband_checker[phone]': phone_number},
                                                callback=self.after_post,
                                                meta={'phone_number': phone_number})

    def after_post(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@id="results"]')

        phone_number = response.meta['phone_number']
        for site in sites:
            item = BtwItem()

            fttc = site.select("div[@class='content']/div[@id='btfttc']/ul/li/text()").extract()
            item['phoneNumber'] = phone_number
            item['fttcAvailable'] = 'not' in fttc[0]

            yield item

这是运行它后刮掉的东西:

Here's what was scraped after running it:

{'fttcAvailable': False, 'phoneNumber': '01253873647'}
{'fttcAvailable': False, 'phoneNumber': '01253776535'}
{'fttcAvailable': True, 'phoneNumber': '01142726749'}

这个想法是使用start_requests抓取主页,然后在回调中逐行读取csv文件,并为每个电话号码(csv行)逐行读取yield新的Requests.另外,通过meta字典将phone_number传递给回调,以便将其写入Item字段(我想您需要用它来区分项目/结果).

The idea is to scrape the main page using start_requests, then read the csv file line-by-line in the callback and yield new Requests for each phone number (csv row). Additionally, pass phone_number to the callback through the meta dictionary in order to write it to the Item field (I think you need this to distinguish items/results).

希望有帮助.

这篇关于使用来自同一URL的多个POST数据进行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆