使用来自同一URL的多个POST数据进行抓取 [英] Scrape using multiple POST data from the same URL
问题描述
我已经创建了一个蜘蛛,该蜘蛛会收集带有匹配电话号码的公司名称列表.然后将其保存到CSV文件.
I have already created one spider that collects a list of company names with matching phone numbers. This is then saved to a CSV file.
然后,我想使用CSV文件中的电话号码作为POST数据从另一个站点抓取数据.我希望它循环访问相同的起始URL,但只是抓取每个电话号码产生的数据,直到CSV文件中没有更多的号码为止.
I am then wanting to scrape data from another site using the phones numbers in the CSV file as POST data. I am wanting it to loop through the same start URL but just scraping the data that each phone number produces until there are no more numbers left in the CSV file.
这是我到目前为止所得到的:
This is what I have got so far:
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import sys
from scrapy.shell import inspect_response
from btw.items import BtwItem
import csv
class BtwSpider(BaseSpider):
name = "btw"
allowed_domains = ["siteToScrape.com"]
start_urls = ["http://www.siteToScrape.com/broadband/broadband_checker"]
def parse(self, response):
phoneNumbers = ['01253873647','01253776535','01142726749']
return [FormRequest.from_response(response,formdata={'broadband_checker[phone]': phoneNumbers[1]},callback=self.after_post)]
def after_post(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="results"]')
items = []
for site in sites:
item = BtwItem()
fttcText = site.select("div[@class='content']/div[@id='btfttc']/ul/li/text()").extract()
# Now we will change the text to be a boolean value
if fttcText[0].count('not') > 0:
fttcEnabled=0
else:
fttcEnabled=1
item['fttcAvailable'] = fttcEnabled
items.append(item)
return items
目前,我只是试图通过一个列表(phoneNumbers)进行循环,但到目前为止,我什至没有设法使它正常工作.一旦我知道该怎么做,我就能自己将其从CSV文件中提取出来.在当前状态下,它仅使用列表中索引为1的phoneNumber.
At the minute I have just been trying to get this looping through a list(phoneNumbers) but I have not even managed to get that to work so far. Once I know how to do that I will be able to get it to pull it from a CSV file by myself. In its current state it is just using the phoneNumber with a index of 1 in the list.
推荐答案
假设您有一个带有电话的phones.csv
文件:
Assuming you have a phones.csv
file with phones in it:
01253873647
01253776535
01142726749
这是你的蜘蛛:
import csv
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
class BtwItem(Item):
fttcAvailable = Field()
phoneNumber = Field()
class BtwSpider(BaseSpider):
name = "btw"
allowed_domains = ["samknows.com"]
def start_requests(self):
yield Request("http://www.samknows.com/broadband/broadband_checker", self.parse_main_page)
def parse_main_page(self, response):
with open('phones.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
phone_number = row[0]
yield FormRequest.from_response(response,
formdata={'broadband_checker[phone]': phone_number},
callback=self.after_post,
meta={'phone_number': phone_number})
def after_post(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="results"]')
phone_number = response.meta['phone_number']
for site in sites:
item = BtwItem()
fttc = site.select("div[@class='content']/div[@id='btfttc']/ul/li/text()").extract()
item['phoneNumber'] = phone_number
item['fttcAvailable'] = 'not' in fttc[0]
yield item
这是运行它后刮掉的东西:
Here's what was scraped after running it:
{'fttcAvailable': False, 'phoneNumber': '01253873647'}
{'fttcAvailable': False, 'phoneNumber': '01253776535'}
{'fttcAvailable': True, 'phoneNumber': '01142726749'}
这个想法是使用start_requests
抓取主页,然后在回调中逐行读取csv文件,并为每个电话号码(csv行)逐行读取yield
新的Requests
.另外,通过meta
字典将phone_number
传递给回调,以便将其写入Item
字段(我想您需要用它来区分项目/结果).
The idea is to scrape the main page using start_requests
, then read the csv file line-by-line in the callback and yield
new Requests
for each phone number (csv row). Additionally, pass phone_number
to the callback through the meta
dictionary in order to write it to the Item
field (I think you need this to distinguish items/results).
希望有帮助.
这篇关于使用来自同一URL的多个POST数据进行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!