Python Scrapy &屈服 [英] Python Scrapy & Yield

查看:36
本文介绍了Python Scrapy &屈服的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在第一次使用 Scrapy 开发刮刀,我也是第一次使用 Yield.我仍在努力解决收益问题.

I am currently developing a scraper using Scrapy for the first time and I am using Yield for the first time as well. I am still trying to wrap my head around yield.

刮板:

  • 抓取一页以获取日期列表(解析)
  • 使用这些日期来格式化 URL,然后抓取 (parse_page_contents)
  • 在此页面上,它会查找每个列表的 URL 并抓取各个列表 (parse_page_listings)
  • 在个人列表中,我想提取所有数据.每个单独的列表上还有 4 个链接,其中包含更多数据.(parse_individual_listings)

我正在努力理解如何将来自 parse_individual_tabs 和 parse_individual_listings 的 JSON 组合成一个 JSON 字符串.这将是每个单独列表的一个,并将被发送到 API.即使只是暂时打印它也可以.

I am struggling to understand how to combine the JSON from parse_individual_tabs and parse_individual_listings into one JSON string. This will be one for each individual listing and will be sent to an API. Even just printing it for the time being will work.

    class MySpider(scrapy.Spider):
    name = "myspider"

    start_urls = [
            '',
    ]


    def parse(self, response):
        rows = response.css('table.apas_tbl tr').extract()
        for row in rows[1:]:
            soup = BeautifulSoup(row, 'lxml')
            dates = soup.find_all('input')
            url = ""
            yield scrapy.Request(url, callback=self.parse_page_contents)

    def parse_page_contents(self, response):
        rows = response.xpath('//div[@id="apas_form"]').extract_first()
        soup = BeautifulSoup(rows, 'lxml')
        pages = soup.find(id='apas_form_text')
        urls = []
        urls.append(response.url)
        for link in pages.find_all('a'):
            urls.append('/'.format(link['href']))

        for url in urls:
             yield scrapy.Request(url, callback=self.parse_page_listings)

    def parse_page_listings(self, response):
        rows = response.xpath('//div[@id="apas_form"]').extract_first()
        soup = BeautifulSoup(rows, 'lxml')
        resultTable = soup.find("table", { "class" : "apas_tbl" })

        for row in resultTable.find_all('a'):
            url = ""
            yield scrapy.Request(url, callback=self.parse_individual_listings)


    def parse_individual_listings(self, response): 
        rows = response.xpath('//div[@id="apas_form"]').extract_first() 
        soup = BeautifulSoup(rows, 'lxml')
        fields = soup.find_all('div',{'id':'fieldset_data'})
        for field in fields:
            print field.label.text.strip()
            print field.p.text.strip()

        tabs = response.xpath('//div[@id="tabheader"]').extract_first() 
        soup = BeautifulSoup(tabs, 'lxml')
        links = soup.find_all("a")
       for link in links:
            yield scrapy.Request( urlparse.urljoin(response.url, link['href']), callback=self.parse_individual_tabs)

致:

def parse_individual_listings(self, response): 
    rows = response.xpath('//div[@id="apas_form"]').extract_first() 
    soup = BeautifulSoup(rows, 'lxml')
    fields = soup.find_all('div',{'id':'fieldset_data'})
    data = {}
    for field in fields:
        data[field.label.text.strip()] = field.p.text.strip()

    tabs = response.xpath('//div[@id="tabheader"]').extract_first() 
    soup = BeautifulSoup(tabs, 'lxml')
    links = soup.find_all("a")
    for link in links:
        yield scrapy.Request(
            urlparse.urljoin(response.url, link['href']), 
            callback=self.parse_individual_tabs,
            meta={'data': data}
        )
    print data

...

    def parse_individual_tabs(self, response): 
        data = {}
        rows = response.xpath('//div[@id="tabContent"]').extract_first() 
        soup = BeautifulSoup(rows, 'lxml')
        fields = soup.find_all('div',{'id':'fieldset_data'})
        for field in fields:
            data[field.label.text.strip()] = field.p.text.strip()

        print json.dumps(data)

def parse_individual_tabs(self, response): 
        data = {}
        rows = response.xpath('//div[@id="tabContent"]').extract_first() 
        soup = BeautifulSoup(rows, 'lxml')
        fields = soup.find_all('div',{'id':'fieldset_data'})
        for field in fields:
            data[field.label.text.strip()] = field.p.text.strip()

        yield json.dumps(data)

推荐答案

通常在获取数据时,您必须使用 Scrapy Items 但它们也可以替换为字典(这将是您所指的 JSON 对象),因此我们将使用它们现在:

Normally when obtaining data, you'll have to use Scrapy Items but they can also be replaced with dictionaries (which would be the JSON objects you are referring to), so we'll use them now:

首先,开始在 parse_individual_listings 方法中创建项目(或字典),就像您在 parse_individual_tabs 中对 data 所做的一样.然后将其传递给下一个请求(该请求将被 parse_individual_tabs 使用 meta 参数,所以它应该看起来像:

First, start creating the item (or dictionary) in the parse_individual_listings method, just as you did with data in parse_individual_tabs. Then pass it to the next request (that will be caught by parse_individual_tabs with the meta argument, so it should look like:

def parse_individual_listings(self, response):
    ...
    data = {}
    data[field1] = 'data1'
    data[field1] = 'data2'
    ...
    yield scrapy.Request(
        urlparse.urljoin(response.url, link['href']), 
        callback=self.parse_individual_tabs,
        meta={'data': data};
    )

然后,您可以在 parse_individual_tabs 中获取该数据:

Then, you can get that data in parse_individual_tabs:

def parse_individual_tabs(self, response):
    data = response.meta['data']
    ...
    # keep populating `data`
    yield data

现在 parse_individual_tabs 中的 data 包含您想要从两个请求中获得的所有信息,您可以在任何回调请求之间执行相同操作.

Now the data in parse_individual_tabs has all the information you want from both requests, you can do the same between any callback requests.

这篇关于Python Scrapy &屈服的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆