不使用scrapy中的请求就无法解析自定义结果 [英] Can't parse customized results without using requests within scrapy

查看:40
本文介绍了不使用scrapy中的请求就无法解析自定义结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用scrapy创建了一个脚本,从imdb.com获取所有与不同演员名字相关的链接,然后解析他们的前三个电影链接,最后抓取director 和那些电影的 writer.如果我坚持当前的尝试,我的脚本会完美无缺.但是,我在 parse_results 方法中使用了 requests 模块(我不想使用)来获取自定义输出.

I've created a script using scrapy to fetch all the links connected to the name of different actors from imdb.com and then parse the first three of their movie links and finally scrape the name of director and writer of those movies. My script does it flawlessly if I stick to the current attempt. However, I've used requests module (which I don't want to) within parse_results method to get the customized output.

网站地址

脚本的作用(考虑第一个命名链接,如Robert De Niro):

What the script does (consider the first named link, as in Robert De Niro):

  1. 脚本使用上面的 url 并抓取命名链接来解析来自 此处位于标题Filmography下.

然后它从 这里

这是我到目前为止所写的(工作中的一个):

This is I've written so far (working one):

import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

class ImdbSpider(scrapy.Spider):
    name = 'imdb'
    start_urls = ['https://www.imdb.com/list/ls058011111/']

    def parse(self, response):
        soup = BeautifulSoup(response.text,"lxml")
        for name_links in soup.select(".mode-detail")[:10]:
            name = name_links.select_one("h3 > a").get_text(strip=True)
            item_link = response.urljoin(name_links.select_one("h3 > a").get("href"))
            yield scrapy.Request(item_link,meta={"name":name},callback=self.parse_items)

    def parse_items(self,response):
        name = response.meta.get("name")
        soup = BeautifulSoup(response.text,"lxml")
        item_links = [response.urljoin(item.get("href")) for item in soup.select(".filmo-category-section .filmo-row > b > a[href]")[:3]]
        result_list = [i for url in item_links for i in self.parse_results(url)]
        yield {"actor name":name,"associated name list":result_list}

    def parse_results(self,link):
        response = requests.get(link)
        soup = BeautifulSoup(response.text,"lxml")
        try:
            director = soup.select_one("h4:contains('Director') ~ a").get_text(strip=True)
        except Exception as e: director = ""
        try:
            writer = soup.select_one("h4:contains('Writer') ~ a").get_text(strip=True)
        except Exception as e: writer = ""
        return director,writer


c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

})
c.crawl(ImdbSpider)
c.start()

输出上述脚本产生的(所需的):

Output the above script produces (desired ones):

{'actor name': 'Robert De Niro', 'associated name list': ['Jonathan Jakubowicz', 'Jonathan Jakubowicz', '', 'Anthony Thorne', 'Martin Scorsese', 'David Grann']}
{'actor name': 'Sidney Poitier', 'associated name list': ['Gregg Champion', 'Richard Leder', 'Gregg Champion', 'Sterling Anderson', 'Lloyd Kramer', 'Theodore Isaac Rubin']}
{'actor name': 'Daniel Day-Lewis', 'associated name list': ['Paul Thomas Anderson', 'Paul Thomas Anderson', 'Paul Thomas Anderson', 'Paul Thomas Anderson', 'Steven Spielberg', 'Tony Kushner']}
{'actor name': 'Humphrey Bogart', 'associated name list': ['', '', 'Mark Robson', 'Philip Yordan', 'William Wyler', 'Joseph Hayes']}
{'actor name': 'Gregory Peck', 'associated name list': ['', '', 'Arthur Penn', 'Tina Howe', 'Walter C. Miller', 'Peter Stone']}
{'actor name': 'Denzel Washington', 'associated name list': ['Joel Coen', 'Joel Coen', 'John Lee Hancock', 'John Lee Hancock', 'Antoine Fuqua', 'Richard Wenk']}

在上述方法中,我在 parse_results 方法中使用了 requests 模块来获得所需的输出,因为我不能在任何列表中使用 yield领悟.

In the above approach I used requests module within parse_results method to get the desired output as I can't use yield within any list comprehension.

如何在不使用requests的情况下让脚本产生准确的输出?

How can let the script produce the exact output without using requests?

推荐答案

解决此问题的一种方法是使用 Request.meta 来保存跨请求的项目的待处理 URL 列表,并弹出来自它的网址.

One way you can address this is using Request.meta to keep a list of pending URLs for an item across requests, and pop URLs from it.

正如@pguardiario 所提到的,缺点是您仍然一次只处理该列表中的一个请求.但是,如果您的项目多于配置的并发数,那应该不是问题.

As @pguardiario mentions, the drawback is that you are still only processing one request from that list at a time. However, if you have more items than configured concurrency, that should not be a problem.

这种方法看起来像这样:

This approach would look like this:

def parse_items(self,response):
    # …
    if item_links:
        meta = {
            "actor name": name,
            "associated name list": [],
            "item_links": item_links,
        }
        yield Request(
            item_links.pop(),
            callback=self.parse_results,
            meta=meta
        )
    else:
        yield {"actor name": name}

def parse_results(self, response):
    # …
    response.meta["associated name list"].append((director, writer))
    if response.meta["item_links"]:
        yield Request(
            response.meta["item_links"].pop(),
            callback=self.parse_results,
            meta=response.meta
        )
    else:
        yield {
            "actor name": response.meta["actor name"],
            "associated name list": response.meta["associated name list"],
        }

这篇关于不使用scrapy中的请求就无法解析自定义结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆