在 Scrapy 中抓取和连接 [英] Crawl and Concatenate in Scrapy

查看:41
本文介绍了在 Scrapy 中抓取和连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Scrapy 抓取电影列表(我只获取导演和电影标题字段).有时,有两个导演,Scrapy 认为他们是不同的.所以第一个导演将有电影标题,但第二个导演将没有电影标题.

I'm trying to crawl movie list with Scrapy (I take only the Director & Movie title fields). Sometimes, there are two directors and Scrapy scape them as different. So the first director will be alon the movie title but for the second there will be no movie title.

所以我创建了一个这样的条件:

So I created a condition like this :

if director2:
            item['director'] = map(unicode.strip,titres.xpath("tbody/tr/td/div/div[2]/div[3]/div[2]/div/h2/div/a/text()").extract())

最后一个 div[2] 只有在有两个导演时才存在.

The last div[2] exists only if there are two directors.

我想像这样连接:director1、director2

And I wanted to concatenate like this : director1, director2

这是我的完整代码:

class movies(scrapy.Spider):
name ="movielist"
allowed_domains = ["domain.com"]
start_urls = ["http://www.domain.com/list"]

def parse(self, response):
    for titles in response.xpath('//*[contains(concat(" ", normalize-space(@class), " "), " grid")]'):
        item = MovieItem()
        director2 = Selector(text=html_content).xpath("h2/div[2]/a/text()")
        if director2:
            item['director'] = map(unicode.strip,titres.xpath,string-join("h2//concat(div[1]/a/text(), ".", div[2]/a/text())").extract())
        else:
            item['director'] = map(unicode.strip,titres.xpath("h2/div/a/text()").extract())
            item['director'] = map(unicode.strip,titres.xpath,string-join("h2//concat(div[1]/a/text(), ".", div[2]/a/text())").extract())
            item['title'] = map(unicode.strip,titres.xpath("h2/a/text()").extract())
        yield item

只有一位导演的 HTML 示例:

Sample HTML with one director:

<h2>
    <a href="#">Movie's title</a>
    <div>Info</div>
    <div><a href="#">Director's name</a></div>
</h2>

有时,当有两个导演时:

Sometime, when there are two directors :

<h2>
    <a href="#">Movie's title</a>
    <div>Info</div>
    <div><a href="#">Director's name</a></div>
    <div><a href="#">Second director's name</a></div>
</h2>

你能告诉我我的语法有什么问题吗?

Can you tell me what's wrong with my syntax ?

我在没有条件和串联的情况下进行了测试,效果很好.

I tested without the condition and withtout the concatenation and it works very well.

这是我第一次使用 Python,所以请放纵一下.

This is my first time with Python so please be indulgent.

非常感谢.

推荐答案

获取所有董事(1、2 或更多)并使用 join() 加入他们:

Get all the directors (1, 2 or more) and join them with join():

item['director'] = ", ".join(titles.xpath("h2/div/a/text()").extract())

更好的 Scrapy 特定方法是使用 ItemLoaderJoin() 处理器.定义一个 ItemLoader:

A better Scrapy-specific approach though would be to use an ItemLoader and Join() processor. Define an ItemLoader:

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join

class MovieLoader(ItemLoader):

    default_output_processor = TakeFirst()

    director_in = MapCompose(unicode.strip)
    director_out = Join()

让它担心剥离和加入:

loader = MovieLoader(MovieItem(), titles)
...
loader.add_xpath("director", "h2/div/a/text()")

这篇关于在 Scrapy 中抓取和连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆