仅获得对抓取到JSON文件的一行输出 [英] Only getting one line of output with scrapy to json file

查看:85
本文介绍了仅获得对抓取到JSON文件的一行输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,所以我是一般编程人员,并且专门为此目的而使用Scrapy.我编写了一个搜寻器,以从pinterest.com上的引脚获取数据.问题是我以前从要爬网的页面上的所有引脚获取数据,但是现在我只获得第一个引脚的数据.

Okay so I'm new to programming in general and using Scrapy for this purpose in specific. I wrote a crawler to get data from pins on pinterest.com. The problem is that I used to get data from all the pins on the page I am crawling, but now I get only the data of the first pin.

我认为问题出在管道或蜘蛛本身.在将"strip"添加到蜘蛛以摆脱空白之后,某些事情发生了变化,但是当我将其更改回去时,我得到了相同的输出,但是有了空白.这是蜘蛛:

I think the problem lies with the pipeline or in the spider itself. Something changed after I added the "strip" to the spider to get rid of the whitespace, but when I changed it back I got the same output but then with the whitespace. This is the spider:

from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem

class PinterestSpider(Spider):
    name = "pinterest"
    allowed_domains = ["pinterest.com"]
    start_urls = ["http://www.pinterest.com/llbean/pins/"]

    def parse(self, response):
        hxs = Selector(response)
        item = PinterestItem()
        items = []
        item ["pin_link"] = hxs.xpath("//div[@class='pinHolder']/a/@href").extract()[0].strip()
        item ["repin_count"] = hxs.xpath("//em[@class='socialMetaCount repinCountSmall']/text()").extract()[0].strip()
        item ["like_count"] = hxs.xpath("//em[@class='socialMetaCount likeCountSmall']/text()").extract()[0].strip()
        item ["board_name"] = hxs.xpath("//div[@class='creditTitle']/text()").extract()[0].strip()
        items.append(item)
        return items

这是我的管道:

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import JsonLinesItemExporter

class JsonLinesExportPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_items.json' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = JsonLinesItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

当我使用命令"scrapy crawl pinterest"时,这是我在JSON文件中得到的输出:

When I use the command "scrapy crawl pinterest" this is the output i get in a JSON file:

"pin_link": "/pin/94716398388365841/", "board_name": "Outdoor Fun", "like_count": "14", "repin_count": "94"}

这正是我想要的输出,但是我只能从一个引脚获得,而不能从页面上的所有引脚获得.我花了很多时间阅读类似的问题,但找不到类似的问题.关于什么是错的任何想法?预先感谢!

This is exactly the output I want, but I get it only from one pin, not from all pins on the page. I spent a lot of time reading through similar questions but I couldnt find any with the similar problem. Any ideas on what is wrong?? Thanks in advance!

哦,我猜想是因为剥离功能之前的[0]吗?抱歉,我刚刚意识到这可能是问题所在……

Oh I guess its because of the [0] before the strip function? Sorry I just realized this could be the problem...

嗯,那不是问题.我很确定它必须与strip功能有关,但是我似乎无法正确使用它来获取多个引脚作为输出.解决方案可以成为这个问题的一部分吗?: Scrapy:为什么提取的字符串是我看到一些重叠但不知道如何使用它.

Mmm, that was not the problem. I am pretty sure it has to do something with the strip function, but I cant seem to use it correctly to get multiple pins as output. Could the solution be part of this question?: Scrapy: Why extracted strings are in this format? I see some overlap but I have no idea how to use it.

好的,所以当我像这样修改蜘蛛时:

Okay so when I modified the spider like this:

from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem

class PinterestSpider(Spider):
name = "pinterest"
allowed_domains = ["pinterest.com"]
start_urls = ["http://www.pinterest.com/llbean/pins/"]

def parse(self, response):
    hxs = Selector(response)
    sites = hxs.xpath("//div[@class='pinWrapper']")
    items = []
    for site in sites:
        item = PinterestItem()        
        item ["pin_link"] = site.select("//div[@class='pinHolder']/a/@href").extract()[0].strip()
        item ["repin_count"] = site.select("//em[@class='socialMetaCount repinCountSmall']/text()").extract()[0].strip()
        item ["like_count"] = site.select("//em[@class='socialMetaCount likeCountSmall']/text()").extract()[0].strip()
        item ["board_name"] = site.select("//div[@class='creditTitle']/text()").extract()[0].strip()
        items.append(item)
    return items

它确实给了我几行输出,但是显然所有的行都具有相同的信息,因此它抓取了页面上的引脚数的项目,但是都具有相同的输出:

It did give me several lines of output, but apparently all with the same information, so it crawled the items of the number of pins on the page, but all with the same output:

{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}

推荐答案

我没有使用Scrapy,所以这是一个疯狂的猜测.

I haven't used Scrapy, so this is a wild guess.

您的选择器正在拉回多个结果.然后,您从每个列表中选择第一个值(带有切片[0]),创建一个名为item PinterestItem,在返回之前将其附加到items列表中那.选择器返回的所有可能结果似乎都没有循环.

Your selectors are pulling back multiple results. You're then selecting the first value out of each list (with the slice [0]), creating a single PinterestItem called item, which you append to the items list before returning that. Nothing appears to be looping over all the possible results returned by the selectors.

因此,请提取所有结果,然后遍历它们以创建您的items列表:

So pull out all of the results, then iterate over them to create your items list:

def parse(self, response):
    hxs = Selector(response)
    pin_links = hxs.xpath("//div[@class='pinHolder']/a/@href").extract()
    repin_counts = hxs.xpath("//em[@class='socialMetaCount repinCountSmall']/text()").extract()
    like_counts = hxs.xpath("//em[@class='socialMetaCount likeCountSmall']/text()").extract()
    board_names = hxs.xpath("//div[@class='creditTitle']/text()").extract()

    items = []
    for pin_link, repin_count, like_count, board_name in zip(pin_links, repin_counts, like_counts, board_names):
        item = PinterestItem()
        item["pin_link"] = pin_link.strip()
        item["repin_count"] = repin_count.strip()
        item["like_count"] = like_count.strip()
        item["board_name"] = board_name.strip()
        items.append(item)
    return items

这篇关于仅获得对抓取到JSON文件的一行输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆