将刮刮的数据放入弹性搜索中时出现TypeError [英] TypeError when putting scraped data from scrapy into elasticsearch

查看:153
本文介绍了将刮刮的数据放入弹性搜索中时出现TypeError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在关注本教程( http:// blog.florian-hopf.de/2014/07/scrapy-and-elasticsearch.html ),并使用这个scrapy弹性搜索管道( https://github.com/knockrentals/scrapy-elasticsearch ),并且能够将数据从scrapy提取到JSON文件,并且在localhost上启动并运行弹性搜索服务器。

I've been following this tutorial (http://blog.florian-hopf.de/2014/07/scrapy-and-elasticsearch.html) and using this scrapy elasticsearch pipeline (https://github.com/knockrentals/scrapy-elasticsearch) and am able to extract data from scrapy to a JSON file and have an elasticsearch server up and running on localhost.

但是,当我尝试使用管道将刮取的数据发送到弹性搜索时,我收到以下错误:

However, when I attempt to send scraped data into elasticsearch using the pipeline, I get the following error:

2015-08-05 21:21:53 [scrapy] ERROR: Error processing {'link': [u'http://www.meetup.com/Search-Meetup-Karlsruhe/events/221907250/'],
 'title': [u'Alles rund um Elasticsearch']}
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 70, in process_item
    self.index_item(item)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 52, in index_item
    local_id = hashlib.sha1(item[uniq_key]).hexdigest()
TypeError: must be string or buffer, not list

我的items.py scrapy文件如下所示:


my items.py scrapy file looks like this:

from scrapy.item import Item, Field

class MeetupItem(Item):
    title = Field()
    link = Field()
    description = Field()

(我认为只有相关部分)我的settings.py文件如下所示:

and (i think only the relevant part of) my settings.py file looks like this:

from scrapy import log

ITEM_PIPELINES = [
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline',
]

ELASTICSEARCH_SERVER = 'localhost' # If not 'localhost' prepend 'http://'
ELASTICSEARCH_PORT = 9200 # If port 80 leave blank
ELASTICSEARCH_USERNAME = ''
ELASTICSEARCH_PASSWORD = ''
ELASTICSEARCH_INDEX = 'meetups'
ELASTICSEARCH_TYPE = 'meetup'
ELASTICSEARCH_UNIQ_KEY = 'link'
ELASTICSEARCH_LOG_LEVEL= log.DEBUG

任何帮助将不胜感激!

推荐答案

正如您可以在错误消息中看到的:错误处理{'link':[u'http:/ /www.meetup.com/Search-Meetup-Karlsruhe/events/221907250/'],'title':[u'Alles rund um Elasticsearch]]} 您的项目的链接标题字段是列表(值周围的方括号表示这一点)。

As you can see in the error message: Error processing {'link': [u'http://www.meetup.com/Search-Meetup-Karlsruhe/events/221907250/'], 'title': [u'Alles rund um Elasticsearch']} your item's link and title fields are lists (the square brackets around the values indicate this).

这是因为你在Scr中提取APY。你没有发布你的问题,但你应该使用 response.xpath()。extract()[0] 来获得列表的第一个结果。自然在这种情况下,您应该准备遇到空的结果集以避免索引错误。

This is because of your extraction in Scrapy. You did not post it with your question but you should use response.xpath().extract()[0] to get the first result of the list. Naturally in this case you should prepare to encounter empty result sets to avoid index-errors.

更新

对于您不提取任何可以使用以下内容进行准备的情况:

For the situation where you do not extract anything you could prepare with the following:

linkSelection = response.xpath().extract()
item['link'] = linkSelection[0] if linkSelection else ""

或类似的东西,取决于你的数据和字段。如果列表为空,那么 None 也可能是有效的。

Or something alike depending on your data and fields. Perhaps None could be valid too if the list is empty.

基本思想是拆分XPath提取和列表项选择。如果包含所需的元素,您应该从列表中选择一个项目。

The basic idea is to split up XPath extraction and list-item selection. And you should select an item from the list if it contains the required elements.

这篇关于将刮刮的数据放入弹性搜索中时出现TypeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆