MongoDB InvalidDocument:无法编码对象 [英] MongoDB InvalidDocument: Cannot encode object

查看:89
本文介绍了MongoDB InvalidDocument:无法编码对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scrapy抓取博客,然后将数据存储在mongodb中。起初,我收到了InvalidDocument异常。对我来说很明显的是,数据的编码方式不正确。因此,在保留对象之前,请在我的MongoPipeline中检查文档是否为 utf-8严格格式,然后才尝试将对象保留至mongodb。但是我仍然收到InvalidDocument异常,现在很烦人。



这是我的代码,我的MongoPipeline对象将对象持久化到mongodb

 #-*-编码:utf-8-*-

#在此处定义物料管道


import pymongo
import sys,traceback
from scrapy.exceptions import dropItem
from crawler.items import BlogItem,CommentItem


class MongoPipeline(object ):
collection_name ='master'

def __init __(self,mongo_uri,mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db

@classmethod
def from_crawler(cls,爬虫):
return cls(
mongo_uri = crawler.settings.get('MONGO_URI'),
mongo_db = crawler .settings.get('MONGO_DATABASE','posts')


def open_spider(self,spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.d b = self.client [self.mongo_db]

def close_spider(self,spider):
self.client.close()

def process_item(self, item,蜘蛛):

如果type(item)为BlogItem:
试试:
如果item中的 url:
item ['url'] = item ['url']。encode('utf-8','strict')
如果item中的 domain:
item ['domain'] = item ['domain']。encode(' utf-8','strict')
如果项目中有'title':
item ['title'] = item ['title']。encode('utf-8','strict')
如果项目中的日期:
item ['date'] = item ['date']。encode('utf-8','strict')
如果项目中的 content item:
item ['content'] = item ['content']。encode('utf-8','strict')
如果item:
item [' author'] = item ['author']。encode('utf-8','strict')

除外,#捕获*所有*异常
e = sys.exc_info()[0]
spider.logger.critical(错误编码%s,e)
traceback.print_exc(file = sys.stdout )
提高DropItem(错误编码BLOG%s%item ['url'])

如果item中的 comments:
comments = item ['comments']
item ['comments'] = []

尝试:
在评论中进行评论:
如果评论中的 date:
评论[' date'] = comment ['date']。encode('utf-8','strict')
如果评论中为 author:
comment ['author'] = comment ['author' ] .encode('utf-8','strict')
如果评论中的 content:
comment ['content'] = comment ['content']。encode('utf-8' ,'strict')

item ['comments']。append(comme nt)

除外:#catch * all *例外
e = sys.exc_info()[0]
spider.logger.critical(错误编码注释%s,e )
traceback.print_exc(file = sys.stdout)

self.db [self.collection_name] .insert(dict(item))

返回项目

我仍然收到以下异常:



<$互联网的最新发展趋势Mais franchement,c\u2019est un peu法院判决书!Ce que je sais dire,compu tene ce qui pr\xe9c\xe8de,c\u2019est quelles sont les condition pour r\xe9ussir si l\u2019on EST评估法国禁区。请参见su je sujets quedéxe9velopperaidans un autre article。',
'date':u'2012-06-27T23:21:25 + 00:00',
'domain': 'reussir-sa-boite.fr',
'title':u'Peut-on encore entreprendre en France?\t\t\t',
'url':'http: //www.reussir-sa-boite.fr/peut-on-encore-entreprendre-en-france/'}
追溯(最近一次通话为最后):
File h:\program files anaconda lib站点包 twisted internet defer.py,第588行,在_runCallbacks
current.result = callback(current.result,* args,** kw)中
文件 H:\PDS\BNP\crawler\crawler\pipelines.py,行76,在process_item中
self.db [self.collection_name] .insert(dict(item ))
文件 h:\程序文件\anaconda\lib\site-packages\pymongo\collection.py,第409行,插入
gen(),check_keys, self.uuid_subtype,客户端)
InvalidDocument:无法编码对象:{'author':'Arnaud Lemasson',
'content':'Telement vrai\xe2\x 80 xa6 Il faut评估raixc3\xaatre动机\xc3\xa9 aujourd\xe2\x80\x99hui倒蒙特b.xc3\xaete。在离开的est prcxc3\xa9lev\xc3\xa9上,je ne pense m\xc3\xaame pas \xc3\xa0 Embaucher,请问我\xc3\xbterait bien trop cher。 Bref,100%d\xe2\x80\x99accave avec vous。 Le probl\xc3\xa8me,je ne vois pas comment cela pourrait changer avec le gouvernement actuel\xe2\x80\xa6 A moins que si,j\xe2\x80\x99ai pu lire il me semble quxxe2\x80\x99il可用在x\xc3\xaate de r\xc3\xa9duire l\xe2\x80\x99IS小型企业和dex\xe2\x80 99x99augmenter pour les grandes\xe2\x80\xa6 A voir',
'date':'2012-06-27T23:21:25 + 00:00'}
2015-11 -04 15:29:15 [scrapy] INFO:关闭蜘蛛(已完成)
2015-11-04 15:29:15 [scrapy] INFO:倾销Scrapy统计信息:
{'downloader / request_bytes' :259,
'downloader / request_count':1,1,
'downloader / request_method_count / GET':1,
'downloader / response_bytes':252396,
'downloader / response_count' :1,1,
'downloader / response_status_count / 200':1,1,
'finish_reason':'完成',
'finish_time':datetime.datetime(2015,11,4,14,29 ,15,701000),
'log_count / DEBUG':2,
'log_count / ERROR':1,
'log_count / INFO':7,
'response_received_count':1,
'scheduler / dequeued':1,
'scheduler / dequeued / memory':1,
'scheduler / enqueued':1,1,
'scheduler / enqueued / memory':1,
'start)
time':datetime.datetime(2015,11,4,4,14,29,13, 191000)}

另一个有趣的事情来自@eLRuLL我的评论以下内容:

 >> s = Telement vrai\xe2\x80\xa6 Il faut vraiment\xc3\xaatre动机\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete在est pr\xc3\xa9lev\xc3\xa9 de partout上,je ne pense m\xc3\xaame pas \xc3\xa0 embaucher,cela me 
>> ; s
'Telement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3 aexaete。在est pr\xc3\xa9lev\xc3\xa9 de partout上,je ne pense m\xc3\xaame pas \xc3\xa0 embaucher,cela me’
>> se = s.encode( utf8, strict)
追溯(最近一次调用为最后一次):
文件< stdin>,< module>中的第1行。
UnicodeDecodeError: ascii编解码器无法解码位置14的字节0xe2:序数不在范围内(128)
>> se = s.encode( utf-8, strict)
追溯(最近一次调用为最后):
文件< stdin>,< module>中的第1行
UnicodeDecodeError: ascii编解码器无法解码位置14的字节0xe2:序数不在范围内(128)
>> s.decode()
追溯(最近一次呼叫最近):
文件< stdin>,在< module>中的第1行。
UnicodeDecodeError:'ascii'编解码器无法解码位置14的字节0xe2:序数不在范围内(128)

然后我的问题是。如果此文本无法编码。那为什么,我的MongoPipeline是否尝试捕获不捕获?因为只有没有引起任何异常的对象才应附加到item ['comments']?

解决方案

最后我弄清楚了。问题不在于编码。就是文件的结构。



因为我没有使用标准的MongoPipeline示例,该示例不处理嵌套的刮擦物品。



我在做什么:
BlogItem:
url
...
评论= [CommentItem]



所以我的BlogItem有一个CommentItems列表。现在问题来了,我将对象持久保存在数据库中:

  self.db [self.collection_name] .insert (dict(item))

所以在这里我将BlogItem解析为字典。但是我没有解析CommentItems的列表。而且由于回溯显示了类似dict的CommentItem,所以我没有想到问题对象不是dict!



最后解决此问题的方法是在将注释添加到注释列表时更改行,如下所示:

  item ['comments']。append(dict(comment))

现在MongoDB认为它是有效的文档。



最后,最后一部分我问为什么我在python控制台上遇到异常而不是在脚本中。



原因是因为我在仅支持ascii的python控制台上工作。因此,错误。


I am using scrapy to scrap blogs and then store the data in mongodb. At first i got the InvalidDocument Exception. So obvious to me is that the data is not in the right encoding. So before persisting the object, in my MongoPipeline i check if the document is in 'utf-8 strict', and only then i try to persist the object to mongodb. BUT Still i get InvalidDocument Exceptions, now that is annoying.

This is my code my MongoPipeline Object that persists objects to mongodb

# -*- coding: utf-8 -*-

# Define your item pipelines here
#

import pymongo
import sys, traceback
from scrapy.exceptions import DropItem
from crawler.items import BlogItem, CommentItem


class MongoPipeline(object):
    collection_name = 'master'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'posts')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):

        if type(item) is BlogItem:
            try:
                if 'url' in item:
                    item['url'] = item['url'].encode('utf-8', 'strict')
                if 'domain' in item:
                    item['domain'] = item['domain'].encode('utf-8', 'strict')
                if 'title' in item:
                    item['title'] = item['title'].encode('utf-8', 'strict')
                if 'date' in item:
                    item['date'] = item['date'].encode('utf-8', 'strict')
                if 'content' in item:
                    item['content'] = item['content'].encode('utf-8', 'strict')
                if 'author' in item:
                    item['author'] = item['author'].encode('utf-8', 'strict')

            except:  # catch *all* exceptions
                e = sys.exc_info()[0]
                spider.logger.critical("ERROR ENCODING %s", e)
                traceback.print_exc(file=sys.stdout)
                raise DropItem("Error encoding BLOG %s" % item['url'])

            if 'comments' in item:
                comments = item['comments']
                item['comments'] = []

                try:
                    for comment in comments:
                        if 'date' in comment:
                            comment['date'] = comment['date'].encode('utf-8', 'strict')
                        if 'author' in comment:
                            comment['author'] = comment['author'].encode('utf-8', 'strict')
                        if 'content' in comment:
                            comment['content'] = comment['content'].encode('utf-8', 'strict')

                        item['comments'].append(comment)

                except:  # catch *all* exceptions
                    e = sys.exc_info()[0]
                    spider.logger.critical("ERROR ENCODING COMMENT %s", e)
                    traceback.print_exc(file=sys.stdout)

        self.db[self.collection_name].insert(dict(item))

        return item

And still i get the following exception:

au coeur de l\u2019explosion de la bulle Internet n\u2019est probablement pas \xe9tranger au succ\xe8s qui a suivi. Mais franchement, c\u2019est un peu court comme argument !Ce que je sais dire, compte tenu de ce qui pr\xe9c\xe8de, c\u2019est quelles sont les conditions pour r\xe9ussir si l\u2019on est vraiment contraint de rester en France. Ce sont des sujets que je d\xe9velopperai dans un autre article.',
     'date': u'2012-06-27T23:21:25+00:00',
     'domain': 'reussir-sa-boite.fr',
     'title': u'Peut-on encore entreprendre en France ?\t\t\t ',
     'url': 'http://www.reussir-sa-boite.fr/peut-on-encore-entreprendre-en-france/'}
    Traceback (most recent call last):
      File "h:\program files\anaconda\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "H:\PDS\BNP\crawler\crawler\pipelines.py", line 76, in process_item
        self.db[self.collection_name].insert(dict(item))
      File "h:\program files\anaconda\lib\site-packages\pymongo\collection.py", line 409, in insert
        gen(), check_keys, self.uuid_subtype, client)
    InvalidDocument: Cannot encode object: {'author': 'Arnaud Lemasson',
     'content': 'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me co\xc3\xbbterait bien trop cher. Bref, 100% d\xe2\x80\x99accord avec vous. Le probl\xc3\xa8me, je ne vois pas comment cela pourrait changer avec le gouvernement actuel\xe2\x80\xa6 A moins que si, j\xe2\x80\x99ai pu lire il me semble qu\xe2\x80\x99ils avaient en t\xc3\xaate de r\xc3\xa9duire l\xe2\x80\x99IS pour les petites entreprises et de l\xe2\x80\x99augmenter pour les grandes\xe2\x80\xa6 A voir',
     'date': '2012-06-27T23:21:25+00:00'}
    2015-11-04 15:29:15 [scrapy] INFO: Closing spider (finished)
    2015-11-04 15:29:15 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 259,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 252396,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 11, 4, 14, 29, 15, 701000),
     'log_count/DEBUG': 2,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start)
    time': datetime.datetime(2015, 11, 4, 14, 29, 13, 191000)}

Another funny thing from the comment of @eLRuLL i did the following:

>>> s = "Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me"
>>> s
'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me'
>>> se = s.encode("utf8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> se = s.encode("utf-8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)

Then my question is. If this text cannot be encoded. Then why, is my MongoPipeline try catch not catching this EXCEPTION? Because only objects that don't raise any exception should be appended to item['comments'] ?

解决方案

Finally I figured it out. The problem was not with encoding. It was with the structure of the documents.

Because i went off on the standard MongoPipeline example which does not deal with nested scrapy items.

What i am doing is: BlogItem: "url" ... comments = [CommentItem]

So my BlogItem has a list of CommentItems. Now the problem came here, for persisting the object in the database i do:

self.db[self.collection_name].insert(dict(item))

So here i am parsing the BlogItem to a dict. But i am not parsing the list of CommentItems. And because the traceback displays the CommentItem kind of like a dict, It did not occur to me that the problematic object is not a dict!

So finally the the way to fix this problem is to change the line when appending the comment to the comment list as such:

item['comments'].append(dict(comment))

Now MongoDB considers it as a valid document.

Lastly, for the last part where i ask why i am getting a exception on the python console and not in the script.

The reason is because i was working on the python console, which only supports ascii. And thus the error.

这篇关于MongoDB InvalidDocument:无法编码对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆