如何使用Python将MongoDB的bsondump转换为JSON? [英] How can I use Python to transform MongoDB's bsondump into JSON?

查看:143
本文介绍了如何使用Python将MongoDB的bsondump转换为JSON?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我从MongoDB转储中获取了大量的.bson.我在命令行上使用 bsondump ,将输出作为stdin传递到python.这样可以成功地将BSON转换为"JSON",但实际上它是一个字符串,似乎不是合法的JSON.

So I have an enormous quantity of .bson from a MongoDB dump. I am using bsondump on the command line, piping the output as stdin to python. This successfully converts from BSON to 'JSON' but it is in fact a string, and seemingly not legal JSON.

例如,输入行如下所示:

For example an incoming line looks like this:

{ "_id" : ObjectId( "4d9b642b832a4c4fb2000000" ),
  "acted_at" : Date( 1302014955933 ),
  "created_at" : Date( 1302014955933 ),
  "updated_at" : Date( 1302014955933 ),
  "_platform_id" : 3,
  "guid" : 72106535190265857 }

我相信的是 Mongo扩展JSON .

当我在这样一行中阅读并执行以下操作时:

When I read in such a line and do:

json_line = json.dumps(line)

我得到:

"{ \"_id\" : ObjectId( \"4d9b642b832a4c4fb2000000\" ),
\"acted_at\" : Date( 1302014955933 ),
\"created_at\" : Date( 1302014955933 ),
\"updated_at\" : Date( 1302014955933 ),
\"_platform_id\" : 3,
\"guid\" : 72106535190265857 }\n"

仍然是<type 'str'>.

我也尝试过

json_line = json.dumps(line, default=json_util.default)

(请参阅pymongo json_util-垃圾邮件检测阻止了第三个链接) 似乎输出与上面的转储相同.加载给出错误:

(see pymongo json_util - spam detection prevents a third link ) Which seems to output the same as dumps above. loads gives an error:

json_line = json.loads(line, object_hook=json_util.object_hook)
ValueError: No JSON object could be decoded

那么,如何将TenGen JSON的字符串转换为可解析的JSON? (最终目标是将制表符分隔的数据流传输到另一个数据库)

So, how can I transform the string of TenGen JSON into parseable JSON? (the end goal is to stream tab separated data to another database)

推荐答案

您所拥有的是TenGen模式下Mongo Extended JSON中的转储(请参见

What you have is a dump in Mongo Extended JSON in TenGen mode (see here). Some possible ways to go:

  1. 如果可以再次转储,请通过MongoDB REST API使用严格输出模式.那应该给您真正的JSON,而不是您现在拥有的JSON.

  1. If you can dump again, use Strict output mode through the MongoDB REST API. That should give you real JSON instead of what you have now.

使用 http://pypi.python.org/pypi/bson/中的bson读取您已经拥有的BSON到Python数据结构中,然后对它们进行所需的任何处理(可能输出JSON).

Use bson from http://pypi.python.org/pypi/bson/ to read the BSON you already have into Python data structures and then do whatever processing you need on those (possibly outputting JSON).

使用MongoDB Python绑定连接到数据库以将数据导入Python,然后进行所需的任何处理. (如果需要,您可以设置一个本地MongoDB实例,然后将转储的文件导入到该实例中.)

Use the MongoDB Python bindings to connect to the database to get the data into Python, and then do whatever processing you need. (If needed, you could set up a local MongoDB instance and import your dumped files into that.)

将Mongo Extended JSON从TenGen模式转换为严格模式.您可以开发一个单独的过滤器来做到这一点(从stdin读取,将TenGen结构替换为Strict结构,并在stdout上输出结果),也可以在处理输入时做到这一点.

Convert the Mongo Extended JSON from TenGen mode to Strict mode. You could develop a separate filter to do it (read from stdin, replace TenGen structures with Strict structures, and output the result on stdout) or you could do it as you process the input.

这是一个使用Python和正则表达式的示例:

Here's an example using Python and regular expressions:

import json, re
from bson import json_util

with open("data.tengenjson", "rb") as f:
    # read the entire input; in a real application,
    # you would want to read a chunk at a time
    bsondata = f.read()

    # convert the TenGen JSON to Strict JSON
    # here, I just convert the ObjectId and Date structures,
    # but it's easy to extend to cover all structures listed at
    # http://www.mongodb.org/display/DOCS/Mongo+Extended+JSON
    jsondata = re.sub(r'ObjectId\s*\(\s*\"(\S+)\"\s*\)',
                      r'{"$oid": "\1"}',
                      bsondata)
    jsondata = re.sub(r'Date\s*\(\s*(\S+)\s*\)',
                      r'{"$date": \1}',
                      jsondata)

    # now we can parse this as JSON, and use MongoDB's object_hook
    # function to get rich Python data structures inside a dictionary
    data = json.loads(jsondata, object_hook=json_util.object_hook)

    # just print the output for demonstration, along with the type
    print(data)
    print(type(data))

    # serialise to JSON and print
    print(json_util.dumps(data))

根据您的目标,其中一个应该是一个合理的起点.

Depending on your goal, one of these should be a reasonable starting point.

这篇关于如何使用Python将MongoDB的bsondump转换为JSON?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆