将MongoDB中的嵌套数据放入Pandas数据框 [英] Getting nested data from MongoDB into a Pandas data frame

查看:125
本文介绍了将MongoDB中的嵌套数据放入Pandas数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将Twitter数据(推文+元数据)收集到MongoDB服务器中.现在,我要进行一些统计分析.为了将数据从MongoDB导入到Pandas数据框中,我使用了以下代码:

I'm collecting Twitter data (tweets + meta data) into a MongoDB server. Now I want to do some statistical analysis. To get the data from MongoDB into a Pandas data frame I used the following code:

cursor = collection.find({},{'id': 1, 'text': 1})

tweet_fields = ['id', 'text']

result = pd.DataFrame(list(cursor), columns = tweet_fields)

这样,我成功地将数据加载到了Pandas中,这很棒.现在,我想对创建推文的用户进行一些分析,这些推文也是我收集的数据.此数据位于JSON的嵌套部分中(我不确定100%是否为true J​​SON),例如user.id(这是Twitter用户帐户的ID).

This way i successfully loaded the data into Pandas, which is great. Now I wanted to do some analysis on the users that created the tweets which was also data I collected. This data is located in a nested part of the JSON (I'm not 100% sure if this is true JSON), for instance user.id which is the id of the Twitter user account.

我可以使用点符号将其添加到光标中

I can just add that to the cursor using dot notation:

cursor = collection.find({},{'id': 1, 'text': 1, 'user.id': 1})

但这会导致该列的NaN.我发现问题在于数据的结构方式:

But this results in a NaN for that column. I found that the problem lies with the way the data is structured:

没有user.id的游标的位:

bit of the cursor without user.id:

[{'_id': ObjectId('561547ae5371c0637f57769e'),
  'id': 651795711403683840,
  'text': 'Video: Zuuuu gut! Caro Korneli besucht für extra 3 Pegida Via KFMW http://t.co/BJX5GKrp7s'},
 {'_id': ObjectId('561547bf5371c0637f5776ac'),
  'id': 651795781557583872,
  'text': 'Iets voor werkloze xenofobe PVV-ers, (en dat zijn waarschijnlijk wel de meeste).........Ze zoeken bij Frontex een paar honderd grenswachten.'},
 {'_id': ObjectId('561547ab5371c0637f57769c'),
  'id': 651795699881889792,
  'text': 'RT @ansichtssache47: Geht gefälligst arbeiten, die #Flüchtlinge haben Hunger! http://t.co/QxUYfFjZB5 #grenzendicht #rente #ZivilerUngehorsa…'}]

带有user.id的游标的位:

bit of the cursor with user.id:

[{'_id': ObjectId('561547ae5371c0637f57769e'),
  'id': 651795711403683840,
  'text': 'Video: Zuuuu gut! Caro Korneli besucht für extra 3 Pegida Via KFMW http://t.co/BJX5GKrp7s',
  'user': {'id': 223528499}},
 {'_id': ObjectId('561547bf5371c0637f5776ac'),
  'id': 651795781557583872,
  'text': 'Iets voor werkloze xenofobe PVV-ers, (en dat zijn waarschijnlijk wel de meeste).........Ze zoeken bij Frontex een paar honderd grenswachten.',
  'user': {'id': 3544739837}}]

因此,简而言之,我不明白如何在Pandas数据框的单独列中获取所收集数据的嵌套部分.

So in short I don't understand how I get the nested part of my collected data in a separate column of my Pandas data frame.

推荐答案

我使用这样的函数将嵌套的JSON行插入数据帧.它使用方便的pandas json.normalize函数:

I use a function like this to get nested JSON lines into a dataframe. It uses the handy pandas json.normalize function:

import pandas as pd
from bson import json_util, ObjectId
from pandas.io.json import json_normalize
import json

def mongo_to_dataframe(mongo_data):

        sanitized = json.loads(json_util.dumps(mongo_data))
        normalized = json_normalize(sanitized)
        df = pd.DataFrame(normalized)

        return df

只需通过调用函数作为参数来传递mongo数据即可.

Just pass your mongo data by calling the function with it as an argument.

sanitized = json.loads(json_util.dumps(mongo_data))将JSON行作为常规JSON加载

sanitized = json.loads(json_util.dumps(mongo_data)) loads the JSON lines as regular JSON

normalized = json_normalize(sanitized)取消嵌套数据

df = pd.DataFrame(normalized)只需将其转换为数据框

df = pd.DataFrame(normalized) simply turns it into a dataframe

这篇关于将MongoDB中的嵌套数据放入Pandas数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆