Python:来自dict系列的Pandas数据框 [英] Python: Pandas dataframe from Series of dict
问题描述
我有一个熊猫数据框:
type(original)
pandas.core.frame.DataFrame
其中包括系列对象original['user']
:
type(original['user'])
pandas.core.series.Series
original['user']
指向许多字典:
type(original['user'].ix[0])
dict
每个字典具有相同的键:
Each dict has the same keys:
original['user'].ix[0].keys()
[u'follow_request_sent',
u'profile_use_background_image',
u'profile_text_color',
u'id',
u'verified',
u'profile_location',
# ... keys removed for brevity
]
以上是 tweeter API .我想从这些字典中构建一个数据框架.
Above is (part of) one of the dicts of user
fields in a tweet from tweeter API. I want to build a data frame from these dicts.
当我尝试直接制作数据框时,每行仅获得一列,此列包含整个字典:
When I try to make a data frame directly, I get only one column for each row and this column contains the whole dict:
pd.DataFrame(original['user'][:2])
user
0 {u'follow_request_sent': False, u'profile_use_...
1 {u'follow_request_sent': False, u'profile_use_..
当我尝试使用from_dict()创建数据框时,会得到相同的结果:
When I try to create a data frame using from_dict() I get the same result:
pd.DataFrame.from_dict(original['user'][:2])
user
0 {u'follow_request_sent': False, u'profile_use_...
1 {u'follow_request_sent': False, u'profile_use_..
接下来,我尝试了列表理解,但返回了错误:
Next I tried a list comprehension which returned an error:
item = [[k, v] for (k,v) in users]
ValueError: too many values to unpack
当我从单行创建数据框时,它几乎可以正常工作:
When I create a data frame from a single row, it nearly works:
df = pd.DataFrame.from_dict(original['user'].ix[0])
df.reset_index()
index contributors_enabled created_at default_profile default_profile_image description entities favourites_count follow_request_sent followers_count following friends_count geo_enabled id id_str is_translation_enabled is_translator lang listed_count location name notifications profile_background_color profile_background_image_url profile_background_image_url_https profile_background_tile profile_image_url profile_image_url_https profile_link_color profile_location profile_sidebar_border_color profile_sidebar_fill_color profile_text_color profile_use_background_image protected screen_name statuses_count time_zone url utc_offset verified
0 description False Mon May 26 11:58:40 +0000 2014 True False {u'urls': []} 0 False 157
除了将description
字段设置为默认索引之外,它的工作原理几乎与我想要的一样.
It works almost like I want it to, except it sets the description
field as the default index.
每个字典有40个键,但我只需要10个键,并且数据框中有28734行.
Each of the dicts has 40 keys but I only need about 10 of them and I have 28734 rows in data frame.
如何过滤掉不需要的键?
How can I filter out the keys which I do not need?
推荐答案
我将尝试执行以下操作:
what I would try to do is the following:
new_df = pd.DataFrame(list(original['user']))
这会将系列转换为列表,然后将其传递给pandas数据框,其余部分应由其负责.
this will convert the series to list then pass it to pandas dataframe and it should take care of the rest.
这篇关于Python:来自dict系列的Pandas数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!