使用ipython创建数据直方图/可视化并过滤掉一些值 [英] Creating data histograms/visualizations using ipython and filtering out some values

查看:125
本文介绍了使用ipython创建数据直方图/可视化并过滤掉一些值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我之前发布了一个问题( Pandas-ipython,如何使用向下钻取功能创建新的数据框),并指出它可能范围太广,所以我有一些更具体的问题,可能更容易回答并帮助我获得从绘制数据开始.

I posted a question earlier ( Pandas-ipython, how to create new data frames with drill down capabilities ) and it was pointed out that it is possibly too broad so I have some more specific questions that may be easier to respond to and help me get a start with graphing data.

我决定尝试使用Pandas(或任何可通过ipython访问的包)为我的数据创建一些可视化文件.我遇到的第一个显而易见的问题是如何在某些条件下进行过滤.例如,我键入命令:

I have decided to try creating some visualizations of my data using Pandas (or any package accessible through ipython). The first, obvious, problem I run into is how can I filter on certain conditions. For example I type the command:

df.Duration.hist(bins=10)

,但是由于无法识别的dtypes(某些条目不是日期时间格式)而出现错误.如何在原始命令中排除这些?

but get an error due to unrecognized dtypes (there are some entries that aren't in datetime format). How can I exclude these in the original command?

此外,如果我想创建相同的直方图,但要过滤以仅保留具有ID(在帐户ID字段中)以整数(或字符串)"2"开头的记录,该怎么办?

Also, what if I want to create the same histogram but filtering to keep only records that have id's (in an account id field) starting with the integer (or string?) '2'?

最终,我希望能够创建直方图,折线图,箱形图等,但可以过滤某些月份,用户ID或不良的"dtypes".

Ultimately, I want to be able to create histograms, line plots, box plots and so on but filtering for certain months, user id's, or just bad 'dtypes'.

任何人都可以帮助我修改上述命令以向其添加过滤器. (我很喜欢python-new的数据)

Can anyone help me modify the above command to add filters to it. (I'm decent with python-new to data)

tnx

更新:下面的一种用户正试图帮助我解决此问题.对于这个问题,我还有一些发展,还有一个更具体的问题.

update: a kind user below has been trying to help me with this problem. I have a few developments to add to the question and a more specific problem.

我的数据框中有开始时间"和结束时间"列,并为经过的时间创建了持续时间"列.

I have columns in my data frame for Start Time and End Time and created a 'Duration' column for time lapsed.

开始时间/结束时间"列的字段如下:

The Start Time/End Time columns have fields that look like:

2014/03/30 15:45

当我将命令pd.to_datetime()应用于这些列时,我得到的字段看起来像:

and when I apply the command pd.to_datetime() to these columns I get fields resulting that look like:

2014-03-30 15:45:00

我将格式更改为日期时间,并创建了一个新列,即持续时间"或一条命令中经过的时间:

I changed the format to datetime and created a new column which is the 'Duration' or time lapsed in one command:

df['Duration'] = pd.to_datetime(df['End Time'])-pd.to_datetime(df['Start Time'])

持续时间列中字段的格式为:

The format of the fields in the duration column is:

01:14:00

或 hh:mm:ss

or hh:mm:ss

在上面的示例中表示经过时间或74分钟.

to indicate time lapsed or 74 mins in the above example.

持续时间列字段(hh:mm:ss)的dtype为:

the dtype of the duration column fields (hh:mm:ss) is:

dtype('<m8[ns]')  

问题是,如何将这些字段转换为整数?

The question is, how can I convert these fields to just integers?

推荐答案

我认为您需要将持续时间(timedelta64)转换为int(假设您具有持续时间).然后.hist方法将起作用.

I think you need to convert duration (timedelta64) to int (assuming you have a duration). Then the .hist method will work.

from pandas import Series
from numpy.random import rand
from numpy import timedelta64

In [21]:

a = (rand(3) *10).astype(int)
a
Out[21]:
array([3, 3, 8])
In [22]:

b = [timedelta64(x, 'D') for x in a] # This is a duration
b
Out[22]:
[numpy.timedelta64(3,'D'), numpy.timedelta64(3,'D'), numpy.timedelta64(8,'D')]
In [23]:

c = Series(b) # This is a duration
c
Out[23]:
0   3 days
1   3 days
2   8 days
dtype: timedelta64[ns]
In [27]:

d = c.apply(lambda x: x / timedelta64(1,'D')) # convert duration to int
d
Out[27]:
0    3
1    3
2    8
dtype: float64
In [28]:

d.hist()

我将持续时间转换为天('D'),但是您可以将其转换为任何

I converted the duration to days ('D'), but you can convert it to any legal unit.

这篇关于使用ipython创建数据直方图/可视化并过滤掉一些值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆