使用ipython创建数据直方图/可视化并过滤掉一些值 [英] Creating data histograms/visualizations using ipython and filtering out some values
问题描述
我之前发布了一个问题( Pandas-ipython,如何使用向下钻取功能创建新的数据框),并指出它可能范围太广,所以我有一些更具体的问题,可能更容易回答并帮助我获得从绘制数据开始.
I posted a question earlier ( Pandas-ipython, how to create new data frames with drill down capabilities ) and it was pointed out that it is possibly too broad so I have some more specific questions that may be easier to respond to and help me get a start with graphing data.
我决定尝试使用Pandas(或任何可通过ipython访问的包)为我的数据创建一些可视化文件.我遇到的第一个显而易见的问题是如何在某些条件下进行过滤.例如,我键入命令:
I have decided to try creating some visualizations of my data using Pandas (or any package accessible through ipython). The first, obvious, problem I run into is how can I filter on certain conditions. For example I type the command:
df.Duration.hist(bins=10)
,但是由于无法识别的dtypes(某些条目不是日期时间格式)而出现错误.如何在原始命令中排除这些?
but get an error due to unrecognized dtypes (there are some entries that aren't in datetime format). How can I exclude these in the original command?
此外,如果我想创建相同的直方图,但要过滤以仅保留具有ID(在帐户ID字段中)以整数(或字符串)"2"开头的记录,该怎么办?
Also, what if I want to create the same histogram but filtering to keep only records that have id's (in an account id field) starting with the integer (or string?) '2'?
最终,我希望能够创建直方图,折线图,箱形图等,但可以过滤某些月份,用户ID或不良的"dtypes".
Ultimately, I want to be able to create histograms, line plots, box plots and so on but filtering for certain months, user id's, or just bad 'dtypes'.
任何人都可以帮助我修改上述命令以向其添加过滤器. (我很喜欢python-new的数据)
Can anyone help me modify the above command to add filters to it. (I'm decent with python-new to data)
tnx
更新:下面的一种用户正试图帮助我解决此问题.对于这个问题,我还有一些发展,还有一个更具体的问题.
update: a kind user below has been trying to help me with this problem. I have a few developments to add to the question and a more specific problem.
我的数据框中有开始时间"和结束时间"列,并为经过的时间创建了持续时间"列.
I have columns in my data frame for Start Time and End Time and created a 'Duration' column for time lapsed.
开始时间/结束时间"列的字段如下:
The Start Time/End Time columns have fields that look like:
2014/03/30 15:45
当我将命令pd.to_datetime()应用于这些列时,我得到的字段看起来像:
and when I apply the command pd.to_datetime() to these columns I get fields resulting that look like:
2014-03-30 15:45:00
我将格式更改为日期时间,并创建了一个新列,即持续时间"或一条命令中经过的时间:
I changed the format to datetime and created a new column which is the 'Duration' or time lapsed in one command:
df['Duration'] = pd.to_datetime(df['End Time'])-pd.to_datetime(df['Start Time'])
持续时间列中字段的格式为:
The format of the fields in the duration column is:
01:14:00
或 hh:mm:ss
or hh:mm:ss
在上面的示例中表示经过时间或74分钟.
to indicate time lapsed or 74 mins in the above example.
持续时间列字段(hh:mm:ss)的dtype为:
the dtype of the duration column fields (hh:mm:ss) is:
dtype('<m8[ns]')
问题是,如何将这些字段转换为整数?
The question is, how can I convert these fields to just integers?
推荐答案
我认为您需要将持续时间(timedelta64)转换为int(假设您具有持续时间).然后.hist方法将起作用.
I think you need to convert duration (timedelta64) to int (assuming you have a duration). Then the .hist method will work.
from pandas import Series
from numpy.random import rand
from numpy import timedelta64
In [21]:
a = (rand(3) *10).astype(int)
a
Out[21]:
array([3, 3, 8])
In [22]:
b = [timedelta64(x, 'D') for x in a] # This is a duration
b
Out[22]:
[numpy.timedelta64(3,'D'), numpy.timedelta64(3,'D'), numpy.timedelta64(8,'D')]
In [23]:
c = Series(b) # This is a duration
c
Out[23]:
0 3 days
1 3 days
2 8 days
dtype: timedelta64[ns]
In [27]:
d = c.apply(lambda x: x / timedelta64(1,'D')) # convert duration to int
d
Out[27]:
0 3
1 3
2 8
dtype: float64
In [28]:
d.hist()
I converted the duration to days ('D'), but you can convert it to any legal unit.
这篇关于使用ipython创建数据直方图/可视化并过滤掉一些值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!