panda df迭代,基于时间的数据装箱(以毫秒为单位) [英] panda df iteration, binning of data based on time in milliseconds

查看:60
本文介绍了panda df迭代,基于时间的数据装箱(以毫秒为单位)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我重新集中了我的问题,并尝试尽可能具体.在下面,我还包括到目前为止使用的代码;

I have refocused my questions and have tried to be as specific as possible. below, I also include code I have used so far;

(1)从SQL中提取数据时,我的时间是混合格式,其中包含一个很难使用的字母.为了避免出现问题,我尝试提出申请;df.time = pd.to_timedelta(df.time,unit ='ms'),这很好,因为它不知道如何提取小时和分钟.例子; 2019.11.22D01:18:00.01000,我只需要以以下格式输入``时间''列;'01:18:00.01000'.也许我可以使用'np.datetime64'将我所有的SQL时间条目转换为所需的格式,并截断我需要的字符数量?请告知团队.我也尝试过'data = np.datetime64('time'),但是却收到'解析日期时间字符串"time"在位置0时出错".

(1) When pulling data from SQL, my time is in a mixed format that contains a letter which is hard to work with. To avoid issues with that, i tried to apply; df.time=pd.to_timedelta(df.time, unit='ms'), which is fine by dont know how to extract the hours and minutes. Example;2019.11.22D01:18:00.01000, i just need to have column 'time' in following format; '01:18:00.01000'. Maybe i can use 'np.datetime64' to convert all my SQL time entries to the desired format and truncate the amount of characters I need? Please advise team. I also tried 'data=np.datetime64('time') but getting ' Error parsing datetime string "time" at position 0 '.

(2)我试图按以下两个因素对我的数据进行分组,首先是"data2",然后是时间".这是因为我的数据将不会按照以下顺序排列,而是会以随机顺序排列.我得到:'DataFrameGroupBy'是不可调用的.那是因为我有重复的data2值吗?请问造成这种情况的原因可以帮忙吗?

(2) I am attempting to group my data below by 2 factors, firstly , by 'data2' and then by 'time'. This is because my data will not be in the order below but rather in a random order. I get: ' DataFrameGroupBy' is not callable. Is that because i have repeating data2 values? Could you please help with what is causing this?

(3)因此,在将数据按"data2"和时间"分组之后,我需要在预定义的时间间隔(即[0 = 10ms],[10-20ms)等)内对数据进行分箱例如,第0,1,2行将位于[0-10ms)bin下.因此,我需要能够首先定义这些垃圾箱(我将拥有一组固定的垃圾箱).然后,对于下一个"data2"更改(例如,从55更改为56),我们将开始时间设置为0,并根据从0到data2再次更改之间经过的时间对行数据进行分类.等等.我该如何编码,我最苦恼的是将计时器设置为"0",并在每行引用时间",只要"data2"的值没有改变.然后,当"data2"发生更改时,请重新开始,对数据进行相应的装箱.

(3) So after I have grouped my data by 'data2' and 'time', I then need to bin the data within predefined time intervals (i.e. [0=10ms), [10-20ms) etc), so rows 0,1,2 will fall under the [0-10ms) bin, for example. Thus, I need to be able to define these bins first ( I will have a fixed set of bins ). Then ,for the next 'data2' change (i.e. from 55 to 56 lets say), we set start time as 0 and bin the row data based on time elapsed from 0 until data2 changes again. And so on. How can I code this, where I struggle the most is setting timer to '0' and referencing 'time' for every row as long as 'data2' value hasn't changed. Then when 'data2' changes, start all over, binning data accordingly.

下面是我到目前为止使用的代码;

Below is the code I have used so far;

import pyodbc 
import pandas as pd
import numpy as np

conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=XXXXXXXXX;'
                      'Database=Dynamics;'
                      'Trusted_Connection=yes;')

cursor = conn.cursor()

SQL_Query = pd.read_sql_query('''select ID,time,data1,data2,data3,data4,data5 from Dynamics''', conn)
df = pd.DataFrame(SQL_Query, columns=['ID','time','data2','data3','data4','data5'])
df.time=pd.to_timedelta(df.time, unit='ms')
df[['data4']] = df[['data4']].apply(pd.to_numeric)
df['diff']=df['data4']-df['data5']
df['diff']=df['diff'].abs()
df=df.groupby(['data3','time'])
print(df)



                     time data_1  data_2 data_3  data_4  data_5
0 2019-11-22 01:18:00.010      a      55      A    1.20    1.24
1 2019-11-22 01:18:00.090      a      55      B    1.25    1.24
2 2019-11-22 01:18:00.100      a      55      C    1.26    1.24
3 2019-11-22 01:18:00.140      a      55      A    1.22    1.22
4 2019-11-22 01:18:00.160      a      55      B    1.23    1.22

推荐答案

Pandas具有日期范围的强大功能.这是一个创建一分钟范围的示例,每行都有一个新的毫秒(也是索引).

Pandas has a great feature of date ranges. Here is an example that creates a one-minute range, with a new millisecond on each row (which is also the index).

import pandas as pd
from datetime import timedelta
import numpy as np

date_rng = pd.date_range(start='2019-11-22T01:18:00.00100', end='2019-11-22T01:19:00.00000', freq='ms') #one minute, in milliseconds
n = len(date_rng) # n = 60000
values = np.random.random(n) # make n random numbers

df = pd.DataFrame({'values': values}, index=date_rng)
print ('dataframe: ')
print (df.head())

这是df的负责人

dataframe: 
                           values
2019-11-22 01:18:00.001  0.914796
2019-11-22 01:18:00.002  0.760555
2019-11-22 01:18:00.003  0.132992
2019-11-22 01:18:00.004  0.572391
2019-11-22 01:18:00.005  0.090188

接下来,Pandas具有一个不错的重采样功能,在此示例中,该值将10 ms仓中的值相加.

Next, Pandas has a nice resample feature which, in this example, sums the values in 10 ms bins.

df2 = df.resample(rule=timedelta(milliseconds=10)).sum() # df2 sums the values in 10 ms bins
print ('beginning of df2')
print (df2.head())
print ('...')
print (df2.tail())

以下是输出:

beginning of df2
                           values
2019-11-22 01:18:00.000  5.236037
2019-11-22 01:18:00.010  4.446964
2019-11-22 01:18:00.020  6.549635
2019-11-22 01:18:00.030  5.141522
2019-11-22 01:18:00.040  5.375919
...
                           values
2019-11-22 01:18:59.960  3.876523
2019-11-22 01:18:59.970  4.864252
2019-11-22 01:18:59.980  5.690987
2019-11-22 01:18:59.990  2.787247
2019-11-22 01:19:00.000  0.613545

请注意,最后一个值要小得多,因为只表示了1 ms.

Note that the last value is much smaller, as only 1 ms is represented.

这篇关于panda df迭代,基于时间的数据装箱(以毫秒为单位)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆