使用 numpy/pandas 按时间戳合并时间序列数据 [英] Merging time series data by timestamp using numpy/pandas

查看:89
本文介绍了使用 numpy/pandas 按时间戳合并时间序列数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有来自三个完全不同的传感器源的时间序列数据作为 CSV 文件,并希望将它们组合成一个大的 CSV 文件.我已经设法使用 numpy 的 genfromtxt 将它们读入 numpy,但我不确定从这里开始做什么.

I have time series data from three completely different sensor sources as CSV files and want to combine them into one big CSV file. I've managed to read them into numpy using numpy's genfromtxt, but I'm not sure what to do from here.

基本上,我所拥有的是这样的:

Basically, what I have is something like this:

表 1:

timestamp    val_a   val_b   val_c

表 2:

timestamp    val_d   val_e   val_f   val_g

表 3:

timestamp    val_h   val_i

所有时间戳都是 UNIX 毫秒时间戳,如 numpy.uint64.

All timestamps are UNIX millisecond timestamps as numpy.uint64.

我想要的是:

timestamp    val_a   val_b   val_c   val_d   val_e   val_f   val_g   val_h   val_i

...所有数据按时间戳组合和排序.三个表中的每一个都已按时间戳排序.由于数据来自不同的来源,因此无法保证表 1 中的时间戳也会出现在表 2 或表 3 中,反之亦然.在这种情况下,空值应标记为 N/A.

...where all data is combined and ordered by timestamps. Each of the three tables is already ordered by timestamp. Since the data comes from different sources, there is no guarantee that a timestamp from table 1 will also be in table 2 or 3 and vice versa. In that case, the empty values should be marked as N/A.

到目前为止,我已经尝试使用 Pandas 来转换数据,如下所示:

So far, I have tried using pandas to convert the data like so:

df_sensor1 = pd.DataFrame(numpy_arr_sens1)
df_sensor2 = pd.DataFrame(numpy_arr_sens2)
df_sensor3 = pd.DataFrame(numpy_arr_sens3)

然后尝试使用pandas.DataFrame.merge,但我很确定这不适用于我现在要做的事情.有人能指出我正确的方向吗?

and then tried using pandas.DataFrame.merge, but I'm pretty sure that won't work for what I'm trying to do now. Can anyone point me in the right direction?

推荐答案

我认为你可以简单地

  • timestamp定义为每个DataFrameindex(使用set_index)
  • 使用 join 将它们与 'outer' 方法合并
  • 可选地将 timestamp 转换为 datetime
  • Define the timestamp as the index of each DataFrame (use of set_index)
  • Use a join to merge them with the 'outer' method
  • Optionnaly convert timestamp to datetime

这是它的样子.

# generating some test data
timestamp = [1440540000, 1450540000]
df1 = pd.DataFrame(
    {'timestamp': timestamp, 'a': ['val_a', 'val2_a'], 'b': ['val_b', 'val2_b'], 'c': ['val_c', 'val2_c']})
# building a different index
timestamp = timestamp * np.random.randn(abs(1))
df2 = pd.DataFrame(
    {'timestamp': timestamp, 'd': ['val_d', 'val2_d'], 'e': ['val_e', 'val2_e'], 'f': ['val_f', 'val2_f'],
     'g': ['val_g', 'val2_g']}, index=index)
# keeping a value in common with the first index
timestamp = [1440540000, 1450560000]
df3 = pd.DataFrame({'timestamp': timestamp, 'h': ['val_h', 'val2_h'], 'i': ['val_i', 'val2_i']}, index=index)

# Setting the timestamp as the index
df1.set_index('timestamp', inplace=True)
df2.set_index('timestamp', inplace=True)
df3.set_index('timestamp', inplace=True)

# You can convert timestamps to dates but it's not mandatory I think
df1.index = pd.to_datetime(df1.index, unit='s')
df2.index = pd.to_datetime(df2.index, unit='s')
df3.index = pd.to_datetime(df3.index, unit='s')

# Just perform a join and that's it
result = df1.join(df2, how='outer').join(df3, how='outer')
result

这篇关于使用 numpy/pandas 按时间戳合并时间序列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆