使用 numpy/pandas 按时间戳合并时间序列数据 [英] Merging time series data by timestamp using numpy/pandas
问题描述
我有来自三个完全不同的传感器源的时间序列数据作为 CSV 文件,并希望将它们组合成一个大的 CSV 文件.我已经设法使用 numpy 的 genfromtxt 将它们读入 numpy,但我不确定从这里开始做什么.
I have time series data from three completely different sensor sources as CSV files and want to combine them into one big CSV file. I've managed to read them into numpy using numpy's genfromtxt, but I'm not sure what to do from here.
基本上,我所拥有的是这样的:
Basically, what I have is something like this:
表 1:
timestamp val_a val_b val_c
表 2:
timestamp val_d val_e val_f val_g
表 3:
timestamp val_h val_i
所有时间戳都是 UNIX 毫秒时间戳,如 numpy.uint64.
All timestamps are UNIX millisecond timestamps as numpy.uint64.
我想要的是:
timestamp val_a val_b val_c val_d val_e val_f val_g val_h val_i
...所有数据按时间戳组合和排序.三个表中的每一个都已按时间戳排序.由于数据来自不同的来源,因此无法保证表 1 中的时间戳也会出现在表 2 或表 3 中,反之亦然.在这种情况下,空值应标记为 N/A.
...where all data is combined and ordered by timestamps. Each of the three tables is already ordered by timestamp. Since the data comes from different sources, there is no guarantee that a timestamp from table 1 will also be in table 2 or 3 and vice versa. In that case, the empty values should be marked as N/A.
到目前为止,我已经尝试使用 Pandas 来转换数据,如下所示:
So far, I have tried using pandas to convert the data like so:
df_sensor1 = pd.DataFrame(numpy_arr_sens1)
df_sensor2 = pd.DataFrame(numpy_arr_sens2)
df_sensor3 = pd.DataFrame(numpy_arr_sens3)
然后尝试使用pandas.DataFrame.merge,但我很确定这不适用于我现在要做的事情.有人能指出我正确的方向吗?
and then tried using pandas.DataFrame.merge, but I'm pretty sure that won't work for what I'm trying to do now. Can anyone point me in the right direction?
推荐答案
我认为你可以简单地
- 将
timestamp
定义为每个DataFrame
的index
(使用set_index
) - 使用
join
将它们与'outer'
方法合并 - 可选地将
timestamp
转换为datetime
- Define the
timestamp
as theindex
of eachDataFrame
(use ofset_index
) - Use a
join
to merge them with the'outer'
method - Optionnaly convert
timestamp
todatetime
这是它的样子.
# generating some test data
timestamp = [1440540000, 1450540000]
df1 = pd.DataFrame(
{'timestamp': timestamp, 'a': ['val_a', 'val2_a'], 'b': ['val_b', 'val2_b'], 'c': ['val_c', 'val2_c']})
# building a different index
timestamp = timestamp * np.random.randn(abs(1))
df2 = pd.DataFrame(
{'timestamp': timestamp, 'd': ['val_d', 'val2_d'], 'e': ['val_e', 'val2_e'], 'f': ['val_f', 'val2_f'],
'g': ['val_g', 'val2_g']}, index=index)
# keeping a value in common with the first index
timestamp = [1440540000, 1450560000]
df3 = pd.DataFrame({'timestamp': timestamp, 'h': ['val_h', 'val2_h'], 'i': ['val_i', 'val2_i']}, index=index)
# Setting the timestamp as the index
df1.set_index('timestamp', inplace=True)
df2.set_index('timestamp', inplace=True)
df3.set_index('timestamp', inplace=True)
# You can convert timestamps to dates but it's not mandatory I think
df1.index = pd.to_datetime(df1.index, unit='s')
df2.index = pd.to_datetime(df2.index, unit='s')
df3.index = pd.to_datetime(df3.index, unit='s')
# Just perform a join and that's it
result = df1.join(df2, how='outer').join(df3, how='outer')
result
这篇关于使用 numpy/pandas 按时间戳合并时间序列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!