如果它们的索引相同,则合并同一数据框中的两行吗? [英] Merge two rows in the same Dataframe if their index is the same?
问题描述
我已经通过从Azure数据库中提取数据来创建了一个大型数据框.数据框的构造并不简单,因为我必须分部分进行,使用concat函数将新列添加到从数据库中拉出的数据集中.
I have created a large Dataframe by pulling data from an Azure database. The construction of the dataframe wasn't simple as I had to do it in parts, using the concat function to add new columns to the data set as they were pulled from the database.
这很好用,但是我按输入日期建立索引,并且在连接时有时会得到两个具有相同索引的数据行.我可以合并具有相同索引的行吗?我在网上搜索解决方案,但我总是遇到一些示例,这些示例试图合并两个单独的数据框,而不是合并同一数据框内的行.
This worked fine, however I am indexing by entry date and when concatenating I sometimes get two data rows with the same index. Is it possible for me to merge lines with the same index? I have searched online for solutions but I always come across examples trying to merge two separate dataframes instead of merging rows within the same dataframe.
Col1 Col2
2015-10-27 22:22:31 1400
2015-10-27 22:22:31 50.5
对此
Col1 Col2
2015-10-27 22:22:31 1400 50.5
我尝试在索引上使用groupby函数,但这只是搞砸了.大部分数据列消失了,吐出了一些非常大的数字.
I have tried using the groupby function on index but that just messed up. Most of the data columns disappeared and a few very large numbers were spat out.
数据具有这种格式,除了具有更多列之外,而且通常是稀疏的!
The data is in this sort of format, except with many more columns and is generally quite sparse!
Col1 Col2 ... Col_n-1 Col_n
2015-10-27 21:15:60+0 1220
2015-10-27 21:25:4+0 1420
2015-10-27 21:28:8+0 1410
2015-10-27 21:37:10+0 51.5
2015-10-27 21:37:11+0 1500
2015-10-27 21:46:14+0 51
2015-10-27 21:46:15+0 1390
2015-10-27 21:55:19+0 1370
2015-10-27 22:04:24+0 1450
2015-10-27 22:13:28+0 1350
2015-10-27 22:22:31+0 1400
2015-10-27 22:22:31+0 50.5
2015-10-27 22:25:33+0 1300
2015-10-27 22:29:42+0 ... 1900
2015-10-27 22:29:42+0 63
2015-10-27 22:34:36+0 1280
推荐答案
对于任何有兴趣的人-我最终将自己的函数编写为:
For anyone interested - I ended up writing my own function to:
- 遍历数据框
- 通过记录索引来记录需要合并的行
- 汇总或平均所有行中的值
- 删除需要合并的每个集合中除一行以外的所有行,将其值替换为聚合或平均值(取决于我的需要)
代码:
def groupDataOnTimeBlock(data, timeBlock_type, timeBlock_factor):
'''
Filter Dataframe to merge lines which are within the same time block.
i.e. being part of the same x number of seconds, weeks, months...
data:
Dataframe to filter.
timeBlock_type:
Time period with which to group data rows. This can be data per:
SECONDS, DAYS, MILLISECONDS
timeBlock_factor:
Number of timeBlock types to group on.
'''
pd.options.mode.chained_assignment = None # default='warn'
tBt = timeBlock_type.upper()
tBf = timeBlock_factor
if tBt == 'SEC' or tBt == 'SECOND' or tBt == 'SECONDS':
roundType = 'SECONDS'
elif tBt == 'MINS' or tBt == 'MINUTES' or tBt == 'MIN':
roundType = 'MINUTES'
elif tBt == 'MILLI' or tBt == 'MILLISECONDS':
roundType = 'MILLISECONDS'
elif tBt == 'WEEK' or tBt == 'WEEKS':
roundType = 'WEEKS'
else:
raise ValueError ('Invalid time block type entered')
numElements = len(data.columns)
anchorValue = timeStampReformat(data.iloc[1,len(data.columns)-7], roundType, tBf)
delIndex = []
mergeCount = 0
av_agg_arr = np.zeros([1,numElements], dtype=float)
#Cycling through dataframe to get averages and note which rows to delete
for i, row in data.iterrows(): #i is the index value, from 0
backDate = timeStampReformat(row['Timestamp'], roundType, tBf)
data.loc[i,'Timestamp'] = backDate #can be done better. Not all rows need updating.
if (backDate > anchorValue): #if data should be grouped
delIndex.pop() #remove last index as this is the final row to use
delIndex.append(i) #add current row so that it isnt missed.
print('collate')
if mergeCount != 0:
av_agg_arr = av_agg_arr/mergeCount
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
data.iloc[i-1, idx] = av_agg_arr[0, idx] #configure previous (index i -1) row. This is the last of the prior datetime group
anchorValue = backDate
mergeCount = 0
# Re-initialising aggregates and passing in current row values.
av_agg_arr = av_agg_arr - av_agg_arr
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
if not pd.isnull(row.values[idx]):
av_agg_arr[0,idx] += row.values[idx]
else: #else if data is still part of same datetime group
for idx in range(1,numElements-1):
if isinstance(row.values[idx],float):
if not pd.isnull(row.values[idx]):
av_agg_arr[0,idx] += row.values[idx]
mergeCount += 1
delIndex.append(i) #picking out index value of row
data.drop(data.index[delIndex], inplace=True) #delete all flagged rows
data.reset_index()
pd.options.mode.chained_assignment = 'warn' # default='warn'
return data
这篇关于如果它们的索引相同,则合并同一数据框中的两行吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!