如何基于一列查找相同的行并将其标记为组 [英] How to find the identical rows based on one column and label them into groups
问题描述
我有一个表格形式的大数据:
I have a bigdata in the form of a table:
Filename A B
xxxxx 1 2
xxxxx 3 4
xxxxx 5 5
xxxxx 6 .
xxxxx . .
yyyyy . .
yyyyy
yyyyy
yyyyy
zzzzz
zzzzz
我需要扫描所有行的第一列(大约10,000行),查看文件名在哪里更改,并为每个唯一的文件名创建一个标签.这样,我将创建另一个带有生成标签的列.
I need to scan the first column for all the rows(10,000 rows approx) and see where the filename changes and create a label for every unique filename. In this way I would have another column created with the labels generated .
file filename A B
1 xxxxx
xxxxx
xxxxx
2 yyyyy
yyyyy
yyyyy
3 zzzzz
zzzzz
我还需要在每个唯一文件(文件1,文件2 ...)的A列中找到最大值.任何建议,将不胜感激.谢谢
Also I need to find the maximum value in column A for each unique file(file 1, file 2...) . Any suggestions would be appreciated. Thanks
推荐答案
我将从示例表,例如您的示例:
I'll start with a sample table like in your example:
T =
Filename A B
________ __ _
'xxxxx' 4 4
'xxxxx' 6 2
'xxxxx' 1 8
'xxxxx' 1 4
'xxxxx' 6 6
'yyyyy' 8 2
'yyyyy' 10 7
'yyyyy' 2 3
'yyyyy' 6 7
'zzzzz' 5 7
'zzzzz' 1 8
我们可以提取文件名的第一列,并使用函数
We can extract the first column of file names and use the function unique
to create a set of indices (i.e. labels) for each unique file. We can then create a table from this vector of labels and concatenate it with our existing table:
[~, ~, index] = unique(T.Filename, 'stable');
T = [table(index, 'VariableNames', {'Label'}) T];
T =
Label Filename A B
_____ ________ __ _
1 'xxxxx' 4 4
1 'xxxxx' 6 2
1 'xxxxx' 1 8
1 'xxxxx' 1 4
1 'xxxxx' 6 6
2 'yyyyy' 8 2
2 'yyyyy' 10 7
2 'yyyyy' 2 3
2 'yyyyy' 6 7
3 'zzzzz' 5 7
3 'zzzzz' 1 8
然后我们可以将此标签矢量与 accumarray
一起使用收集每个唯一文件的列A
的最大值:
We can then use this label vector with accumarray
to collect the maximum value of column A
for each unique file:
maxVals = accumarray(T.Label, T.A, [], @max)
maxVals =
6 % For file 1
10 % For file 2
5 % For file 3
这篇关于如何基于一列查找相同的行并将其标记为组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!