如何基于一列查找相同的行并将其标记为组 [英] How to find the identical rows based on one column and label them into groups

查看:59
本文介绍了如何基于一列查找相同的行并将其标记为组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表格形式的大数据:

I have a bigdata in the form of a table:

    Filename    A     B
    xxxxx       1     2
    xxxxx       3     4   
    xxxxx       5     5 
    xxxxx       6     .
    xxxxx       .     .
    yyyyy       .     .
    yyyyy
    yyyyy
    yyyyy
    zzzzz
    zzzzz

我需要扫描所有行的第一列(大约10,000行),查看文件名在哪里更改,并为每个唯一的文件名创建一个标签.这样,我将创建另一个带有生成标签的列.

I need to scan the first column for all the rows(10,000 rows approx) and see where the filename changes and create a label for every unique filename. In this way I would have another column created with the labels generated .

  file filename  A     B
    1      xxxxx
           xxxxx
           xxxxx
    2      yyyyy
           yyyyy
           yyyyy
    3      zzzzz
           zzzzz

我还需要在每个唯一文件(文件1,文件2 ...)的A列中找到最大值.任何建议,将不胜感激.谢谢

Also I need to find the maximum value in column A for each unique file(file 1, file 2...) . Any suggestions would be appreciated. Thanks

推荐答案

我将从示例,例如您的示例:

I'll start with a sample table like in your example:

T = 

    Filename    A     B
    ________    __    _

    'xxxxx'      4    4
    'xxxxx'      6    2
    'xxxxx'      1    8
    'xxxxx'      1    4
    'xxxxx'      6    6
    'yyyyy'      8    2
    'yyyyy'     10    7
    'yyyyy'      2    3
    'yyyyy'      6    7
    'zzzzz'      5    7
    'zzzzz'      1    8

我们可以提取文件名的第一列,并使用函数 为每个唯一文件创建一组索引(即标签).然后,我们可以根据此标签向量创建一个表,并将其与我们现有的表连接起来:

We can extract the first column of file names and use the function unique to create a set of indices (i.e. labels) for each unique file. We can then create a table from this vector of labels and concatenate it with our existing table:

[~, ~, index] = unique(T.Filename, 'stable');
T = [table(index, 'VariableNames', {'Label'}) T];

T = 

    Label    Filename    A     B
    _____    ________    __    _

    1        'xxxxx'      4    4
    1        'xxxxx'      6    2
    1        'xxxxx'      1    8
    1        'xxxxx'      1    4
    1        'xxxxx'      6    6
    2        'yyyyy'      8    2
    2        'yyyyy'     10    7
    2        'yyyyy'      2    3
    2        'yyyyy'      6    7
    3        'zzzzz'      5    7
    3        'zzzzz'      1    8

然后我们可以将此标签矢量与 accumarray 一起使用收集每个唯一文件的列A的最大值:

We can then use this label vector with accumarray to collect the maximum value of column A for each unique file:

maxVals = accumarray(T.Label, T.A, [], @max)

maxVals =

     6    % For file 1
    10    % For file 2
     5    % For file 3

这篇关于如何基于一列查找相同的行并将其标记为组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆