pandas python根据一个或多个其他列的子集更新列A的子集 [英] pandas python Update subset of column A based on subset of one or more other columns

查看:89
本文介绍了pandas python根据一个或多个其他列的子集更新列A的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑:我已经修改了以下描述的部分内容,以阐明功能"和组"的含义,修正输入错误,并包括我尝试过的其他代码.

Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.

我的熊猫df有450万行和23列.下表显示了df2的几行,而df2是从df生成的.它显示了两个组( eeskin hduquant )和三个功能(失败退出状态 job_number ):

My pandas df has 4.5 mil rows and 23 columns. The table below shows a few lines from df2 which is generated from df. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):

# report by group

        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6     -1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6     -1     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64    -1     0                63     
1    hduquant  job_number   2       64    -1     4.07192e+06      61     

等级"列值为-1是占位符.

The "rank" column value of -1 is a placeholder.

我想为每个功能更新每个组的排名.在这种情况下,功能"是指功能"列中的每个唯一值:失败退出状态工作编号.例如,更新 job_number 的排名意味着仅在功能"列等于 job_number 的行上修改"rank"列中的值.事实证明,这些行中的每一行还对应于组"列中的不同组值.

I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.

因此,我不想一次更新每个列"rank"中的所有值,而是希望逐个功能地对它们进行处理,其中每次写入都会更新单个功能上所有组的值.

So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.

要素"job_number"的等级基于"#_jobs" col的值(最高职位数为等级1).对于功能失败",排名基于"top_value"的频率". exits_status现在可以保持-1.

The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status can remain -1 for now.

结果应如下所示:

        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6      1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6      2     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64     2     0                63     
1    hduquant  job_number   2       64     1     4.07192e+06      61     

"eeskin"在failed中的排名为1,在job_number中的排名为2. "hdquant"在failed上排名2,在job_number上排名1.

"eeskin" ranks 1 for failed and 2 for job_number. "hdquant" ranks 2 for failed and 1 for job_number.

我可以使用以下代码更新job_number的排名值:

I am able to update the rank values for job_number with this code:

if feat == 'job_number':
     grouped = grouped.sort_values("#_jobs", ascending=False)
     grouped['rank'] = grouped.index + 1

        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6     -1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6      2     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64    -1     0                63     
1    hduquant  job_number   2       64     1     4.07192e+06      61     

但是当我尝试同时更新两者时,都不会更新:

But when I try to update both, neither is updated:

feat = ['job_number', 'failed']

for f in feat:
    if f == 'job_number':
        grouped = grouped.sort_values("#_jobs", ascending=False)
        grouped['rank'] = grouped.index + 1
    elif f == 'failed':  #  or f == 'exit_status'
        x = len(not grouped[f] == 0)
        grouped['x'] = x
        grouped = grouped.sort_values("x", ascending=False)
        grouped['rank'] = grouped.index + 1      
        del grouped['x']

        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6     -1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6     -1     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64    -1     0                63     
1    hduquant  job_number   2       64    -1     4.07192e+06      61     

我已经尝试实施Matt W.的建议,但到目前为止没有成功:

I've tried to implement the suggestion from Matt W. but so far without success:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

我对他的代码做了如下修改,但也没有成功:

I modified his code a bit as follows but also without success:

df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))


附录 @Matt W.


Addendum @Matt W.

输入:

import pandas as pd

df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]], 
                          columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype={'group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'}))
df

输出:

         group  owner  job_number      submission_time           start_time             end_time  failed  exit_status  granted_pe  slots  task_number       maxvmem  h_data  h_rt  highp  exclusive  h_vmem  gpu      pe  slot  campus       wait_time     wtime
      0  g1     u1     3902779     2018-09-27 21:38:06  2018-10-01 07:24:38  2018-10-01 08:00:42  0       0            single      1      55696         609865728.0  4.0     6.0   0      0          4.0     0    single  1     0      3 days 09:46:32  00:36:04  
4080243  g50    u92    4071923     2018-10-25 02:08:14  2018-10-27 01:41:58  2018-10-27 02:08:50  0       0            shared      1      119          7.654482e+08  2.5     1.5   0      1          16.0    0    shared  1     0      1 days 23:33:44  00:26:52
4080244  g50    u92    4071922     2018-10-25 02:08:11  2018-10-27 01:46:53  2018-10-27 02:08:53  0       0            shared      1      2208         1.074463e+09  2.5     1.5   0      10         24.0    0    shared  1     0      1 days 23:38:42  00:22:00

代码产生第一行.我只是为了多样化而增加了几行.

The code produces the first line. I tacked on a couple more lines just for variety.

有203个组,699个所有者.有数千个作业:作业"定义为job_number,task_number和submission_time的唯一组合.

There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.

我想创建一个整体报告,每个组创建一个报告,都集中在资源使用上.

I want to create an overall report and one report per group, both focused on resource usage.

总体报告的组成部分:

一般统计:

  • 计数,平均值,标准,最小值,25%,50%,75%,最大值(数字)
  • 计数,唯一,顶部,频率(字符串)
  • 首先计数,最后#次时间增量cols(时间增量)

工作:

  • 任务编号最多的工作编号,任务编号最多的提交时间
  • 最早/最新的
  • 工作(如上定义)
    • 提交时间,开始时间和结束时间
    • job_number with the most task_numbers, job_number w most submission times
    • job (as defined above) with earliest/latest
      • submission_time, start_time, and end_time
      • 失败!= 0
      • exit_status!= 0
      • granted_pe,插槽,maxvmem,h_data,h_rt,exclusive,h_vmem和gpu
      • pe ==单身
      • pe ==共享
      • pe ==用于pe的每个addtl类别
      • wait_time和wtime

      所有者:

      • 工作最多的老板
      • 最早/最新的所有者
        • 提交时间,开始时间,结束时间
        • owner with the most jobs
        • owner with earliest/latest
          • submission_time, start_time, end_time
          • 失败!= 0
          • exit_status!= 0
          • granted_pe,插槽,maxvmem,h_data,h_rt,独占,h_vmem,gpu
          • pe ==单身
          • pe ==共享
          • pe ==用于pe的每个addtl类别
          • wait_time和wtime

          组:

          • 工作最多的组
          • 拥有最多所有者的组
          • 最早/最新的
            • 提交时间,开始时间和结束时间
            • group with the most jobs
            • group with the most owners
            • group with earliest/latest
              • submission_time, start_time, and end_time
              • 失败!= 0
              • exit_status!= 0
              • granted_pe,插槽,maxvmem,h_data,h_rt,独占,h_vmem和gpu
              • pe ==单身
              • pe ==共享
              • pe ==用于pe的每个addtl类别
              • wait_time和wtime

              单个按组"报告的组成部分:

              按功能(在df中为列):

              By feature (column in df):

              一般统计:

              • 计数,平均值,标准,最小值,25%,50%,75%,最大值(数字)
              • 计数,唯一,顶部,频率(字符串)
              • 首先计数,最后#次时间增量cols(时间增量)

              该组的统计信息:

              按工作:

              • 任务编号最多的工作编号,任务编号最多的提交时间
              • 最早/最新的
              • 工作(如上定义)
                • 提交时间,开始时间和结束时间
                • job_number with the most task_numbers, job_number w most submission_times
                • job (as defined above) with earliest/latest
                  • submission_time, start_time, and end_time
                  • 失败!= 0
                  • exit_status!= 0
                  • granted_pe,插槽,maxvmem,h_data,h_rt,exclusive,h_vmem和gpu
                  • pe ==单身(count/len)
                  • pe ==共享(数/len)
                  • pe ==用于pe的每个附加类别(计数/长度)
                  • wait_time和wtime

                  由所有者:

                  • 工作最多的老板
                  • 最早/最新的所有者
                    • 提交时间,开始时间,结束时间
                    • owner with the most jobs
                    • owner with earliest/latest
                      • submission_time, start_time, end_time
                      • 失败!= 0
                      • exit_status!= 0
                      • granted_pe,插槽,maxvmem,h_data,h_rt,exclusive,h_vmem和gpu
                      • pe ==单身(count/len)
                      • pe ==共享(数/len)
                      • pe ==用于pe的每个附加类别(计数/长度)
                      • wait_time和wtime

                      按排名:

                      在这里,我希望每个组在所有其他组中都排在前面,从使用率最高或失败最多的1个到使用率最低的203个.我将使用这些值绘制每个组的图形.

                      Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.

                      排名:

                        • 职位,职位编号,任务编号,提交时间
                        • number of
                          • jobs, job_numbers, task_numbers, submission times
                          • 提交时间,开始时间,结束时间
                          • 提交时间,开始时间,结束时间
                          • 失败!= 0
                          • exit_status!= 0
                          • granted_pe,插槽,maxvmem,h_data,h_rt,exclusive,h_vmem和gpu
                          • pe ==单身
                          • pe ==共享
                          • pe ==用于pe的每个addtl类别
                          • wait_time和wtime

                          推荐答案

                          您可以通过使用熊猫.loc

                          初始化数据框:

                          df = pd.DataFrame({'group':['e','e','e','h','h','h'],
                                             'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
                                             'cats':[1, 1, 1, 5, 2, 2],
                                             'jobs':[1, 1, 1, 64, 64, 64],
                                             'rank':[-1, -1, -1, -1, -1, -1],
                                             'topvalue':[100, 0, 4, 37, 0, 3.9],
                                             'freq':[1, 1, 1, 58, 63, 61]
                          })
                          

                          我们想对职位进行排名,因此我们只使用.loc隔离排名位置,然后在任务的右侧,使用.loc隔离Jobs列并使用.rank()函数

                          We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function

                          按职位价值排序职位:

                          df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
                          

                          最高频率不为0时按频率排列的排名失败功能:

                          Rank failure feature by frequency where top value is not 0:

                          为此,您确实将0排名,这似乎与您所说的背道而驰.因此,我们将通过两种方式进行此操作.

                          For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.

                          这样,我们过滤掉0s 开始,并对其他所有内容进行排名.这将使top_value == 0排名保持为-1

                          This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1

                          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
                                  df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
                          

                          这样,我们就不会滤除0.

                          This way we don't filter out the 0s.

                          df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
                                  df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
                          

                          这篇关于pandas python根据一个或多个其他列的子集更新列A的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆