遍历df列并根据行索引,列引用在数据框中返回值 [英] iterate through df column and return value in dataframe based on row index, column reference

查看:60
本文介绍了遍历df列并根据行索引,列引用在数据框中返回值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是将年份"列中的每个值与相应的年份(即1999年,2000年)进行比较.然后,我想从相应的列返回相应的值.例如,对于2004年的阿富汗(第一行),我想找到名为"2004"的列,并从包含阿富汗的行中返回值.

My goal is to compare each value from the column "year" against the appropriate column year (i.e. 1999, 2000). I then want to return the corresponding value from the corresponding column. For example, for Afghanistan (first row), year 2004, I want to find the column named "2004" and return the value from the row that contains afghanistan.

这是桌子.供参考,此表是单个定义年的教育程度与1999-2010年每个国家的gdp表之间的sql连接的结果.我的最终目标是返回教育数据所来自的年份的gdp.

Here is the table. For reference this table is the result of a sql join between educational attainment in a single defined year and a table for gdp per country for years 1999 - 2010. My ultimate goal is to return the gdp from the year that the educational data is from.

country year    men_ed_yrs  women_ed_yrs    total_ed_yrs    1999    2000    2001    2002    2003    2004    2005    2006    2007    2008    2009    2010
0   Afghanistan 2004    11  5   8   NaN NaN 2461666315  4128818042  4583648922  5285461999  6.275076e+09    7.057598e+09    9.843842e+09    1.019053e+10    1.248694e+10    1.593680e+10
1   Albania 2004    11  11  11  3414760915  3632043908  4060758804  4435078648  5746945913  7314865176  8.158549e+09    8.992642e+09    1.070101e+10    1.288135e+10    1.204421e+10    1.192695e+10
2   Algeria 2005    13  13  13  48640611686 54790060513 54744714110 56760288396 67863829705 85324998959 1.030000e+11    1.170000e+11    1.350000e+11    1.710000e+11    1.370000e+11    1.610000e+11
3   Andorra 2008    11  12  11  1239840270  1401694156  1484004617  1717563533  2373836214  2916913449  3.248135e+09    3.536452e+09    4.010785e+09    4.001349e+09    3.649863e+09    3.346317e+09
4   Anguilla    2008    11  11  11  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

gdp_ed_list = []
for value in df_combined_column_named['year']: #loops through each year in year column
        if value in df_combined_column_named.columns: #compares year to column names
            idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist() #supposed to get the index associated with value
            gdp_ed = df_combined_column_named.get_value(idx, value) #get the value of the cell found at idx, value
            gdp_ed_list.append(gdp_ed) #append to a list

当前,我的代码被卡在index.list()部分.它返回错误:

Currently, my code is getting stuck at the index.list() section. It is returning the error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-85-361acb97edd4> in <module>()
      2 for value in df_combined_column_named['year']: #loops through each year in year column
      3     if value in df_combined_column_named.columns: #compares year to column names
----> 4         idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist()
      5         gdp_ed = df_combined_column_named.get_value(idx, value)
      6         gdp_ed_list.append(gdp_ed)
KeyError: u'2004'

有什么想法吗?

推荐答案

您似乎正在尝试将year列中的值与列标签匹配,然后在相应的单元格中提取该值.您可以通过遍历行来做到这一点(见下文),但我认为这不是最快的方法. 相反,您可以使用 pd.melt 合并将具有年份标签的列合并为单个列,例如year_col:

It looks like you are trying to match the value in the year column to column labels and then extract the value in the corresponding cells. You could do that by looping through the rows (see below) but I think it would be not be the fastest way. Instead, you could use pd.melt to coalesce the columns with year-like labels into a single column, say, year_col:

In [38]: melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs', 'total_ed_yrs'], var_name='year_col')

In [39]: melted
Out[39]: 
        country  year  men_ed_yrs  women_ed_yrs  total_ed_yrs year_col         value  
0   Afghanistan  2004          11             5             8     1999            NaN   
1       Albania  2004          11            11            11     1999   3.414761e+09   
2       Algeria  2005          13            13            13     1999   4.864061e+10   
3       Andorra  2008          11            12            11     1999   1.239840e+09   
4      Anguilla  2008          11            11            11     1999            NaN   
5   Afghanistan  2004          11             5             8     2000            NaN
...

以这种方式融合" DataFrame的好处是 现在您将同时拥有yearyear_col列. 您要查找的值在year等于year_col 的行中.这很容易通过.loc来获得:

The benefit of "melting" the DataFrame in this way is that now you would have both year and year_col columns. The values you are looking for are in the rows where year equals year_col. And that is easy to obtain by using .loc:

In [41]: melted.loc[melted['year'] == melted['year_col']]
Out[41]: 
        country  year  men_ed_yrs  women_ed_yrs  total_ed_yrs year_col  \
25  Afghanistan  2004          11             5             8     2004   
26      Albania  2004          11            11            11     2004   
32      Algeria  2005          13            13            13     2005   
48      Andorra  2008          11            12            11     2008   
49     Anguilla  2008          11            11            11     2008   

           value  
25  5.285462e+09  
26  7.314865e+09  
32  1.030000e+11  
48  4.001349e+09  
49           NaN  


因此,您可以使用


Thus, you could use

import numpy as np
import pandas as pd
nan = np.nan
df = pd.DataFrame({'1999': [nan, 3414760915.0, 48640611686.0, 1239840270.0, nan],
 '2000': [nan, 3632043908.0, 54790060513.0, 1401694156.0, nan],
 '2001': [2461666315.0, 4060758804.0, 54744714110.0, 1484004617.0, nan],
 '2002': [4128818042.0, 4435078648.0, 56760288396.0, 1717563533.0, nan],
 '2003': [4583648922.0, 5746945913.0, 67863829705.0, 2373836214.0, nan],
 '2004': [5285461999.0, 7314865176.0, 85324998959.0, 2916913449.0, nan],
 '2005': [6275076000.0, 8158549000.0, 103000000000.0, 3248135000.0, nan],
 '2006': [7057598000.0, 8992642000.0, 117000000000.0, 3536452000.0, nan],
 '2007': [9843842000.0, 10701010000.0, 135000000000.0, 4010785000.0, nan],
 '2008': [10190530000.0, 12881350000.0, 171000000000.0, 4001349000.0, nan],
 '2009': [12486940000.0, 12044210000.0, 137000000000.0, 3649863000.0, nan],
 '2010': [15936800000.0, 11926950000.0, 161000000000.0, 3346317000.0, nan],
 'country': ['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Anguilla'],
 'men_ed_yrs': [11, 11, 13, 11, 11],
 'total_ed_yrs': [8, 11, 13, 11, 11],
 'women_ed_yrs': [5, 11, 13, 12, 11],
 'year': ['2004', '2004', '2005', '2008', '2008']})

melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs', 
                              'total_ed_yrs'], var_name='year_col')
result = melted.loc[melted['year'] == melted['year_col']]
print(result)


为什么提出KeyError :


Why was a KeyError raised:

KeyErrordf_combined_column_named['year'][value]引发.假设value'2004'.然后df_combined_column_named['year']是一个包含年份的字符串表示形式的序列,并由整数(如0、1、2,...)索引. df_combined_column_named['year'][value]失败,因为它尝试使用不在整数索引中的字符串'2004'为该系列建立索引.

The KeyError is being raised by df_combined_column_named['year'][value]. Suppose value is '2004'. Then df_combined_column_named['year'] is a Series containing string representations of years and indexed by integers (like 0, 1, 2, ...). df_combined_column_named['year'][value] fails because it attempts to index this Series with the string '2004' which is not in the integer index.

或者,这是通过使用 iterrows .这也许更容易理解,但是与其他以列为中心的以熊猫为中心的方法相比,使用iterrows通常慢:

Alternatively, here is another way to achieve the goal by looping through the rows using iterrows. This is perhaps simpler to understand, but in general using iterrows is slow compared to other column-based Pandas-centric methods:

data = []
for idx, row in df.iterrows():
    data.append((row['country'], row['year'], row[row['year']]))
result = pd.DataFrame(data, columns=['country', 'year', 'value'])
print(result)

打印

       country  year         value
0  Afghanistan  2004  5.285462e+09
1      Albania  2004  7.314865e+09
2      Algeria  2005  1.030000e+11
3      Andorra  2008  4.001349e+09
4     Anguilla  2008           NaN

这篇关于遍历df列并根据行索引,列引用在数据框中返回值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆