pandas 数据框遍历行 [英] Pandas Dataframe iterate over rows

查看:103
本文介绍了 pandas 数据框遍历行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中X和Y是细胞坐标,mRNA是每个细胞的mRNA数量.

        ID        X        Y  mRNA
0        0  149.492  189.153     0
1        1  115.084  194.082     2
2        2  135.331  194.831     7
3        3  136.965  184.493     2
4        4  124.025  190.069     1
...    ...      ...      ...   ...
2410  2410  452.596  256.313     0
2411  2411  196.448  333.959    46
2412  2412  190.779  318.418    71
2413  2413  202.941  335.446    37
2414  2414  254.967  369.431    13

目前,我正在尝试应用此公式,但我无法使其真正起作用.理想情况下,我想执行此操作:

For ID 0: sqrt[((X0-X1)^2)+((Y0-Y1)^2)]
          sqrt[((X0-X2)^2)+((Y0-Y2)^2)]
          ............
          sqrt[((X0-Xn)^2)+((Y0-Yn)^2)]

(where n is the last cell ID in my csv file 2414)

然后必须对所有单元格的ID 1执行相同的操作,然后对ID 2进行此类操作,依此类推.

import pandas as pd
import numpy as np

df=pd.read_csv('Detailed2.csv', sep=',')
print(df)

df1 = np.sqrt(((df['X'].sub(df['X']))^2).add((df['Y'].sub(df['Y']))^2)).to_frame('col')
print(df1)

此代码不起作用.

解决方案

PMende在我从事我的工作时发布了NumPy解决方案,它甚至更好.对他表示敬意.


我喜欢他的回答略有不同,因为它不使用任何显式循环.

 raw_str = \
    '''
            ID        X        Y  mRNA
    0        0  149.492  189.153     0
    1        1  115.084  194.082     2
    2        2  135.331  194.831     7
    3        3  136.965  184.493     2
    4        4  124.025  190.069     1
    2410  2410  452.596  256.313     0
    2411  2411  196.448  333.959    46
    2412  2412  190.779  318.418    71
    2413  2413  202.941  335.446    37
    2414  2414  254.967  369.431    13
    '''

df_1 = pd.read_csv(StringIO(raw_str), header=0, delim_whitespace=True, usecols=[1, 2, 3, 4])

coords = df_1[['X', 'Y']].to_numpy()

distances = spsp.distance_matrix(coords, coords)

col_names = df_1['ID'].map(lambda x: f'col_id_{x}').rename()

df_2 = pd.DataFrame(data=distances, columns=col_names)

df_3 = pd.concat((df_1, df_2), axis=1)
 

多余的变量显然会影响性能,它们在这里只是为了清楚起见.


创建数千个列有点疯狂,这是一种更合理的解决方案,可以将距离保存为每一行中的列表.

 from io import StringIO

import pandas as pd
import scipy.spatial as spsp

raw_str = \
    '''
            ID        X        Y  mRNA
    0        0  149.492  189.153     0
    1        1  115.084  194.082     2
    2        2  135.331  194.831     7
    3        3  136.965  184.493     2
    4        4  124.025  190.069     1
    2410  2410  452.596  256.313     0
    2411  2411  196.448  333.959    46
    2412  2412  190.779  318.418    71
    2413  2413  202.941  335.446    37
    2414  2414  254.967  369.431    13
    '''

df_1 = pd.read_csv(StringIO(raw_str), header=0, delim_whitespace=True, usecols=[1, 2, 3, 4])

coords = df_1[['X', 'Y']].to_numpy()

distances = spsp.distance_matrix(coords, coords)

df_1['dist'] = distances.tolist()
 

df_1:

      ID        X  ...  mRNA                                               dist
0     0  149.492  ...     0  [0.0, 34.759250639218344, 15.256919905406859, ...
1     1  115.084  ...     2  [34.759250639218344, 0.0, 20.26084919246971, 2...
2     2  135.331  ...     7  [15.256919905406859, 20.26084919246971, 0.0, 1...
3     3  136.965  ...     2  [13.36567727427235, 23.889894976746966, 10.466...
4     4  124.025  ...     1  [25.483468072458283, 9.800288261066603, 12.267...
5  2410  452.596  ...     0  [310.45531146366295, 343.201176433007, 323.167...
6  2411  196.448  ...    46  [152.2289183171187, 161.81988637061886, 151.96...
7  2412  190.779  ...    71  [135.69840306355857, 145.56501613025023, 135.4...
8  2413  202.941  ...    37  [155.75120368716253, 166.4410794996235, 156.02...
9  2414  254.967  ...    13  [208.86630390994137, 224.30899556192568, 211.6...
 

I have a dataframe where X and Y are cell coordinates and mRNA is the number of mRNA per cell.

        ID        X        Y  mRNA
0        0  149.492  189.153     0
1        1  115.084  194.082     2
2        2  135.331  194.831     7
3        3  136.965  184.493     2
4        4  124.025  190.069     1
...    ...      ...      ...   ...
2410  2410  452.596  256.313     0
2411  2411  196.448  333.959    46
2412  2412  190.779  318.418    71
2413  2413  202.941  335.446    37
2414  2414  254.967  369.431    13

At the moment I am trying to apply this formula but I cannot really make it to work. Ideally I want to do this operation:

For ID 0: sqrt[((X0-X1)^2)+((Y0-Y1)^2)]
          sqrt[((X0-X2)^2)+((Y0-Y2)^2)]
          ............
          sqrt[((X0-Xn)^2)+((Y0-Yn)^2)]

(where n is the last cell ID in my csv file 2414)

Then the same operation will have to be done for ID 1 against all the cells, then ID 2, and so on.

import pandas as pd
import numpy as np

df=pd.read_csv('Detailed2.csv', sep=',')
print(df)

df1 = np.sqrt(((df['X'].sub(df['X']))^2).add((df['Y'].sub(df['Y']))^2)).to_frame('col')
print(df1)

This code is not working.

解决方案

PMende posted a NumPy solution while I was working on mine, and it's even better. Kudos to him.


Here is a slight variation on his answer which I like because it doesn't use any explicit loops.

raw_str = \
    '''
            ID        X        Y  mRNA
    0        0  149.492  189.153     0
    1        1  115.084  194.082     2
    2        2  135.331  194.831     7
    3        3  136.965  184.493     2
    4        4  124.025  190.069     1
    2410  2410  452.596  256.313     0
    2411  2411  196.448  333.959    46
    2412  2412  190.779  318.418    71
    2413  2413  202.941  335.446    37
    2414  2414  254.967  369.431    13
    '''

df_1 = pd.read_csv(StringIO(raw_str), header=0, delim_whitespace=True, usecols=[1, 2, 3, 4])

coords = df_1[['X', 'Y']].to_numpy()

distances = spsp.distance_matrix(coords, coords)

col_names = df_1['ID'].map(lambda x: f'col_id_{x}').rename()

df_2 = pd.DataFrame(data=distances, columns=col_names)

df_3 = pd.concat((df_1, df_2), axis=1)

The extra variables obviously hurt performance, they're here simply for the sake of clarity.


Creating thousands of columns is kind of crazy, this is a more reasonable solution which saves the distances as lists in each row.

from io import StringIO

import pandas as pd
import scipy.spatial as spsp

raw_str = \
    '''
            ID        X        Y  mRNA
    0        0  149.492  189.153     0
    1        1  115.084  194.082     2
    2        2  135.331  194.831     7
    3        3  136.965  184.493     2
    4        4  124.025  190.069     1
    2410  2410  452.596  256.313     0
    2411  2411  196.448  333.959    46
    2412  2412  190.779  318.418    71
    2413  2413  202.941  335.446    37
    2414  2414  254.967  369.431    13
    '''

df_1 = pd.read_csv(StringIO(raw_str), header=0, delim_whitespace=True, usecols=[1, 2, 3, 4])

coords = df_1[['X', 'Y']].to_numpy()

distances = spsp.distance_matrix(coords, coords)

df_1['dist'] = distances.tolist()

df_1:

     ID        X  ...  mRNA                                               dist
0     0  149.492  ...     0  [0.0, 34.759250639218344, 15.256919905406859, ...
1     1  115.084  ...     2  [34.759250639218344, 0.0, 20.26084919246971, 2...
2     2  135.331  ...     7  [15.256919905406859, 20.26084919246971, 0.0, 1...
3     3  136.965  ...     2  [13.36567727427235, 23.889894976746966, 10.466...
4     4  124.025  ...     1  [25.483468072458283, 9.800288261066603, 12.267...
5  2410  452.596  ...     0  [310.45531146366295, 343.201176433007, 323.167...
6  2411  196.448  ...    46  [152.2289183171187, 161.81988637061886, 151.96...
7  2412  190.779  ...    71  [135.69840306355857, 145.56501613025023, 135.4...
8  2413  202.941  ...    37  [155.75120368716253, 166.4410794996235, 156.02...
9  2414  254.967  ...    13  [208.86630390994137, 224.30899556192568, 211.6...

这篇关于 pandas 数据框遍历行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆