pandas 数据框遍历行 [英] Pandas Dataframe iterate over rows
问题描述
ID X Y mRNA
0 0 149.492 189.153 0
1 1 115.084 194.082 2
2 2 135.331 194.831 7
3 3 136.965 184.493 2
4 4 124.025 190.069 1
... ... ... ... ...
2410 2410 452.596 256.313 0
2411 2411 196.448 333.959 46
2412 2412 190.779 318.418 71
2413 2413 202.941 335.446 37
2414 2414 254.967 369.431 13
目前,我正在尝试应用此公式,但我无法使其真正起作用.理想情况下,我想执行此操作:
For ID 0: sqrt[((X0-X1)^2)+((Y0-Y1)^2)]
sqrt[((X0-X2)^2)+((Y0-Y2)^2)]
............
sqrt[((X0-Xn)^2)+((Y0-Yn)^2)]
(where n is the last cell ID in my csv file 2414)
然后必须对所有单元格的ID 1执行相同的操作,然后对ID 2进行此类操作,依此类推.
import pandas as pd
import numpy as np
df=pd.read_csv('Detailed2.csv', sep=',')
print(df)
df1 = np.sqrt(((df['X'].sub(df['X']))^2).add((df['Y'].sub(df['Y']))^2)).to_frame('col')
print(df1)
此代码不起作用.
PMende在我从事我的工作时发布了NumPy解决方案,它甚至更好.对他表示敬意.
我喜欢他的回答略有不同,因为它不使用任何显式循环.
raw_str = \
'''
ID X Y mRNA
0 0 149.492 189.153 0
1 1 115.084 194.082 2
2 2 135.331 194.831 7
3 3 136.965 184.493 2
4 4 124.025 190.069 1
2410 2410 452.596 256.313 0
2411 2411 196.448 333.959 46
2412 2412 190.779 318.418 71
2413 2413 202.941 335.446 37
2414 2414 254.967 369.431 13
'''
df_1 = pd.read_csv(StringIO(raw_str), header=0, delim_whitespace=True, usecols=[1, 2, 3, 4])
coords = df_1[['X', 'Y']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
col_names = df_1['ID'].map(lambda x: f'col_id_{x}').rename()
df_2 = pd.DataFrame(data=distances, columns=col_names)
df_3 = pd.concat((df_1, df_2), axis=1)
多余的变量显然会影响性能,它们在这里只是为了清楚起见.
创建数千个列有点疯狂,这是一种更合理的解决方案,可以将距离保存为每一行中的列表.
from io import StringIO
import pandas as pd
import scipy.spatial as spsp
raw_str = \
'''
ID X Y mRNA
0 0 149.492 189.153 0
1 1 115.084 194.082 2
2 2 135.331 194.831 7
3 3 136.965 184.493 2
4 4 124.025 190.069 1
2410 2410 452.596 256.313 0
2411 2411 196.448 333.959 46
2412 2412 190.779 318.418 71
2413 2413 202.941 335.446 37
2414 2414 254.967 369.431 13
'''
df_1 = pd.read_csv(StringIO(raw_str), header=0, delim_whitespace=True, usecols=[1, 2, 3, 4])
coords = df_1[['X', 'Y']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
df_1['dist'] = distances.tolist()
df_1
:
ID X ... mRNA dist
0 0 149.492 ... 0 [0.0, 34.759250639218344, 15.256919905406859, ...
1 1 115.084 ... 2 [34.759250639218344, 0.0, 20.26084919246971, 2...
2 2 135.331 ... 7 [15.256919905406859, 20.26084919246971, 0.0, 1...
3 3 136.965 ... 2 [13.36567727427235, 23.889894976746966, 10.466...
4 4 124.025 ... 1 [25.483468072458283, 9.800288261066603, 12.267...
5 2410 452.596 ... 0 [310.45531146366295, 343.201176433007, 323.167...
6 2411 196.448 ... 46 [152.2289183171187, 161.81988637061886, 151.96...
7 2412 190.779 ... 71 [135.69840306355857, 145.56501613025023, 135.4...
8 2413 202.941 ... 37 [155.75120368716253, 166.4410794996235, 156.02...
9 2414 254.967 ... 13 [208.86630390994137, 224.30899556192568, 211.6...
I have a dataframe where X and Y are cell coordinates and mRNA is the number of mRNA per cell.
ID X Y mRNA
0 0 149.492 189.153 0
1 1 115.084 194.082 2
2 2 135.331 194.831 7
3 3 136.965 184.493 2
4 4 124.025 190.069 1
... ... ... ... ...
2410 2410 452.596 256.313 0
2411 2411 196.448 333.959 46
2412 2412 190.779 318.418 71
2413 2413 202.941 335.446 37
2414 2414 254.967 369.431 13
At the moment I am trying to apply this formula but I cannot really make it to work. Ideally I want to do this operation:
For ID 0: sqrt[((X0-X1)^2)+((Y0-Y1)^2)]
sqrt[((X0-X2)^2)+((Y0-Y2)^2)]
............
sqrt[((X0-Xn)^2)+((Y0-Yn)^2)]
(where n is the last cell ID in my csv file 2414)
Then the same operation will have to be done for ID 1 against all the cells, then ID 2, and so on.
import pandas as pd
import numpy as np
df=pd.read_csv('Detailed2.csv', sep=',')
print(df)
df1 = np.sqrt(((df['X'].sub(df['X']))^2).add((df['Y'].sub(df['Y']))^2)).to_frame('col')
print(df1)
This code is not working.
PMende posted a NumPy solution while I was working on mine, and it's even better. Kudos to him.
Here is a slight variation on his answer which I like because it doesn't use any explicit loops.
raw_str = \
'''
ID X Y mRNA
0 0 149.492 189.153 0
1 1 115.084 194.082 2
2 2 135.331 194.831 7
3 3 136.965 184.493 2
4 4 124.025 190.069 1
2410 2410 452.596 256.313 0
2411 2411 196.448 333.959 46
2412 2412 190.779 318.418 71
2413 2413 202.941 335.446 37
2414 2414 254.967 369.431 13
'''
df_1 = pd.read_csv(StringIO(raw_str), header=0, delim_whitespace=True, usecols=[1, 2, 3, 4])
coords = df_1[['X', 'Y']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
col_names = df_1['ID'].map(lambda x: f'col_id_{x}').rename()
df_2 = pd.DataFrame(data=distances, columns=col_names)
df_3 = pd.concat((df_1, df_2), axis=1)
The extra variables obviously hurt performance, they're here simply for the sake of clarity.
Creating thousands of columns is kind of crazy, this is a more reasonable solution which saves the distances as lists in each row.
from io import StringIO
import pandas as pd
import scipy.spatial as spsp
raw_str = \
'''
ID X Y mRNA
0 0 149.492 189.153 0
1 1 115.084 194.082 2
2 2 135.331 194.831 7
3 3 136.965 184.493 2
4 4 124.025 190.069 1
2410 2410 452.596 256.313 0
2411 2411 196.448 333.959 46
2412 2412 190.779 318.418 71
2413 2413 202.941 335.446 37
2414 2414 254.967 369.431 13
'''
df_1 = pd.read_csv(StringIO(raw_str), header=0, delim_whitespace=True, usecols=[1, 2, 3, 4])
coords = df_1[['X', 'Y']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
df_1['dist'] = distances.tolist()
df_1
:
ID X ... mRNA dist
0 0 149.492 ... 0 [0.0, 34.759250639218344, 15.256919905406859, ...
1 1 115.084 ... 2 [34.759250639218344, 0.0, 20.26084919246971, 2...
2 2 135.331 ... 7 [15.256919905406859, 20.26084919246971, 0.0, 1...
3 3 136.965 ... 2 [13.36567727427235, 23.889894976746966, 10.466...
4 4 124.025 ... 1 [25.483468072458283, 9.800288261066603, 12.267...
5 2410 452.596 ... 0 [310.45531146366295, 343.201176433007, 323.167...
6 2411 196.448 ... 46 [152.2289183171187, 161.81988637061886, 151.96...
7 2412 190.779 ... 71 [135.69840306355857, 145.56501613025023, 135.4...
8 2413 202.941 ... 37 [155.75120368716253, 166.4410794996235, 156.02...
9 2414 254.967 ... 13 [208.86630390994137, 224.30899556192568, 211.6...
这篇关于 pandas 数据框遍历行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!