计算pandas DataFrame中每个组的行值更改 [英] Count row value change for each group in pandas DataFrame
问题描述
我在熊猫里有一个DataFrame
,其中包含有关人的位置的及时信息.大约有300+百万行.
I have a DataFrame
in pandas with information about people location in time. It is about 300+ million rows.
示例:
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
print (df)
输出:
Address Name Year
0 Beverly hills John 2018
1 Beverly hills John 2018
2 Beverly hills John 2019
3 Orange county John 2019
4 New York John 2019
5 Canada Steve 2018
6 Canada Steve 2019
7 Canada Steve 2019
8 California Steve 2020
9 Canada Steve 2020
10 Canada John 2020
11 Canada John 2021
12 Beverly hills John 2021
13 California Steve 2021
14 California Steve 2022
15 NewYork Steve 2018
16 California Steve 2018
17 NewYork Steve 2022
我想计算地址在特定的年份之间的变化的总计.换句话说,2018年有多少人从加拿大"移居到加利福尼亚".
I want to calculate the total of changes between Addresses in a specific Year. Or in other words, how many times people moved from "Canada" to "California" in 2018.
理想的输出:
1)每年的矩阵如下.示例:2019年(包括2018年至2019年)中的所有地址更改.
1) Matrix as below for each year. Example: all address changes in the year 2019 (including 2018 to 2019).
+---------------+---------------+---------------+----------+------------+
| From\ To | Beverly hills | Orange county | New York | California |
+---------------+---------------+---------------+----------+------------+
| Beverly hills | 0 | 1 | 0 | 0 |
+---------------+---------------+---------------+----------+------------+
| Orange county | 0 | 0 | 1 | 0 |
+---------------+---------------+---------------+----------+------------+
| New York | 0 | 2 | 0 | 0 |
+---------------+---------------+---------------+----------+------------+
| California | 0 | 0 | 0 | 0 |
+---------------+---------------+---------------+----------+------------+
2)多年来的地址更改.
+---------------+---------------+------+------+------+
| Address 1 | Address 2 | 2018 | 2019 | 2020 |
+---------------+---------------+------+------+------+
| Beverly hills | Orange county | 0 | 1 | 0 |
+---------------+---------------+------+------+------+
| New York | Canada | 0 | 0 | 1 |
+---------------+---------------+------+------+------+
| Canada | New York | 1 | 0 | 0 |
+---------------+---------------+------+------+------+
| California | Canada | 0 | 1 | 2 |
+---------------+---------------+------+------+------+
到目前为止我的解决方案: 感谢@QuangHoang,我可以使用以下代码捕获年份"的更改和地址"的更改:
My solution so far: Thanks to @QuangHoang I can capture the change of "Year" and change of "Address" with the following code:
groups = df.groupby('Name')
for col in ['Year', 'Address']:
df[f'cng-{col}'] = groups[col].shift().fillna(df[col]).ne(df[col]).astype(int)
groups[col].shift()
在每个名称中将对应的列移动1. fillna(df[col]
将每个(移位的)组的第一行填充为原始行,表示没有更改.最后,ne(df[col])
将移位后的值与原始值进行比较,以进行更改.
groups[col].shift()
shifts the corresponding column by 1 within each name. fillna(df[col]
fills the first row in each (shifted) group with the original, indicating no change. Finally, ne(df[col])
compares the shifted values with the original values for changes.
收益:
+----+---------------+-------+------+----------+-------------+
| ID | Address | Name | Year | cng-Year | cng-Address |
+----+---------------+-------+------+----------+-------------+
| 0 | Beverly hills | John | 2018 | 0 | 0 |
+----+---------------+-------+------+----------+-------------+
| 1 | Beverly hills | John | 2018 | 0 | 0 |
+----+---------------+-------+------+----------+-------------+
| 2 | Beverly hills | John | 2019 | 1 | 0 |
+----+---------------+-------+------+----------+-------------+
| 3 | Orange county | John | 2019 | 0 | 1 |
+----+---------------+-------+------+----------+-------------+
| 4 | New York | John | 2019 | 0 | 1 |
+----+---------------+-------+------+----------+-------------+
| 10 | Canada | John | 2020 | 1 | 1 |
+----+---------------+-------+------+----------+-------------+
| 11 | Canada | John | 2021 | 1 | 0 |
+----+---------------+-------+------+----------+-------------+
| 12 | Beverly hills | John | 2021 | 0 | 1 |
+----+---------------+-------+------+----------+-------------+
| 5 | Canada | Steve | 2018 | 0 | 0 |
+----+---------------+-------+------+----------+-------------+
| 15 | NewYork | Steve | 2018 | 1 | 1 |
+----+---------------+-------+------+----------+-------------+
| 16 | California | Steve | 2018 | 0 | 1 |
+----+---------------+-------+------+----------+-------------+
| 6 | Canada | Steve | 2019 | 1 | 0 |
+----+---------------+-------+------+----------+-------------+
| 7 | Canada | Steve | 2019 | 0 | 0 |
+----+---------------+-------+------+----------+-------------+
| 8 | California | Steve | 2020 | 1 | 1 |
+----+---------------+-------+------+----------+-------------+
| 9 | Canada | Steve | 2020 | 0 | 1 |
+----+---------------+-------+------+----------+-------------+
| 13 | California | Steve | 2021 | 1 | 1 |
+----+---------------+-------+------+----------+-------------+
| 14 | California | Steve | 2022 | 1 | 0 |
+----+---------------+-------+------+----------+-------------+
| 17 | NewYork | Steve | 2022 | 1 | 1 |
+----+---------------+-------+------+----------+-------------+
推荐答案
如果我理解这个问题.
df.drop_duplicates().groupby(['Name','Year']).size().reset_index(name="changes")
使用此输出
Name Year changes
0 John 2018 1
1 John 2019 3
2 John 2020 1
3 John 2021 2
4 Steve 2018 3
5 Steve 2019 1
6 Steve 2020 2
7 Steve 2021 1
8 Steve 2022 2
这篇关于计算pandas DataFrame中每个组的行值更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!