数据处理,一种下采样 [英] Data manipulation, kind of downsampling
本文介绍了数据处理,一种下采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个很大的csv文件,下面是数据示例.我将以八个团队为例进行说明.
I have a large csv file, example of the data below. I will use an example of eight teams to illustrate.
home_team away_team home_score away_score year
belgium france 2 2 1990
brazil uruguay 3 1 1990
italy belgium 1 2 1990
sweden mexico 3 1 1990
france chile 3 1 1991
brazil england 2 1 1991
italy belgium 1 2 1991
chile switzerland 2 2 1991
我的数据运行了很多年. 我想获得每个团队每年的总得分,请参见下面的示例,
My data runs for many years. I would like to have total number of scores of each team every year, see example below,
team total_scores year
belgium 4 1990
france 2 1990
brazil 3 1990
uruguay 1 1990
italy 1 1990
sweden 3 1990
mexico 1 1990
france 3 1991
chile 5 1991
brazil 2 1991
england 1 1991
italy 1 1991
belgium 2 1991
switzerland 2 1991
有想法吗?
推荐答案
这是R中的另一种解决方案.
Here is yet another solution in R.
#Packages needed
library(dplyr)
library(magrittr)
library(tidyr)
#Your data
home_team <- c("belgium", "brazil", "italy", "sweden",
"france", "brazil", "italy", "chile")
away_team <- c("france", "uruguay", "belgium", "mexico",
"chile", "england", "belgium", "switzerland")
home_score <- c(2,3,1,3,
3,2,1,2)
away_score <- c(2,1,2,1,
1,1,2,2)
year <- c(1990, 1990, 1990, 1990,
1991, 1991, 1991, 1991)
df <- data.frame(home_team, away_team, home_score, away_score, year, stringsAsFactors = FALSE)
df
# home_team away_team home_score away_score year
# 1 belgium france 2 2 1990
# 2 brazil uruguay 3 1 1990
# 3 italy belgium 1 2 1990
# 4 sweden mexico 3 1 1990
# 5 france chile 3 1 1991
# 6 brazil england 2 1 1991
# 7 italy belgium 1 2 1991
# 8 chile switzerland 2 2 1991
#Column names for the new data.frames
my_colnames <- c("team", "score", "year")
#Using select() to create separate home and away datasets
df_home <- df %>% select(matches("home|year")) %>% setNames(my_colnames) %>% mutate(game_where = "home")
df_away <- df %>% select(matches("away|year")) %>% setNames(my_colnames) %>% mutate(game_where = "away")
#rbind()'ing both data.frames
#Grouping the rows together first by the team and then by the year
#Summing up the scores for the aforementioned groupings
#Sorting the newly produced data.frame by year
df_1 <- rbind(df_home, df_away) %>% group_by(team, year) %>% tally(score) %>% arrange(year)
df_1
# team year n
# <chr> <dbl> <dbl>
# 1 belgium 1990 4
# 2 brazil 1990 3
# 3 france 1990 2
# 4 italy 1990 1
# 5 mexico 1990 1
# 6 sweden 1990 3
# 7 uruguay 1990 1
# 8 belgium 1991 2
# 9 brazil 1991 2
#10 chile 1991 3
#11 england 1991 1
#12 france 1991 3
#13 italy 1991 1
#14 switzerland 1991 2
这篇关于数据处理,一种下采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文