Julia Dataframe结合了特定的计算和转置 [英] Julia Dataframe combine specific calculations and tranpose

查看:71
本文介绍了Julia Dataframe结合了特定的计算和转置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要做一些非常具体的事情,我正在尝试以一种很好的方式做到这一点,尤其是我希望对其进行优化.

I need to do something quite specific and i'm trying to do it the good way , especially i want it to be optimized .

所以我有一个看起来像这样的DataFrame:

So i have a DataFrame that look like this :

v = ["x","y","z"][rand(1:3, 10)]
df = DataFrame(Any[collect(1:10), v, rand(10)], [:USER_ID, :GENRE_MAIN, :TOTAL_LISTENED])

 Row │ USER_ID  GENRE_MAIN  TOTAL_LISTENED 
     │ Int64    String      Float64        
─────┼─────────────────────────────────────
   1 │       1  x                 0.237186
  12 │       1  y                 0.237186
  13 │       1  x                 0.254486
   2 │       2  z                 0.920804
   3 │       3  y                 0.140626
   4 │       4  x                 0.653306
   5 │       5  x                 0.83126
   6 │       6  x                 0.928973
   7 │       7  y                 0.519728
   8 │       8  x                 0.409969
   9 │       9  z                 0.798064
  10 │      10  x                 0.701332

我想按用户汇总(每个user_id有很多行)并进行许多计算

I want to aggregate it by user (i have many rows per user_id ) and do many calculations

我需要计算前1、2、3、4、5个流派,专辑名称,每个user_id的艺术家名称及其各自的值(对应的total_listened),并且它必须像这样:

I need to calculate the top 1 ,2 ,3 ,4 ,5 genre, album name, artist name per user_id and its respective values (the total_listened that correspond) and it has to be like this :

USER_ID │ ALBUM1_NAME      │ ALBUM2_NAME  | ALBUM1_NAME_VALUE | ALBUM2_NAME_VALUES | ......│ GENRE1       │ GENRE2  

每个user_id一行.

One line per user_id .

我得到的解决方案可以满足我想要的90%,但是我无法对其进行修改以包含total_listened的值:

I got this solution that fits 90% of what i wanted but i can't modify it to also include the values of total_listened:

using DataFrames, Pipe, Random, Pkg

Pkg.activate(".")
Pkg.add("DataFrames")
Pkg.add("Pipe")

Random.seed!(1234)

df = DataFrame(USER_ID=rand(1:10, 80),
               GENRE_MAIN=rand(string.("genre_", 1:6), 80),
               ALBUM_NAME=rand(string.("album_", 1:6), 80),
               ALBUM_ARTIST_NAME=rand(string.("artist_", 1:6), 80))

function top5(sdf, col, prefix)
    return @pipe groupby(sdf, col) |>
                 combine(_, nrow) |>
                 sort!(_, :nrow, rev=true) |>
                 first(_, 5) |>
                 vcat(_[!, 1], fill(missing, 5 - nrow(_))) |>
                 DataFrame([string(prefix, i) for i in 1:5] .=> _)
end

@pipe groupby(df, :USER_ID) |>
      combine(_,
              x -> top5(x, :GENRE_MAIN, "genre"),
              x -> top5(x, :ALBUM_NAME, "album"), 
              x -> top5(x, :ALBUM_ARTIST_NAME, "artist"))

一个例子:

对于刚好为DataFrame的用户1,我希望结果为:

for the user 1 of the DataFrame just up i want the result to be :

 Row │ USER_ID  GENRE1  GENRE2   GENRE1_VALUE GENRE2_VALUE   ......
     │ Int64    String  String    Float64     Float64      
─────┼─────────────────────────────────────────────────────
   1 │       1  x         y       0.491672    0.237186.     ......

我在这里只参加了GENRE,但我也想将其用于ALBUM_NAME,ALBUM_ARTIST_NAME

I took only GENRE here , but i also want it for ALBUM_NAME, ALBUM_ARTIST_NAME

我也想在之后排名最高, 通过total_listened对用户进行排序,并计算其百分位数. 将它们按总排名的top5%,top10%,top20%进行排名 我可以用我想要的加标签的分位数

I also want after to do a top rank % , Order the users by total_listened and calculate their percentile. to rank them by top5% , top10%, top20% of the total I can calculate the tagetted quantile i want with

x = .05
quantile(df.TOTAL_LISTENED, x)

,然后将所有用户的total_listened优于此分位数 但是我不知道如何在联合收割机中正确地计算它...

and then just put all the users's total_listened that is superior to this quantile but i don't know how to calculate it properly in the combine...

谢谢

推荐答案

如前一篇文章中所评论-我建议您提出一个特定的问题,不要在StackOverflow上重做整个项目(如果您需要这样的帮助,请 https://discourse.julialang.org/是一个讨论的好地方,尤其是您需要许多分析步骤并且需要精确定义所需的内容-最好在 https://discourse.julialang上使用. org/您共享了完整的数据集,因为您在此处提供的采样器不足以进行稍后的适当分析,因为它太小了.

As commented in the previous post - I would recommend you to ask a specific question not to redo your whole project on StackOverflow (if you need such help https://discourse.julialang.org/ is a good place to discuss, especially that you need many steps of the analysis and they require a precise definition of what you want exactly - also it would be best if on https://discourse.julialang.org/ you shared your full data set, as the sampler you provide here is not enough to do a proper analysis later since it is too small).

以下是如何添加总计列的示例(我假设您希望按总计对数据进行排序):

Here is an example how to add totals columns (I assume that you want data to be ordered by the totals):

julia> using Random, DataFrames, Pipe

julia> Random.seed!(1234);

julia> df = DataFrame([rand(1:10, 100), rand('a':'k', 100), rand(100)],
                      [:USER_ID, :GENRE_MAIN, :TOTAL_LISTENED]);

julia> function top5(sdf, col, prefix)
           @pipe groupby(sdf, col) |>
                 combine(_, :TOTAL_LISTENED => sum => :SUM) |>
                 sort!(_, :SUM, rev=true) |>
                 first(_, 5) |>
                 vcat(_[!, 1], fill(missing, 5 - nrow(_)),
                      _[!, 2], fill(missing, 5 - nrow(_))) |>
                 DataFrame([[string(prefix, i) for i in 1:5];
                            [string(prefix, i, "_VALUE") for i in 1:5]] .=> _)
       end;

julia> @pipe groupby(df, :USER_ID) |>
             combine(_, x -> top5(x, :GENRE_MAIN, "genre"))
10×11 DataFrame
 Row │ USER_ID  genre1  genre2  genre3  genre4  genre5   genre1_VALUE  genre2_VALUE  genre3_VALUE  genre4_VALUE  genre5_VALUE    
     │ Int64    Char    Char    Char    Char    Char?    Float64       Float64       Float64       Float64       Float64?        
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       1  d       b       j       e       i             2.34715      2.014         1.68587       0.693472        0.377869
   2 │       4  b       e       d       c       missing       0.90263      0.589418      0.263121      0.107839  missing         
   3 │       8  c       d       i       k       j             1.55335      1.40416       0.977785      0.779468        0.118024
   4 │       2  a       e       f       g       k             1.34841      0.901507      0.87146       0.797606        0.669002
   5 │      10  a       e       f       i       d             1.60554      1.07311       0.820425      0.757363        0.678598
   6 │       7  f       i       g       c       a             2.59654      1.49654       1.15944       0.670488        0.258173
   7 │       9  i       b       e       a       g             1.57373      0.954117      0.603848      0.338918        0.133201
   8 │       5  f       g       c       k       d             1.33899      0.722283      0.664457      0.54016         0.507337
   9 │       3  d       c       f       h       e             1.63695      0.919088      0.544296      0.531262        0.0540101
  10 │       6  d       g       f       j       i             1.68768      0.97688       0.333207      0.259212        0.0636912

这篇关于Julia Dataframe结合了特定的计算和转置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆