使用Sparklyr按时间戳和ID查找丢失的行 [英] Find missing rows by timestamp + ID with sparklyr

查看:108
本文介绍了使用Sparklyr按时间戳和ID查找丢失的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试查找缺少的时间戳。这里有很多解决此问题的解决方案。尽管如此,我也想找到ID丢失的哪里时间戳。

I try to find missing timestamp. Here are a lot of solutions to fix this single problem. Nevertheless I also want to find "where" timestamp by ID is missing.

因此,例如,测试数据集如下所示:

So for example the test-dataset would look like this:

elemuid timestamp
1232    2018-02-10 23:00:00
1232    2018-02-10 23:01:00
1232    2018-02-10 22:58:00
1674    2018-02-10 22:40:00
1674    2018-02-10 22:39:00
1674    2018-02-10 22:37:00
1674    2018-02-10 22:35:00

解决方案应该像这样:

elemuid timestamp
1232    2018-02-10 22:59:00
1674    2018-02-10 22:38:00
1674    2018-02-10 22:36:00

我的问题是我只能使用 dplyr ,因为我也想在 sparklyr
我会很高兴为您提供帮助!

My problem is that I can only use dplyr, because I would like to use this code also in sparklyr. I would be really happy for your help!

推荐答案

为简单起见,假设您已经遵循< a href = https://stackoverflow.com/q/49871925/6910411>上一个问题的说明,并计算出最小和最大 min_max 纪元时间(以秒为单位)。

For the simplicity let's assume you've already followed the instructions from your previous question, and computed minimum and maximum (min_max) Epoch time in seconds.

其余步骤与我们之前遵循的步骤非常相似:

The remaining steps are quite similar to the ones we followed before:


  • 生成值范围:

  • Generate range of values:

epoch_range <- spark_session(sc) %>% 
  invoke("range", as.integer(min_max[1]), as.integer(min_max[2]), 60L) %>%
  invoke("withColumnRenamed", "id", "timestamp")


  • 计算不同的 elemuid

    elemuids <- df %>% select(elemuid) %>% distinct() %>% spark_dataframe()
    


  • 现在,我们要生成参考表,作为范围和唯一ID的笛卡尔乘积:

    Now, we want to generate a reference table as a Cartesian product of the range and unique ids:

    ref <- epoch_range %>% 
      invoke("crossJoin", elemuids) %>% 
      sdf_register() %>%
      mutate(timestamp = from_unixtime(timestamp, "yyyy-MM-dd HH:mm:ss.SSS"))
    

    您可能会想使用更多的 dplyr -ish方法:

    You might be tempted to use more dplyr-ish method:

    sdf_register(epoch_range) %>% mutate(dummy = 1) %>% 
      left_join(sdf_register(elemuids) %>% mutate(dummy = 1), by = "dummy") %>%
      select(-dummy)
    

    如果产品尺寸较小(在这种情况下,Spark应该使用广播联接),这将很好地工作,但是否则将导致完全的数据偏斜,因此通常不安全使用。

    This would work fine if size of the product is small (in that case Spark should use broadcast join), but will cause complete data skew otherwise so it is not safe to use in general.

    最后,我们将像以前一样外部联接数据:

    Finally we'll outer join data as before:

    ref %>% left_join(df, by = c("timestamp", "elemuid"))
    

    填写内容,或者(如已在答案中解释 /stackoverflow.com/users/3732271/akrun\">akrun )反加入以删除缺失的点:

    to fill out things, or (as already explained in the answer provided by akrun) anti join to remove missing points:

    ref %>% anti_join(df, by = c("timestamp", "elemuid"))
    

    这篇关于使用Sparklyr按时间戳和ID查找丢失的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆