如何制作可重现的Apache Spark示例 [英] How to make good reproducible Apache Spark examples

查看:64
本文介绍了如何制作可重现的Apache Spark示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经花了很多时间阅读一些带有标签,而且我经常发现海报提供的信息不足以真正理解他们的问题.我通常会发表评论,要求他们发布 MCVE ,但有时让他们展示一些示例输入/输出数据就像拔牙一样.

问题的一部分可能是人们只是不知道如何轻松地为spark-dataframe创建MCVE.我认为拥有此熊猫问题作为可以链接的指南.

那么如何去创造一个好的,可复制的例子呢?

解决方案

提供小的示例数据,可以轻松地重新创建.

至少,发布者应在其数据框和代码上提供几行和几列,以方便地创建它.简单来说,我是指剪切和粘贴.使其尽可能小以证明您的问题.


我有以下数据框:

 +-----+---+-----+----------+
|index|  X|label|      date|
+-----+---+-----+----------+
|    1|  1|    A|2017-01-01|
|    2|  3|    B|2017-01-02|
|    3|  5|    A|2017-01-03|
|    4|  7|    B|2017-01-04|
+-----+---+-----+----------+
 

可以使用以下代码创建:

 df = sqlCtx.createDataFrame(
    [
        (1, 1, 'A', '2017-01-01'),
        (2, 3, 'B', '2017-01-02'),
        (3, 5, 'A', '2017-01-03'),
        (4, 7, 'B', '2017-01-04')
    ],
    ('index', 'X', 'label', 'date')
)
 


显示所需的输出.

询问您的具体问题,并向我们显示您想要的输出.


如何创建新列 'is_divisible' ,如果值是 'yes' ,则其值为 加上7天可被列 'X' 'no' 中的值整除?

所需的输出:

 +-----+---+-----+----------+------------+
|index|  X|label|      date|is_divisible|
+-----+---+-----+----------+------------+
|    1|  1|    A|2017-01-01|         yes|
|    2|  3|    B|2017-01-02|         yes|
|    3|  5|    A|2017-01-03|         yes|
|    4|  7|    B|2017-01-04|          no|
+-----+---+-----+----------+------------+
 


说明如何获取输出.

详细说明如何获得所需的输出.有助于显示示例计算.


例如,在第1行中,X = 1,日期= 2017-01-01.迄今为止,再加上7天就会产生2017年1月8日.该月的一天是8,并且由于8被1整除,所以答案是是".

同样,对于最后一行X = 7,日期= 2017-01-04.将7加到日期将产生11,作为一个月中的某天.由于11%7不是0,答案是否".


分享您现有的代码.

向我们显示您已完成或尝试过的操作,包括代码的全部* ,即使该代码不起作用也是如此.告诉我们您卡在哪里,如果遇到错误,请附上错误消息.

(*您可以省略代码来创建spark上下文,但是应该包括所有导入.)


我知道如何添加一个新列 date 加上7天,但是我很难获取月份中的天作为整数.

 from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))
 


包括版本,导入和使用语法突出显示

  • 此答案中的完整详细信息,由此答案中的完整详细信息,由 MaxU 如何提问 and tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth.

    Perhaps part of the problem is that people just don't know how to easily create an MCVE for spark-dataframes. I think it would be useful to have a spark-dataframe version of this pandas question as a guide that can be linked.

    So how does one go about creating a good, reproducible example?

    解决方案

    Provide small sample data, that can be easily recreated.

    At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem.


    I have the following dataframe:

    +-----+---+-----+----------+
    |index|  X|label|      date|
    +-----+---+-----+----------+
    |    1|  1|    A|2017-01-01|
    |    2|  3|    B|2017-01-02|
    |    3|  5|    A|2017-01-03|
    |    4|  7|    B|2017-01-04|
    +-----+---+-----+----------+
    

    which can be created with this code:

    df = sqlCtx.createDataFrame(
        [
            (1, 1, 'A', '2017-01-01'),
            (2, 3, 'B', '2017-01-02'),
            (3, 5, 'A', '2017-01-03'),
            (4, 7, 'B', '2017-01-04')
        ],
        ('index', 'X', 'label', 'date')
    )
    


    Show the desired output.

    Ask your specific question and show us your desired output.


    How can I create a new column 'is_divisible' that has the value 'yes' if the day of month of the 'date' plus 7 days is divisible by the value in column'X', and 'no' otherwise?

    Desired output:

    +-----+---+-----+----------+------------+
    |index|  X|label|      date|is_divisible|
    +-----+---+-----+----------+------------+
    |    1|  1|    A|2017-01-01|         yes|
    |    2|  3|    B|2017-01-02|         yes|
    |    3|  5|    A|2017-01-03|         yes|
    |    4|  7|    B|2017-01-04|          no|
    +-----+---+-----+----------+------------+
    


    Explain how to get your output.

    Explain, in great detail, how you get your desired output. It helps to show an example calculation.


    For instance in row 1, the X = 1 and date = 2017-01-01. Adding 7 days to date yields 2017-01-08. The day of the month is 8 and since 8 is divisible by 1, the answer is 'yes'.

    Likewise, for the last row X = 7 and the date = 2017-01-04. Adding 7 to the date yields 11 as the day of the month. Since 11 % 7 is not 0, the answer is 'no'.


    Share your existing code.

    Show us what you have done or tried, including all* of the code even if it does not work. Tell us where you are getting stuck and if you receive an error, please include the error message.

    (*You can leave out the code to create the spark context, but you should include all imports.)


    I know how to add a new column that is date plus 7 days but I'm having trouble getting the day of the month as an integer.

    from pyspark.sql import functions as f
    df.withColumn("next_week", f.date_add("date", 7))
    


    Include versions, imports, and use syntax highlighting


    For performance tuning posts, include the execution plan


    Parsing spark output files

    • MaxU provided useful code in this answer to help parse Spark output files into a DataFrame.

    Other notes.

    这篇关于如何制作可重现的Apache Spark示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆