如何制作良好的可重现的 Apache Spark 示例 [英] How to make good reproducible Apache Spark examples

查看:24
本文介绍了如何制作良好的可重现的 Apache Spark 示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我花了很多时间阅读一些带有 标签,而且我经常发现海报没有提供足够的信息来真正理解他们的问题.我通常会评论要求他们发布 MCVE,但有时让他们显示一些示例输入/输出数据就像拔牙一样.>

问题的一部分可能是人们不知道如何轻松地为 spark-dataframe 创建 MCVE.我认为拥有 这个熊猫问题 作为可以链接的指南.

那么如何创建一个好的、可重现的示例?

解决方案

提供可以轻松重新创建的小样本数据.

至少,海报应该在他们的数据框和代码上提供几行和几列,可以用来轻松创建它.简单,我的意思是剪切和粘贴.尽可能小地展示您的问题.


我有以下数据框:

+-----+---+-----+----------+|索引|X|标签|日期|+-----+---+-----+------------+|1|1|A|2017-01-01||2|3|B|2017-01-02||3|5|A|2017-01-03||4|7|B|2017-01-04|+-----+---+-----+------------+

可以使用此代码创建:

df = sqlCtx.createDataFrame([(1, 1, 'A', '2017-01-01'),(2, 3, 'B', '2017-01-02'),(3, 5, 'A', '2017-01-03'),(4, 7, 'B', '2017-01-04')],('索引', 'X', '标签', '日期'))


显示所需的输出.

提出您的具体问题并向我们展示您想要的输出.


如何创建一个新列 'is_divisible' 具有值 'yes' 如果 'date' 加上 7 天可以被列中的值整除'X'and 'no' 否则吗?

期望的输出:

+-----+---+-----+----------+------------+|索引|X|标签|日期|is_divisible|+-----+---+-----+------------+------------+|1|1|A|2017-01-01|是||2|3|B|2017-01-02|是||3|5|A|2017-01-03|是||4|7|B|2017-01-04|没有|+-----+---+-----+------------+------------+


解释如何获得输出.

详细解释如何获得所需的输出.它有助于显示示例计算.


例如在第 1 行中,X = 1 且日期 = 2017-01-01.将迄今为止的 7 天添加到 2017-01-08.一个月中的第几天是 8,因为 8 可以被 1 整除,所以答案是是".

同样,最后一行 X = 7,日期 = 2017-01-04.将 7 添加到日期会产生 11 作为该月的第几天.由于 11 % 7 不是 0,所以答案是否".


分享您现有的代码.

向我们展示您所做或尝试过的内容,包括代码的所有*,即使它不起作用.告诉我们您在哪里卡住了,如果您收到错误,请附上错误消息.

(*您可以省略创建 spark 上下文的代码,但您应该包括所有导入.)


我知道如何添加一个 date 加上 7 天的新列,但是我无法将月份中的某天作为整数获取.

from pyspark.sql import 函数为 fdf.withColumn(next_week", f.date_add(date", 7))


包括版本、导入和使用语法高亮


对于性能调优帖子,包括执行计划


解析火花输出文件


其他注意事项.

I've been spending a fair amount of time reading through some questions with the and tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth.

Perhaps part of the problem is that people just don't know how to easily create an MCVE for spark-dataframes. I think it would be useful to have a spark-dataframe version of this pandas question as a guide that can be linked.

So how does one go about creating a good, reproducible example?

解决方案

Provide small sample data, that can be easily recreated.

At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem.


I have the following dataframe:

+-----+---+-----+----------+
|index|  X|label|      date|
+-----+---+-----+----------+
|    1|  1|    A|2017-01-01|
|    2|  3|    B|2017-01-02|
|    3|  5|    A|2017-01-03|
|    4|  7|    B|2017-01-04|
+-----+---+-----+----------+

which can be created with this code:

df = sqlCtx.createDataFrame(
    [
        (1, 1, 'A', '2017-01-01'),
        (2, 3, 'B', '2017-01-02'),
        (3, 5, 'A', '2017-01-03'),
        (4, 7, 'B', '2017-01-04')
    ],
    ('index', 'X', 'label', 'date')
)


Show the desired output.

Ask your specific question and show us your desired output.


How can I create a new column 'is_divisible' that has the value 'yes' if the day of month of the 'date' plus 7 days is divisible by the value in column'X', and 'no' otherwise?

Desired output:

+-----+---+-----+----------+------------+
|index|  X|label|      date|is_divisible|
+-----+---+-----+----------+------------+
|    1|  1|    A|2017-01-01|         yes|
|    2|  3|    B|2017-01-02|         yes|
|    3|  5|    A|2017-01-03|         yes|
|    4|  7|    B|2017-01-04|          no|
+-----+---+-----+----------+------------+


Explain how to get your output.

Explain, in great detail, how you get your desired output. It helps to show an example calculation.


For instance in row 1, the X = 1 and date = 2017-01-01. Adding 7 days to date yields 2017-01-08. The day of the month is 8 and since 8 is divisible by 1, the answer is 'yes'.

Likewise, for the last row X = 7 and the date = 2017-01-04. Adding 7 to the date yields 11 as the day of the month. Since 11 % 7 is not 0, the answer is 'no'.


Share your existing code.

Show us what you have done or tried, including all* of the code even if it does not work. Tell us where you are getting stuck and if you receive an error, please include the error message.

(*You can leave out the code to create the spark context, but you should include all imports.)


I know how to add a new column that is date plus 7 days but I'm having trouble getting the day of the month as an integer.

from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))


Include versions, imports, and use syntax highlighting


For performance tuning posts, include the execution plan


Parsing spark output files

  • MaxU provided useful code in this answer to help parse Spark output files into a DataFrame.

Other notes.

这篇关于如何制作良好的可重现的 Apache Spark 示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆