如何制作良好的可重现的 Apache Spark 示例 [英] How to make good reproducible Apache Spark examples

查看：24 发布时间：2021/11/14 21:13:13 dataframe apache-spark pyspark apache-spark-sql

本文介绍了如何制作良好的可重现的 Apache Spark 示例的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我花了很多时间阅读一些带有 pyspark 和 spark-dataframe 标签，而且我经常发现海报没有提供足够的信息来真正理解他们的问题.我通常会评论要求他们发布 MCVE，但有时让他们显示一些示例输入/输出数据就像拔牙一样.>

问题的一部分可能是人们不知道如何轻松地为 spark-dataframe 创建 MCVE.我认为拥有这个熊猫问题作为可以链接的指南.

那么如何创建一个好的、可重现的示例?

解决方案

提供可以轻松重新创建的小样本数据.

至少，海报应该在他们的数据框和代码上提供几行和几列，可以用来轻松创建它.简单，我的意思是剪切和粘贴.尽可能小地展示您的问题.

我有以下数据框:

+-----+---+-----+----------+|索引|X|标签|日期|+-----+---+-----+------------+|1|1|A|2017-01-01||2|3|B|2017-01-02||3|5|A|2017-01-03||4|7|B|2017-01-04|+-----+---+-----+------------+

可以使用此代码创建:

df = sqlCtx.createDataFrame([(1, 1, 'A', '2017-01-01'),(2, 3, 'B', '2017-01-02'),(3, 5, 'A', '2017-01-03'),(4, 7, 'B', '2017-01-04')],('索引', 'X', '标签', '日期'))

显示所需的输出.

提出您的具体问题并向我们展示您想要的输出.

如何创建一个新列 'is_divisible' 具有值 'yes' 如果 'date' 加上 7 天可以被列中的值整除'X'，and 'no' 否则吗?

期望的输出:

+-----+---+-----+----------+------------+|索引|X|标签|日期|is_divisible|+-----+---+-----+------------+------------+|1|1|A|2017-01-01|是||2|3|B|2017-01-02|是||3|5|A|2017-01-03|是||4|7|B|2017-01-04|没有|+-----+---+-----+------------+------------+

解释如何获得输出.

详细解释如何获得所需的输出.它有助于显示示例计算.

例如在第 1 行中，X = 1 且日期 = 2017-01-01.将迄今为止的 7 天添加到 2017-01-08.一个月中的第几天是 8，因为 8 可以被 1 整除，所以答案是是".

同样，最后一行 X = 7，日期 = 2017-01-04.将 7 添加到日期会产生 11 作为该月的第几天.由于 11 % 7 不是 0，所以答案是否".

分享您现有的代码.

向我们展示您所做或尝试过的内容，包括代码的所有*，即使它不起作用.告诉我们您在哪里卡住了，如果您收到错误，请附上错误消息.

(*您可以省略创建 spark 上下文的代码，但您应该包括所有导入.)

我知道如何添加一个 date 加上 7 天的新列，但是我无法将月份中的某天作为整数获取.

from pyspark.sql import 函数为 fdf.withColumn(next_week", f.date_add(date", 7))

包括版本、导入和使用语法高亮

此答案中的完整详细信息，作者为desertnaut.

对于性能调优帖子，包括执行计划

此答案中的完整详细信息，作者为Alper t.特克.
为上下文使用标准化名称会有所帮助.

解析火花输出文件

MaxU 在 this answer 帮助将 Spark 输出文件解析为 DataFrame.

其他注意事项.

请务必阅读如何提问和首先如何创建一个最小、完整且可验证的示例.
阅读上面链接的这个问题的其他答案.
有一个好的、描述性的标题.
要有礼貌.SO 上的人是志愿者，所以问得很好.

I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth.

Perhaps part of the problem is that people just don't know how to easily create an MCVE for spark-dataframes. I think it would be useful to have a spark-dataframe version of this pandas question as a guide that can be linked.

So how does one go about creating a good, reproducible example?

解决方案

Provide small sample data, that can be easily recreated.

At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem.

I have the following dataframe:

+-----+---+-----+----------+
|index|  X|label|      date|
+-----+---+-----+----------+
|    1|  1|    A|2017-01-01|
|    2|  3|    B|2017-01-02|
|    3|  5|    A|2017-01-03|
|    4|  7|    B|2017-01-04|
+-----+---+-----+----------+

which can be created with this code:

df = sqlCtx.createDataFrame(
    [
        (1, 1, 'A', '2017-01-01'),
        (2, 3, 'B', '2017-01-02'),
        (3, 5, 'A', '2017-01-03'),
        (4, 7, 'B', '2017-01-04')
    ],
    ('index', 'X', 'label', 'date')
)

Show the desired output.

Ask your specific question and show us your desired output.

How can I create a new column 'is_divisible' that has the value 'yes' if the day of month of the 'date' plus 7 days is divisible by the value in column'X', and 'no' otherwise?

Desired output:

+-----+---+-----+----------+------------+
|index|  X|label|      date|is_divisible|
+-----+---+-----+----------+------------+
|    1|  1|    A|2017-01-01|         yes|
|    2|  3|    B|2017-01-02|         yes|
|    3|  5|    A|2017-01-03|         yes|
|    4|  7|    B|2017-01-04|          no|
+-----+---+-----+----------+------------+

Explain how to get your output.

Explain, in great detail, how you get your desired output. It helps to show an example calculation.

For instance in row 1, the X = 1 and date = 2017-01-01. Adding 7 days to date yields 2017-01-08. The day of the month is 8 and since 8 is divisible by 1, the answer is 'yes'.

Likewise, for the last row X = 7 and the date = 2017-01-04. Adding 7 to the date yields 11 as the day of the month. Since 11 % 7 is not 0, the answer is 'no'.

Share your existing code.

Show us what you have done or tried, including all* of the code even if it does not work. Tell us where you are getting stuck and if you receive an error, please include the error message.

(*You can leave out the code to create the spark context, but you should include all imports.)

I know how to add a new column that is date plus 7 days but I'm having trouble getting the day of the month as an integer.

from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))

Include versions, imports, and use syntax highlighting

Full details in this answer written by desertnaut.

For performance tuning posts, include the execution plan

Full details in this answer written by Alper t. Turker.
It helps to use standardized names for contexts.

Parsing spark output files

MaxU provided useful code in this answer to help parse Spark output files into a DataFrame.

Other notes.

Be sure to read how to ask and How to create a Minimal, Complete, and Verifiable example first.
Read the other answers to this question, which are linked above.
Have a good, descriptive title.
Be polite. People on SO are volunteers, so ask nicely.

这篇关于如何制作良好的可重现的 Apache Spark 示例的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何制作良好的可重现的 Apache Spark 示例 [英] How to make good reproducible Apache Spark examples

问题描述

提供可以轻松重新创建的小样本数据.

显示所需的输出.

解释如何获得输出.

分享您现有的代码.

包括版本、导入和使用语法高亮

对于性能调优帖子，包括执行计划

解析火花输出文件

其他注意事项.

Provide small sample data, that can be easily recreated.

Show the desired output.

Explain how to get your output.

Share your existing code.

Include versions, imports, and use syntax highlighting

For performance tuning posts, include the execution plan

Parsing spark output files

Other notes.

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何制作良好的可重现的 Apache Spark 示例 [英] How to make good reproducible Apache Spark examples

问题描述

提供可以轻松重新创建的小样本数据.

显示所需的输出.

解释如何获得输出.

分享您现有的代码.

包括版本、导入和使用语法高亮

对于性能调优帖子，包括执行计划

解析火花输出文件

其他注意事项.

Provide small sample data, that can be easily recreated.

Show the desired output.

Explain how to get your output.

Share your existing code.

Include versions, imports, and use syntax highlighting

For performance tuning posts, include the execution plan

Parsing spark output files

Other notes.

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭