如何制作可重现的Apache Spark示例 [英] How to make good reproducible Apache Spark examples

查看：64 发布时间：2020/9/3 22:41:56 dataframe apache-spark pyspark apache-spark-sql

本文介绍了如何制作可重现的Apache Spark示例的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经花了很多时间阅读一些带有 pyspark 和 spark -dataframe 标签，而且我经常发现海报提供的信息不足以真正理解他们的问题.我通常会发表评论，要求他们发布 MCVE ，但有时让他们展示一些示例输入/输出数据就像拔牙一样.

问题的一部分可能是人们只是不知道如何轻松地为spark-dataframe创建MCVE.我认为拥有此熊猫问题作为可以链接的指南.

那么如何去创造一个好的，可复制的例子呢?

解决方案

提供小的示例数据，可以轻松地重新创建.

至少，发布者应在其数据框和代码上提供几行和几列，以方便地创建它.简单来说，我是指剪切和粘贴.使其尽可能小以证明您的问题.

我有以下数据框:

 +-----+---+-----+----------+
|index|  X|label|      date|
+-----+---+-----+----------+
|    1|  1|    A|2017-01-01|
|    2|  3|    B|2017-01-02|
|    3|  5|    A|2017-01-03|
|    4|  7|    B|2017-01-04|
+-----+---+-----+----------+

可以使用以下代码创建:

 df = sqlCtx.createDataFrame(
    [
        (1, 1, 'A', '2017-01-01'),
        (2, 3, 'B', '2017-01-02'),
        (3, 5, 'A', '2017-01-03'),
        (4, 7, 'B', '2017-01-04')
    ],
    ('index', 'X', 'label', 'date')
)

显示所需的输出.

询问您的具体问题，并向我们显示您想要的输出.

如何创建新列 'is_divisible' ，如果值是 'yes' ，则其值为 加上7天可被列 'X'，和 'no' 中的值整除?

所需的输出:

 +-----+---+-----+----------+------------+
|index|  X|label|      date|is_divisible|
+-----+---+-----+----------+------------+
|    1|  1|    A|2017-01-01|         yes|
|    2|  3|    B|2017-01-02|         yes|
|    3|  5|    A|2017-01-03|         yes|
|    4|  7|    B|2017-01-04|          no|
+-----+---+-----+----------+------------+

说明如何获取输出.

详细说明如何获得所需的输出.有助于显示示例计算.

例如，在第1行中，X = 1，日期= 2017-01-01.迄今为止，再加上7天就会产生2017年1月8日.该月的一天是8，并且由于8被1整除，所以答案是是".

同样，对于最后一行X = 7，日期= 2017-01-04.将7加到日期将产生11，作为一个月中的某天.由于11％7不是0，答案是否".

分享您现有的代码.

向我们显示您已完成或尝试过的操作，包括代码的全部* ，即使该代码不起作用也是如此.告诉我们您卡在哪里，如果遇到错误，请附上错误消息.

(*您可以省略代码来创建spark上下文，但是应该包括所有导入.)

我知道如何添加一个新列 date 加上7天，但是我很难获取月份中的天作为整数.

 from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))

包括版本，导入和使用语法突出显示

此答案中的完整详细信息，由此答案中的完整详细信息，由 MaxU 在如何提问和pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth.

Perhaps part of the problem is that people just don't know how to easily create an MCVE for spark-dataframes. I think it would be useful to have a spark-dataframe version of this pandas question as a guide that can be linked.

So how does one go about creating a good, reproducible example?
解决方案
Provide small sample data, that can be easily recreated.

At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem.

I have the following dataframe:
```
+-----+---+-----+----------+
|index|  X|label|      date|
+-----+---+-----+----------+
|    1|  1|    A|2017-01-01|
|    2|  3|    B|2017-01-02|
|    3|  5|    A|2017-01-03|
|    4|  7|    B|2017-01-04|
+-----+---+-----+----------+
```
which can be created with this code:
```
df = sqlCtx.createDataFrame(
    [
        (1, 1, 'A', '2017-01-01'),
        (2, 3, 'B', '2017-01-02'),
        (3, 5, 'A', '2017-01-03'),
        (4, 7, 'B', '2017-01-04')
    ],
    ('index', 'X', 'label', 'date')
)
```
Show the desired output.

Ask your specific question and show us your desired output.

How can I create a new column 'is_divisible' that has the value 'yes' if the day of month of the 'date' plus 7 days is divisible by the value in column'X', and 'no' otherwise?

Desired output:
```
+-----+---+-----+----------+------------+
|index|  X|label|      date|is_divisible|
+-----+---+-----+----------+------------+
|    1|  1|    A|2017-01-01|         yes|
|    2|  3|    B|2017-01-02|         yes|
|    3|  5|    A|2017-01-03|         yes|
|    4|  7|    B|2017-01-04|          no|
+-----+---+-----+----------+------------+
```
Explain how to get your output.

Explain, in great detail, how you get your desired output. It helps to show an example calculation.

For instance in row 1, the X = 1 and date = 2017-01-01. Adding 7 days to date yields 2017-01-08. The day of the month is 8 and since 8 is divisible by 1, the answer is 'yes'.

Likewise, for the last row X = 7 and the date = 2017-01-04. Adding 7 to the date yields 11 as the day of the month. Since 11 % 7 is not 0, the answer is 'no'.

Share your existing code.

Show us what you have done or tried, including all* of the code even if it does not work. Tell us where you are getting stuck and if you receive an error, please include the error message.

(*You can leave out the code to create the spark context, but you should include all imports.)

I know how to add a new column that is date plus 7 days but I'm having trouble getting the day of the month as an integer.
```
from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))
```
Include versions, imports, and use syntax highlighting
- Full details in this answer written by desertnaut.
For performance tuning posts, include the execution plan
- Full details in this answer written by user8371915.
- It helps to use standardized names for contexts.
Parsing spark output files
- MaxU provided useful code in this answer to help parse Spark output files into a DataFrame.
Other notes.
- Be sure to read how to ask and How to create a Minimal, Complete, and Verifiable example first.
- Read the other answers to this question, which are linked above.
- Have a good, descriptive title.
- Be polite. People on SO are volunteers, so ask nicely.
这篇关于如何制作可重现的Apache Spark示例的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何制作可重现的Apache Spark示例 [英] How to make good reproducible Apache Spark examples

问题描述

提供小的示例数据，可以轻松地重新创建.

显示所需的输出.

说明如何获取输出.

分享您现有的代码.

包括版本，导入和使用语法突出显示

Provide small sample data, that can be easily recreated.

Show the desired output.

Explain how to get your output.

Share your existing code.

Include versions, imports, and use syntax highlighting

For performance tuning posts, include the execution plan

Parsing spark output files

Other notes.

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何制作可重现的Apache Spark示例 [英] How to make good reproducible Apache Spark examples

问题描述

提供小的示例数据，可以轻松地重新创建.

显示所需的输出.

说明如何获取输出.

分享您现有的代码.

包括版本，导入和使用语法突出显示

Provide small sample data, that can be easily recreated.

Show the desired output.

Explain how to get your output.

Share your existing code.

Include versions, imports, and use syntax highlighting

For performance tuning posts, include the execution plan

Parsing spark output files

Other notes.

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭