如何制作良好的可重现的 Apache Spark 示例 [英] How to make good reproducible Apache Spark examples
问题描述
我花了很多时间阅读一些带有 pyspark 和 spark-dataframe 标签,而且我经常发现海报没有提供足够的信息来真正理解他们的问题.我通常会评论要求他们发布 MCVE,但有时让他们显示一些示例输入/输出数据就像拔牙一样.>
问题的一部分可能是人们不知道如何轻松地为 spark-dataframe 创建 MCVE.我认为拥有 这个熊猫问题 作为可以链接的指南.
那么如何创建一个好的、可重现的示例?
提供可以轻松重新创建的小样本数据.
至少,海报应该在他们的数据框和代码上提供几行和几列,可以用来轻松创建它.简单,我的意思是剪切和粘贴.尽可能小地展示您的问题.
我有以下数据框:
+-----+---+-----+----------+|索引|X|标签|日期|+-----+---+-----+------------+|1|1|A|2017-01-01||2|3|B|2017-01-02||3|5|A|2017-01-03||4|7|B|2017-01-04|+-----+---+-----+------------+
可以使用此代码创建:
df = sqlCtx.createDataFrame([(1, 1, 'A', '2017-01-01'),(2, 3, 'B', '2017-01-02'),(3, 5, 'A', '2017-01-03'),(4, 7, 'B', '2017-01-04')],('索引', 'X', '标签', '日期'))
显示所需的输出.
提出您的具体问题并向我们展示您想要的输出.
如何创建一个新列 'is_divisible'
具有值 'yes'
如果 'date'
加上 7 天可以被列中的值整除'X'
,and 'no'
否则吗?
期望的输出:
+-----+---+-----+----------+------------+|索引|X|标签|日期|is_divisible|+-----+---+-----+------------+------------+|1|1|A|2017-01-01|是||2|3|B|2017-01-02|是||3|5|A|2017-01-03|是||4|7|B|2017-01-04|没有|+-----+---+-----+------------+------------+
解释如何获得输出.
详细解释如何获得所需的输出.它有助于显示示例计算.
例如在第 1 行中,X = 1 且日期 = 2017-01-01.将迄今为止的 7 天添加到 2017-01-08.一个月中的第几天是 8,因为 8 可以被 1 整除,所以答案是是".
同样,最后一行 X = 7,日期 = 2017-01-04.将 7 添加到日期会产生 11 作为该月的第几天.由于 11 % 7 不是 0,所以答案是否".
分享您现有的代码.
向我们展示您所做或尝试过的内容,包括代码的所有*,即使它不起作用.告诉我们您在哪里卡住了,如果您收到错误,请附上错误消息.
(*您可以省略创建 spark 上下文的代码,但您应该包括所有导入.)
我知道如何添加一个 date
加上 7 天的新列,但是我无法将月份中的某天作为整数获取.
from pyspark.sql import 函数为 fdf.withColumn(next_week", f.date_add(date", 7))
包括版本、导入和使用语法高亮
- 此答案中的完整详细信息,作者为desertnaut.
对于性能调优帖子,包括执行计划
- 此答案中的完整详细信息,作者为Alper t.特克.
- 为上下文使用标准化名称会有所帮助.
解析火花输出文件
- MaxU 在 this answer 帮助将 Spark 输出文件解析为 DataFrame.
其他注意事项.
- 请务必阅读如何提问和首先如何创建一个最小、完整且可验证的示例.
- 阅读上面链接的这个问题的其他答案.
- 有一个好的、描述性的标题.
- 要有礼貌.SO 上的人是志愿者,所以问得很好.
I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth.
Perhaps part of the problem is that people just don't know how to easily create an MCVE for spark-dataframes. I think it would be useful to have a spark-dataframe version of this pandas question as a guide that can be linked.
So how does one go about creating a good, reproducible example?
Provide small sample data, that can be easily recreated.
At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem.
I have the following dataframe:
+-----+---+-----+----------+
|index| X|label| date|
+-----+---+-----+----------+
| 1| 1| A|2017-01-01|
| 2| 3| B|2017-01-02|
| 3| 5| A|2017-01-03|
| 4| 7| B|2017-01-04|
+-----+---+-----+----------+
which can be created with this code:
df = sqlCtx.createDataFrame(
[
(1, 1, 'A', '2017-01-01'),
(2, 3, 'B', '2017-01-02'),
(3, 5, 'A', '2017-01-03'),
(4, 7, 'B', '2017-01-04')
],
('index', 'X', 'label', 'date')
)
Show the desired output.
Ask your specific question and show us your desired output.
How can I create a new column 'is_divisible'
that has the value 'yes'
if the day of month of the 'date'
plus 7 days is divisible by the value in column'X'
, and 'no'
otherwise?
Desired output:
+-----+---+-----+----------+------------+
|index| X|label| date|is_divisible|
+-----+---+-----+----------+------------+
| 1| 1| A|2017-01-01| yes|
| 2| 3| B|2017-01-02| yes|
| 3| 5| A|2017-01-03| yes|
| 4| 7| B|2017-01-04| no|
+-----+---+-----+----------+------------+
Explain how to get your output.
Explain, in great detail, how you get your desired output. It helps to show an example calculation.
For instance in row 1, the X = 1 and date = 2017-01-01. Adding 7 days to date yields 2017-01-08. The day of the month is 8 and since 8 is divisible by 1, the answer is 'yes'.
Likewise, for the last row X = 7 and the date = 2017-01-04. Adding 7 to the date yields 11 as the day of the month. Since 11 % 7 is not 0, the answer is 'no'.
Share your existing code.
Show us what you have done or tried, including all* of the code even if it does not work. Tell us where you are getting stuck and if you receive an error, please include the error message.
(*You can leave out the code to create the spark context, but you should include all imports.)
I know how to add a new column that is date
plus 7 days but I'm having trouble getting the day of the month as an integer.
from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))
Include versions, imports, and use syntax highlighting
- Full details in this answer written by desertnaut.
For performance tuning posts, include the execution plan
- Full details in this answer written by Alper t. Turker.
- It helps to use standardized names for contexts.
Parsing spark output files
- MaxU provided useful code in this answer to help parse Spark output files into a DataFrame.
Other notes.
- Be sure to read how to ask and How to create a Minimal, Complete, and Verifiable example first.
- Read the other answers to this question, which are linked above.
- Have a good, descriptive title.
- Be polite. People on SO are volunteers, so ask nicely.
这篇关于如何制作良好的可重现的 Apache Spark 示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!