如何制作可重现的Apache Spark示例 [英] How to make good reproducible Apache Spark examples
问题描述
我已经花了很多时间阅读一些带有 pyspark 和 spark -dataframe 标签,而且我经常发现海报提供的信息不足以真正理解他们的问题.我通常会发表评论,要求他们发布 MCVE ,但有时让他们展示一些示例输入/输出数据就像拔牙一样.>
问题的一部分可能是人们只是不知道如何轻松地为spark-dataframe创建MCVE.我认为拥有此熊猫问题作为可以链接的指南.
那么如何去创造一个好的,可复制的例子呢?
提供小的示例数据,可以轻松地重新创建.
至少,发布者应在其数据框和代码上提供几行和几列,以方便地创建它.简单来说,我是指剪切和粘贴.使其尽可能小以证明您的问题.
我有以下数据框:
+-----+---+-----+----------+
|index| X|label| date|
+-----+---+-----+----------+
| 1| 1| A|2017-01-01|
| 2| 3| B|2017-01-02|
| 3| 5| A|2017-01-03|
| 4| 7| B|2017-01-04|
+-----+---+-----+----------+
可以使用以下代码创建:
df = sqlCtx.createDataFrame(
[
(1, 1, 'A', '2017-01-01'),
(2, 3, 'B', '2017-01-02'),
(3, 5, 'A', '2017-01-03'),
(4, 7, 'B', '2017-01-04')
],
('index', 'X', 'label', 'date')
)
显示所需的输出.
询问您的具体问题,并向我们显示您想要的输出.
如何创建新列 'is_divisible'
,如果值是 'yes'
,则其值为 'X'
,和 'no'
中的值整除?
所需的输出:
+-----+---+-----+----------+------------+
|index| X|label| date|is_divisible|
+-----+---+-----+----------+------------+
| 1| 1| A|2017-01-01| yes|
| 2| 3| B|2017-01-02| yes|
| 3| 5| A|2017-01-03| yes|
| 4| 7| B|2017-01-04| no|
+-----+---+-----+----------+------------+
说明如何获取输出.
详细说明如何获得所需的输出.有助于显示示例计算.
例如,在第1行中,X = 1,日期= 2017-01-01.迄今为止,再加上7天就会产生2017年1月8日.该月的一天是8,并且由于8被1整除,所以答案是是".
同样,对于最后一行X = 7,日期= 2017-01-04.将7加到日期将产生11,作为一个月中的某天.由于11%7不是0,答案是否".
分享您现有的代码.
向我们显示您已完成或尝试过的操作,包括代码的全部* ,即使该代码不起作用也是如此.告诉我们您卡在哪里,如果遇到错误,请附上错误消息.
(*您可以省略代码来创建spark上下文,但是应该包括所有导入.)
我知道如何添加一个新列 date
加上7天,但是我很难获取月份中的天作为整数.
from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))
包括版本,导入和使用语法突出显示
- 此答案中的完整详细信息,由此答案中的完整详细信息,由 MaxU 在如何提问和pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth.
Perhaps part of the problem is that people just don't know how to easily create an MCVE for spark-dataframes. I think it would be useful to have a spark-dataframe version of this pandas question as a guide that can be linked.
So how does one go about creating a good, reproducible example?
解决方案Provide small sample data, that can be easily recreated.
At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem.
I have the following dataframe:
+-----+---+-----+----------+ |index| X|label| date| +-----+---+-----+----------+ | 1| 1| A|2017-01-01| | 2| 3| B|2017-01-02| | 3| 5| A|2017-01-03| | 4| 7| B|2017-01-04| +-----+---+-----+----------+
which can be created with this code:
df = sqlCtx.createDataFrame( [ (1, 1, 'A', '2017-01-01'), (2, 3, 'B', '2017-01-02'), (3, 5, 'A', '2017-01-03'), (4, 7, 'B', '2017-01-04') ], ('index', 'X', 'label', 'date') )
Show the desired output.
Ask your specific question and show us your desired output.
How can I create a new column
'is_divisible'
that has the value'yes'
if the day of month of the'date'
plus 7 days is divisible by the value in column'X'
, and'no'
otherwise?Desired output:
+-----+---+-----+----------+------------+ |index| X|label| date|is_divisible| +-----+---+-----+----------+------------+ | 1| 1| A|2017-01-01| yes| | 2| 3| B|2017-01-02| yes| | 3| 5| A|2017-01-03| yes| | 4| 7| B|2017-01-04| no| +-----+---+-----+----------+------------+
Explain how to get your output.
Explain, in great detail, how you get your desired output. It helps to show an example calculation.
For instance in row 1, the X = 1 and date = 2017-01-01. Adding 7 days to date yields 2017-01-08. The day of the month is 8 and since 8 is divisible by 1, the answer is 'yes'.
Likewise, for the last row X = 7 and the date = 2017-01-04. Adding 7 to the date yields 11 as the day of the month. Since 11 % 7 is not 0, the answer is 'no'.
Share your existing code.
Show us what you have done or tried, including all* of the code even if it does not work. Tell us where you are getting stuck and if you receive an error, please include the error message.
(*You can leave out the code to create the spark context, but you should include all imports.)
I know how to add a new column that is
date
plus 7 days but I'm having trouble getting the day of the month as an integer.from pyspark.sql import functions as f df.withColumn("next_week", f.date_add("date", 7))
Include versions, imports, and use syntax highlighting
- Full details in this answer written by desertnaut.
For performance tuning posts, include the execution plan
- Full details in this answer written by user8371915.
- It helps to use standardized names for contexts.
Parsing spark output files
- MaxU provided useful code in this answer to help parse Spark output files into a DataFrame.
Other notes.
- Be sure to read how to ask and How to create a Minimal, Complete, and Verifiable example first.
- Read the other answers to this question, which are linked above.
- Have a good, descriptive title.
- Be polite. People on SO are volunteers, so ask nicely.
这篇关于如何制作可重现的Apache Spark示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!