在scala中使用循环的高效方法 [英] Efficient way of using for loops in scala
问题描述
我正在尝试根据其列的某些值将数据帧划分为n个组。最后是下面的代码。
但是它并没有看到嵌套for循环的高效interms,我正在寻找一些优雅的方法来实现以下代码。有人可以提供输入吗?
I am trying to divide a data frame into n groups based on certain values of its columns. And ended up with the below code. But it doesnt look efficient interms of nested for loops, I am looking for some elegant approach in implementing the following code. Can some one please provide inputs?
输入将是列数据,基于哪个数据框应该被划分。
所以我有一个val存储在列的不同的值。
它将存储如下:
Input will be column Names based on which the data frame should be divided. So I have a val storing in the distinct values of columns. It will store like :
(0)(0) = F
(0)(1) = M
(1)(0) = drugY
(1)(1) = drugC
(1)(2) = drugX
所以我共有5个列值创建,如下所示:
So I have a total 5 created with column values as follows:
F and drugY
M and drugY
F and drugC
M and drugC
F and drugX
M and drugX
推荐答案
我不太明白你想做什么,但如果你想生成组合使用Spark数据框api,您可以这样做
I dont really understand what you want to do, but if you want to generate the combinations using the Spark dataframe api, you can do it like this
val patients = Seq(
(1, "f"),
(2, "m")
).toDF("id", "name")
val drugs = Seq(
(1, "drugY"),
(2, "drugC"),
(3, "drugX")
).toDF("id", "name")
patients.createOrReplaceTempView("patients")
drugs.createOrReplaceTempView("drugs")
sqlContext.sql("select p.id as patient_id, p.name as patient_name, d.id as drug_id, d.name as drug_name from patients p cross join drugs d").show
+----------+------------+-------+---------+
|patient_id|patient_name|drug_id|drug_name|
+----------+------------+-------+---------+
| 1| f| 1| drugY|
| 1| f| 2| drugC|
| 1| f| 3| drugX|
| 2| m| 1| drugY|
| 2| m| 2| drugC|
| 2| m| 3| drugX|
+----------+------------+-------+---------+
或数据框api
val cartesian = patients.join(drugs)
cartesian.show
(2) Spark Jobs
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
| 1| f| 1|drugY|
| 1| f| 2|drugC|
| 1| f| 3|drugX|
| 2| m| 1|drugY|
| 2| m| 2|drugC|
| 2| m| 3|drugX|
+---+----+---+-----+
之后,您可以使用交叉表获取频率分布表
After that you can use a crosstab to get the a table of the frequency distribution
c.stat.crosstab(patient_name,drug_name)。show
c.stat.crosstab("patient_name","drug_name").show
+----------------------+-----+-----+-----+
|patient_name_drug_name|drugC|drugX|drugY|
+----------------------+-----+-----+-----+
| m| 1| 1| 1|
| f| 1| 1| 1|
+----------------------+-----+-----+-----+
这篇关于在scala中使用循环的高效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!