在scala中使用循环的高效方法 [英] Efficient way of using for loops in scala

查看：101 发布时间：2017/3/26 3:53:55 scala apache-spark dataframe

本文介绍了在scala中使用循环的高效方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试根据其列的某些值将数据帧划分为n个组。最后是下面的代码。
但是它并没有看到嵌套for循环的高效interms，我正在寻找一些优雅的方法来实现以下代码。有人可以提供输入吗？

I am trying to divide a data frame into n groups based on certain values of its columns. And ended up with the below code. But it doesnt look efficient interms of nested for loops, I am looking for some elegant approach in implementing the following code. Can some one please provide inputs?

输入将是列数据，基于哪个数据框应该被划分。
所以我有一个val存储在列的不同的值。
它将存储如下：

Input will be column Names based on which the data frame should be divided. So I have a val storing in the distinct values of columns. It will store like :

 (0)(0) = F
(0)(1) = M
(1)(0) = drugY
(1)(1) = drugC
(1)(2) = drugX

所以我共有5个列值创建，如下所示：

So I have a total 5 created with column values as follows:

    F and drugY
M and drugY 
F and drugC
M and drugC
F and drugX
M and drugX

推荐答案

我不太明白你想做什么，但如果你想生成组合使用Spark数据框api，您可以这样做

I dont really understand what you want to do, but if you want to generate the combinations using the Spark dataframe api, you can do it like this

val patients = Seq(
    (1, "f"),
    (2, "m")
).toDF("id", "name")

val drugs = Seq(
    (1, "drugY"),
    (2, "drugC"),
    (3, "drugX")
).toDF("id", "name")

patients.createOrReplaceTempView("patients")
drugs.createOrReplaceTempView("drugs")

sqlContext.sql("select p.id as patient_id, p.name as patient_name, d.id as drug_id, d.name as drug_name  from patients p cross join drugs d").show



+----------+------------+-------+---------+
|patient_id|patient_name|drug_id|drug_name|
+----------+------------+-------+---------+
|         1|           f|      1|    drugY|
|         1|           f|      2|    drugC|
|         1|           f|      3|    drugX|
|         2|           m|      1|    drugY|
|         2|           m|      2|    drugC|
|         2|           m|      3|    drugX|
+----------+------------+-------+---------+

或数据框api

val cartesian = patients.join(drugs)

cartesian.show
(2) Spark Jobs
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
|  1|   f|  1|drugY|
|  1|   f|  2|drugC|
|  1|   f|  3|drugX|
|  2|   m|  1|drugY|
|  2|   m|  2|drugC|
|  2|   m|  3|drugX|
+---+----+---+-----+

之后，您可以使用交叉表获取频率分布表

After that you can use a crosstab to get the a table of the frequency distribution

c.stat.crosstab（patient_name，drug_name）。show

c.stat.crosstab("patient_name","drug_name").show

+----------------------+-----+-----+-----+
|patient_name_drug_name|drugC|drugX|drugY|
+----------------------+-----+-----+-----+
|                     m|    1|    1|    1|
|                     f|    1|    1|    1|
+----------------------+-----+-----+-----+

这篇关于在scala中使用循环的高效方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在scala中使用循环的高效方法 [英] Efficient way of using for loops in scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在scala中使用循环的高效方法 [英] Efficient way of using for loops in scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭