在scala中使用循环的高效方法 [英] Efficient way of using for loops in scala

查看:101
本文介绍了在scala中使用循环的高效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据其列的某些值将数据帧划分为n个组。最后是下面的代码。
但是它并没有看到嵌套for循环的高效interms,我正在寻找一些优雅的方法来实现以下代码。有人可以提供输入吗?

I am trying to divide a data frame into n groups based on certain values of its columns. And ended up with the below code. But it doesnt look efficient interms of nested for loops, I am looking for some elegant approach in implementing the following code. Can some one please provide inputs?

输入将是列数据,基于哪个数据框应该被划分。
所以我有一个val存储在列的不同的值。
它将存储如下:

Input will be column Names based on which the data frame should be divided. So I have a val storing in the distinct values of columns. It will store like :

 (0)(0) = F
(0)(1) = M
(1)(0) = drugY
(1)(1) = drugC
(1)(2) = drugX

所以我共有5个列值创建,如下所示:

So I have a total 5 created with column values as follows:

    F and drugY
M and drugY 
F and drugC
M and drugC
F and drugX
M and drugX


推荐答案

我不太明白你想做什么,但如果你想生成组合使用Spark数据框api,您可以这样做

I dont really understand what you want to do, but if you want to generate the combinations using the Spark dataframe api, you can do it like this

val patients = Seq(
    (1, "f"),
    (2, "m")
).toDF("id", "name")

val drugs = Seq(
    (1, "drugY"),
    (2, "drugC"),
    (3, "drugX")
).toDF("id", "name")

patients.createOrReplaceTempView("patients")
drugs.createOrReplaceTempView("drugs")

sqlContext.sql("select p.id as patient_id, p.name as patient_name, d.id as drug_id, d.name as drug_name  from patients p cross join drugs d").show



+----------+------------+-------+---------+
|patient_id|patient_name|drug_id|drug_name|
+----------+------------+-------+---------+
|         1|           f|      1|    drugY|
|         1|           f|      2|    drugC|
|         1|           f|      3|    drugX|
|         2|           m|      1|    drugY|
|         2|           m|      2|    drugC|
|         2|           m|      3|    drugX|
+----------+------------+-------+---------+

或数据框api

val cartesian = patients.join(drugs)

cartesian.show
(2) Spark Jobs
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
|  1|   f|  1|drugY|
|  1|   f|  2|drugC|
|  1|   f|  3|drugX|
|  2|   m|  1|drugY|
|  2|   m|  2|drugC|
|  2|   m|  3|drugX|
+---+----+---+-----+

之后,您可以使用交叉表获取频率分布表

After that you can use a crosstab to get the a table of the frequency distribution

c.stat.crosstab(patient_name,drug_name)。show

c.stat.crosstab("patient_name","drug_name").show

+----------------------+-----+-----+-----+
|patient_name_drug_name|drugC|drugX|drugY|
+----------------------+-----+-----+-----+
|                     m|    1|    1|    1|
|                     f|    1|    1|    1|
+----------------------+-----+-----+-----+

这篇关于在scala中使用循环的高效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆