如何使用Scala的DataFrame比较表中的每一列而无需关心该列是什么? [英] How do I compare each column in a table using DataFrame by Scala without caring what the column is?

查看:110
本文介绍了如何使用Scala的DataFrame比较表中的每一列而无需关心该列是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我之前提出的问题如下.

表2-属性表

表3

例如,id1和id2具有不同的颜色和大小,因此id1和id2行(表3中的第二行)具有"id1 id2 0 0";

id1和id3具有相同的颜色和不同的大小,因此id1和id3行(表3中的第3行)具有"id1 id3 1 0";

相同属性--1不同属性--0

但是,如果我不知道Table2中有多少个属性列,该怎么办?我该怎么做?如我不知道列名的颜色或大小.也许还有一列叫做品牌.那我该如何获得Table3?

解决方案

以下解决方案应适用于 Table2 中任意数量的未知属性.我已经从您的最后一个问题

  val t1 = List(("id1","id2"),("id1","id3"),("id2","id3")).toDF("id_x","id_y")val t2 =列表(("id1","blue","m","brand1"),("id2","red","s","brand1"),("id3","blue","s","brand2")).toDF("id","color","size","brand")val outSchema = t2.columns.tailvar t3 = t1.join(t2.as("x"),$"id_x" === $"x.id",内部").join(t2.as("y"),$"id_y" === $"y.id",内部")for(columnName<-outSchema){t3 = t3.withColumn(columnName +"s",when(col(s"x.$ columnName")=== col(s"y.$ columnName"),1).otherwise(0)).drop(列名称).drop("id").withColumnRenamed(columnName +"s",columnName)}t3.show(假) 

最终输出是

  + ---- + ---- + ----- + ---- + ----- +| id_x | id_y |颜色|大小|品牌|+ ---- + ---- + ----- + ---- + ----- +| id1 | id2 | 0 | 0 | 1 || id1 | id3 | 1 | 0 | 0 || id2 | id3 | 0 | 1 | 0 |+ ---- + ---- + ----- + ---- + ----- + 

该解决方案应适用于任何数量未知的属性.

The question I asked before is as follows. Last question

Table 1 -- ID pairs table

Table 2 -- Attribute table

Table 3

For example, id1 and id2 have different color and size, so the id1 and id2 row(2nd row in Table 3) has "id1 id2 0 0";

id1 and id3 have same color and different size, so the id1 and id3 row(3nd row in Table 3) has "id1 id3 1 0";

Same attribute---1 Different attribute---0

But, what if I do not know how many attribute columns in Table2; how can I make it? Such as I do not know the column name color or size. Maybe there is another column called brand. Then how can I get Table3?

解决方案

The following solution should work for any unknown number of attributes in Table2. I have edited the answer from your Last Question

val t1 = List(
  ("id1","id2"),
  ("id1","id3"),
  ("id2","id3")
).toDF("id_x", "id_y")

val t2 = List(
  ("id1","blue","m","brand1"),
  ("id2","red","s","brand1"),
  ("id3","blue","s","brand2")
).toDF("id", "color", "size", "brand")

val outSchema = t2.columns.tail

var t3 = t1
  .join(t2.as("x"), $"id_x" === $"x.id", "inner")
  .join(t2.as("y"), $"id_y" === $"y.id", "inner")

  for(columnName <- outSchema){
    t3 = t3.withColumn(columnName+"s", when(col(s"x.$columnName") === col(s"y.$columnName"),1).otherwise(0))
      .drop(columnName)
      .drop("id")
      .withColumnRenamed(columnName+"s", columnName)

  }
t3.show(false)

The final output is

+----+----+-----+----+-----+
|id_x|id_y|color|size|brand|
+----+----+-----+----+-----+
|id1 |id2 |0    |0   |1    |
|id1 |id3 |1    |0   |0    |
|id2 |id3 |0    |1   |0    |
+----+----+-----+----+-----+

The solution should work for any unknown number of attributes.

这篇关于如何使用Scala的DataFrame比较表中的每一列而无需关心该列是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆