再在斯卡拉presenting嵌套结构 [英] Representing nested structures in scala

查看：138 发布时间：2016/5/22 16:12:30 scala apache-spark

本文介绍了再在斯卡拉presenting嵌套结构的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有嵌套子表中的一些行，如下图所示稀疏表，我怎么重新present此结构Scala集合

  | rowkey |订单ID |名称|量|供应商|帐户| rowkey1 | ID0：1001 | ID1：苹果| ID1：1000 | ID3：水果，INC。|
                     | ID2：apple2| ID2：1200 | || rowkey2 | ID4：1002 | ID5：橙色| ID5：5000 | || rowkey3 | ID6：1003 | ID7：鸭梨| ID7：500 | | ID10：77777
                     | ID8：pear2| ID8：350 | |
                     | ID9：pear3| ID9：500 | |

请注意：id1,2,3，..为每一个组属性，它基本上是对每个子行的组识别符号，例如重新present唯一标识符排在第一位| ID2 ：apple2| ID2 ：1200属于同一个组的 ID2 （子行有两个属性根据rowkey1（名称和金额））

来看看这3个行的另一种方式：

  rowkey1，（订单ID，ID0，1001），（姓名，ID1，苹果），（金额，ID1，1000），（姓名，ID2，apple2）， （金额，id2,1200），（供应商，ID3，果INC。）
    rowkey2，（订单ID，ID4，1002），（姓名，ID5，橙色），（金额，id5,5000）
    rowkey3，（订单ID，ID6，1003），（姓名，ID7，梨花）（金额，id7,500），（姓名，ID8，pear2），（金额，id8,350），（姓名，ID9 pear3），（金额，ID9，250），（账号，ID10，777777）

编辑：请注意，表中有2000列，是否有可能创建一个类（或属性添加到一个类）动态，例如负载字段名和类型从外部文件中Scala呢？我知道这种情况下，类被限制在22场

EDIT2：也注意到，任何属性可以有多个行（除rowkey），即订单ID，名称，数量，供应商，客户和其他1995+列，因此创造个人单线班所有的人都没有可行的，我在寻找最通用的解决方案。

感谢您的答案，我想，以使其更加通用，我可以创建这些类：

 案例类ColumnLine（
  ID：智力，
  值：选项[任意]
）
案例类栏目（
  colname需要：字符串，
  coltype：字符串，
  行：选项[列表[ColumnLine]
）
案例类行（
  rowkey：字符串，
  列：地图[字符串，列] // colname需要 - ＆GT;柱
）
案例类表（
  名称：字符串，
  行：地图[字符串，行] // rowkey  - ＆GT;行
）

现在，我试图找出如何查询这个结构，即返回的行与列colname需要==量所包含的行其中value> 500

EDIT3：OK，这是快速和肮脏的方式，但似乎工作，它会扫描我的笔记本电脑在〜15秒10M记录

 进口scala.util.control.Breaks._你好对象{高清主（参数：数组[字符串]）{
    VAL N =千万
    高清=的uuid java.util.UUID.randomUUID.toString
    VAL行：行=新行（UUID，列表（
                列（订单ID，字符串，表（单（ID2，有的（UUID））））
                列（名，字符串，表（单（ID2，有些（苹果）），单（ID3，有些（apple2））））
                列（量，内部，清单（单（ID2，有些（1000）），单（ID3，有些（1200））））
                列（供应商，字符串，表（单（ID4，有些（fruits.inc））））
                列（账户，内部，清单（单（ID10，有些（7777））））
                           ）
            ）
    的println（新java.util.Date）
    VAL表：列表[行] = List.fill（N）（行）
    table.par.filter（行=＆GT; GT（行，量，500））
    .filter（行=＆GT; EQ（行，供应商，fruits.inc））
    .filter（行=＆GT; EQ（行账户，7777））
    //.foreach(println）
    的println（新java.util.Date）}高清EQ（行：行，colname需要：字符串，COLVALUE：任何）：布尔= {
    VAR RES：布尔= FALSE
    VAL西：柱= getCol（行，colname需要）
    易碎{
        对（线474;  -  col.lines）{
            如果（line.value.getOrElse（）== COLVALUE）{
                RES =真
                打破
            }
        }
    }
    返回水库
}高清GT（行：行，colname需要：字符串，COLVALUE：诠释）：布尔= {
        VAR RES：布尔= FALSE
        VAL西：柱= getCol（行，colname需要）
        易碎{
                对（线474;  -  col.lines）{
                        如果（line.value.getOrElse（）asInstanceOf [INT]方式＆gt; COLVALUE）{
                                RES =真
                                打破
                        }
                }
        }
        返回水库
}高清getCol（行：行，colname需要：字符串）：列=
  row.columns.filter（_。colname需要== colname需要）。头案例类单（ID：字符串，值：选项[任意]）案例类栏目（
  colname需要：字符串，
  coltype：字符串，
  行：列表[单]
）案例类行（
   rowkey：字符串，
   列：列表[专栏]
）}

解决方案

最自然的方式重新present这在Scala中，假设为固定柱结构是可以治疗的，将是这样

 案例类单（名称：字符串，金额为：int）案例类SingleEntry（
  订单ID：智力，
  名称：字符串，
  量：智力，
  供应商：选项[INT]，
  账号：选项[龙]
）案例类条目（
  订单ID：智力，
  项目：列表[单张]，
  供应商：选项[字符串]
  账号：选项[龙]
）{
  DEF单（号码：单=＆GT;布尔）：列表[SingleEntry] =
    items.filter（P）{.MAP情况（姓名，金额）=＆GT;
      SingleEntry（订单ID，名称，数量，供应商，帐户）
    }
}

然后拔出你想要的物品，你会

 表。
  过滤（_。supplier.exists（_ ==fruits.inc））。
  flatMap（_单独使用（_量方式＆gt; 500））

但也有许多方法，你可以重新present这种数据结构，包括地图（嵌套的或其他方式）;我不会采取任何特别的答案作为典型。

I have a sparse table that has nested sub-tables in some of the rows, as shown below, how do I represent this structure with scala collections

| rowkey |  orderid  |      name   |    amount    |     supplier      |   account

| rowkey1|id0: 1001  |id1: "apple" |  id1: 1000   | id3: "fruits, inc"|
                     |id2: "apple2"|  id2: 1200   |                   | 

| rowkey2|id4: 1002  |id5: "orange"|  id5: 5000   |                   | 

| rowkey3|id6: 1003  |id7: "pear"  |  id7: 500    |                   |id10: 77777
                     |id8: "pear2"  |  id8: 350    |                   | 
                     |id9: "pear3"  |  id9: 500    |                   |

note: id1,2,3,.. represent unique identifiers for each "group attribute", which is basically the groupid for each sub-row, e.g. in the first row "|id2: "apple2"| id2: 1200" belong to the same group id2 (sub-row with two attributes (name and amount) under rowkey1)

another way to look at these 3 rows:

    rowkey1, (orderid, id0, 1001), (name, id1, "apple"), (amount, id1, 1000), (name, id2, "apple2"), (amount, id2,1200), (supplier, id3, "fruit inc.")
    rowkey2, (orderid, id4, 1002), (name, id5, "orange"), (amount, id5,5000)
    rowkey3, (orderid, id6, 1003), (name, id7, "pear"), (amount, id7,500),(name, id8, "pear2"), (amount, id8,350),(name, id9, "pear3"), (amount, id9, 250), (account, id10, 777777)

edit: note that the table has 2000 columns, Is it possible to create a class (or add attributes to a class) dynamically, e.g. load field names and types from external file in Scala? I know that case classes are limited to 22 fields

edit2: also note that any of the attributes can have multiple lines (except rowkey), i.e. orderid, name, amount, supplier, account and 1995+ other columns, so creating individual "singleLine" classes for all of them is not feasible, I'm looking for the most general solution.

thanks for the answers, I guess to make it more general I can create these classes:

case class ColumnLine(
  id: Int,
  value: Option[Any]
)
case class Column(
  colname: String,
  coltype: String,
  lines: Option[List[ColumnLine]]
)
case class Row (
  rowkey:String,
  columns:Map[String,Column] //colname -> Column
)
case class Table (
  name:String,
  rows:Map[String,Row] //rowkey -> Row
)

now I'm trying to figure out how to query this structure, i.e. return rows where column with colname=="amount" contains lines where value >500

edit3: ok, this is "quick and dirty" way, but seems to work, it scans 10M records in ~15 sec on my laptop

import scala.util.control.Breaks._

object hello{

def main(args: Array[String]) {
    val n = 10000000
    def uuid = java.util.UUID.randomUUID.toString
    val row: Row = new Row(uuid, List(
                Column("orderid", "String", List(Single("id2",Some(uuid)))),
                Column("name", "String", List(Single("id2",Some("apple")),Single("id3",Some("apple2")))),
                Column("amount", "Int", List(Single("id2",Some(1000)),Single("id3",Some(1200)))),
                Column("supplier", "String", List(Single("id4",Some("fruits.inc")))),
                Column("account", "Int", List(Single("id10",Some(7777))))
                           )
            )
    println(new java.util.Date)
    val table: List[Row]= List.fill(n)(row)
    table.par.filter(row=> gt(row, "amount",500))
    .filter(row=> eq(row, "supplier","fruits.inc"))
    .filter(row=> eq(row, "account", 7777))
    //.foreach(println)
    println(new java.util.Date)

}

def eq (row:Row, colname: String, colvalue:Any): Boolean = {
    var res:Boolean = false
    val col:Column = getCol(row,colname) 
    breakable{ 
        for (line <- col.lines){ 
            if (line.value.getOrElse()==colvalue){
                res = true
                break
            }
        }
    }
    return res
}

def gt (row:Row, colname: String, colvalue:Int): Boolean = {
        var res:Boolean = false
        val col:Column = getCol(row,colname)
        breakable{
                for (line <- col.lines){
                        if (line.value.getOrElse().asInstanceOf[Int]>colvalue){
                                res = true
                                break
                        }
                }
        }
        return res
}

def getCol(row: Row, colname: String) : Column =
  row.columns.filter(_.colname==colname).head

case class Single(id: String, value: Option[Any])

case class Column(
  colname: String,
  coltype: String,
  lines: List[Single]
)

case class Row(
   rowkey: String,
   columns: List[Column]
)

}

解决方案

The most natural way to represent this in Scala, assuming that the column structure can be treated as fixed, would be something like

case class Single(name: String, amount: Int)

case class SingleEntry(
  orderid: Int,
  name: String,
  amount: Int,
  supplier: Option[Int],
  account: Option[Long]
)

case class Entry(
  orderid: Int,
  items: List[Single],
  supplier: Option[String],
  account: Option[Long]
) {
  def singly(p: Single => Boolean): List[SingleEntry] =
    items.filter(p).map{ case(name, amount) =>
      SingleEntry(orderid, name, amount, supplier, account)
    }
}

And then to pull out the items you want, you would

table.
  filter(_.supplier.exists(_ == "fruits.inc")).
  flatMap(_.singly(_.amount > 500))

But there are many ways you could represent this data structure, including with maps (nested or otherwise); I wouldn't take any particular answer as canonical.

这篇关于再在斯卡拉presenting嵌套结构的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

再在斯卡拉presenting嵌套结构 [英] Representing nested structures in scala

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

再在斯卡拉presenting嵌套结构 [英] Representing nested structures in scala

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭