汇总统计火花字符串类型 [英] Summary Statistics for string types in spark

查看:164
本文介绍了汇总统计火花字符串类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有什么样的汇总函数火花一样,在R。

Is there something like summary function in spark like that in "R".

附带火花(MultivariateStatisticalSummary)摘要计算只能工作在数字类型。

The summary calculation which comes with spark(MultivariateStatisticalSummary) operates only on numeric types.

我感兴趣的是获得字符串类型也像前四的最大弦发生的历史结果(GROUPBY样的操作),唯一的号码等。

I am interested in getting the results for string types also like the first four max occuring strings(groupby kind of operation) , number of uniques etc.

有没有preexisting code这个?

Is there any preexisting code for this ?

如果不是请提出来处理字符串类型的最佳方式。

If not what please suggest the best way to deal with string types.

推荐答案

我不认为这是对字符串中MLlib这样的事情。但它很可能是一个宝贵的贡献,如果你要实现它。

I don't think there is such a thing for String in MLlib. But it would probably be a valuable contribution, if you are going to implement it.

仅仅计算这些指标之一是容易的。例如。通过频率最高4:

Calculating just one of these metrics is easy. E.g. for top 4 by frequency:

def top4(rdd: org.apache.spark.rdd.RDD[String]) =
  rdd
    .map(s => (s, 1))
    .reduceByKey(_ + _)
    .map { case (s, count) => (count, s) }
    .top(4)
    .map { case (count, s) => s }

或者唯一身份号码:

Or number of uniques:

def numUnique(rdd: org.apache.spark.rdd.RDD[String]) =
  rdd.distinct.count

但是,在单次这样做了所有的指标需要更多的工作。

But doing this for all metrics in a single pass takes more work.

这些例子假设一下,如果您有多个数据的列,你已经每列拆分成一个独立的RDD。这是组织中的数据的好方法,而且有必要为执行洗牌操作。

These examples assume that, if you have multiple "columns" of data, you have split each column into a separate RDD. This is a good way to organize the data, and it's necessary for operations that perform a shuffle.

我通过拆分栏的意义:

def split(together: RDD[(Long, Seq[String])],
          columns: Int): Seq[RDD[(Long, String)]] = {
  together.cache // We will do N passes over this RDD.
  (0 until columns).map {
    i => together.mapValues(s => s(i))
  }
}

这篇关于汇总统计火花字符串类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆