汇总统计火花字符串类型 [英] Summary Statistics for string types in spark
问题描述
有什么样的汇总函数火花一样,在R。
Is there something like summary function in spark like that in "R".
附带火花(MultivariateStatisticalSummary)摘要计算只能工作在数字类型。
The summary calculation which comes with spark(MultivariateStatisticalSummary) operates only on numeric types.
我感兴趣的是获得字符串类型也像前四的最大弦发生的历史结果(GROUPBY样的操作),唯一的号码等。
I am interested in getting the results for string types also like the first four max occuring strings(groupby kind of operation) , number of uniques etc.
有没有preexisting code这个?
Is there any preexisting code for this ?
如果不是请提出来处理字符串类型的最佳方式。
If not what please suggest the best way to deal with string types.
推荐答案
我不认为这是对字符串中MLlib这样的事情。但它很可能是一个宝贵的贡献,如果你要实现它。
I don't think there is such a thing for String in MLlib. But it would probably be a valuable contribution, if you are going to implement it.
仅仅计算这些指标之一是容易的。例如。通过频率最高4:
Calculating just one of these metrics is easy. E.g. for top 4 by frequency:
def top4(rdd: org.apache.spark.rdd.RDD[String]) =
rdd
.map(s => (s, 1))
.reduceByKey(_ + _)
.map { case (s, count) => (count, s) }
.top(4)
.map { case (count, s) => s }
或者唯一身份号码:
Or number of uniques:
def numUnique(rdd: org.apache.spark.rdd.RDD[String]) =
rdd.distinct.count
但是,在单次这样做了所有的指标需要更多的工作。
But doing this for all metrics in a single pass takes more work.
这些例子假设一下,如果您有多个数据的列,你已经每列拆分成一个独立的RDD。这是组织中的数据的好方法,而且有必要为执行洗牌操作。
These examples assume that, if you have multiple "columns" of data, you have split each column into a separate RDD. This is a good way to organize the data, and it's necessary for operations that perform a shuffle.
我通过拆分栏的意义:
def split(together: RDD[(Long, Seq[String])],
columns: Int): Seq[RDD[(Long, String)]] = {
together.cache // We will do N passes over this RDD.
(0 until columns).map {
i => together.mapValues(s => s(i))
}
}
这篇关于汇总统计火花字符串类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!