为Scala中的字符串生成频率映射 [英] Generating a frequency map for a string in Scala
问题描述
Map [ Char,Int] = Map(h - > 1,e - > 1,o - > 1,l - > 2)
我可以迭代地执行:
val str =hello
varcount = new scala.collection.mutable.HashMap [Char,Int]
for(i< - str){
if(counts.contains(i))
counts.put我,计数(i)+ 1)
else
counts.put(i,1)
}
通过在REPL中混乱,我发现我可以做一些更简洁的事情,而不是使用可变的集合:
> str.groupBy(_。toChar).map {p => (p._1,p._2.length)}
scala.collection.immutable.Map [Char,Int] = Map(h - > 1,e - > 1,o - > 1,l - > 2)
但是我不知道groupBy()的性能特征和什么在块中传递给地图(就像什么,确切的是p)。
我如何用Scala中的功能范例成语? >
对于背景,我刚刚从Ruby进入Scala。在Ruby中,我将使用 inject
,但我不知道在Scala中如何并行的方法是:
counters = str.each_byte.inject(Hash.new(0)){| h,c | h [c] + = 1; h}
1)什么是 p
表示?
groupBy
将一个元素映射到一个键的类型为 K
。在某些集合 Coll
上调用时,会返回一个 Map [K,Coll]
,其中包含来自密钥 K
到映射到同一个键的所有元素。
所以,在你的情况下, str .groupBy(_。toChar)
产生一个从 k
(它是一个字符)到具有所有元素(字符)的字符串的映射映射) c
,使 k == c.toChar
。
你得到这个:
映射(e - >e,h - >h l - >ll,o - >o)
A Map
是一组键和值的迭代。在这种情况下,每对都是一个字符和一串元素。在映射
上调用映射
操作涉及映射这些对 - p
是一对 p._1
是一个字符, p._2
是相关联的字符串(on您可以像上面那样调用长度
。
2)如何用惯用的方式
以上是如何用惯用的方法 - 使用 groupBy
和 map
。或者,您可以使用字符串长度上的不可变地图和递归来计算频率,或使用不可变地图和 foldLeft
。
3)性能特征
最佳基准查看差异。
这是一个高度重复的字符串(〜3GHz iMac,JDK7,Scala 2.10.0每晚)的几个微基准:
object Imperative extends testing.Benchmark {
val str =abc* 750000
def run(){
var counts = new scala.collection.mutable。 HashMap [Char,Int]
var i = 0
val until = str.length
while(i< until){
var c = str(i)
如果(counts.contains(c))
counts.put(c,计数(c)+ 1)
else
counts.put(c,1)
i + = 1
}
// println(f)
}
}
对象组合器扩展testing.Benchmark {
val str =abc* 750000
def run(){
val f = str.groupBy(_。toChar).map(p =>(p._1 ,p._2.length))
}
}
对象折叠扩展testing.Benchmark {
val str =abc* 750000
def run(){
val f = str.foldLeft(Map [Char,Int]()withDefaultV alue 0){(h,c)=> h.updated(c,h(c)+1)}
}
}
结果:
-
势在必行:
$ 103 57 53 58 53 53 53 53 53 53
-
组合者:
$ 72 51 63 56 53 52 52 54 53 53
-
折叠:
$ 163 62 71 62 57 57 57 58 57 57
请注意,将命令式版本更改为使用 withDefaultValue
:
var counts = new scala.collection.mutable.HashMap [Char,Int] .withDefaultValue(0)
var i = 0
val until = str.length
while(i< until){
var c = str(i)
counts.put(c,计数(c)+ 1)$ b $由于转发每个<$ c $,因此bi $ = 1
}
显然非常慢c> put call:
-
withDefaultValue
code> $ 133 87 109 106 101 100 101 100 101 101
结论:在这种情况下,人物的装箱和拆箱是足够高的,所以这些方法之间的性能差异很难观察。
编辑:
更新:您可能希望使用 ScalaMeter内联基准测试来代替基准
trait。
Let's say I have a string, "hello", and I want to generate a character frequency map:
Map[Char,Int] = Map(h -> 1, e -> 1, o -> 1, l -> 2)
I could do this iteratively:
val str = "hello"
var counts = new scala.collection.mutable.HashMap[Char,Int]
for (i <- str) {
if (counts.contains(i))
counts.put(i, counts(i) + 1)
else
counts.put(i, 1)
}
By messing around in the REPL, I've found I can do something a bit more concise and not using a mutable collection:
> str.groupBy(_.toChar).map{ p => (p._1, p._2.length)}
scala.collection.immutable.Map[Char,Int] = Map(h -> 1, e -> 1, o -> 1, l -> 2)
But I don't know about the performance characteristics of groupBy() nor what is going on in the block passed to map (like what, exactly, p is).
How do I do this idiomatically using the functional paradigms in Scala?
For background, I'm just coming to Scala for the first time from Ruby. In Ruby, I would use inject
but I'm not sure what the parallel way to do it in Scala is:
counts = str.each_byte.inject(Hash.new(0)){ |h, c| h[c] += 1; h}
1) What does p
mean?
groupBy
takes a function which maps an elements to a key of type K
. When invoked on some collection Coll
, it returns a Map[K, Coll]
which contains mappings from keys K
to all the elements which mapped to the same key.
So, in your case, str.groupBy(_.toChar)
yields a map mapping from a key k
(which is a character) to a string with all the elements (characters) c
such that k == c.toChar
.
You get this:
Map(e -> "e", h -> "h", l -> "ll", o -> "o")
A Map
is an iterable of pairs of keys and values. In this case, each pair is a character and a string of elements. Calling the map
operation on a Map
involves mapping on these pairs - p
is a pair where p._1
is a character, and p._2
is the associated string (on which you can call length
, as you did above).
2) How to do this idiomatically
The above is how to do it idiomatically - using groupBy
and map
. Alternatively, you can use an immutable map and recursion on the string length to compute the frequencies, or an immutable map and a foldLeft
.
3) Performance characteristic
Best to benchmark to see the differences. Here are a couple of microbenchmark for a highly-repetitive string (~3GHz iMac, JDK7, Scala 2.10.0 nightly):
object Imperative extends testing.Benchmark {
val str = "abc" * 750000
def run() {
var counts = new scala.collection.mutable.HashMap[Char,Int]
var i = 0
val until = str.length
while (i < until) {
var c = str(i)
if (counts.contains(c))
counts.put(c, counts(c) + 1)
else
counts.put(c, 1)
i += 1
}
//println(f)
}
}
object Combinators extends testing.Benchmark {
val str = "abc" * 750000
def run() {
val f = str.groupBy(_.toChar).map(p => (p._1, p._2.length))
}
}
object Fold extends testing.Benchmark {
val str = "abc" * 750000
def run() {
val f = str.foldLeft(Map[Char, Int]() withDefaultValue 0){(h, c) => h.updated(c, h(c)+1)}
}
}
Results:
Imperative:
$ 103 57 53 58 53 53 53 53 53 53
Combinators:
$ 72 51 63 56 53 52 52 54 53 53
Fold:
$ 163 62 71 62 57 57 57 58 57 57
Note that changing the imperative version to use withDefaultValue
:
var counts = new scala.collection.mutable.HashMap[Char,Int].withDefaultValue(0)
var i = 0
val until = str.length
while (i < until) {
var c = str(i)
counts.put(c, counts(c) + 1)
i += 1
}
is apparently terribly slow due to forwarding each put
call:
withDefaultValue
:$ 133 87 109 106 101 100 101 100 101 101
Conclusion: the boxing and unboxing of characters in this case is high-enough so that the differences in performance between these approaches are hard to observe.
EDIT:
Update: You may want to use ScalaMeter inline benchmarking in place of the Benchmark
trait.
这篇关于为Scala中的字符串生成频率映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!