查找字符串中最常见字符的更有效方法 [英] More effective method for Finding the most common character in a string

查看:178
本文介绍了查找字符串中最常见字符的更有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个查找字符串中最常见字符的方法:

I created a method for finding the most common character in a string:

public static char getMax(String s) {

char maxappearchar = ' ';
int counter = 0;
int[] charcnt = new int[Character.MAX_VALUE + 1];


for (int i = 0 ; i < s.length() ; i++)
{
    char ch = s.charAt(i);
    // increment this character's cnt and compare it to our max.
    charcnt[ch]++ ;
    if (charcnt[ch] >= counter)
    {
        counter = charcnt[ch];
        maxappearchar = ch;
    } 
}
System.out.println("the max char is   " +maxappearchar + "  and displayed  " +counter+ "  times");
return maxappearchar;
}

我问的是不同的解决方案:

I am asking about different solutions for it:


  • 解决方案1 ​​ - 最快的代码(是我的附加代码吗?)

  • 解决方案2 - 在内存方面最有效,减少使用数组和变量

我使用HashMap创建了我的方法 - 是否更适合解决方案2?如果是这样的话?什么是利弊?

I created my method using HashMap - is that more Suitable for solution 2? If so why? And what are the pros/cons?

附加的代码是否适用于o技术(o ^,o logn ...)?如果是这样的原因?

Is the attached code is suitable for o Technique (o^ , o logn ... )? If so why?

推荐答案

最快的方法是计算每个字符的出现次数,然后取最大值计数数组。如果您的字符串很长,那么在循环字符串中的字符时不会跟踪当前最大值会获得不错的加速。

The fastest way to do this will be to count occurrences of every character, then take the max value in the count array. If your string is long, you'll gain a decent speedup from not tracking the current max while looping over characters in the String.

参见如何计算字符串中字符的频率?以获取有关如何计算频率的许多其他想法。

See How to count frequency of characters in a string? for many other ideas about how to count frequencies.

如果你的字符串主要是ASCII,那么count循环中的一个分支可以在低128字符值的数组或其余的HashMap之间进行选择,应该是值得的。如果您的字符串没有非ASCII字符,分支将很好地预测。如果在ascii和非ascii之间有很多交替,那么与使用HashMap相比,分支可能会有点伤害。

If your Strings are mostly ASCII, a branch in the count loop to choose between an array for the low 128 char values, or a HashMap for the rest, should be worth it. The branch will predict well if your strings don't have non-ASCII characters. If there's a lot of alternating between ascii and non-ascii, the branch might hurt a bit, compared to using HashMap for everything.

public static char getMax(String s) {

    char maxappearchar = ' ';
    int counter = 0;
    int[] ascii_count = new int[128];  // fast path for ASCII
    HashMap<Character,Integer> nonascii_count = new HashMap<Character,Integer>();

    for (int i = 0 ; i < s.length() ; i++)
    {
        char ch = s.charAt(i);  // This does appear to be the recommended way to iterate over a String
        // alternatively, iterate over 32bit Unicode codepoints, not UTF-16 chars, if that matters.
        if (ch < 128) {
            ascii_count[ch]++;
        } else {
            // some code to set or increment the nonascii_count[ch];
        }
    }

    // loop over ascii_count and find the highest element
    // loop over the keys in nonascii_count, and see if any of them are even higher.
    return maxappearchar;
}

我没有充实代码,因为我没有做很多Java,所以IDK如果有一个容器比可以插入 - 1 - 或 - 增量操作比HashMap更有效 get put 对。 https://stackoverflow.com/a/6712620/224132 建议Guava MultiSet< Character> ,看起来不错。

I didn't flesh out the code, since I don't do a lot of Java, so IDK if there's a container than can do the insert-1-or-increment operation more efficiently than a HashMap get and put pair. https://stackoverflow.com/a/6712620/224132 suggests Guava MultiSet<Character>, which looks good.

这可能比你的数组2 ^ 16 <$更好C $ C> INT 秒。但是,如果您只触摸此阵列的低128个元素,则可能永远不会触及大部分内存。分配但未触及的内存并没有真正受到伤害,或者用尽RAM /交换。

This may do better than your array of 2^16 ints. However, if you only ever touch the low 128 elements of this array, then most of the memory may never be touched. Allocated but untouched memory doesn't really hurt, or use up RAM / swap.

然而,在结尾处循环所有65536个条目意味着至少读取它,所以操作系统必须对其进行软页面故障并将其连接起来。它会污染缓存。实际上,更新每个角色的最大值可能是更好的选择。 Microbenchmarks可能会显示迭代字符串,然后循环遍历 charcnt [Character.MAX_VALUE] 获胜,但这不会解释触及那么多的缓存/ TLB污染 - 真正需要的记忆。

However, looping over all 65536 entries at the end means at least reading it, so the OS would have to soft pagefault it in and wire it up. And it will pollute caches. So actually, updating the max on every character might be a better choice. Microbenchmarks might show that iterating over the String, then looping over charcnt[Character.MAX_VALUE] wins, but that wouldn't account for the cache / TLB pollution of touching that much not-really-needed memory.

这篇关于查找字符串中最常见字符的更有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆