为什么Java中的大多数字符串操作基于regexp? [英] Why are most string manipulations in Java based on regexp?

查看:91
本文介绍了为什么Java中的大多数字符串操作基于regexp?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Java中,有很多方法都与操作字符串有关。
最简单的例子是String.split(something)方法。

In Java there are a bunch of methods that all have to do with manipulating Strings. The simplest example is the String.split("something") method.

现在许多方法的实际定义是它们都是常规的表达式作为输入参数。这就是所有非常强大的构建块。

Now the actual definition of many of those methods is that they all take a regular expression as their input parameter(s). Which makes then all very powerful building blocks.

现在,在许多方法中你会看到两种效果:

Now there are two effects you'll see in many of those methods:


  1. 每次调用方法时,它们都会重新编译表达式。因此,它们会对性能产生影响。

  2. 我发现在大多数现实生活的情况下,这些方法都是用固定文本调用的。分裂方法最常见的用法更糟糕:通常使用单个字符(通常是',a';'或'&')进行分割。

因此,不仅默认方法功能强大,它们似乎也因其实际用途而被压制。在内部,我们开发了一种fastSplit方法,可以拆分固定字符串。我在家里写了一个测试,看看如果知道它是一个单一的字符,我能做多快。两者都明显快于标准拆分方法。

So it's not only that the default methods are powerful, they also seem overpowered for what they are actually used for. Internally we've developed a "fastSplit" method that splits on fixed strings. I wrote a test at home to see how much faster I could do it if it was known to be a single char. Both are significantly faster than the "standard" split method.

所以我想知道:为什么选择Java API的方式是现在的?
这样做的理由是什么,而不是像split(char)和split(String)以及splitRegex(String)这样的东西?

So I was wondering: why was the Java API chosen the way it is now? What was the good reason to go for this instead of having a something like split(char) and split(String) and a splitRegex(String) ??

更新:我打了几个电话,看看分割字符串的各种方法需要多长时间。

Update: I slapped together a few calls to see how much time the various ways of splitting a string would take.

简短摘要:它会产生差异!

Short summary: It makes a big difference!

我做了10000000次迭代对于每个测试用例,总是使用输入

I did 10000000 iterations for each test case, always using the input

"aap,noot,mies,wim,zus,jet,teun" 

并且始终使用','或,作为拆分参数。

and always using ',' or "," as the split argument.

这是我在我的Linux系统上得到的(它是一个Atom D510盒子,所以它有点慢):

This is what I got on my Linux system (it's an Atom D510 box, so it's a bit slow):

fastSplit STRING
Test  1 : 11405 milliseconds: Split in several pieces
Test  2 :  3018 milliseconds: Split in 2 pieces
Test  3 :  4396 milliseconds: Split in 3 pieces

homegrown fast splitter based on char
Test  4 :  9076 milliseconds: Split in several pieces
Test  5 :  2024 milliseconds: Split in 2 pieces
Test  6 :  2924 milliseconds: Split in 3 pieces

homegrown splitter based on char that always splits in 2 pieces
Test  7 :  1230 milliseconds: Split in 2 pieces

String.split(regex)
Test  8 : 32913 milliseconds: Split in several pieces
Test  9 : 30072 milliseconds: Split in 2 pieces
Test 10 : 31278 milliseconds: Split in 3 pieces

String.split(regex) using precompiled Pattern
Test 11 : 26138 milliseconds: Split in several pieces 
Test 12 : 23612 milliseconds: Split in 2 pieces
Test 13 : 24654 milliseconds: Split in 3 pieces

StringTokenizer
Test 14 : 27616 milliseconds: Split in several pieces
Test 15 : 28121 milliseconds: Split in 2 pieces
Test 16 : 27739 milliseconds: Split in 3 pieces

正如你所看到的,如果你有很多固定字符会有很大的不同分手做。

As you can see it makes a big difference if you have a lot of "fixed char" splits to do.

给你们一些见解;我目前在Apache日志文件和Hadoop竞技场中使用 big 网站的数据。所以对我来说这个东西真的很重要:)

To give you guys some insight; I'm currently in the Apache logfiles and Hadoop arena with the data of a big website. So to me this stuff really matters :)

我在这里没有考虑的东西是垃圾收集器。据我所知,将正则表达式编译成Pattern / Matcher / ..会分配很多对象,需要在一段时间内收集。因此,从长远来看,这些版本之间的差异可能更大......或更小。

Something I haven't factored in here is the garbage collector. As far as I can tell compiling a regular expression into a Pattern/Matcher/.. will allocate a lot of objects, that need to be collected some time. So perhaps in the long run the differences between these versions is even bigger .... or smaller.

我的结论到目前为止:


  • 如果要分割很多字符串,则只对其进行优化。

  • 如果使用正则表达式方法,如果重复使用,则始终预编译相同模式。

  • 忘记(过时)StringTokenizer

  • 如果要拆分单个字符,请使用自定义方法,特别是如果只有需要将其拆分为特定数量的碎片(如... 2)。

  • Only optimize this if you have a LOT of strings to split.
  • If you use the regex methods always precompile if you repeatedly use the same pattern.
  • Forget the (obsolete) StringTokenizer
  • If you want to split on a single char then use a custom method, especially if you only need to split it into a specific number of pieces (like ... 2).

PS我给你所有我自己开发的char方法分开玩(根据许可证,本网站上的所有内容都属于:))。我还没有完全测试过它们。玩得开心。

P.S. I'm giving you all my homegrown split by char methods to play with (under the license that everything on this site falls under :) ). I never fully tested them .. yet. Have fun.

private static String[]
        stringSplitChar(final String input,
                        final char separator) {
    int pieces = 0;

    // First we count how many pieces we will need to store ( = separators + 1 )
    int position = 0;
    do {
        pieces++;
        position = input.indexOf(separator, position + 1);
    } while (position != -1);

    // Then we allocate memory
    final String[] result = new String[pieces];

    // And start cutting and copying the pieces.
    int previousposition = 0;
    int currentposition = input.indexOf(separator);
    int piece = 0;
    final int lastpiece = pieces - 1;
    while (piece < lastpiece) {
        result[piece++] = input.substring(previousposition, currentposition);
        previousposition = currentposition + 1;
        currentposition = input.indexOf(separator, previousposition);
    }
    result[piece] = input.substring(previousposition);

    return result;
}

private static String[]
        stringSplitChar(final String input,
                        final char separator,
                        final int maxpieces) {
    if (maxpieces <= 0) {
        return stringSplitChar(input, separator);
    }
    int pieces = maxpieces;

    // Then we allocate memory
    final String[] result = new String[pieces];

    // And start cutting and copying the pieces.
    int previousposition = 0;
    int currentposition = input.indexOf(separator);
    int piece = 0;
    final int lastpiece = pieces - 1;
    while (currentposition != -1 && piece < lastpiece) {
        result[piece++] = input.substring(previousposition, currentposition);
        previousposition = currentposition + 1;
        currentposition = input.indexOf(separator, previousposition);
    }
    result[piece] = input.substring(previousposition);

    // All remaining array elements are uninitialized and assumed to be null
    return result;
}

private static String[]
        stringChop(final String input,
                   final char separator) {
    String[] result;
    // Find the separator.
    final int separatorIndex = input.indexOf(separator);
    if (separatorIndex == -1) {
        result = new String[1];
        result[0] = input;
    }
    else {
        result = new String[2];
        result[0] = input.substring(0, separatorIndex);
        result[1] = input.substring(separatorIndex + 1);
    }
    return result;
}


推荐答案

请注意,正则表达式不需要每次都要重新编译。来自 Javadoc

Note that the regex need not be recompiled each time. From the Javadoc:


调用 str形式的此方法。 split(regex,n)产生与表达式相同的结果

An invocation of this method of the form str.split(regex, n) yields the same result as the expression



Pattern.compile(regex).split(str, n) 

也就是说,如果你是担心性能,你可以预先编译模式,然后重复使用它:

That is, if you are worried about performance, you may precompile the pattern and then reuse it:

Pattern p = Pattern.compile(regex);
...
String[] tokens1 = p.split(str1); 
String[] tokens2 = p.split(str2); 
...

而不是

String[] tokens1 = str1.split(regex);
String[] tokens2 = str2.split(regex);
...

我认为这种API设计的主要原因是方便。由于正则表达式也包括所有固定字符串/字符,因此它简化了API以使用一种方法而不是几种方法。如果有人担心性能,正如上面所示仍然可以预编译正则表达式。

I believe that the main reason for this API design is convenience. Since regular expressions include all "fixed" strings/chars too, it simplifies the API to have one method instead of several. And if someone is worried about performance, the regex can still be precompiled as shown above.

我的感觉(我无法用任何统计证据回复)是大多数案例 String.split()用于性能不是问题的上下文中。例如。这是一次性行动,或者与其他因素相比,性能差异可以忽略不计。 IMO很少见的情况是你在紧密的循环中使用相同的正则表达式分割字符串数千次,性能优化确实有意义。

My feeling (which I can't back with any statistical evidence) is that most of the cases String.split() is used in a context where performance is not an issue. E.g. it is a one-off action, or the performance difference is negligible compared to other factors. IMO rare are the cases where you split strings using the same regex thousands of times in a tight loop, where performance optimization indeed makes sense.

看到一个很有意思正则表达式实现与固定字符串/字符的性能比较与专门用于这些的匹配器的性能比较。差异可能不足以证明单独实施的合理性。

It would be interesting to see a performance comparison of a regex matcher implementation with fixed strings/chars compared to that of a matcher specialized to these. The difference might not be big enough to justify the separate implementation.

这篇关于为什么Java中的大多数字符串操作基于regexp?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆