在Java中搜索子字符串的最快方法是什么? [英] What is the fastest way to search for substring in Java?

查看:166
本文介绍了在Java中搜索子字符串的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解在Java中进行子字符串搜索时可能出现的性能问题。我知道在Java中搜索子字符串的两种内置方法。

I want to understand the performance issues that can emerge while making substring search in Java. I know the two built-in methods of searching for substring in Java.

1。 String.indexOf()

据我所知,这种方法使用子串搜索的强力算法,因此其复杂度为O(nm) n和m是字符串和模式的长度。

As far as I understand this method uses the brute-force algorithm of substring search, thus its complexity is O(nm) where n and m are lengths of string and pattern.

2。使用模式和匹配器

我对正则表达式算法的实现方式及其复杂性一无所知。

I know nothing about how the regex algorithms are implemented and about their complexity.

所以问题是:

1)这些方法中哪一种更可取从表现的角度来看?

1) Which of these methods is preferrable from the perspective of performance?

2)正则表达式搜索的复杂性是什么?它取决于正则表达式本身吗?

2) What is the complexity of regex search? Does it depends on the regex itself?

推荐答案

老实说,如果你关心最坏情况的性能,JNI会调用本机代码您的标准库的 strstr 函数。良好实现的 strstr ,与最新版本的glibc一样,具有线性最坏情况运行时间和恒定最坏情况空间使用。我相信glibc的 strstr 也可以在文本中做类似Boyer-Moore的跳远。 C标准库由知道如何编写和维护优秀和通用库并实践其工艺的人员维护。对于Java标准类库,也不能这样说。

Honestly, if you care about worst-case performance, JNI into native code that calls your standard library's strstr function. Well-implemented strstr, like the one in recent versions of glibc, has linear worst-case running time and constant worst-case space usage. I believe glibc's strstr can do Boyer-Moore-like long jumps through the text, too. C standard libraries are maintained by people who know how to write and maintain good and general-purpose libraries and practise their craft. The same cannot be said for the Java standard class library.

您必须将Java UTF-16字符串转换为适合 strstr <的字符串/ code>,例如UTF-8字符串。您还必须优雅地处理UTF-8字符串中的嵌入式零字节。除此之外,您将获得编写良好且维护良好的库的好处。

You will have to turn a Java UTF-16 string into something suitable for strstr, such as a UTF-8 string. You will also have to handle embedded zero bytes in the UTF-8 string gracefully. Other than that, you will reap the benefits of a well-written and well-maintained library.

Java使用Boyer-Moore进行正则表达式搜索(针对此特定情况)字符串搜索入侵了一个天真的正则表达式实现。仅使用您的字符串编译模式将导致匹配执行得相对较好。但请注意,这不会扩展到使用正则表达式库进行字符串搜索之外的任何内容;你仍然坚持使用一个天真的正则表达式实现,如果你给它一个非常重要的正则表达式,那就是回溯所有。

Java does regex searches (for this particular case) using a Boyer-Moore string search hacked into a naive regex implementation. Compiling a Pattern with just your string will result in a Matcher that performs relatively well. Note, however, that this does NOT extend to anything beyond string searching with the regex library; you're still stuck with a naive regex implementation that backtracks and all if you feed it a nontrivial regular expression.

作为为什么你不应该使用Java正则表达式的证据实际的正则表达式,我告诉你以下内容:

As evidence for why you shouldn't use Java regex for actual regexes, I present you the following:

public class regex {
  public static void main(String[] args) throws Exception {
    String haystack = "ab";
    String needle = "abab?.*";
    for (int i = 0; i < 7; i++) haystack = haystack + haystack;
    for (int i = 0; i < 4; i++) needle = needle + needle;
    System.out.println(haystack.length() + " " + needle.length());
    long before = System.currentTimeMillis();
    System.out.println(Pattern.matches(needle, haystack));
    long after = System.currentTimeMillis(); // long after indeed...
    System.out.println(after - before);
  }
}

这是一个256个字符的干草堆搜索一个112个字符的针正则表达式(这是你在编译器类中学到的一个诚实的正则表达式)。在我的机器上完成大约需要24秒。

This is a search in a 256-character haystack for a needle regex (that's an honest regex that you learnt about in compilers class) of 112 characters. It takes about 24 seconds to complete on my machine.

这篇关于在Java中搜索子字符串的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆