String .contains VS Set<String>.contains VS 正则表达式 String.matches() [英] String .contains VS Set&lt;String&gt; .contains VS Regex String.matches()

查看:42
本文介绍了String .contains VS Set<String>.contains VS 正则表达式 String.matches()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在两个文件中有两组不太长的字符串(200~500 字),如下所示:

I have two sets of strings which are not very long (200~500 words) in two files which looks like this:

File1          File2

this           window
that           good
word           work
java           fine
book           home

所有独特的词.

现在首先从文件中读取字符串(逐行)并将它们存储在:

Now First read the strings from file (line-by-line) and store them in:

  1. 设置<字符串>set1 设置<字符串>set2:可能看起来像这样:[this, that, word, java, book] and [window, good, work, good,家]
  1. Set<String> set1 Set<String> set2: That may looks like this: [this, that, word, java, book] and [window, good, work, fine, home]

  1. String str1 String str2:可能看起来像这样:str1: thisthatwordjava and str2: windowgoodworkfinehome OR 可以是 str1: this,that,word,java(以逗号分隔).
  1. String str1 String str2: That may looks like this: str1: thisthatwordjava and str2: windowgoodworkfinehome OR can be str1: this,that,word,java (separated by comma).

现在有三种方法可以检查SetString出现的词home:

Now there are three ways to check the word home in which Set or String will be present:

  1. 使用set1/2.contains("home")
  2. 使用str1/2.contains("home")
  3. 使用str1/2.matches("home")

以上所有方法都可以正常工作,但哪一个是最佳

All of the above will work fine, but which one the BEST one

注意: 这个问题的目的是因为检查字符串的频率非常高.

Note: The purpose of this question is because the frequency of checking for string is very high.

推荐答案

不要做性能假设

是什么让您认为 String.contains 会具有更好的性能"?

Don't Make Performance Assumptions

What makes you think that String.contains will have "better performance"?

它不会,除了非常简单的情况,如果:

It won't, except for very simple cases, that is if:

  • 你的字符串列表很短,
  • 要比较的字符串很短,
  • 您想要进行一次性查找.

对于所有其他情况,Set 方法将更好地扩展和工作.与单个字符串相比,您肯定会为 Set 带来内存开销,但是即使您想存储数百万个字符串并比较长字符串,O(1) 查找也将保持不变.

For all other cases, the Set approach will scale and work better. Sure you'll have a memory overhead for the Set as opposed to a single string, but the O(1) lookups will remain constant even if you want to store millions of strings and compare long strings.

使用更安全、更稳健的设计,尤其是因为这里实施起来并不困难.正如你提到的,你会经常检查,那么一套方法肯定更适合你.

Use the safer and more robust design, especially as here it's not a difficult solution to implement. And as you mention that you will check frequently, then a set approach is definitely better for you.

此外,String.contain 将是不安全的,就好像您的两者都有匹配的字符串和子字符串一样,您的查找将失败.正如 kennytm 在评论中所说,如果我们使用您的示例,并且您的列表中有java"字符串,则查找ava"将匹配它,这显然是您不想要的.

Also, String.contain will be unsafe, as if your both have matching strings and substrings your lookups will fail. As kennytm said in a comment, if we use your example, and you have the "java" string in your list, looking up "ava" will match it, which you apparently don't want.

您可能不想使用简单的 HashSet 或调整其设置.例如,您可以考虑使用 Guava ImmutableSet,如果您的集合只创建一次但经常检查.

You may not want to use the simple HashSet or to tweak its settings though. For instance, you could consider a Guava ImmutableSet, if your set will be created only once but checked very often.

这就是我要做的,假设您想要一个不可变的集合(正如您所说,您从文件中读取字符串列表).这是即兴的,没有经过验证,所以请原谅缺乏仪式.

Here's what I'd do, assuming you want an immutable set (as you say you read the list of strings from a file). This is off-hand and without verification so forgive the lack of ceremonies.

import com.google.common.collect.ImmutableSet;
import com.google.common.io.Files;
import com.google.common.base.Splitter;

final Set<String> lookupTable = ImmutableSet.copyOf(
  Splitter.on(',')
    .trimResults()
    .omitEmptyStrings()
    .split(Files.asCharSource(new File("YOUR_FILE_PATH"), Charsets.UTF_8).read())
);

使用正确的路径、正确的字符集进行调味,如果您想允许空格和空字符串,可以修剪或不修剪.

Season to taste with correct path, correct charset, and with or without trimming if you want to allow spaces and an empty string.

如果您不想使用 Guava 而只想使用 vanilla Java,那么只需在 Java 8 中执行类似的操作(再次道歉,未经测试):

If you don't want to use Guava and only vanilla Java, then simply do something like this in Java 8 (again, apologies, untested):

final Set<String> lookupTable =
    Files.lines(Paths.get("YOUR_FILE_PATH"))
      .map(line -> line.split(",+"))
      .map(Arrays::stream)
      .collect(toSet());

使用 Java <8

如果你有 Java <8、然后使用通常的FileInputStream读取文件,然后String.split[]或者StringTokenizer提取一个数组,最后将数组项加入到Set中.

Using Java < 8

If you have Java < 8, then use the usual FileInputStream to read the file, then String.split[] or StringTokenizer to extract an array, and finally add the array entries into a Set.

这篇关于String .contains VS Set<String>.contains VS 正则表达式 String.matches()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆