从 Java 字符串中去除所有不可打印字符的最快方法 [英] Fastest way to strip all non-printable characters from a Java String

查看:31
本文介绍了从 Java 字符串中去除所有不可打印字符的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从 Java 中的 String 中去除所有不可打印字符的最快方法是什么?

到目前为止,我已经尝试并测量了 138 字节、131 个字符的字符串:

  • String 的 replaceAll() - 最慢​​的方法
    • 517009 结果/秒
  • 预编译一个 Pattern,然后使用 Matcher 的 replaceAll()
    • 637836 个结果/秒
  • 使用 StringBuffer,使用 codepointAt() 一一获取代码点并附加到 StringBuffer
    • 711946 个结果/秒
  • 使用 StringBuffer,使用 charAt() 一一获取字符并附加到 StringBuffer
    • 1052964 个结果/秒
  • 预分配一个 char[] 缓冲区,使用 charAt() 一一获取字符并填充此缓冲区,然后转换回 String
    • 2022653 结果/秒
  • 预分配 2 个 char[] 缓冲区 - 旧的和新的,使用 getChars() 一次获取现有字符串的所有字符,逐一迭代旧缓冲区并填充新缓冲区,然后将新缓冲区转换为字符串 - 我自己最快的版本
    • 2502502 结果/秒
  • 相同的东西有 2 个缓冲区 - 仅使用 byte[]getBytes() 并将编码指定为utf-8"
    • 857485 个结果/秒
  • 与 2 个 byte[] 缓冲区相同,但将编码指定为常量 Charset.forName("utf-8")
    • 791076 个结果/秒
  • 与 2 个 byte[] 缓冲区相同的东西,但将编码指定为 1 字节本地编码(几乎是一件明智的事情)
    • 370164 个结果/秒

我最好的尝试如下:

 char[] oldChars = new char[s.length()];s.getChars(0, s.length(), oldChars, 0);char[] newChars = new char[s.length()];int newLen = 0;for (int j = 0; j 

有没有想过如何让它更快?

回答一个非常奇怪的问题的奖励积分:为什么直接使用utf-8"字符集名称比使用预先分配的静态 const Charset.forName("utf-8") 产生更好的性能?

更新

  • 来自棘轮怪胎的建议产生了令人印象深刻的每秒 3105590 个结果,提高了 +24%!
  • 来自 Ed Staub 的建议带来了另一项改进 - 3471017 个结果/秒,比之前最好的结果高出 12%.

更新 2

我已尽力收集所有提议的解决方案及其交叉变异并将其发布为
(来源:
(来源:
(来源:greycat.ru)

我很难决定谁提供了最好的答案,但考虑到现实世界的应用程序最佳解决方案是由 Ed Staub 提供/启发的,我想标记他的答案是公平的.感谢所有参与此活动的人,您的意见非常有帮助且非常宝贵.随意在您的机器上运行测试套件并提出更好的解决方案(有效的 JNI 解决方案,有人吗?).

参考文献

解决方案

如果将此方法嵌入到一个不跨线程共享的类中是合理的,那么您可以重用缓冲区:

char [] oldChars = new char[5];字符串 stripControlChars(String s){最终 int inputLen = s.length();if ( oldChars.length < inputLen ){oldChars = 新字符[inputLen];}s.getChars(0, inputLen, oldChars, 0);

等等...

这是一个巨大的胜利 - 20% 左右,据我了解目前的最佳情况.

如果要在潜在的大字符串上使用它并且担心内存泄漏",则可以使用弱引用.

What is the fastest way to strip all non-printable characters from a String in Java?

So far I've tried and measured on 138-byte, 131-character String:

  • String's replaceAll() - slowest method
    • 517009 results / sec
  • Precompile a Pattern, then use Matcher's replaceAll()
    • 637836 results / sec
  • Use StringBuffer, get codepoints using codepointAt() one-by-one and append to StringBuffer
    • 711946 results / sec
  • Use StringBuffer, get chars using charAt() one-by-one and append to StringBuffer
    • 1052964 results / sec
  • Preallocate a char[] buffer, get chars using charAt() one-by-one and fill this buffer, then convert back to String
    • 2022653 results / sec
  • Preallocate 2 char[] buffers - old and new, get all chars for existing String at once using getChars(), iterate over old buffer one-by-one and fill new buffer, then convert new buffer to String - my own fastest version
    • 2502502 results / sec
  • Same stuff with 2 buffers - only using byte[], getBytes() and specifying encoding as "utf-8"
    • 857485 results / sec
  • Same stuff with 2 byte[] buffers, but specifying encoding as a constant Charset.forName("utf-8")
    • 791076 results / sec
  • Same stuff with 2 byte[] buffers, but specifying encoding as 1-byte local encoding (barely a sane thing to do)
    • 370164 results / sec

My best try was the following:

    char[] oldChars = new char[s.length()];
    s.getChars(0, s.length(), oldChars, 0);
    char[] newChars = new char[s.length()];
    int newLen = 0;
    for (int j = 0; j < s.length(); j++) {
        char ch = oldChars[j];
        if (ch >= ' ') {
            newChars[newLen] = ch;
            newLen++;
        }
    }
    s = new String(newChars, 0, newLen);

Any thoughts on how to make it even faster?

Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?

Update

  • Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!
  • Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.

Update 2

I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately.

You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs:

  • Debian sid
  • Linux 2.6.39-2-amd64 (x86_64)
  • Java installed from a package sun-java6-jdk-6.24-1, JVM identifies itself as
    • Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
    • Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes:

Same single string

This mode works on a same single string provided by StringSource class as a constant. The showdown is:

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
6 535 947 │ Voo1
──────────┼──────────────────────────────
5 350 454 │ RatchetFreak2EdStaub1GreyCat1
5 249 343 │ EdStaub1
5 002 501 │ EdStaub1GreyCat1
4 859 086 │ ArrayOfCharFromStringCharAt
4 295 532 │ RatchetFreak1
4 045 307 │ ArrayOfCharFromArrayOfChar
2 790 178 │ RatchetFreak2EdStaub1GreyCat2
2 583 311 │ RatchetFreak2
1 274 859 │ StringBuilderChar
1 138 174 │ StringBuilderCodePoint
  994 727 │ ArrayOfByteUTF8String
  918 611 │ ArrayOfByteUTF8Const
  756 086 │ MatcherReplace
  598 945 │ StringReplaceAll
  460 045 │ ArrayOfByteWindows1251

In charted form:
(source: greycat.ru)

Multiple strings, 100% of strings contain control characters

Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion.

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
2 123 142 │ Voo1
──────────┼──────────────────────────────
1 782 214 │ EdStaub1
1 776 199 │ EdStaub1GreyCat1
1 694 628 │ ArrayOfCharFromStringCharAt
1 481 481 │ ArrayOfCharFromArrayOfChar
1 460 067 │ RatchetFreak2EdStaub1GreyCat1
1 438 435 │ RatchetFreak2EdStaub1GreyCat2
1 366 494 │ RatchetFreak2
1 349 710 │ RatchetFreak1
  893 176 │ ArrayOfByteUTF8String
  817 127 │ ArrayOfByteUTF8Const
  778 089 │ StringBuilderChar
  734 754 │ StringBuilderCodePoint
  377 829 │ ArrayOfByteWindows1251
  224 140 │ MatcherReplace
  211 104 │ StringReplaceAll

In charted form:
(source: greycat.ru)

Multiple strings, 1% of strings contain control characters

Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place.

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
3 711 952 │ Voo1
──────────┼──────────────────────────────
2 851 440 │ EdStaub1GreyCat1
2 455 796 │ EdStaub1
2 426 007 │ ArrayOfCharFromStringCharAt
2 347 969 │ RatchetFreak2EdStaub1GreyCat2
2 242 152 │ RatchetFreak1
2 171 553 │ ArrayOfCharFromArrayOfChar
1 922 707 │ RatchetFreak2EdStaub1GreyCat1
1 857 010 │ RatchetFreak2
1 023 751 │ ArrayOfByteUTF8String
  939 055 │ StringBuilderChar
  907 194 │ ArrayOfByteUTF8Const
  841 963 │ StringBuilderCodePoint
  606 465 │ MatcherReplace
  501 555 │ StringReplaceAll
  381 185 │ ArrayOfByteWindows1251

In charted form:
(source: greycat.ru)

It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).

References

解决方案

If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:

char [] oldChars = new char[5];

String stripControlChars(String s)
{
    final int inputLen = s.length();
    if ( oldChars.length < inputLen )
    {
        oldChars = new char[inputLen];
    }
    s.getChars(0, inputLen, oldChars, 0);

etc...

This is a big win - 20% or so, as I understand the current best case.

If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.

这篇关于从 Java 字符串中去除所有不可打印字符的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆