从Java String中删除所有不可打印字符的最快方法 [英] Fastest way to strip all non-printable characters from a Java String

查看:133
本文介绍了从Java String中删除所有不可打印字符的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从Java中的 String 中删除​​所有不可打印字符的最快方法是什么?

What is the fastest way to strip all non-printable characters from a String in Java?

所以我已经尝试并测量了138字节,131个字符的字符串:

So far I've tried and measured on 138-byte, 131-character String:


  • String的 replaceAll() - 最慢​​的方法


    • 517009结果/秒

    • String's replaceAll() - slowest method
      • 517009 results / sec

      • 637836 results / sec


      • 711946结果/秒


      • 1052964结果/秒


      • 2022653结果/秒


      • 2502502结果/秒


      • 857485结果/秒


      • 791076结果/秒


      • 370164结果/秒

      我最好的尝试如下:

          char[] oldChars = new char[s.length()];
          s.getChars(0, s.length(), oldChars, 0);
          char[] newChars = new char[s.length()];
          int newLen = 0;
          for (int j = 0; j < s.length(); j++) {
              char ch = oldChars[j];
              if (ch >= ' ') {
                  newChars[newLen] = ch;
                  newLen++;
              }
          }
          s = new String(newChars, 0, newLen);
      

      有关如何让它更快的想法吗?

      Any thoughts on how to make it even faster?

      回答一个非常奇怪的问题的奖励点:为什么直接使用utf-8字符集名称比使用预先分配的静态const Charset.forName(utf-8)

      Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?


      • 来自的建议棘轮狂热令人印象深刻的3105590结果/秒表现,+ 24%的提升!

      • 来自 Ed Staub 的建议产生了又一次改进 - 3471017结果/秒,比之前最好的+ 12%。

      • Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!
      • Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.

      我已尽力收集所有提议的解决方案及其交叉突变,并将其发布为 small github上的基准测试框架。目前它运动17种算法。其中一个是特殊 - Voo1 算法(由SO用户Voo提供)采用复杂的反射技巧,从而实现了恒星速度,但它混淆了JVM字符串的状态,因此它是单独进行基准测试的。

      I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately.

      欢迎您查看并运行它以确定您的盒子上的结果。以下是我对我的结果的总结。它是规格:

      You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs:


      • Debian sid

      • Linux 2.6.39-2-amd64(x86_64)

      • 从软件包安装Java sun-java6-jdk-6.24-1 ,JVM将自己标识为


        • Java(TM)SE运行时环境(版本1.6.0_24-b07)

        • Java HotSpot(TM)64位服务器VM(版本19.1- b02,混合模式)

        • Debian sid
        • Linux 2.6.39-2-amd64 (x86_64)
        • Java installed from a package sun-java6-jdk-6.24-1, JVM identifies itself as
          • Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
          • Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

          不同的算法在给定不同的集合时最终会显示不同的结果输入数据。我在3种模式下运行了基准测试:

          Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes:

          此模式适用于同一个模式由 StringSource 类提供的字符串作为常量。摊牌是:

          This mode works on a same single string provided by StringSource class as a constant. The showdown is:

          
           Ops / s  │ Algorithm
          ──────────┼──────────────────────────────
          6 535 947 │ Voo1
          ──────────┼──────────────────────────────
          5 350 454 │ RatchetFreak2EdStaub1GreyCat1
          5 249 343 │ EdStaub1
          5 002 501 │ EdStaub1GreyCat1
          4 859 086 │ ArrayOfCharFromStringCharAt
          4 295 532 │ RatchetFreak1
          4 045 307 │ ArrayOfCharFromArrayOfChar
          2 790 178 │ RatchetFreak2EdStaub1GreyCat2
          2 583 311 │ RatchetFreak2
          1 274 859 │ StringBuilderChar
          1 138 174 │ StringBuilderCodePoint
            994 727 │ ArrayOfByteUTF8String
            918 611 │ ArrayOfByteUTF8Const
            756 086 │ MatcherReplace
            598 945 │ StringReplaceAll
            460 045 │ ArrayOfByteWindows1251
          

          以图表形式:
          相同的单个字符串图表http://www.greycat.ru/img/os-chart-single.png

          预生成的源字符串提供程序使用(0..127)字符集的大量随机字符串 - 因此几乎所有字符串都包含至少一个控制字符。算法以循环方式从此预生成的数组中接收字符串。

          Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion.

          
           Ops / s  │ Algorithm
          ──────────┼──────────────────────────────
          2 123 142 │ Voo1
          ──────────┼──────────────────────────────
          1 782 214 │ EdStaub1
          1 776 199 │ EdStaub1GreyCat1
          1 694 628 │ ArrayOfCharFromStringCharAt
          1 481 481 │ ArrayOfCharFromArrayOfChar
          1 460 067 │ RatchetFreak2EdStaub1GreyCat1
          1 438 435 │ RatchetFreak2EdStaub1GreyCat2
          1 366 494 │ RatchetFreak2
          1 349 710 │ RatchetFreak1
            893 176 │ ArrayOfByteUTF8String
            817 127 │ ArrayOfByteUTF8Const
            778 089 │ StringBuilderChar
            734 754 │ StringBuilderCodePoint
            377 829 │ ArrayOfByteWindows1251
            224 140 │ MatcherReplace
            211 104 │ StringReplaceAll
          

          以图表形式:
          多个字符串,100%浓度http://www.greycat.ru/img/os-chart-multi100.png

          与之前相同,但只有1%的字符串是使用控制字符生成的 - 其他99%是使用[32..127]字符集生成的,因此它们根本不能包含控制字符。这个合成负载在我的位置最接近这个算法的真实世界应用。

          Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place.

          
           Ops / s  │ Algorithm
          ──────────┼──────────────────────────────
          3 711 952 │ Voo1
          ──────────┼──────────────────────────────
          2 851 440 │ EdStaub1GreyCat1
          2 455 796 │ EdStaub1
          2 426 007 │ ArrayOfCharFromStringCharAt
          2 347 969 │ RatchetFreak2EdStaub1GreyCat2
          2 242 152 │ RatchetFreak1
          2 171 553 │ ArrayOfCharFromArrayOfChar
          1 922 707 │ RatchetFreak2EdStaub1GreyCat1
          1 857 010 │ RatchetFreak2
          1 023 751 │ ArrayOfByteUTF8String
            939 055 │ StringBuilderChar
            907 194 │ ArrayOfByteUTF8Const
            841 963 │ StringBuilderCodePoint
            606 465 │ MatcherReplace
            501 555 │ StringReplaceAll
            381 185 │ ArrayOfByteWindows1251
          

          以图表形式:
          多个字符串,1%浓度http://www.greycat.ru/img/os-chart-multi1.png

          我很难决定谁提供了最好的答案,但鉴于现实世界的应用程序最好的解决方案是由Ed Staub给出/启发的,我想我的答案是公平的。感谢所有参与此活动的人,您的意见非常有帮助且非常宝贵。随意在您的盒子上运行测试套件并提出更好的解决方案(工作JNI解决方案,任何人?)。

          It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).

          • GitHub repository with a benchmarking suite

          推荐答案

          如果将此方法嵌入到不跨线程共享的类中是合理的,那么你可以重用缓冲区:

          If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:

          char [] oldChars = new char[5];
          
          String stripControlChars(String s)
          {
              final int inputLen = s.length();
              if ( oldChars.length < inputLen )
              {
                  oldChars = new char[inputLen];
              }
              s.getChars(0, inputLen, oldChars, 0);
          

          等...

          这是一个大赢家 - 大概20%左右,据我了解目前最好的情况。

          This is a big win - 20% or so, as I understand the current best case.

          如果这是用于潜在的大字符串而内存泄漏是关注,可以使用弱参考。

          If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.

          这篇关于从Java String中删除所有不可打印字符的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆