从Java String中删除所有不可打印字符的最快方法 [英] Fastest way to strip all non-printable characters from a Java String
问题描述
从Java中的 String
中删除所有不可打印字符的最快方法是什么?
What is the fastest way to strip all non-printable characters from a String
in Java?
所以我已经尝试并测量了138字节,131个字符的字符串:
So far I've tried and measured on 138-byte, 131-character String:
- String的
replaceAll()
- 最慢的方法
- 517009结果/秒
- String's
replaceAll()
- slowest method- 517009 results / sec
- 637836 results / sec
- 711946结果/秒
- 1052964结果/秒
- 2022653结果/秒
- 2502502结果/秒
- 857485结果/秒
- 791076结果/秒
- 370164结果/秒
我最好的尝试如下:
char[] oldChars = new char[s.length()]; s.getChars(0, s.length(), oldChars, 0); char[] newChars = new char[s.length()]; int newLen = 0; for (int j = 0; j < s.length(); j++) { char ch = oldChars[j]; if (ch >= ' ') { newChars[newLen] = ch; newLen++; } } s = new String(newChars, 0, newLen);
有关如何让它更快的想法吗?
Any thoughts on how to make it even faster?
回答一个非常奇怪的问题的奖励点:为什么直接使用utf-8字符集名称比使用预先分配的静态const
Charset.forName(utf-8)
?Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const
Charset.forName("utf-8")
?- 来自的建议棘轮狂热令人印象深刻的3105590结果/秒表现,+ 24%的提升!
- 来自 Ed Staub 的建议产生了又一次改进 - 3471017结果/秒,比之前最好的+ 12%。
- Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!
- Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.
我已尽力收集所有提议的解决方案及其交叉突变,并将其发布为 small github上的基准测试框架。目前它运动17种算法。其中一个是特殊 - Voo1 算法(由SO用户Voo提供)采用复杂的反射技巧,从而实现了恒星速度,但它混淆了JVM字符串的状态,因此它是单独进行基准测试的。
I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately.
欢迎您查看并运行它以确定您的盒子上的结果。以下是我对我的结果的总结。它是规格:
You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs:
- Debian sid
- Linux 2.6.39-2-amd64(x86_64)
- 从软件包安装Java
sun-java6-jdk-6.24-1
,JVM将自己标识为
- Java(TM)SE运行时环境(版本1.6.0_24-b07)
- Java HotSpot(TM)64位服务器VM(版本19.1- b02,混合模式)
- Debian sid
- Linux 2.6.39-2-amd64 (x86_64)
- Java installed from a package
sun-java6-jdk-6.24-1
, JVM identifies itself as- Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
- Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
不同的算法在给定不同的集合时最终会显示不同的结果输入数据。我在3种模式下运行了基准测试:
Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes:
此模式适用于同一个模式由
StringSource
类提供的字符串作为常量。摊牌是:This mode works on a same single string provided by
StringSource
class as a constant. The showdown is:Ops / s │ Algorithm ──────────┼────────────────────────────── 6 535 947 │ Voo1 ──────────┼────────────────────────────── 5 350 454 │ RatchetFreak2EdStaub1GreyCat1 5 249 343 │ EdStaub1 5 002 501 │ EdStaub1GreyCat1 4 859 086 │ ArrayOfCharFromStringCharAt 4 295 532 │ RatchetFreak1 4 045 307 │ ArrayOfCharFromArrayOfChar 2 790 178 │ RatchetFreak2EdStaub1GreyCat2 2 583 311 │ RatchetFreak2 1 274 859 │ StringBuilderChar 1 138 174 │ StringBuilderCodePoint 994 727 │ ArrayOfByteUTF8String 918 611 │ ArrayOfByteUTF8Const 756 086 │ MatcherReplace 598 945 │ StringReplaceAll 460 045 │ ArrayOfByteWindows1251
以图表形式:
相同的单个字符串图表http://www.greycat.ru/img/os-chart-single.png预生成的源字符串提供程序使用(0..127)字符集的大量随机字符串 - 因此几乎所有字符串都包含至少一个控制字符。算法以循环方式从此预生成的数组中接收字符串。
Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion.
Ops / s │ Algorithm ──────────┼────────────────────────────── 2 123 142 │ Voo1 ──────────┼────────────────────────────── 1 782 214 │ EdStaub1 1 776 199 │ EdStaub1GreyCat1 1 694 628 │ ArrayOfCharFromStringCharAt 1 481 481 │ ArrayOfCharFromArrayOfChar 1 460 067 │ RatchetFreak2EdStaub1GreyCat1 1 438 435 │ RatchetFreak2EdStaub1GreyCat2 1 366 494 │ RatchetFreak2 1 349 710 │ RatchetFreak1 893 176 │ ArrayOfByteUTF8String 817 127 │ ArrayOfByteUTF8Const 778 089 │ StringBuilderChar 734 754 │ StringBuilderCodePoint 377 829 │ ArrayOfByteWindows1251 224 140 │ MatcherReplace 211 104 │ StringReplaceAll
以图表形式:
多个字符串,100%浓度http://www.greycat.ru/img/os-chart-multi100.png与之前相同,但只有1%的字符串是使用控制字符生成的 - 其他99%是使用[32..127]字符集生成的,因此它们根本不能包含控制字符。这个合成负载在我的位置最接近这个算法的真实世界应用。
Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place.
Ops / s │ Algorithm ──────────┼────────────────────────────── 3 711 952 │ Voo1 ──────────┼────────────────────────────── 2 851 440 │ EdStaub1GreyCat1 2 455 796 │ EdStaub1 2 426 007 │ ArrayOfCharFromStringCharAt 2 347 969 │ RatchetFreak2EdStaub1GreyCat2 2 242 152 │ RatchetFreak1 2 171 553 │ ArrayOfCharFromArrayOfChar 1 922 707 │ RatchetFreak2EdStaub1GreyCat1 1 857 010 │ RatchetFreak2 1 023 751 │ ArrayOfByteUTF8String 939 055 │ StringBuilderChar 907 194 │ ArrayOfByteUTF8Const 841 963 │ StringBuilderCodePoint 606 465 │ MatcherReplace 501 555 │ StringReplaceAll 381 185 │ ArrayOfByteWindows1251
以图表形式:
多个字符串,1%浓度http://www.greycat.ru/img/os-chart-multi1.png我很难决定谁提供了最好的答案,但鉴于现实世界的应用程序最好的解决方案是由Ed Staub给出/启发的,我想我的答案是公平的。感谢所有参与此活动的人,您的意见非常有帮助且非常宝贵。随意在您的盒子上运行测试套件并提出更好的解决方案(工作JNI解决方案,任何人?)。
It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).
- 使用基准测试套件 GitHub存储库
- GitHub repository with a benchmarking suite
推荐答案
如果将此方法嵌入到不跨线程共享的类中是合理的,那么你可以重用缓冲区:
If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:
char [] oldChars = new char[5]; String stripControlChars(String s) { final int inputLen = s.length(); if ( oldChars.length < inputLen ) { oldChars = new char[inputLen]; } s.getChars(0, inputLen, oldChars, 0);
等...
这是一个大赢家 - 大概20%左右,据我了解目前最好的情况。
This is a big win - 20% or so, as I understand the current best case.
如果这是用于潜在的大字符串而内存泄漏是关注,可以使用弱参考。
If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.
这篇关于从Java String中删除所有不可打印字符的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!