StringTokenizer - 用整数读取行 [英] StringTokenizer - reading lines with integers

查看:261
本文介绍了StringTokenizer - 用整数读取行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于优化我的代码的问题(哪个有效,但速度太慢......)。我正在阅读表格中的输入

I have a question about optimization of my code (which works but is too slow...). I am reading an input in a form

X1 Y1
X2 Y2
etc

其中Xi,Yi是整数。我正在使用 bufferedReader 来读取行,然后使用 StringTokenizer 来处理这些数字:

where Xi, Yi are integers. I am using bufferedReader for reading lines and then StringTokenizer for processing those numbers like this:

StringTokenizer st = new StringTokenizer(line, " ");

int x = Integer.parseInt(st.nextToken());
int y = Integer.parseInt(st.nextToken());

问题是这种方法在处理大型数据集时似乎效率低下。你能给我一些简单的改进(我听说可以使用一些整数解析int或正则表达式),这会改善性能吗?感谢您的任何提示

The problem is that this approach seems time inefficient when coping with large data sets. Could you suggest me some simple improvement (I have heard that some integer parse int or regex can be used) which would improve the performance? Thanks for any tips

编辑:也许我误判了自己,并且必须在代码的其他地方进行一些改进......

Perhaps I misjudged myself and some improvements have to be done elsewhere in the code...

推荐答案

(更新后的答案)

我可以说无论程序速度有什么问题,令牌化器的选择是不是其中之一。在每个方法初始运行以均衡初始化怪癖之后,我可以在几毫秒内解析1000000行12 34。如果你愿意,你可以切换到使用indexOf,但我真的认为你需要查看其他代码的瓶颈而不是这个微优化。斯普利特对我来说是个惊喜 - 与其他方法相比,它真的非常慢。我已经添加了Guava split test,它比String.split快,但比StringTokenizer稍慢。

I can say that whatever the problems in your program speed, the choice of tokenizer is not one of them. After an initial run of each method to even out initialisation quirks, I can parse 1000000 rows of "12 34" in milliseconds. You could switch to using indexOf if you like but I really think you need to look at other bits of code for the bottleneck rather than this micro-optimisation. Split was a surprise for me - it's really, really slow compared to the other methods. I've added in Guava split test and it's faster than String.split but slightly slower than StringTokenizer.


  • 分裂:371ms

  • IndexOf:48ms

  • StringTokenizer:92ms

  • Guava Splitter.split():108ms

  • CsvMapper构建一个csv doc并解析为POJOS:237ms(如果你将这些行构建成一个doc,则为175!)

  • Split: 371ms
  • IndexOf: 48ms
  • StringTokenizer: 92ms
  • Guava Splitter.split(): 108ms
  • CsvMapper build a csv doc and parse into POJOS: 237ms (or 175 if you build the lines into one doc!)

这里的差异甚至可以忽略不计甚至超过数百万行。

The difference here is pretty negligible even over millions of rows.

现在我的博客上写了这个: http://demeranville.com/battle-of-the-tokenizers-delimited-text-parser-performance/

There's now a write up of this on my blog: http://demeranville.com/battle-of-the-tokenizers-delimited-text-parser-performance/

我跑的代码是:

import java.util.StringTokenizer;
import org.junit.Test;

public class TestSplitter {

private static final String line = "12 34";
private static final int RUNS = 1000000;//000000;

public final void testSplit() {
    long start = System.currentTimeMillis();
    for (int i=0;i<RUNS;i++){
        String[] st = line.split(" ");
        int x = Integer.parseInt(st[0]);
        int y = Integer.parseInt(st[1]);
    }
    System.out.println("Split: "+(System.currentTimeMillis() - start)+"ms");
}

public final void testIndexOf() {
    long start = System.currentTimeMillis();
    for (int i=0;i<RUNS;i++){
        int index = line.indexOf(' ');
        int x = Integer.parseInt(line.substring(0,index));
        int y = Integer.parseInt(line.substring(index+1));
    }       
    System.out.println("IndexOf: "+(System.currentTimeMillis() - start)+"ms");      
}

public final void testTokenizer() {
    long start = System.currentTimeMillis();
    for (int i=0;i<RUNS;i++){
        StringTokenizer st = new StringTokenizer(line, " ");
        int x = Integer.parseInt(st.nextToken());
        int y = Integer.parseInt(st.nextToken());
    }
    System.out.println("StringTokenizer: "+(System.currentTimeMillis() - start)+"ms");
}

@Test
public final void testAll() {
    this.testSplit();
    this.testIndexOf();
    this.testTokenizer();
    this.testSplit();
    this.testIndexOf();
    this.testTokenizer();
}

}

eta:这是番石榴代码:

eta: here's the guava code:

public final void testGuavaSplit() {
    long start = System.currentTimeMillis();
    Splitter split = Splitter.on(" ");
    for (int i=0;i<RUNS;i++){
        Iterator<String> it = split.split(line).iterator();
        int x = Integer.parseInt(it.next());
        int y = Integer.parseInt(it.next());
    }
    System.out.println("GuavaSplit: "+(System.currentTimeMillis() - start)+"ms");
}

更新

我也在CsvMapper测试中添加了:

I've added in a CsvMapper test too:

public static class CSV{
    public int x;
    public int y;
}

public final void testJacksonSplit() throws JsonProcessingException, IOException {
    CsvMapper mapper = new CsvMapper();
    CsvSchema schema = CsvSchema.builder().addColumn("x", ColumnType.NUMBER).addColumn("y", ColumnType.NUMBER).setColumnSeparator(' ').build();

    long start = System.currentTimeMillis();
    StringBuilder builder = new StringBuilder();
    for (int i = 0; i < RUNS; i++) {
        builder.append(line);
        builder.append('\n');
    }       
    String input = builder.toString();
    MappingIterator<CSV> it = mapper.reader(CSV.class).with(schema).readValues(input);
    while (it.hasNext()){
        CSV csv = it.next();
    }
    System.out.println("CsvMapperSplit: " + (System.currentTimeMillis() - start) + "ms");
}

这篇关于StringTokenizer - 用整数读取行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆