按最大行分割非常大的文本文件 [英] split very large text file by max rows

查看:184
本文介绍了按最大行分割非常大的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将包含字符串的大文件拆分成一组新的(较小的)文件并尝试使用nio2。



我不想将整个文件加载到内存中,所以我尝试使用BufferedReader。



较小的文本文件应受文本行数限制。



该解决方案有效,但是我想问一下是否有人知道一个具有更好性能的解决方案,请使用java 8(也许lamdas with stream() - api?)和nio2:

  public void splitTextFiles(Path bigFile,int maxRows)抛出IOException {

int i = 1;
try(BufferedReader reader = Files.newBufferedReader(bigFile)){
String line = null;
int lineNum = 1;

Path splitFile = Paths.get(i +split.txt);
BufferedWriter writer = Files.newBufferedWriter(splitFile,StandardOpenOption.CREATE);

while((line = reader.readLine())!= null){

if(lineNum> maxRows){
writer.close();
lineNum = 1;
i ++;
splitFile = Paths.get(i +split.txt);
writer = Files.newBufferedWriter(splitFile,StandardOpenOption.CREATE);
}

writer.append(line);
writer.newLine();
lineNum ++;
}

writer.close();
}
}


解决方案

小心直接使用 <$ c $之间的区别c> InputStreamReader / OutputStreamWriter 及其子类和 Reader / Writer <$的工厂方法C $ C>文件 。在前一种情况下,当没有给出明确的字符集时使用系统的默认编码,后者总是默认为 UTF-8 。所以我强烈建议总是指定所需的字符集,即使它是 Charset.defaultCharset() StandardCharsets.UTF_8 如果您在创建 Reader Writer 的各种方法之间切换,请记录您的意图并避免意外。






如果要在行边界处拆分,则无法查看文件的内容。因此,您无法按照合并时的方式对其进行优化。



<如果您愿意牺牲可移植性,可以尝试一些优化。如果您知道charset编码将明确地将'\ n'映射到(字节)'\ n'对于大多数单字节编码以及 UTF-8 ,您可以扫描字节级别的换行符以获取拆分的文件位置并避免任何从您的应用程序到I / O系统的数据传输。

  public void splitTextFiles(Path bigFile,int maxRows)抛出IOException {
MappedByteBuffer bb;
try(FileChannel in = FileChannel.open(bigFile,READ)){
bb = in.map(FileChannel.MapMode.READ_ONLY,0,in.size());
}
for(int start = 0,pos = 0,end = bb.remaining(),i = 1,lineNum = 1; pos< end; lineNum ++){
while(pos< ; end&& bb.get(pos ++)!='\ n');
if(lineNum< maxRows&& pos< end)继续;
Path splitFile = Paths.get(i ++ +split.txt);
//如果要覆盖现有文件,请使用CREATE,TRUNCATE_EXISTING
try(FileChannel out = FileChannel.open(splitFile,CREATE_NEW,WRITE)){
bb.position(start).limit (POS);
while(bb.hasRemaining())out.write(bb);
bb.clear();
start = pos;
lineNum = 0;
}
}
}

缺点是它没有不能使用 UTF-16 EBCDIC 等编码,与 BufferedReader.readLine不同( )它不支持单独的'\ r'作为旧MacOS9中使用的行终止符。



此外,它只支持小于2GB的文件;由于虚拟地址空间有限,32Bit JVM上的限制可能更小。对于大于限制的文件,有必要一个接一个地迭代源文件的块和 map



这些问题可以修复,但会增加这种方法的复杂性。鉴于我的机器上的速度提升仅为15%(我没想到I / O在这里占主导地位),并且当复杂性提高时甚至会更小,我认为这不值得。






底线是此任务的 Reader / Writer 方法已足够,但您应该注意用于操作的 Charset


I want to split a huge file containing strings into a set of new (smaller) file and tried to use nio2.

I do not want to load the whole file into memory, so I tried it with BufferedReader.

The smaller text files should be limited by the number of text rows.

The solution works, however I want to ask if someone knows a solution with better performance by usion java 8 (maybe lamdas with stream()-api?) and nio2:

public void splitTextFiles(Path bigFile, int maxRows) throws IOException{

        int i = 1;
        try(BufferedReader reader = Files.newBufferedReader(bigFile)){
            String line = null;
            int lineNum = 1;

            Path splitFile = Paths.get(i + "split.txt");
            BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);

            while ((line = reader.readLine()) != null) {

                if(lineNum > maxRows){
                    writer.close();
                    lineNum = 1;
                    i++;
                    splitFile = Paths.get(i + "split.txt");
                    writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
                }

                writer.append(line);
                writer.newLine();
                lineNum++;
            }

            writer.close();
        }
}

解决方案

Beware of the difference between the direct use of InputStreamReader/OutputStreamWriter and their subclasses and the Reader/Writer factory methods of Files. While in the former case the system’s default encoding is used when no explicit charset is given, the latter always default to UTF-8. So I strongly recommend to always specify the desired charset, even if it’s either Charset.defaultCharset() or StandardCharsets.UTF_8 to document your intention and avoid surprises if you switch between the various ways to create a Reader or Writer.


If you want to split at line boundaries, there is no way around looking into the file’s contents. So you can’t optimize it the way like when merging.

If you are willing to sacrifice the portability you could try some optimizations. If you know that the charset encoding will unambiguously map '\n' to (byte)'\n' as it’s the case for most single byte encodings as well as for UTF-8 you can scan for line breaks on the byte level to get the file positions for the split and avoid any data transfer from your application to the I/O system.

public void splitTextFiles(Path bigFile, int maxRows) throws IOException {
    MappedByteBuffer bb;
    try(FileChannel in = FileChannel.open(bigFile, READ)) {
        bb=in.map(FileChannel.MapMode.READ_ONLY, 0, in.size());
    }
    for(int start=0, pos=0, end=bb.remaining(), i=1, lineNum=1; pos<end; lineNum++) {
        while(pos<end && bb.get(pos++)!='\n');
        if(lineNum < maxRows && pos<end) continue;
        Path splitFile = Paths.get(i++ + "split.txt");
        // if you want to overwrite existing files use CREATE, TRUNCATE_EXISTING
        try(FileChannel out = FileChannel.open(splitFile, CREATE_NEW, WRITE)) {
            bb.position(start).limit(pos);
            while(bb.hasRemaining()) out.write(bb);
            bb.clear();
            start=pos;
            lineNum = 0;
        }
    }
}

The drawbacks are that it doesn’t work with encodings like UTF-16 or EBCDIC and, unlike BufferedReader.readLine() it won’t support lone '\r' as line terminator as used in old MacOS9.

Further, it only supports files smaller than 2GB; the limit is likely even smaller on 32Bit JVMs due to the limited virtual address space. For files larger than the limit, it would be necessary to iterate over chunks of the source file and map them one after another.

These issues could be fixed but would raise the complexity of this approach. Given the fact that the speed improvement is only about 15% on my machine (I didn’t expect much more as the I/O dominates here) and would be even smaller when the complexity raises, I don’t think it’s worth it.


The bottom line is that for this task the Reader/Writer approach is sufficient but you should take care about the Charset used for the operation.

这篇关于按最大行分割非常大的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆