如何使用Java多线程将大文本文件拆分为较小的块 [英] how to split a large text file into smaller chunks using java multithread

查看:453
本文介绍了如何使用Java多线程将大文本文件拆分为较小的块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试开发一个多线程Java程序,用于将大型文本文件拆分为较小的文本文件.创建的较小文件必须具有前缀的行数. 例如: 如果输入文件的行数是100,而输入的行数是10,则我的程序的结果是将输入文件分成10个文件. 我已经开发了程序的单线程版本:

I'm trying to develop a multithreaded java program for split a large text file into smaller text files. The smaller files created must have a prefixed number of lines. For example: if the number of lines of input file is 100 and the input number is 10, the result of my program is to split the input file into 10 files. I've already developed a singlethreaded version of my program:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;

public class TextFileSingleThreaded {

    public static void main(String[] args) {
        if (args.length != 2) {
            System.out.println("Invalid Input!");
        }

        //first argument is the file path
        File file = new File(args[0]);

        //second argument is the number of lines per chunk
        //In particular the smaller files will have numLinesPerChunk lines
        int numLinesPerChunk = Integer.parseInt(args[1]);

        BufferedReader reader = null;
        PrintWriter writer = null;
        try {
            reader = new BufferedReader(new FileReader(file));
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        String line;        

        long start = System.currentTimeMillis();

        try {
            line = reader.readLine();
            for (int i = 1; line != null; i++) {
                writer = new PrintWriter(new FileWriter(args[0] + "_part" + i + ".txt"));
                for (int j = 0; j < numLinesPerChunk && line != null; j++) {
                    writer.println(line);
                    line = reader.readLine();
                }
                writer.flush();
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        writer.close();

        long end = System.currentTimeMillis();

        System.out.println("Taken time[sec]:");
        System.out.println((end - start) / 1000);

    }

}

我想编写该程序的多线程版本,但是我不知道如何从指定的行开始读取文件.请帮帮我. :(

I want to write a multithreaded version of this program but I don't know how to read a file beginning from a specified line. Help me please. :(

推荐答案

我想编写该程序的多线程版本,但是我不知道如何从指定的行开始读取文件.请帮帮我. :(

I want to write a multithreaded version of this program but I don't know how to read a file beginning from a specified line. Help me please. :(

正如我所暗示的那样,我不会让每个线程都从文件的开头读取而忽略行,直到它们到达输入文件的相应部分为止.这是非常低效的.就像您暗示的那样,如果文件要按行分成几块,则读者必须阅读所有先前的行.这意味着一堆重复的读取IO,这将导致应用程序运行缓慢.

I would not, as this implied, have each thread read from the beginning of the file ignoring lines until they come to their portion of the input file. This is highly inefficient. As you imply, the reader has to read all of the prior lines if the file is going to be divided up into chunks by lines. This means a whole bunch of duplicate read IO which will result in a much slower application.

您可以改为拥有1位读者和N位作家.读者将把要写入的行添加到每个作者的某种BlockingQueue中.问题在于,您可能不会获得任何并发性.只有一位作家很可能同时工作,而其余的作家则在等待读者到达输入文件的一部分.另外,如果读取器比写入器快(这很可能),那么如果要分割的文件很大,则很容易用完内存,使内存中的所有行排队.您可以使用大小受限制的阻塞队列,这意味着读取器可能会阻塞等待写入器,但同样,多个写入器很可能不会同时运行.

You could instead have 1 reader and N writers. The reader will be adding the lines to be written to some sort of BlockingQueue per writer. The problem with this is that chances are you won't get any concurrency. Only one writer will most likely be working at one time while the rest of the writers wait for the reader to reach their part of the input file. Also, if the reader is faster than the writer (which is likely) then you could easily run out of memory queueing up all of the lines in memory if the file to be divided is large. You could use a size limited blocking queue which means the reader may block waiting for the writers but again, multiple writers will most likely not be running at the same time.

如评论中所述,由于这些限制,最有效的方法是单线程.如果您是在做练习,那么听起来好像您需要一次阅读文件,请注意每个输出文件在文件中的开始和结束位置,然后将线程与这些位置分叉,以便它们可以-读取文件并将其并行写入各自的输出文件中,而无需进行大量行缓冲.

As mentioned in the comments, the most efficient way of doing this is single threaded because of these restrictions. If you are doing this as an exercise then it sounds like you will need to read the file through one time, note the start and end positions in the file for each of the output files and then fork the threads with those locations so they can re-read the file and write it into their separate output files in parallel without a lot of line buffering.

这篇关于如何使用Java多线程将大文本文件拆分为较小的块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆