如何读取Hadoop的(HDFS)第一行的文件有效地使用Java? [英] How to read first line in Hadoop (HDFS) file efficiently using Java?
问题描述
我有我的Hadoop集群在一个大的CSV文件。该文件的第一行是一个标题线,它由字段名称。我想这个标题行做一次手术,但我不想处理整个文件。另外,我的计划是用Java编写和使用的火花。
I have a large CSV file on my Hadoop cluster. The first line of the file is a 'header' line, which consists of field names. I want to do an operation on this header line, but I do not want to process the whole file. Also, my program is written in Java and using Spark.
什么是阅读只是一个大的CSV文件的第一行的Hadoop集群上的一个有效的方式?
What is an efficient way to read just the first line of a large CSV file on an Hadoop cluster?
推荐答案
您可以访问的 HDFS 文件/docs/r1.1.1/api/org/apache/hadoop/fs/FileSystem.html相对=nofollow>文件系统类,朋友们:
You can access hdfs
files with FileSystem class and friends:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.DistributedFileSystem;
DistributedFileSystem fileSystem = new DistributedFileSystem();
Configuration conf = new Configuration();
fileSystem.initialize(new URI("hdfs://namenode-host:54310"), conf);
FSDataInputStream input = fileSystem.open(new Path("/path/to/file.csv"));
System.out.println((new BufferedReader(new InputStreamReader(input))).readLine());
这code不会用麻preduce,将与一个合理的速度运行。
This code won't use MapReduce and will run with a reasonable speed.
这篇关于如何读取Hadoop的(HDFS)第一行的文件有效地使用Java?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!