如何读取sparkContext特定行 [英] How to read specific lines from sparkContext

查看:205
本文介绍了如何读取sparkContext特定行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好我想读通过火花从文本文件中的特定行。

Hi I am trying to read specific lines from a text file using spark.

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("data.txt");
String firstLine = lines.first();

它可以使用的。首先()命令来获取data.text文件的第一行。我怎样才能访问文档的第N行?我需要的Java解决方案。

It can used the .first() command to fetch the first line of the data.text document. How can I access Nth line of the document? I need java solution.

推荐答案

阿帕奇火花RDDS不意味着被用于查找。最有效的方式来获得 N 日线将是 lines.take(N)获得(N)。你这样做时,它都会将阅读第一 N 文件的线条。你可以运行 lines.cache 来避免这种情况,但它仍然会移到第一个 N 行了一个网络非常低效的舞蹈。

Apache Spark RDDs are not meant to be used for lookups. The most "efficient" way to get the nth line would be lines.take(n).get(n). Every time you do this, it will read the first n lines of the file. You could run lines.cache to avoid that, but it will still move the first n lines over the network in a very inefficient dance.

如果数据可以容纳一台机器上,仅仅收取这一切一次,并在本地访问它:列表&LT;串GT;本地= lines.collect(); local.get(N);

If the data can fit on one machine, just collect it all once, and access it locally: List<String> local = lines.collect(); local.get(n);.

如果数据不适合在一个机器上,你需要一个分布式系统,支持高效的查询。最典型的例子是HBase的和卡桑德拉。

If the data does not fit on one machine, you need a distributed system which supports efficient lookups. Popular examples are HBase and Cassandra.

有也可能是您的问题可以有效地与火花来解决,而不是通过查找。如果你在一个单独的问题解释更大的问题,你可能会得到这样的解决方案。 (查找在单机应用中非常常见,但分布式算法有不同的想法。)

It is also possible that your problem can be solved efficiently with Spark, but not via lookups. If you explain the larger problem in a separate question, you may get a solution like that. (Lookups are very common in single-machine applications, but distributed algorithms have to think differently.)

这篇关于如何读取sparkContext特定行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆