从spark数据框获取特定行 [英] get specific row from spark dataframe
问题描述
在scala spark数据帧中是否有df[100, c("column")]
的替代方案.我想从一列Spark数据框中选择特定的行.
例如上面R等价代码中的100th
行
Is there any alternative for df[100, c("column")]
in scala spark data frames. I want to select specific row from a column of spark data frame.
for example 100th
row in above R equivalent code
推荐答案
首先,您必须了解DataFrames
是分布式的,这意味着您无法以典型的过程方式访问它们,您必须先进行分析.虽然,您正在询问Scala
,但我建议您阅读
Firstly, you must understand that DataFrames
are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala
I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.
但是,继续我的解释,我将使用RDD
API的某些方法,因为所有DataFrame
都有一个RDD
作为属性.请看下面的示例,并注意如何记录第二条记录.
However, continuing with my explanation, I would use some methods of the RDD
API cause all DataFrame
s have one RDD
as attribute. Please, see my example bellow, and notice how I take the 2nd record.
df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
myIndex = 1
values = (df.rdd.zipWithIndex()
.filter(lambda ((l, v), i): i == myIndex)
.map(lambda ((l,v), i): (l, v))
.collect())
print(values[0])
# (u'b', 2)
希望有人能用更少的步骤提供另一个解决方案.
Hopefully, someone gives another solution with fewer steps.
这篇关于从spark数据框获取特定行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!