有没有办法根据pyspark中的索引对数据帧进行切片? [英] Is there a way to slice dataframe based on index in pyspark?

查看:42
本文介绍了有没有办法根据pyspark中的索引对数据帧进行切片?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 python 或 R 中,有一些方法可以使用索引对 DataFrame 进行切片.

In python or R, there are ways to slice DataFrame using index.

例如,在熊猫中:

df.iloc[5:10,:]

pyspark 中是否有类似的方法可以根据行的位置对数据进行切片?

Is there a similar way in pyspark to slice data based on location of rows?

推荐答案

简答

如果您已经有一个索引列(假设它被称为 'id'),您可以使用 pyspark.sql.Column.between:

If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:

from pyspark.sql.functions import col
df.where(col("id").between(5, 10))

如果您还没有索引列,您可以自己添加一个,然后使用上面的代码.您应该根据其他一些列 (orderBy("someColumn")) 在数据中内置一些排序.

If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).

完整说明

不,按索引对 Spark DataFrame 进行切片并不容易,除非该索引已作为列存在.

No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.

Spark DataFrame 本质上是无序的,不支持随机访问.(没有内置索引的概念,因为 ).每行都被视为结构化数据的独立集合,这就是分布式并行处理的原因.因此,任何执行器都可以获取任何数据块并对其进行处理,而无需考虑行的顺序.

Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.

现在显然可以执行涉及排序的操作(lead, lag 等),但这些会更慢因为它需要 spark 在 executor 之间打乱数据.(数据改组通常是 Spark 作业中最慢的组成部分之一.)

Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)

相关/进一步阅读

这篇关于有没有办法根据pyspark中的索引对数据帧进行切片?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆