有没有一种方法可以基于pyspark中的索引对数据帧进行切片? [英] Is there a way to slice dataframe based on index in pyspark?

查看:164
本文介绍了有没有一种方法可以基于pyspark中的索引对数据帧进行切片?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在python或R中,有一些方法可以使用索引对DataFrame进行切片.

In python or R, there are ways to slice DataFrame using index.

例如,在熊猫中:

df.iloc[5:10,:]

pyspark中是否有类似的方式根据行的位置对数据进行切片?

Is there a similar way in pyspark to slice data based on location of rows?

推荐答案

简短回答

如果您已经有一个索引列(假设它称为'id'),则可以使用

If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:

from pyspark.sql.functions import col
df.where(col("id").between(5, 10))

如果您还没有索引列,则可以自己添加一个,然后使用上面的代码.您应该根据其他一些列(orderBy("someColumn"))在数据中内置一些排序.

If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).

完整说明

否,除非索引已经以列形式存在,否则无法轻易地按索引对Spark DataFrame进行切片.

No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.

Spark DataFrame本质上是无序的,不支持随机访问. (没有内置索引的概念,因为

Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)

相关/深入阅读

  • PySpark DataFrames - way to enumerate without converting to Pandas?
  • PySpark - get row number for each row in a group
  • how to add Row id in pySpark dataframes

这篇关于有没有一种方法可以基于pyspark中的索引对数据帧进行切片?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆