有没有一种方法可以基于pyspark中的索引对数据帧进行切片? [英] Is there a way to slice dataframe based on index in pyspark?

查看：164 发布时间：2020/9/4 2:27:39 apache-spark pyspark apache-spark-sql

本文介绍了有没有一种方法可以基于pyspark中的索引对数据帧进行切片?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在python或R中，有一些方法可以使用索引对DataFrame进行切片.

In python or R, there are ways to slice DataFrame using index.

例如，在熊猫中:

df.iloc[5:10,:]

pyspark中是否有类似的方式根据行的位置对数据进行切片?

Is there a similar way in pyspark to slice data based on location of rows?

推荐答案

简短回答

如果您已经有一个索引列(假设它称为'id')，则可以使用

If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:

from pyspark.sql.functions import col
df.where(col("id").between(5, 10))

如果您还没有索引列，则可以自己添加一个，然后使用上面的代码.您应该根据其他一些列(orderBy("someColumn"))在数据中内置一些排序.

If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).

完整说明

否，除非索引已经以列形式存在，否则无法轻易地按索引对Spark DataFrame进行切片.

No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.

Spark DataFrame本质上是无序的，不支持随机访问. (没有内置索引的概念，因为).每行都被视为结构化数据的独立集合，因此可以进行分布式并行处理.因此，任何执行者都可以获取任何数据块并对其进行处理，而无需考虑行的顺序.

Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.

现在很显然，可以执行涉及订购的操作(

Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)

相关/深入阅读

PySpark DataFrames - way to enumerate without converting to Pandas?
PySpark - get row number for each row in a group
how to add Row id in pySpark dataframes

这篇关于有没有一种方法可以基于pyspark中的索引对数据帧进行切片?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有没有一种方法可以基于pyspark中的索引对数据帧进行切片? [英] Is there a way to slice dataframe based on index in pyspark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

有没有一种方法可以基于pyspark中的索引对数据帧进行切片? [英] Is there a way to slice dataframe based on index in pyspark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭