如何在两行中将pyspark数据帧切片 [英] How to slice a pyspark dataframe in two row-wise

查看:204
本文介绍了如何在两行中将pyspark数据帧切片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Databricks工作.

I am working in Databricks.

我有一个包含500行的数据框,我想创建一个包含100行的数据框,另一个包含其余的400行.

I have a dataframe which contains 500 rows, I would like to create two dataframes on containing 100 rows and the other containing the remaining 400 rows.

+--------------------+----------+
|              userid| eventdate|
+--------------------+----------+
|00518b128fc9459d9...|2017-10-09|
|00976c0b7f2c4c2ca...|2017-12-16|
|00a60fb81aa74f35a...|2017-12-04|
|00f9f7234e2c4bf78...|2017-05-09|
|0146fe6ad7a243c3b...|2017-11-21|
|016567f169c145ddb...|2017-10-16|
|01ccd278777946cb8...|2017-07-05|

我尝试了以下操作,但收到错误消息

I have tried the below but I receive an error

df1 = df[:99]
df2 = df[100:499]


TypeError: unexpected item type: <type 'slice'>

推荐答案

最初,我误解了,并以为您想对列进行切片.如果要选择行的子集,一种方法是使用

Initially I misunderstood and thought you wanted to slice the columns. If you want to select a subset of rows, one method is to create an index column using monotonically_increasing_id(). From the docs:

保证生成的ID单调递增,并且 唯一,但不是连续的.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

您可以使用此ID对数据框进行排序,并使用limit()对其进行子集化,以确保准确获得所需的行.

You can use this ID to sort the dataframe and subset it using limit() to ensure you get exactly the rows you want.

例如:

import pyspark.sql.functions as f
import string

# create a dummy df with 500 rows and 2 columns
N = 500
numbers = [i%26 for i in range(N)]
letters = [string.ascii_uppercase[n] for n in numbers]

df = sqlCtx.createDataFrame(
    zip(numbers, letters),
    ('numbers', 'letters')
)

# add an index column
df = df.withColumn('index', f.monotonically_increasing_id())

# sort ascending and take first 100 rows for df1
df1 = df.sort('index').limit(100)

# sort descending and take 400 rows for df2
df2 = df.sort('index', ascending=False).limit(400)

只需验证这是否符合您的要求即可:

Just to verify that this did what you wanted:

df1.count()
#100
df2.count()
#400

我们还可以验证索引列是否不重叠:

Also we can verify that the index column doesn't overlap:

df1.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+---+
#|min|max|
#+---+---+
#|  0| 99|
#+---+---+

df2.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+----------+
#|min|       max|
#+---+----------+
#|100|8589934841|
#+---+----------+

这篇关于如何在两行中将pyspark数据帧切片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆