如何将pyspark数据帧分成两行 [英] How to slice a pyspark dataframe in two row-wise

查看:44
本文介绍了如何将pyspark数据帧分成两行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Databricks 工作.

I am working in Databricks.

我有一个包含 500 行的数据框,我想创建两个包含 100 行的数据框,另一个包含剩余的 400 行.

I have a dataframe which contains 500 rows, I would like to create two dataframes on containing 100 rows and the other containing the remaining 400 rows.

+--------------------+----------+
|              userid| eventdate|
+--------------------+----------+
|00518b128fc9459d9...|2017-10-09|
|00976c0b7f2c4c2ca...|2017-12-16|
|00a60fb81aa74f35a...|2017-12-04|
|00f9f7234e2c4bf78...|2017-05-09|
|0146fe6ad7a243c3b...|2017-11-21|
|016567f169c145ddb...|2017-10-16|
|01ccd278777946cb8...|2017-07-05|

我已尝试以下操作,但收到错误

I have tried the below but I receive an error

df1 = df[:99]
df2 = df[100:499]


TypeError: unexpected item type: <type 'slice'>

推荐答案

最初我误解了,以为您想对列进行切片.如果要选择行的子集,一种方法是使用 monotonically_increasing_id().来自文档:

Initially I misunderstood and thought you wanted to slice the columns. If you want to select a subset of rows, one method is to create an index column using monotonically_increasing_id(). From the docs:

生成的 ID 保证单调递增且唯一,但不连续.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

您可以使用此 ID 对数据框进行排序并使用 limit() 对其进行子集化,以确保您获得所需的行.

You can use this ID to sort the dataframe and subset it using limit() to ensure you get exactly the rows you want.

例如:

import pyspark.sql.functions as f
import string

# create a dummy df with 500 rows and 2 columns
N = 500
numbers = [i%26 for i in range(N)]
letters = [string.ascii_uppercase[n] for n in numbers]

df = sqlCtx.createDataFrame(
    zip(numbers, letters),
    ('numbers', 'letters')
)

# add an index column
df = df.withColumn('index', f.monotonically_increasing_id())

# sort ascending and take first 100 rows for df1
df1 = df.sort('index').limit(100)

# sort descending and take 400 rows for df2
df2 = df.sort('index', ascending=False).limit(400)

只是为了验证这是否符合您的要求:

Just to verify that this did what you wanted:

df1.count()
#100
df2.count()
#400

我们也可以验证索引列没有重叠:

Also we can verify that the index column doesn't overlap:

df1.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+---+
#|min|max|
#+---+---+
#|  0| 99|
#+---+---+

df2.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+----------+
#|min|       max|
#+---+----------+
#|100|8589934841|
#+---+----------+

这篇关于如何将pyspark数据帧分成两行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆