在 pyspark 数据框中显示不同的列值 [英] Show distinct column values in pyspark dataframe

查看：31 发布时间：2021/11/14 21:38:41 python pyspark apache-spark-sql

本文介绍了在 pyspark 数据框中显示不同的列值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

使用 pyspark 数据帧，你如何做相当于 Pandas df['col'].unique().

With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique().

我想列出 pyspark 数据框列中的所有唯一值.

I want to list out all the unique values in a pyspark dataframe column.

不是 SQL 类型的方式(注册模板然后 SQL 查询不同的值).

Not the SQL type way (registertemplate then SQL query for distinct values).

此外，我不需要 groupby->countDistinct，而是我想检查该列中的不同 VALUES.

Also I don't need groupby->countDistinct, instead I want to check distinct VALUES in that column.

假设我们正在处理以下数据表示(两列，k 和 v，其中 k 包含三个条目，两个唯一:

Let's assume we're working with the following representation of data (two columns, k and v, where k contains three entries, two unique:

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+

使用 Pandas 数据框:

With a Pandas dataframe:

import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()

这将返回一个 ndarray，即 array(['foo', 'bar'], dtype=object)

您要求提供熊猫 df['col'].unique() 的 pyspark 数据框替代方案".现在，给定以下 Spark 数据帧:

You asked for a "pyspark dataframe alternative for pandas df['col'].unique()". Now, given the following Spark dataframe:

s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))

如果您想要来自 Spark 的 same 结果，即 ndarray，请使用 toPandas():

If you want the same result from Spark, i.e. an ndarray, use toPandas():

s_df.toPandas()['k'].unique()

或者，如果您不需要 ndarray 并且只需要 k 列的唯一值列表:

Alternatively, if you don't need an ndarray specifically and just want a list of the unique values of column k:

s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()

最后，您还可以使用列表推导式，如下所示:

Finally, you can also use a list comprehension as follows:

[i.k for i in s_df.select('k').distinct().collect()]

这篇关于在 pyspark 数据框中显示不同的列值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文