每个组的百分比计数并使用pyspark进行枢轴 [英] percentage count per group and pivot with pyspark

查看：125 发布时间：2020/4/25 6:52:00 sql apache-spark pyspark jupyter-notebook

本文介绍了每个组的百分比计数并使用pyspark进行枢轴的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框，其中包含从和到的列.这两个都是国家代码，它们显示了起始国家和目的地国家/地区.

I have dataframe with columns from and to.Both are country codes and they show starting country and destination country.

+----+---+
|from| to|
+----+---+
|  TR| tr|
|  TR| tr|
|  TR| tr|
|  TR| gr|
|  ES| tr|
|  GR| tr|
|  CZ| it|
|  LU| it|
|  AR| it|
|  DE| it|
|  IT| it|
|  IT| it|
|  US| it|
|  GR| fr|

是否有一种方法可以获取一个数据框，该数据框显示所有目标国家/地区中每个目标国家/地区的百分比，并在列中列出所有目标国家/地区代码?

Is there a way to get a dataframe that shows the percentage of each destination country per country of origin, with column all the destination country code?

该百分比必须在同一原产国(行)的总目的地中.

the percentage must be out of the total destinations by the same country of origin(row).

例如

+----+---+----+---+----+
|from| tr|  it| fr|  gr|
+----+---+----+---+----+
|  TR|0.6|0.12|0.2|0.09|
|  IT|0.3| 0.3|0.3| 0.8|
|  US|0.1|0.34|0.3| 0.2|

推荐答案

您可以将pivot与count一起调整结果.首先是一些进口:

You can pivot with count and adjust the result. First some imports:

from pyspark.sql.functions import col, lit, coalesce
from itertools import chain

查找级别:

levels = [x for x in chain(*df.select("to").distinct().collect())]

pivot:

pivoted = df.groupBy("from").pivot("to", levels).count()

compute行计数表达式:

row_count = sum(coalesce(col(x), lit(0)) for x in levels)

创建已调整列的列表:

adjusted = [(col(c) / row_count).alias(c) for c in levels]

和select:

pivoted.select(col("from"), *adjusted)

这篇关于每个组的百分比计数并使用pyspark进行枢轴的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

每个组的百分比计数并使用pyspark进行枢轴 [英] percentage count per group and pivot with pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

每个组的百分比计数并使用pyspark进行枢轴 [英] percentage count per group and pivot with pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭