如何添加数据框的一列的字符串并形成另一列，该列将具有原始列的增量值 [英] How to add strings of one columns of the dataframe and form another column that will have the incremental value of the original column

查看：117 发布时间：2020/9/4 4:17:19 python apache-spark dataframe pyspark

本文介绍了如何添加数据框的一列的字符串并形成另一列，该列将具有原始列的增量值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个要粘贴其数据的DataFrame:

I have a DataFrame whose data I am pasting below:

+---------------+--------------+----------+------------+----------+
|name           |      DateTime|       Seq|sessionCount|row_number|
+---------------+--------------+----------+------------+----------+
|            abc| 1521572913344|        17|           5|         1|
|            xyz| 1521572916109|        17|           5|         2|
|           rafa| 1521572916118|        17|           5|         3|
|             {}| 1521572916129|        17|           5|         4|
|     experience| 1521572917816|        17|           5|         5|
+---------------+--------------+----------+------------+----------+

列'name'的类型为字符串.我想要一个新列"effective_name"，它将包含"name"的增量值，如下所示:

The column 'name' is of type string. I want a new column "effective_name" which will contain the incremental values of "name" like shown below:

+---------------+--------------+----------+------------+----------+-------------------------+
|name          | DateTime |sessionSeq|sessionCount|row_number |effective_name|
+---------------+--------------+----------+------------+----------+-------------------------+
|abc            |1521572913344 |17        |5           |1         |abc                      |
|xyz            |1521572916109 |17        |5           |2         |abcxyz                   |
|rafa           |1521572916118 |17        |5           |3         |abcxyzrafa               |
|{}             |1521572916129 |17        |5           |4         |abcxyzrafa{}             |
|experience     |1521572917816 |17        |5           |5         |abcxyzrafa{}experience   |
+---------------+--------------+----------+------------+----------+-------------------------+

新列包含其以前的name列值的增量串联.

The new column contains the incremental concatenation of its previous values of the name column.

推荐答案

您可以使用

You can achieve this by using a pyspark.sql.Window, which orders by the clientDateTime, pyspark.sql.functions.concat_ws, and pyspark.sql.functions.collect_list:

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.orderBy("DateTime")  # define Window for ordering

df.drop("Seq", "sessionCount", "row_number").select(
    "*",
    f.concat_ws(
        "",
        f.collect_list(f.col("name")).over(w)
    ).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+-------------------------+
#|name           |      DateTime|effective_name           |
#+---------------+--------------+-------------------------+
#|abc            |1521572913344 |abc                      |
#|xyz            |1521572916109 |abcxyz                   |
#|rafa           |1521572916118 |abcxyzrafa               |
#|{}             |1521572916129 |abcxyzrafa{}             |
#|experience     |1521572917816 |abcxyzrafa{}experience   |
#+---------------+--------------+-------------------------+

我放下了"Seq"，"sessionCount"，"row_number"，以使输出显示更加友好.

I dropped "Seq", "sessionCount", "row_number" to make the output display friendlier.

如果需要按组进行此操作，则可以将partitionBy添加到Window.假设您要按sessionSeq分组，可以执行以下操作:

If you needed to do this per group, you can add a partitionBy to the Window. Say in this case you want to group by sessionSeq, you can do the following:

w = Window.partitionBy("Seq").orderBy("DateTime")

df.drop("sessionCount", "row_number").select(
    "*",
    f.concat_ws(
        "",
        f.collect_list(f.col("name")).over(w)
    ).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+----------+-------------------------+
#|name           |      DateTime|sessionSeq|effective_name           |
#+---------------+--------------+----------+-------------------------+
#|abc            |1521572913344 |17        |abc                      |
#|xyz            |1521572916109 |17        |abcxyz                   |
#|rafa           |1521572916118 |17        |abcxyzrafa               |
#|{}             |1521572916129 |17        |abcxyzrafa{}             |
#|experience     |1521572917816 |17        |abcxyzrafa{}experience   |
#+---------------+--------------+----------+-------------------------+

如果您更喜欢使用withColumn，则以上内容等同于:

If you prefer to use withColumn, the above is equivalent to:

df.drop("sessionCount", "row_number").withColumn(
    "effective_name",
    f.concat_ws(
        "",
        f.collect_list(f.col("name")).over(w)
    )
).show(truncate=False)

说明

您要在多个行上应用一个函数，这称为聚合.对于任何聚合，您都需要定义要聚合的行以及顺序.我们使用Window进行此操作.在这种情况下，w = Window.partitionBy("Seq").orderBy("DateTime")将按Seq对数据进行分区，并按DateTime进行排序.

You want to apply a function over multiple rows, which is called an aggregation. With any aggregation, you need to define which rows to aggregate over and the order. We do this using a Window. In this case, w = Window.partitionBy("Seq").orderBy("DateTime") will partition the data by the Seq and sort by the DateTime.

我们首先在窗口上应用聚合函数collect_list("name").这将从name列中收集所有值并将它们放在列表中.插入顺序由窗口的顺序定义.

We first apply the aggregate function collect_list("name") over the window. This gathers all of the values from the name column and puts them in a list. The order of insertion is defined by the Window's order.

例如，此步骤的中间输出为:

For example, the intermediate output of this step would be:

df.select(
    f.collect_list("name").over(w).alias("collected")
).show()
#+--------------------------------+
#|collected                       |
#+--------------------------------+
#|[abc]                           |
#|[abc, xyz]                      |
#|[abc, xyz, rafa]                |
#|[abc, xyz, rafa, {}]            |
#|[abc, xyz, rafa, {}, experience]|
#+--------------------------------+

现在列表中已包含适当的值，我们可以将它们与空字符串连接起来作为分隔符.

Now that the appropriate values are in the list, we can concatenate them together with an empty string as the separator.

df.select(
    f.concat_ws(
        "",
        f.collect_list("name").over(w)
    ).alias("concatenated")
).show()
#+-----------------------+
#|concatenated           |
#+-----------------------+
#|abc                    |
#|abcxyz                 |
#|abcxyzrafa             |
#|abcxyzrafa{}           |
#|abcxyzrafa{}experience |
#+-----------------------+

这篇关于如何添加数据框的一列的字符串并形成另一列，该列将具有原始列的增量值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何添加数据框的一列的字符串并形成另一列，该列将具有原始列的增量值 [英] How to add strings of one columns of the dataframe and form another column that will have the incremental value of the original column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何添加数据框的一列的字符串并形成另一列，该列将具有原始列的增量值 [英] How to add strings of one columns of the dataframe and form another column that will have the incremental value of the original column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭