为什么我的 PySpark 正则表达式不超过第一行? [英] Why does my PySpark regular expression not give more than the first row?

查看:46
本文介绍了为什么我的 PySpark 正则表达式不超过第一行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从这个答案中汲取灵感:https://stackoverflow.com/a/61444594/4367851我已经能够将我的 .txt 文件拆分为 Spark DataFrame 中的列.然而,它只给了我第一个游戏 - 即使示例 .txt 文件包含更多.

Taking inspiration from this answer: https://stackoverflow.com/a/61444594/4367851 I have been able to split my .txt file into columns in a Spark DataFrame. However, it only gives me the first game - even though the sample .txt file contains many more.

我的代码:

basefile = spark.sparkContext.wholeTextFiles("example copy 2.txt").toDF().\
    selectExpr("""split(replace(regexp_replace(_2, '\\\\n', ','), ""),",") as new""").\
    withColumn("Event", col("new")[0]).\
    withColumn("White", col("new")[2]).\
    withColumn("Black", col("new")[3]).\
    withColumn("Result", col("new")[4]).\
    withColumn("UTCDate", col("new")[5]).\
    withColumn("UTCTime", col("new")[6]).\
    withColumn("WhiteElo", col("new")[7]).\
    withColumn("BlackElo", col("new")[8]).\
    withColumn("WhiteRatingDiff", col("new")[9]).\
    withColumn("BlackRatingDiff", col("new")[10]).\
    withColumn("ECO", col("new")[11]).\
    withColumn("Opening", col("new")[12]).\
    withColumn("TimeControl", col("new")[13]).\
    withColumn("Termination", col("new")[14]).\
    drop("new")


basefile.show()

输出:

+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|               Event|          White|            Black|        Result|             UTCDate|             UTCTime|         WhiteElo|         BlackElo|     WhiteRatingDiff|     BlackRatingDiff|        ECO|             Opening|         TimeControl|         Termination|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|[Event "Rated Cla...|[White "BFG9k"]|[Black "mamalak"]|[Result "1-0"]|[UTCDate "2012.12...|[UTCTime "23:01:03"]|[WhiteElo "1639"]|[BlackElo "1403"]|[WhiteRatingDiff ...|[BlackRatingDiff ...|[ECO "C00"]|[Opening "French ...|[TimeControl "600...|[Termination "Nor...|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+

输入文件:

[Event "Rated Classical game"]
[Site "https://lichess.org/j1dkb5dw"]
[White "BFG9k"]
[Black "mamalak"]
[Result "1-0"]
[UTCDate "2012.12.31"]
[UTCTime "23:01:03"]
[WhiteElo "1639"]
[BlackElo "1403"]
[WhiteRatingDiff "+5"]
[BlackRatingDiff "-8"]
[ECO "C00"]
[Opening "French Defense: Normal Variation"]
[TimeControl "600+8"]
[Termination "Normal"]

1. e4 e6 2. d4 b6 3. a3 Bb7 4. Nc3 Nh6 5. Bxh6 gxh6 6. Be2 Qg5 7. Bg4 h5 8. Nf3 Qg6 9. Nh4 Qg5 10. Bxh5 Qxh4 11. Qf3 Kd8 12. Qxf7 Nc6 13. Qe8# 1-0

[Event "Rated Classical game"]
.
.
.

每个游戏都以 [Event 开始,所以我觉得它应该是可行的,因为文件具有重复结构,可惜我无法让它工作.

Each game starts with [Event so I feel like it should be doable as the file has repeating structure, alas I can't get it to work.

加分:

  1. 我实际上并不需要移动列表,所以如果方便的话,可以删除它们.
  2. 我只想要 " 里面的内容"转换为 Spark DataFrame 后的每一行.

非常感谢.

推荐答案

wholeTextFiles 将每个文件读入单个记录.如果你只读取一个文件,结果将是一个只有一行的 RDD,包含整个文本文件.问题中的正则表达式逻辑每行只返回一个结果,这将是文件中的第一个条目.

wholeTextFiles reads each file into a single record. If you read only one file, the result will a RDD with only one row, containing the whole text file. The regexp logic in the question returns only one result per row and this will be the first entry in the file.

可能最好的解决方案是将操作系统级别的文件拆分为每个游戏的一个文件(例如 此处) 以便 Spark 可以并行读取多个游戏.但是如果单个文件不是太大,也可以在 PySpark 中拆分游戏:

Probably the best solution would be to split the file at the os level into one file per game (for example here) so that Spark can read the multiple games in parallel. But if a single file is not too big, splitting the games can also be done within PySpark:

读取文件:

basefile = spark.sparkContext.wholeTextFiles(<....>).toDF()

创建一个列列表并使用 regexp_extract:

Create a list of columns and convert this list into a list of column expressions using regexp_extract:

from pyspark.sql import functions as F

cols = ['Event', 'White', 'Black', 'Result', 'UTCDate', 'UTCTime', 'WhiteElo', 'BlackElo', 'WhiteRatingDiff', 'BlackRatingDiff', 'ECO', 'Opening', 'TimeControl', 'Termination']
cols = [F.regexp_extract('game', rf'{col} \"(.*)\"',1).alias(col) for col in cols]

提取数据:

  1. 将整个文件拆分成一系列游戏
  2. explode 这个数组变成单条记录
  3. 删除每条记录中的换行符,以便正则表达式正常工作
  4. 使用上面定义的列表达式来提取数据
  1. split the whole file into an array of games
  2. explode this array into single records
  3. delete the line breaks within each record so that the regular expression works
  4. use the column expressions defined above to extract the data

basefile.selectExpr("split(_2,'\\\\[Event ') as game") \
  .selectExpr("explode(game) as game") \
  .withColumn("game", F.expr("concat('Event ', replace(game, '\\\\n', ''))")) \
  .select(cols) \
  .show(truncate=False)

输出(对于包含三个游戏副本的输入文件):

Output (for an input file containing three copies of the game):

+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Event                |White|Black  |Result|UTCDate   |UTCTime |WhiteElo|BlackElo|WhiteRatingDiff|BlackRatingDiff|ECO|Opening                         |TimeControl|Termination|
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Rated Classical game |BFG9k|mamalak|1-0   |2012.12.31|23:01:03|1639    |1403    |+5             |-8             |C00|French Defense: Normal Variation|600+8      |Normal     |
|Rated Classical game2|BFG9k|mamalak|1-0   |2012.12.31|23:01:03|1639    |1403    |+5             |-8             |C00|French Defense: Normal Variation|600+8      |Normal     |
|Rated Classical game3|BFG9k|mamalak|1-0   |2012.12.31|23:01:03|1639    |1403    |+5             |-8             |C00|French Defense: Normal Variation|600+8      |Normal     |
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+

这篇关于为什么我的 PySpark 正则表达式不超过第一行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆