使用 regexp_replace 在 pypsark 上循环的错误消息 [英] Error message in a loop for on pypsark using regexp_replace

查看：57 发布时间：2021/6/25 18:32:50 pyspark extract-error-message

本文介绍了使用 regexp_replace 在 pypsark 上循环的错误消息的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在 pyspark 中进行循环，并且收到以下消息:

列不可迭代"

这是代码:

(regexp_replace(data_join_result[varibale_choisie],(random.choice(data_join_result.collect()[j][varibale_choisie]))，data_join_result.collect()[j][lettre_choisie] ))))

在错误信息中，此时问题来了:

data_join_result.collect()[j][lettre_choisie]

如果有人知道如何修复它！谢谢

解决方案

>最后，我找到了如何创建一个**循环来破坏数据集**.如果有人需要一天，我会分享！<块引用>

首先，您需要定义要创建的错误，用于替换的字母，例如要损坏的变量，以及我添加带有特殊字符的错误:

lettre = [A"、B"、C"、D"、E"、F"、G"、H"、"I"、J"、K"、L"、M"、N"、O"、P"、Q"、R"、";S"、T"、U"、V"、W"、X"、Y"、Z"]code_erreur= [替换"、插入"、删除"、espace"、caract_spe"、NA"、逆向"]nombre_erreur=[1",1",1",2"]变量 =[VARIABLEA"，VARIABLEB"]caract_spe =[_"、^"、¨"、"、."、é"、-"、*"、"ù","ï","à","è","î","â"]

我创建了一个列表nombre_erreur"，bc 我想要 75% 的数据集有 1 个错误，25% 有 2 个错误.

<块引用>

接下来，创建定义:

def def_code_erreur(code_erreur,varibale,nombre_erreur,lettre,caract_spe):如果 type_erreur==删除":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + col1[(pos+1):]如果 type_erreur==espace":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + ""+ col1[(pos):]如果 type_erreur==插入":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + lettre_choisie + col1[(pos):]如果 type_erreur==caract_spe":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + caract_spe_choisi + col1[(pos):]如果 type_erreur==替换":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos-1] + lettre_choisie + col1[(pos):]如果 type_erreur==逆":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos-1] + col1[pos:pos+1] + col1[pos-1:pos] + col1[(pos+1):]如果 type_erreur==NA":对于范围内的 i(0,int(nb_erreur)):列 1 = 列 1返回 col1udf_def_code_erreur = udf(def_code_erreur, StringType())

<块引用>

好吧，你必须调用udf_def_code_erreur"！！如果你想破坏整个数据集，你可以在循环中调用它.

i'm making a loop in pyspark, and i have this message:

"Column is not iterable"

This is the code:

(regexp_replace(data_join_result[varibale_choisie],
(random.choice(data_join_result.collect()[j][varibale_choisie])),
data_join_result.collect()[j][lettre_choisie] ))))

in the error message, the problem comes at this moment:

data_join_result.collect()[j][lettre_choisie]

If someone knows how to fix it! Thx

解决方案

>Finally, I find how to creat a **loop to corrup a dataset**. I'm sharing if someone needs one day!

First, you need to defind errors you want to creat, letters to use to replace for example, variable you want to corrup, and I add errors with special caracters:

lettre = [ "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"]

code_erreur= [ "replace","inserte","delete","espace","caract_spe", "NA","inverse"]

nombre_erreur=["1","1","1","2"]

varibale =["VARIABLEA","VARIABLEB"]

caract_spe =["_", "^", "¨", "", ".", "é", "-", "*","ù","ï","à","è","î","â"]

I creat a list "nombre_erreur", bc I want 75% of my dataset with 1 error and 25% with 2 errors.

Next, creat definition:

def def_code_erreur(code_erreur,varibale ,nombre_erreur,lettre,caract_spe):

  if type_erreur=="delete":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos] + col1[(pos+1):]
      
  if type_erreur=="espace":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos] + " " + col1[(pos):]
      
  if type_erreur=="inserte":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos] + lettre_choisie + col1[(pos):] 
      
  if type_erreur=="caract_spe":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos] + caract_spe_choisi + col1[(pos):]
      
  if type_erreur=="replace":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos-1] + lettre_choisie + col1[(pos):]      
      
  if type_erreur=="inverse":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos-1] + col1[pos:pos+1] + col1[pos-1:pos] + col1[(pos+1):]      
      
  if type_erreur=="NA":
    for i in range(0,int(nb_erreur)):
      col1 = col1

    
  return col1


udf_def_code_erreur = udf(def_code_erreur, StringType())

In fine, you have to call "udf_def_code_erreur "!! You can call it in a loop if you want to corrupt the whole dataset.

这篇关于使用 regexp_replace 在 pypsark 上循环的错误消息的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 regexp_replace 在 pypsark 上循环的错误消息 [英] Error message in a loop for on pypsark using regexp_replace

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 regexp_replace 在 pypsark 上循环的错误消息 [英] Error message in a loop for on pypsark using regexp_replace

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭