使用 regexp_replace 在 pypsark 上循环的错误消息 [英] Error message in a loop for on pypsark using regexp_replace

查看:57
本文介绍了使用 regexp_replace 在 pypsark 上循环的错误消息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


我正在 pyspark 中进行循环,并且收到以下消息:

列不可迭代"

这是代码:

(regexp_replace(data_join_result[varibale_choisie],(random.choice(data_join_result.collect()[j][varibale_choisie])),data_join_result.collect()[j][lettre_choisie] ))))

在错误信息中,此时问题来了:

data_join_result.collect()[j][lettre_choisie]

我的输入:
变量A |变量B
蓝色 |白色
粉红色 |黑暗

我的预期输出:
变量A |变量B
BLTE |白色
粉红色 |达姆

如果有人知道如何修复它!谢谢

解决方案



>最后,我找到了如何创建一个**循环来破坏数据集**.如果有人需要一天,我会分享!<块引用>

首先,您需要定义要创建的错误,用于替换的字母,例如要损坏的变量,以及我添加带有特殊字符的错误:

lettre = [A"、B"、C"、D"、E"、F"、G"、H"、"I"、J"、K"、L"、M"、N"、O"、P"、Q"、R"、";S"、T"、U"、V"、W"、X"、Y"、Z"]code_erreur= [替换"、插入"、删除"、espace"、caract_spe"、NA"、逆向"]nombre_erreur=[1",1",1",2"]变量 =[VARIABLEA",VARIABLEB"]caract_spe =[_"、^"、¨"、"、."、é"、-"、*"、"ù","ï","à","è","î","â"]

  • 我创建了一个列表nombre_erreur",bc 我想要 75% 的数据集有 1 个错误,25% 有 2 个错误.

<块引用>

接下来,创建定义:

def def_code_erreur(code_erreur,varibale,nombre_erreur,lettre,caract_spe):如果 type_erreur==删除":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + col1[(pos+1):]如果 type_erreur==espace":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + ""+ col1[(pos):]如果 type_erreur==插入":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + lettre_choisie + col1[(pos):]如果 type_erreur==caract_spe":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + caract_spe_choisi + col1[(pos):]如果 type_erreur==替换":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos-1] + lettre_choisie + col1[(pos):]如果 type_erreur==逆":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos-1] + col1[pos:pos+1] + col1[pos-1:pos] + col1[(pos+1):]如果 type_erreur==NA":对于范围内的 i(0,int(nb_erreur)):列 1 = 列 1返回 col1udf_def_code_erreur = udf(def_code_erreur, StringType())

<块引用>

好吧,你必须调用udf_def_code_erreur"!!如果你想破坏整个数据集,你可以在循环中调用它.


i'm making a loop in pyspark, and i have this message:

"Column is not iterable" 

This is the code:

(regexp_replace(data_join_result[varibale_choisie],
(random.choice(data_join_result.collect()[j][varibale_choisie])),
data_join_result.collect()[j][lettre_choisie] )))) 

in the error message, the problem comes at this moment:

data_join_result.collect()[j][lettre_choisie]

My input:
VARIABLEA  | VARIABLEB
BLUE        | WHITE
PINK         | DARK

My expected output:
VARIABLEA  | VARIABLEB
BLTE        | WHITE
PINK         | DARM

If someone knows how to fix it! Thx

解决方案



>Finally, I find how to creat a **loop to corrup a dataset**. I'm sharing if someone needs one day!

First, you need to defind errors you want to creat, letters to use to replace for example, variable you want to corrup, and I add errors with special caracters:

lettre = [ "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"]

code_erreur= [ "replace","inserte","delete","espace","caract_spe", "NA","inverse"]

nombre_erreur=["1","1","1","2"]

varibale =["VARIABLEA","VARIABLEB"]

caract_spe =["_", "^", "¨", "", ".", "é", "-", "*","ù","ï","à","è","î","â"]

  • I creat a list "nombre_erreur", bc I want 75% of my dataset with 1 error and 25% with 2 errors.

Next, creat definition:

def def_code_erreur(code_erreur,varibale ,nombre_erreur,lettre,caract_spe):

  if type_erreur=="delete":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos] + col1[(pos+1):]
      
  if type_erreur=="espace":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos] + " " + col1[(pos):]
      
  if type_erreur=="inserte":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos] + lettre_choisie + col1[(pos):] 
      
  if type_erreur=="caract_spe":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos] + caract_spe_choisi + col1[(pos):]
      
  if type_erreur=="replace":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos-1] + lettre_choisie + col1[(pos):]      
      
  if type_erreur=="inverse":
    for i in range(0,int(nb_erreur)):
      longueur = len(col1)
      pos = random.choice(range(1,longueur))
      col1 = col1[:pos-1] + col1[pos:pos+1] + col1[pos-1:pos] + col1[(pos+1):]      
      
  if type_erreur=="NA":
    for i in range(0,int(nb_erreur)):
      col1 = col1

    
  return col1


udf_def_code_erreur = udf(def_code_erreur, StringType())

In fine, you have to call "udf_def_code_erreur "!! You can call it in a loop if you want to corrupt the whole dataset.

这篇关于使用 regexp_replace 在 pypsark 上循环的错误消息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆