如何在数据框中找到错字并替换 [英] How to find a typo in a data frame and replace it

查看:90
本文介绍了如何在数据框中找到错字并替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含名称,姓氏,生日和一些随机变量的数据框。可以这样说:

 姓氏姓氏random_value 
1 1 Luke Skywalker 1
2 1卢克·天行者2
4 2 Leia Organa 3
5 3 Han Solo 7
7 1 Ben Solo 1
8 5 Lando Calrissian 3
9 3 Han Solo 4
10 3 Ham Solo 4
11 1 Luke Wkywalker 9

我如何弄清楚,如果姓名或姓氏有误,则根据出生 NAME SURNAME ,然后用正确的名称或姓氏替换错字吗?



例如,我们看到有两个 Han Solo ,生日在 3 ,然后有 Ham Solo ,相同的生日。我希望该算法执行的操作是找出 Ham 是错误的,然后将其替换为 Han



如果有两种不同的拼写出现次数相同(对于相同的 Birth ),这并不重要,只要该组的所有 NAME SURNAME 都相同(因此总是 Ham Han ,但对于相同的 Birth 不混合使用。 / p>

所以最终结果将是这样:

 姓氏姓氏random_value 
1 1卢克·天行者1
2 1卢克·天行者2
4 2莱娅·奥加纳3
5 3汉索罗7
7 1本索罗1
8 5 Lando Calrissian 3
11 3 Han Solo 4
12 3 Han Solo 4
13 1大号uke Skywalker 9

有没有自动方法可以做到这一点?我的数据集很大(> 3mill行),不可能手动检查。



我想我们会查找所有具有相同出生地的姓名和姓氏然后检查是否有一些奇异的离群点,它们仅以字母不同或字母的顺序已被切换( Luke vs Lkue )。当发现这样的异常值时,我们将其替换。

解决方案

这是查找错别字的一种方法。
首先,定义您在问题中提到的数据框:

  my_df< -data.frame(BIRTH = c (1,1,2,3,1,5,3,3,1),
NAME = c( Luke, Luke, Leia, Han, Ben, Lando , Han, Ham, Luke),
SURNAME = c( Skywalker, Skywalker, Organa, Solo, Solo, Calrissian, Solo , Solo, Wkywalker),
random_value = c(1,2,3,7,1,3,4,4,9))

第二步,创建一个新列,将要匹配的所有条目组合在一起:

  my_df $ birth_and_names<-do.call(粘贴,c(my_df [c( BIRTH, NAME, SURNAME)],sep =))

第三,使用stringdist包,根据字符串距离定义距离矩阵:

 库(stringdist)
dist.matrix< -stringdistmatrix(my_df $ birth_and_names,my_df $ birth_and_names,method ='jw',p = 0.1)
row.names(dist.matrix)< -my_df $出生和名字
names(dist.matrix)< -my_df $ birth_and_names
dis t.matrix< -as.dist(dist.matrix)

第四次,聚类并将结果显示为

  clusts< -hclust(dist.matrix,method = ward.D2)
图( clusts)

在此处查看树状图:





现在您在哪里想要设置参数以自动组合相似的结果当然取决于您,并且取决于问题。在误报与误报之间通常会进行权衡。



在此示例中,以0.2的距离进行剪切似乎是适当的,因此:

  my_df $ LikelyGroup< -cutree(clusts,h = 0.2)

其中,现在 my_df $ LikelyGroup 是标识符列,即使它们的拼写错误,每个标识符也只有一个数字。



现在可以命名组,为每个名称/生日列查找模式:

  library(dplyr)

Mode<-function(x){
ux<-unique(x)
ux [which.max(tabulate(match( x,ux))))]
}

my_df< -my_df%>%
group_by(LikelyGroup)%&%;%
mutate(Group_Birth = Mode( BIRTH),
Group_Name = Mode(NAME),
Group_Surname = Mode(SURNAME))

输出my_df:

  BIRTH | NAME | SURNAME | random_value | LikelyGroup | Group_Birth | Group_Name | Group_Surname 
------ | ----- | ---------- | ---------------- ||- --------- | ------------ || ------------ || --------------
1 |卢克|天行者| 1 | 1 | 1 |卢克|天行者
1 |卢克|天行者| 2 | 1 | 1 |卢克|天行者
2 |莱亚|奥尔加纳| 3 | 2 | 2 | Leia | Organa
3 | Han |个展| 7 | 3 | 3 | Han | Solo
1 | Ben | Solo | 1 | 4 | 1 | Ben | Solo
5 | Lando | Calrissian | 3 | 5 | 5 | Lando | Calrissian
3 | Han |独奏| 4 | 3 | 3 | Han | Solo
3 | Ham | Solo | 4 | 3 | 3 | Han | Solo
1 | Luke | Wkywalker | 9 | 1 | 1 |卢克|天行者

请参见 https://gist.github.com/gdmcdonald/9135ec8f7e903a0735a0b16d8cb97297


I have a data frame with names, surnames, birthdays and some random variables. Lets say it looks like this:

    BIRTH  NAME    SURNAME random_value
 1      1  Luke  Skywalker            1
 2      1  Luke  Skywalker            2
 4      2  Leia     Organa            3
 5      3   Han       Solo            7
 7      1   Ben       Solo            1
 8      5 Lando Calrissian            3
 9      3   Han       Solo            4
 10     3   Ham       Solo            4
 11     1  Luke  Wkywalker            9

How can I figure out, if there is a typo in name or surname, based on BIRTH,NAMEand SURNAME, and then replace the typo with the correct name or surname?

For example, we see, that there are two Han Solos with birthdays on 3and then there is a Ham Solo with the same birthdate. What I would like this algorithm to do is figure out that Hamis wrong and replace it with Han.

If there are two different spellings which have equal number of occurrences (for same BIRTH), it doesn't really matter, which one is chosen, as long that all the NAMEor SURNAMEfor this group is the same (so always Hamor Hanbut not mixed for the same BIRTH).

So the end result would be this:

BIRTH   NAME          SURNAME random_value
    1      1  Luke  Skywalker            1
    2      1  Luke  Skywalker            2
    4      2  Leia     Organa            3
    5      3   Han       Solo            7
    7      1   Ben       Solo            1
    8      5 Lando Calrissian            3
    11     3   Han       Solo            4
    12     3   Han       Solo            4
    13     1  Luke  Skywalker            9

Is there any automated way to do this? My data set is large (>3mill rows) and it would be impossible to check manually.

I would imagine that we look for all the names and surnames with the same birth and then check, if there are some singular outliers that differ only by a letter or that the order of the letters is switched (Lukevs Lkue). When we find an outlier like that, we replace it.

解决方案

Here is one way to find the typos. First, define the data frame you mention in the question:

my_df<-data.frame(BIRTH = c(1,1,2,3,1,5,3,3,1),
       NAME = c("Luke","Luke","Leia","Han","Ben","Lando","Han","Ham","Luke"),
       SURNAME = c("Skywalker","Skywalker","Organa","Solo","Solo","Calrissian","Solo","Solo","Wkywalker"),
       random_value = c(1,2,3,7,1,3,4,4,9))

Second, make a new column combining all the entries you want to match on:

my_df$birth_and_names <- do.call(paste, c(my_df[c("BIRTH", "NAME", "SURNAME")], sep = " ")) 

Third, define a distance matrix based upon string distance, using the package stringdist:

library(stringdist)
dist.matrix<-stringdistmatrix(my_df$birth_and_names,my_df$birth_and_names,method='jw',p=0.1)
row.names(dist.matrix)<-my_df$birth_and_names
names(dist.matrix)<-my_df$birth_and_names
dist.matrix<-as.dist(dist.matrix)

Fourth, cluster and display the results as a dendrogram.

clusts<-hclust(dist.matrix,method="ward.D2")
plot(clusts)

See the dendrogram here:

Now where exactly you want to set your parameters for automatically combining similar results is of course up to you, and depends upon the problem. There are the usual trade-offs between false positives and false negatives.

For this example, cutting at a distance of 0.2 seems appropriate, so:

my_df$LikelyGroup<-cutree(clusts,h=0.2)

where now my_df$LikelyGroup is the column of identifiers which has one number per individual, even if they are miss-spelled.

Now to name the groups, find the mode for each name/birthday column:

library(dplyr)

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

my_df<-my_df%>%
  group_by(LikelyGroup)%>%
  mutate(Group_Birth=Mode(BIRTH),
         Group_Name=Mode(NAME),
         Group_Surname=Mode(SURNAME))

Output my_df:

 BIRTH|NAME |SURNAME   | random_value| LikelyGroup| Group_Birth|Group_Name |Group_Surname 
------|-----|----------|-------------|------------|------------|-----------|--------------
     1|Luke |Skywalker |            1|           1|           1|Luke       |Skywalker     
     1|Luke |Skywalker |            2|           1|           1|Luke       |Skywalker     
     2|Leia |Organa    |            3|           2|           2|Leia       |Organa        
     3|Han  |Solo      |            7|           3|           3|Han        |Solo          
     1|Ben  |Solo      |            1|           4|           1|Ben        |Solo          
     5|Lando|Calrissian|            3|           5|           5|Lando      |Calrissian    
     3|Han  |Solo      |            4|           3|           3|Han        |Solo          
     3|Ham  |Solo      |            4|           3|           3|Han        |Solo          
     1|Luke |Wkywalker |            9|           1|           1|Luke       |Skywalker     

See gist at https://gist.github.com/gdmcdonald/9135ec8f7e903a0735a0b16d8cb97297

这篇关于如何在数据框中找到错字并替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆