如何标准化列的多个值? [英] How to standardize multiple values of column?

查看:60
本文介绍了如何标准化列的多个值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在不手动输入所有可能的 VAR1 值的情况下实现 VAR1_STRUCTURED 中的值,因为我得到了 50000 个观察值,意味着 50000 个可能的情况.

Var1 Var1_Structured好莱坞街 125 号好莱坞街 125 号好莱坞街 125 号好莱坞街 125 号125 好莱坞街 125 好莱坞街目标商店目标商店Trget Stre 目标商店目标.店铺 目标店铺目标商店 目标商店沃尔玛沃尔玛沃尔玛公司 Walmart沃尔玛沃尔玛

下面还有更多的值...

解决方案

您的问题确实不准确.请遵循@RiggsFolly 的建议并阅读有关如何提出好问题的参考资料.

此外,正如@DuduMarkovitz 所建议的,您应该首先简化问题并清理数据.一些帮助您入门的资源:

  • 然后,您可以对每个组执行进一步的操作.例如,这里我使用 hunspell 在 en_US 字典中查找 Var1 中每个单独的单词以查找拼写错误并尝试在每个 group,其中 id 没有拼写错误 (potential_id)

    库(dplyr)图书馆(整理)图书馆(hunspell)tibble(Var1 = sapply(d_ap@clusters, names)) %>%unnest(.id = "group") %>%group_by(group)%>%变异(id = row_number())%>%分离行(Var1)%>%变异(检查 = hunspell_check(Var1))%>%group_by(id, add = TRUE) %>%总结(checked_vars = toString(Var1),result_per_word = toString(check),potential_id = 全部(检查))

    给出:

    #Source: 本地数据框 [10 x 5]#Groups: 组 [?]## group id checked_vars result_per_word potential_id# <int><int><chr><chr><lgl>#1 1 1 125, 好莱坞, 圣 TRUE, TRUE, TRUE TRUE#2 1 2 125, Hllywood, St. TRUE, FALSE, TRUE FALSE#3 1 3 125,好莱坞,圣真,真,真真#4 2 1 目标,存储真,真真#5 2 2 Trget、Stre FALSE、FALSE FALSE#6 2 3 目标.,存储真,真真#7 2 4 T,argetStore 真,假假#8 3 1 沃尔玛 错误 错误#9 3 2 Walmart, Inc. 假的,真假的#10 3 3 Wal, marte FALSE, FALSE FALSE

    注意:这里由于我们还没有进行任何文本处理,所以结果不是很确定,但你懂的.

    <小时>

    数据

    df <- tibble::tribble(~Var1,好莱坞街 125 号","125 好莱坞街",好莱坞街125号","目标商店","Trget 街","目标.商店","目标商店",沃尔玛","沃尔玛公司",沃尔玛")

    I need to achieve values in the VAR1_STRUCTURED without manually inputting all possible VAR1 values, since I got 50000 observations, means 50000 possible cases.

    Var1                   Var1_Structured
    125 Hollywood St.      125 Hollywood St.
    125 Hllywood St.       125 Hollywood St.
    125 Hollywood St       125 Hollywood St.
    Target Store           Target Store
    Trget Stre             Target Store
    Target. Store          Target Store
    T argetStore           Target Store
    Walmart                Walmart
    Walmart Inc.           Walmart
    Wal marte              Walmart
    

    and there's a lot more values under...

    解决方案

    Your question is really imprecise. Please, follow @RiggsFolly suggestions and read the references on how to ask a good question.

    Also, as suggested by @DuduMarkovitz, you should start by simplifying the problem and cleaning your data. A few resources to get you started:

    Once you are satisfied with the results, you could then proceed to identify a group for each Var1 entry (this will help you down the road to perform further analysis/manipulations on similar entries) This could be done in many different ways but as per mentioned by @GordonLinoff, one possibily is the Levenshtein Distance.

    Note: for 50K entries, the result won't be 100% accurate as it will not always categorize the terms in the appropriate group but this should considerably reduce manual efforts.

    In R, you could do this using adist()

    Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.

    Using your example data:

    d <- adist(df$Var1)
    # add rownames (this will prove useful later on)
    rownames(d) <- df$Var1
    
    > d
    #                  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
    #125 Hollywood St.    0    1    1   16   15   16   15   15   15    15
    #125 Hllywood St.     1    0    2   15   14   15   15   14   14    14
    #125 Hollywood St     1    2    0   15   15   15   14   14   15    15
    #Target Store        16   15   15    0    2    1    2   10   10     9
    #Trget Stre          15   14   15    2    0    3    4    9   10     8
    #Target. Store       16   15   15    1    3    0    3   11   11    10
    #T argetStore        15   15   14    2    4    3    0   10   11     9
    #Walmart             15   14   14   10    9   11   10    0    5     2
    #Walmart Inc.        15   14   15   10   10   11   11    5    0     6
    #Wal marte           15   14   15    9    8   10    9    2    6     0
    

    For this small sample, you can see the 3 distinct groups (the clusters of low Levensthein Distance values) and could easily assign them manually, but for larger sets, you will likely need a clustering algorithm.

    I already pointed you in the comments to one of my previous answer showing how to do this using hclust() and the Ward's minimum variance method but I think here you would be better off using other techniques (one of my favorite resource on the topic for a quick overview of some of the most widely used methods in R is this detailed answer)

    Here's an example using affinity propagation clustering:

    library(apcluster)
    d_ap <- apcluster(negDistMat(r = 1), d)
    

    You will find in the APResult object d_ap the elements associated with each clusters and the optimum number of clusters, in this case: 3.

    > d_ap@clusters
    #[[1]]
    #125 Hollywood St.  125 Hllywood St.  125 Hollywood St 
    #                1                 2                 3 
    #
    #[[2]]
    # Target Store    Trget Stre Target. Store  T argetStore 
    #            4             5             6             7 
    #
    #[[3]]
    #     Walmart Walmart Inc.    Wal marte 
    #           8            9           10 
    

    You can also see a visual representation:

    > heatmap(d_ap, margins = c(10, 10))
    

    Then, you can perform further manipulations for each group. As an example, here I use hunspell to lookup each separate words from Var1 in a en_US dictionary for spelling mistakes and try to find, within each group, which id has no spelling mistakes (potential_id)

    library(dplyr)
    library(tidyr)
    library(hunspell)
    
    tibble(Var1 = sapply(d_ap@clusters, names)) %>%
      unnest(.id = "group") %>%
      group_by(group) %>%
      mutate(id = row_number()) %>%
      separate_rows(Var1) %>%
      mutate(check = hunspell_check(Var1)) %>%
      group_by(id, add = TRUE) %>%
      summarise(checked_vars = toString(Var1), 
                result_per_word = toString(check), 
                potential_id = all(check))
    

    Which gives:

    #Source: local data frame [10 x 5]
    #Groups: group [?]
    #
    #   group    id        checked_vars   result_per_word potential_id
    #   <int> <int>               <chr>             <chr>        <lgl>
    #1      1     1 125, Hollywood, St.  TRUE, TRUE, TRUE         TRUE
    #2      1     2  125, Hllywood, St. TRUE, FALSE, TRUE        FALSE
    #3      1     3  125, Hollywood, St  TRUE, TRUE, TRUE         TRUE
    #4      2     1       Target, Store        TRUE, TRUE         TRUE
    #5      2     2         Trget, Stre      FALSE, FALSE        FALSE
    #6      2     3      Target., Store        TRUE, TRUE         TRUE
    #7      2     4       T, argetStore       TRUE, FALSE        FALSE
    #8      3     1             Walmart             FALSE        FALSE
    #9      3     2       Walmart, Inc.       FALSE, TRUE        FALSE
    #10     3     3          Wal, marte      FALSE, FALSE        FALSE
    

    Note: Here since we haven't performed any text processing, the results are not very conclusive, but you get the idea.


    Data

    df <- tibble::tribble(
      ~Var1,                   
      "125 Hollywood St.",      
      "125 Hllywood St.",       
      "125 Hollywood St",       
      "Target Store",           
      "Trget Stre",             
      "Target. Store",          
      "T argetStore",           
      "Walmart",                
      "Walmart Inc.",           
      "Wal marte" 
    )
    

    这篇关于如何标准化列的多个值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆