如何标准化列的多个值? [英] How to standardize multiple values of column?
问题描述
我需要在不手动输入所有可能的 VAR1 值的情况下实现 VAR1_STRUCTURED 中的值,因为我得到了 50000 个观察值,意味着 50000 个可能的情况.
Var1 Var1_Structured好莱坞街 125 号好莱坞街 125 号好莱坞街 125 号好莱坞街 125 号125 好莱坞街 125 好莱坞街目标商店目标商店Trget Stre 目标商店目标.店铺 目标店铺目标商店 目标商店沃尔玛沃尔玛沃尔玛公司 Walmart沃尔玛沃尔玛
下面还有更多的值...
您的问题确实不准确.请遵循@RiggsFolly 的建议并阅读有关如何提出好问题的参考资料.
此外,正如@DuduMarkovitz 所建议的,您应该首先简化问题并清理数据.一些帮助您入门的资源:
然后,您可以对每个组执行进一步的操作.例如,这里我使用
hunspell
在 en_US 字典中查找Var1
中每个单独的单词以查找拼写错误并尝试在每个group
,其中id
没有拼写错误 (potential_id
)库(dplyr)图书馆(整理)图书馆(hunspell)tibble(Var1 = sapply(d_ap@clusters, names)) %>%unnest(.id = "group") %>%group_by(group)%>%变异(id = row_number())%>%分离行(Var1)%>%变异(检查 = hunspell_check(Var1))%>%group_by(id, add = TRUE) %>%总结(checked_vars = toString(Var1),result_per_word = toString(check),potential_id = 全部(检查))
给出:
#Source: 本地数据框 [10 x 5]#Groups: 组 [?]## group id checked_vars result_per_word potential_id# <int><int><chr><chr><lgl>#1 1 1 125, 好莱坞, 圣 TRUE, TRUE, TRUE TRUE#2 1 2 125, Hllywood, St. TRUE, FALSE, TRUE FALSE#3 1 3 125,好莱坞,圣真,真,真真#4 2 1 目标,存储真,真真#5 2 2 Trget、Stre FALSE、FALSE FALSE#6 2 3 目标.,存储真,真真#7 2 4 T,argetStore 真,假假#8 3 1 沃尔玛 错误 错误#9 3 2 Walmart, Inc. 假的,真假的#10 3 3 Wal, marte FALSE, FALSE FALSE
注意:这里由于我们还没有进行任何文本处理,所以结果不是很确定,但你懂的.
<小时>数据
df <- tibble::tribble(~Var1,好莱坞街 125 号","125 好莱坞街",好莱坞街125号","目标商店","Trget 街","目标.商店","目标商店",沃尔玛","沃尔玛公司",沃尔玛")
I need to achieve values in the VAR1_STRUCTURED without manually inputting all possible VAR1 values, since I got 50000 observations, means 50000 possible cases.
Var1 Var1_Structured 125 Hollywood St. 125 Hollywood St. 125 Hllywood St. 125 Hollywood St. 125 Hollywood St 125 Hollywood St. Target Store Target Store Trget Stre Target Store Target. Store Target Store T argetStore Target Store Walmart Walmart Walmart Inc. Walmart Wal marte Walmart
and there's a lot more values under...
解决方案Your question is really imprecise. Please, follow @RiggsFolly suggestions and read the references on how to ask a good question.
Also, as suggested by @DuduMarkovitz, you should start by simplifying the problem and cleaning your data. A few resources to get you started:
- Basic Text Processing Tutorial by Matt Deny
- Handling and Processing Strings in R by Gaston Sanchez
Once you are satisfied with the results, you could then proceed to identify a group for each
Var1
entry (this will help you down the road to perform further analysis/manipulations on similar entries) This could be done in many different ways but as per mentioned by @GordonLinoff, one possibily is the Levenshtein Distance.Note: for 50K entries, the result won't be 100% accurate as it will not always categorize the terms in the appropriate group but this should considerably reduce manual efforts.
In R, you could do this using
adist()
Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.
Using your example data:
d <- adist(df$Var1) # add rownames (this will prove useful later on) rownames(d) <- df$Var1 > d # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] #125 Hollywood St. 0 1 1 16 15 16 15 15 15 15 #125 Hllywood St. 1 0 2 15 14 15 15 14 14 14 #125 Hollywood St 1 2 0 15 15 15 14 14 15 15 #Target Store 16 15 15 0 2 1 2 10 10 9 #Trget Stre 15 14 15 2 0 3 4 9 10 8 #Target. Store 16 15 15 1 3 0 3 11 11 10 #T argetStore 15 15 14 2 4 3 0 10 11 9 #Walmart 15 14 14 10 9 11 10 0 5 2 #Walmart Inc. 15 14 15 10 10 11 11 5 0 6 #Wal marte 15 14 15 9 8 10 9 2 6 0
For this small sample, you can see the 3 distinct groups (the clusters of low Levensthein Distance values) and could easily assign them manually, but for larger sets, you will likely need a clustering algorithm.
I already pointed you in the comments to one of my previous answer showing how to do this using
hclust()
and the Ward's minimum variance method but I think here you would be better off using other techniques (one of my favorite resource on the topic for a quick overview of some of the most widely used methods in R is this detailed answer)Here's an example using affinity propagation clustering:
library(apcluster) d_ap <- apcluster(negDistMat(r = 1), d)
You will find in the APResult object
d_ap
the elements associated with each clusters and the optimum number of clusters, in this case: 3.> d_ap@clusters #[[1]] #125 Hollywood St. 125 Hllywood St. 125 Hollywood St # 1 2 3 # #[[2]] # Target Store Trget Stre Target. Store T argetStore # 4 5 6 7 # #[[3]] # Walmart Walmart Inc. Wal marte # 8 9 10
You can also see a visual representation:
> heatmap(d_ap, margins = c(10, 10))
Then, you can perform further manipulations for each group. As an example, here I use
hunspell
to lookup each separate words fromVar1
in a en_US dictionary for spelling mistakes and try to find, within eachgroup
, whichid
has no spelling mistakes (potential_id
)library(dplyr) library(tidyr) library(hunspell) tibble(Var1 = sapply(d_ap@clusters, names)) %>% unnest(.id = "group") %>% group_by(group) %>% mutate(id = row_number()) %>% separate_rows(Var1) %>% mutate(check = hunspell_check(Var1)) %>% group_by(id, add = TRUE) %>% summarise(checked_vars = toString(Var1), result_per_word = toString(check), potential_id = all(check))
Which gives:
#Source: local data frame [10 x 5] #Groups: group [?] # # group id checked_vars result_per_word potential_id # <int> <int> <chr> <chr> <lgl> #1 1 1 125, Hollywood, St. TRUE, TRUE, TRUE TRUE #2 1 2 125, Hllywood, St. TRUE, FALSE, TRUE FALSE #3 1 3 125, Hollywood, St TRUE, TRUE, TRUE TRUE #4 2 1 Target, Store TRUE, TRUE TRUE #5 2 2 Trget, Stre FALSE, FALSE FALSE #6 2 3 Target., Store TRUE, TRUE TRUE #7 2 4 T, argetStore TRUE, FALSE FALSE #8 3 1 Walmart FALSE FALSE #9 3 2 Walmart, Inc. FALSE, TRUE FALSE #10 3 3 Wal, marte FALSE, FALSE FALSE
Note: Here since we haven't performed any text processing, the results are not very conclusive, but you get the idea.
Data
df <- tibble::tribble( ~Var1, "125 Hollywood St.", "125 Hllywood St.", "125 Hollywood St", "Target Store", "Trget Stre", "Target. Store", "T argetStore", "Walmart", "Walmart Inc.", "Wal marte" )
这篇关于如何标准化列的多个值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!