近似字符串匹配-机器学习 [英] Approximate String Matching - Machine Learning

查看:312
本文介绍了近似字符串匹配-机器学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要求,其中我的源数据位于HDFS中,并且有一个包含用户技能的字段.现在,源文件具有属于用户的所有技能,例如-管理,JAVA,HADOOP,PIG,SQL,性能调整,C,业务咨询,销售等....

I have a requirement where my source data is in HDFS, and there is one field which contains skills of the users. Now the source file has all kind of skills attributed to a user for eg - MANAGEMENT , JAVA, HADOOP , PIG ,SQL, PERFORMANCE TUNING, C ,BUSINESS CONSULTING , SALES etc etc.....

现在我的查询是我需要建立一种机器学习算法来检测所谓的技能中是否存在一些拼写错误.例如,如果不是销售,而是列中有薪水,或者像hadoop一样被误认为是hadup.所以我想标准化这些东西.

NOW my query is that i need to build a machine learning algorithm to detect if there are some spelling mistakes in the so called skills. for eg if instead of sales the column has sals or like hadoop is misspelt as hadup. so i want to standardise these things.

我该怎么做?我不懂机器学习,但我愿意学习和编码.我在PYTHON工作很自在.

How can i go about doing this?? I dont know Machine Learning, but i am willing to learn and code it . I am comfortable in working in PYTHON.

任何建议我该怎么做?如果你们能提出想法,那真是太好了!

Any Suggestions how can i go about doing this?? Would really be great if you guys can pitch in Ideas !!

推荐答案

这种问题通常有两个部分:找出哪些项目可能出错,然后进行修复.

There are typically two parts to such a problem: figuring out which items are likely in error, and then fixing those.

如果您假设大多数项目的拼写正确,那么查找可能的错误就非常容易了.修复错误要自动化得多,而且在任何合理的时间长度内100%正确地执行此错误可能都是不可能的.但是您可能会发现,如果您能很好地找到错误,则手动修复它们没什么大不了的.

If you assume that the majority of items are spelled correctly, then finding the likely errors is pretty easy. Fixing the errors is a lot harder to automate, and it's probably impossible to do it 100% correctly in any reasonable length of time. But you might find that if you do a good job finding the errors, fixing them manually is no big deal.

要查找错误,我建议您列出每个技能的列表,并计算每个技能在整个数据集中被引用多少次.完成后,您将获得类似以下的列表:

To find the errors I would suggest that you make a list of each of the skills and a count of how many times each skill is referenced in the entire data set. When you're done you'll have a list like:

MANAGEMENT, 22
JAVA, 298
HADOOP, 12
HADUP, 1
SALES, 200
SALS, 1

等列出了每种技能以及拥有该技能的用户数量.

etc. Each skill is listed along with the number of users who possess that skill.

现在,按频率对它们进行排序,然后选择一个阈值.假设您选择更仔细地检查频率为3或更低的任何事物.这个想法是,与其他项目相比使用次数很少的项目可能是拼写错误.

Now, sort those by frequency and choose a threshold. Say you choose to examine more closely anything that has a frequency of 3 or less. The idea is that items that are used a very small number of times in relation to other items are probably misspellings.

一旦确定了要仔细检查的术语,就可以确定是否要自动执行更改,或者是否要手动进行更改.当我必须这样做时,我得到了可能的拼写错误的列表,并手动创建了一个包含拼写错误和更正的文件.例如:

Once you've identified the terms you want to examine more closely, you can determine if you'd like to automate the change or if you will do it manually. When I had to do this, I got my list of likely misspellings and manually created a file that had the misspelling and the correction. For example:

SALS,SALES
HADUP,HADOOP
PREFORMANCE,PERFORMANCE

有几百个,但是手动创建文件比编写程序来确定正确的拼写要快得多.

There were a couple hundred, but manually creating the file was a whole lot faster than writing a program to figure out what the correct spelling should be.

然后我加载了该文件并浏览了我的用户记录,并根据需要进行了替换.

Then I loaded that file and went through my user records, making the replacements as required.

最省时的方法是找到可能需要更换的候选人.在那之后,修复它们几乎是事后的想法.

The big time saver is finding the likely candidates for replacement. After that, fixing them is almost an afterthought.

也就是说,除非您真的想在研究项目上花费数月.然后,您可以玩弄编辑距离算法,语音算法和其他可能 认为"edicit"和"etiquette"应该是同一个词的东西.

That is, unless you really want to spend months on a research project. Then you can knock yourself out playing with edit distance algorithms, phonetic algorithms, and other stuff that might figure out that "edicit" and "etiquette" are supposed to be the same word.

这篇关于近似字符串匹配-机器学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆