如何解决在MySQL中存储人名的困境,并保持两种歧视性,并搜索类似的名称? [英] How to solve Dilemma of storing human names in MySQL and keep both discriminability and a search for similar names?

查看:154
本文介绍了如何解决在MySQL中存储人名的困境,并保持两种歧视性,并搜索类似的名称?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



标准:


  1. 我只有全名。 (没有分隔,例如预先姓名,姓氏等)

  2. 存储应该是变音敏感的。 (以下名字代表不同人士)




    • Voss和Voß。

    • Joel和Joël。

    • franc和Franc和Fránc。


  3. 搜索应该返回搜索字符串中的所有相似名称:例如:搜索franc应该返回[franc,Franc,Fránc]等等...如果搜索不仅返回变音不敏感的匹配,还可能返回与搜索字符串匹配的类似的声音名称或名称,那么这个搜索将会非常棒。)


我使用 COLLATION utf8_bin 列(声明为唯一)其中我将存储名称。这将满足第2点。但这将伤害到第三点。将列名称声明为唯一排序规则utf8_unicode_ci 满足点3,但它伤害点二。



所以我的问题是:有没有办法解决这个任务,尊重所有的标准?而且我不想重新发明:在数据库中有一个优雅的方法来处理人名(和他们的搜索)吗? (不幸的是,我没有将名字分成名字,姓氏和可选的中文名的可能性...)



编辑:



名字的数量是一百万(〜1.000.000)个条目。如果重要:我使用python作为脚本语言来填充数据库并稍后查询数据。

解决方案

什么是有用的是,如果您可以将全名分解为组件名称单词,并存储每个人的语音编码(metaphone或许多其他选项之一)。你只需要使用名称词的概念,而不是将其分类为第一个或中间或最后一个,这是很好的,因为这些类别在文化之间无法正常工作)。但是如果您希望搜索Paul Carl比匹配Carl Paul更好地匹配Paul Karl,您可以稍后使用排名信息。您需要注意可能需要存储多个版本的某些名称单词的歧义标点符号。例如,Bre-Anna Heim会被打破名字breannabreanna和heim。有时候,破折号就像布莱恩一样,但有时不像萨莉六月那样。安娜不会只使用布莱恩或安娜,而是萨利六月可能只是用萨利或者六月的时候,很难知道,覆盖两种可能性。



您可以通过类似的分解和语音编码您要搜索的全名来编写您的查询,您的查询可以返回,例如那些完整的具有两个或多个组件名称语音匹配的名称(或者一个,如果在搜索或源中只有一个名称),这将为您提供一个全名考虑的一部分,您可以提出一个简单的排名,或者甚至在这个子集上做一些像距离匹配算法的东西,这对于整个百万个名字来说,计算量太贵了,当我说距离匹配时,我在说Levenshtein距离等在线算法。 p>

(编辑)推理这是处理像后续的情况g名字:Maria de los Angeles Gomez-Rodriguez。一个数据录入者可能刚刚进入Maria Gomez。另一个可能进入玛丽亚戈麦斯罗德里格斯。另一个可能会进入玛丽亚·罗德里格斯。


I was granted with the beautiful task ;-) to design some tables in a MySQL Database which should hold human names.

Criteria:

  1. I have only the full names. (There is no separation for e.g. prename, surname and so on)
  2. The storage should be diacritic sensitive. (The following names stand for different persons)

    • "Voss" and "Voß".
    • "Joel" and "Joël".
    • "franc" and "Franc" and "Fránc".
  3. A search should return all similar names to the search string: E.g: Search for "franc" should return ["franc", "Franc", "Fránc"] and so on... (It would be awesome if the search would return not only the diacritice insensitive matches but perhaps similar sounding names or names that match in parts to the search string, too...)

I thougt of using the COLLATION utf8_bin for the column (declared as unique) in which I will store the names. This would satisfy point 2. But this will hurt point three. Declaring the column name as unique with collation utf8_unicode_ci satisfys point 3. but it hurts point two.

So my question is: Is there a way to solve this task and respecting all criteria? And since I don't want to reinvent the wheel: Is there an elegant way to handle human names (and their searches) in databases? (Sadly, I do not have the possibility of splitting the names into prename, surnames and optional middlenames...)

Edit:

The amount of names is arount a million (~1.000.000) entrys. And if it matters: I am using python as scripting language to populate the database and query the data later on.

解决方案

What is useful is if you can decompose the full name into component "name words" and store a phonetic encoding (metaphone or one of the many other choices) for each of them. You just need the notion of name words though, not specifically categorizing it as first or middle or last, which is fine because those categories don't work well across cultures anyway). But you can use positional order information later in ranking if you want so that searching for "Paul Carl" matches "Paul Karl" better than matching "Carl Paul". You need to be aware of ambiguous punctuation that may require storing multiple versions of some name words. For instance Bre-Anna Heim would be broken into the name words "bre" "anna" "breanna" and "heim". Sometimes the dash is irrelevant like Bre-Anna, but sometimes not like in Sally-June". Bre-Anna never uses just Bre or Anna, but Sally-June may just use Sally or just June sometimes. It's hard to know which, so cover both possibilities.

You can write your query against this by similarly decomposing and phonetically encoding the full name you're searching for. Your query can return, say, those full names that have two or more component name phonetic matches (or one if there is only one name in the search or the source). This gives you a subset of full names to consider further. You could come up with a simple ranking of them, or even do something like a distance matching algorithm on this subset, which would be too expensive computationally to do against the entire million names. When I say distance matching, I'm talking on-line algorithms like Levenshtein distance and the like.

(edit) The reasoning for this is handling cases like the following name: Maria de los Angeles Gomez-Rodriguez. One data entry person may just enter Maria Gomez. Another might enter Maria Gomez Rodriguez. Yet another might enter Maria Angeles Rodrigus.

这篇关于如何解决在MySQL中存储人名的困境,并保持两种歧视性,并搜索类似的名称?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆