搜索阿拉伯名称会丢弃“أ"和“أ"之间的差异., "ا";在 mysql [英] Searching Arabic names discard the differences between "أ" , "ا" in mysql

查看:21
本文介绍了搜索阿拉伯名称会丢弃“أ"和“أ"之间的差异., "ا";在 mysql的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在我的数据库中存储阿拉伯名称.在阿拉伯语中,有些字母可能以不同的格式书写,例如ا"،أ"،آ",它们都代表同一个字母.还有,ه" Ìة".

I am storing Arabic name in my database. In Arabic there are some letters that may written in different format such as "ا"، "أ"، "آ" it all represent the same letter. Also, "ه" ، "ة".

我需要在数据库中搜索名称并忽略ا"،أ"،آ"之间的差异以及ه" ،ة"之间的差异.

I need to search database for names and ignoring the differences between "ا"، "أ"، "آ" and also the differences between "ه" ، "ة".

因此,例如,当用户在搜索框اسامة"中输入时,它应该返回أسامة"،اسامة"،أسامه،اسامه ...等.另一个例子,فايز"،فائز"应该返回.

So, for example when user enter in search box "اسامة" it should return "أسامة"، "اسامة"، "أسامه، "اسامه ... etc. Another example, "فايز"، "فائز" should return both.

我如何使用 mysql 查询做到这一点?如何搜索相似的名字而不是同名的?

How I could do that using mysql query? How I can search for similar names not the same name?

我尝试过 Like 关键字,但它不起作用.

I tried Like keyword, but it not working.

select * from employee WHERE fname like "%أسامة%" and mname="علي" and lname="الجاسم"

推荐答案

更新:我重写了我的答案,以防有人需要解决方案并偶然发现这个问题.

Update: I rewrote my answer in case someone needs a solution and stumbles to this question.

对于这个问题,我知道有 3 种可能的解决方案:

There are 3 possible solutions I know for this problem:

  1. 创建自定义排序规则
  2. 添加规范化字段
  3. 在查询中使用正则表达式

我写了一个教程 展示了如何将这些解决方案应用于 MySQL.我将尝试在此处总结这些步骤.

I wrote a tutorial showing how to apply these solutions to MySQL. I will try to summarize this steps here.

1.创建自定义排序规则

您可以创建自定义排序规则,使 MySQL 将这些字符视为一个整体.自定义排序规则可以添加到位于字符集目录中的名为 Index.xml 的文件中.可以通过使用以下内容查询 information_schema 来找到文件的位置:

You can create a custom collation make MySQL deal with these characters as one. Custom collation can be added to a file called Index.xml located in the directory of charsets. The location of the file can be found by querying the information_schema with the following:

SHOW VARIABLES LIKE 'character_sets_dir';

导航到目录,备份文件,打开它并滚动到元素<charset name="utf8″>,添加以下XML:

Navigate to the directory, back up the file, open it and scroll to element <charset name="utf8″>, add the following XML:

<charset name="utf8">
.
.
.
  <collation name="utf8_arabic_ci" id="1029">
   <rules>
     <reset>\u0627</reset> <!-- Alef 'ا' -->
     <i>\u0623</i>        <!-- Alef With Hamza Above 'أ' -->
     <i>\u0625</i>        <!-- Alef With Hamza Below 'إ' -->
     <i>\u0622</i>        <!-- Alef With Madda Above 'آ' -->
   </rules>
   <rules>
     <reset>\u0629</reset> <!-- Teh Marbuta 'ة' -->
     <i>\u0647</i>        <!-- Heh 'ه' -->
   </rules>
   <rules>
     <reset>\u0000</reset> <!-- Ignore Tashkil -->
     <i>\u064E</i>        <!-- Fatha 'َ' -->
     <i>\u064F</i>        <!-- Damma 'ُ' -->
     <i>\u0650</i>        <!-- Kasra 'ِ' -->
     <i>\u0651</i>        <!-- Shadda 'ّ' -->
     <i>\u064F</i>        <!-- Sukun 'ْ' -->
     <i>\u064B</i>        <!-- Fathatan 'ً' -->
     <i>\u064C</i>        <!-- Dammatan 'ٌ' -->
     <i>\u064D</i>        <!-- Kasratan 'ٍ' -->
   </rules>
 </collation>
</charset>

这个xml简单的说这个排序规则是utf8字符集之一,我选择将它命名为utf8_arabic_ci,并选择了1029的id号,自定义排序规则 ID 的范围为 1024-2047.整理规则告诉 MySQL 将所有形式的 Alef 以及 Teh 和 Heh 视为相同的字符,并完全忽略 tashkil.如果您愿意,可以添加更多规则.有关自定义排序规则的更多信息,请参阅 MySQL 文档.

This xml simply says that this collation is one of utf8 charsets, I chose to name it utf8_arabic_ci, and chose id number of 1029, custom collations ids are in the range 1024-2047. The collation rules tells MySQL to treat all forms of Alef as the same character, and also Teh and Heh, and ignore tashkil completely. You can add more rules if you like. Refer to MySQL docuemntation for more info about custom collations.

现在重新启动 MySQL 并使用如下查询将列的排序规则更改为我们的新排序规则:

Now restart MySQL and change the collation of the column to our new collation with a query like:

ALTER TABLE persons MODIFY name VARCHAR(50) 
CHARACTER SET 'utf8' COLLATE 'utf8_arabic_ci';

你应该能够搜索'اسامة'并得到'اسامة'、'أسامه'、'أسَامَة'......等

You should be able to search for 'اسامة' and get 'اسامة', 'أسامه', 'أسَامَة' ... etc.

2.添加规范化字段

此解决方案需要向表中添加一个新字段.该字段将被标准化",这是标准化阿拉伯名字字段的示例:

This solution requires adding a new field to the table. The field will be 'normalized', this is an example of a normalized Arabic first name field:

id normalized_name name
1  احمد            احمد
2  أحمد            احمد
3  أسامه          اسامة
4  أسامة          اسامة
5  اسامه          اسامة
6  اسَامه          اسامة

可以通过向表中添加一个新列并用规范化"函数的结果填充它来创建这个规范化字段,该函数将一个字符的不同变体替换为一个字符并删除 Tashkil.现在为了使搜索查询工作,我们将查询规范化字段并显示原始字段.类似于以下内容:

This normalized field can be created by adding a new column to the table and filling it with the result of a 'normalizing' function which replaces the different variation of a character with only one and also removes Tashkil. Now to make the search query work, we will query for the normalized field and display the original field. Something like the following:

SELECT name FROM persons WHERE normalized_name = "اسامة";
 
+--------------+
| name         |
+--------------+
| أسامه        |
| أسامة        |
| اسامه        |
| اسَامه        |
+--------------+

3.在查询中使用正则表达式

我不推荐此解决方案,您将失去索引的优势,会降低性能,并且您将很难生成正则表达式模式.但您可能会发现它对测试或特殊查询很有用.

I don't recommend this solution, you will lose the advantage of indices, will cost you on performance, and you will have a hard time generating regex patterns. But you might find it useful for testing or special queries.

您可以在 MySQL 查询中使用 REGEX 或其同义词 RLIKE.例如,如果您想通过 Alef 的任何变体查找名称أحمد",您将使用如下模式:

You can use REGEX or its synonym RLIKE in MySQL queries. For example, if you want to find the name 'أحمد' with any of the variations of Alef, you will use a pattern like:

SELECT name FROM clients WHERE name REGEXP 'ا|أ|إ]حمد]'

这应该会显示所需的结果,您所要做的就是编写一个函数来为搜索字符串生成此模式.这是一个示例函数,但请记住,这只是一个示例,并不适用于所有情况:

This should show the required result, all you have to do is to write a function to generate this pattern for the search string. This is an example function but keep in mind this is just an example and won't work in all cases:

// Add all your patterns and replacement in these arrays
$patterns     = array( "/(ا|أ|آ)/", "/(ه|ة)/" ); 
$replacements = array( "[ا|أ|آ]",   "[ة|ه]" );   
$query_string = preg_replace($patterns, $replacements, $search_string);

这应该适用于 Alef、Teh 和 Heh,但不适用于 Tashkil.

This should work for Alef, Teh and Heh, but it won't work for the Tashkil.

结论

添加自定义排序规则我认为是大多数情况下的最佳解决方案,但您可能无法编辑字符集文件(例如,如果您使用的是共享主机),添加规范化字段将是解决方案情况下,您可能会发现正则表达式模式在某些情况下很有用.

Adding a custom collation I think is the best solution for most cases, but you might not be able to edit charset files (like if you are using a shared hosting for example), adding a normalized field will be the solution in this case, and you might find regex pattters useful in somecases.

这篇关于搜索阿拉伯名称会丢弃“أ"和“أ"之间的差异., "ا";在 mysql的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆