如何在MediaWiki数据库中进行重音和不区分大小写的搜索? [英] How to do an accent and case-insensitive search in MediaWiki database?

查看:94
本文介绍了如何在MediaWiki数据库中进行重音和不区分大小写的搜索?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我在Wiki(MediaWiki 1.19.4)中拥有这些页面标题:

Let's pretend that I have these page titles in my wiki (MediaWiki 1.19.4):

SOMETHIng
Sómethìng
SomêthÏng
SÒmetHínG

如果用户搜索something,我希望所有4页都作为结果返回.

If a user searches something I want that all 4 pages are returned as the result.

目前,我唯一想到的就是该查询(MySQL Percona 5.5.30-30.2):

At the moment the only thing I could think of is this query (MySQL Percona 5.5.30-30.2):

SELECT page_title
FROM page
WHERE page_title LIKE '%something%' COLLATE utf8_general_ci

仅返回SOMETHIng.

我必须走在正确的道路上,因为如果我搜索sóméthíngSÓMÉTHÍNG,则得到SOMETHIng作为结果.我该如何修改查询,以便获得预期的其他结果?由于page表仅包含〜2K行,因此此处的性能并不重要.

I must be on the right path, because if I search sóméthíng OR SÓMÉTHÍNG, I get SOMETHIng as the result. How could I modify the query so I get the other results as expected? Performance is not critical here since the page table contains only ~2K rows.

这是具有相关位的表定义:

This is the table definition with the relevant bits:

CREATE TABLE page (
    (...)
    page_title VARCHAR(255) NOT NULL DEFAULT '' COLLATE latin1_bin,
    (...)
    UNIQUE INDEX name_title (page_namespace, page_title),
)

不得修改表定义 ,因为这是MediaWiki和AFAIK的库存安装,其代码希望该字段以这种方式定义(即,以二进制数据形式存储的unicode).

The table definition must not be modified, since this is a stock installation of MediaWiki and AFAIK its code expects this field being defined that way (i.e. unicode stored as binary data).

推荐答案

我找到了完美的解决方案,无需修改或创建表.它可能会影响性能(我没有测试),但是正如我在问题中指出的那样,它是一个约2K的行表,因此没什么大不了的.

I found the perfect solution, no modyfing or creating tables. It might have performance implications (I didn't test), but as I stated in my question, it's a ~2K rows table so it shouldn't matter much.

问题的根源是 MediaWiki将以UTF8编码的文本存储在latin1编码的表中.对MediaWiki而言,这无关紧要,因为它知道了它,并且始终会使用正确的字符集查询数据库并执行其操作,本质上使用MySQL作为愚蠢的容器.这样做是因为MySQL中对UTF8的支持显然不足以满足其需求(请参见MediaWiki的DefaultSettings.php变量$wgDBmysql5中的注释).

The root of the problem is that MediaWiki stores UTF8-encoded text in latin1-encoded tables. It doesn't matter much to MediaWiki since it's aware of it and it'll always query the database with the correct charset and do its thing, essentially using MySQL as a dumb bit container. It does this because apparently UTF8 support in MySQL is not adequate for its needs (see comments in MediaWiki's DefaultSettings.php, variable $wgDBmysql5).

当您希望数据库本身能够执行支持UTF8的操作时,就会出现问题(就像我想在问题中做的那样).您将无法执行此操作,因为据MySQL所知,它没有存储UTF8编码的文本(尽管如前段所述).

The problem appears when you want the database itself to be able to perform UTF8-aware operations (like I wanted to do in my question). You won't be able to do that because as far as MySQL knows, it's not storing UTF8-encoded text (although it is, as explained in the previous paragraph).

有一个明显的解决方案:将要使用的列强制转换为UTF8,例如CONVERT(col_name USING utf8).这里的问题是MySQL试图提供危险的帮助:它认为col_name正在存储latin1编码的文本,并且它将每个字节转换(而不是编码)成其UTF8等效项,您将以双重编码的UTF8结尾,这显然是错误的.

There's an obvious solution for this: cast to UTF8 the column you want to use, something like this CONVERT(col_name USING utf8). The problem here is that MySQL is trying to be dangerously helpful: it thinks that col_name is storing latin1-encoded text and it will translate (not encode) each byte into its UTF8 equivalent, and you will end with double-encoded UTF8, which is obviously wrong.

如何避免MySQL如此出色和有用?只需在转换为UTF8之前先广播到BINARY 这样,MySQL不会承担任何责任,并且将完全按照要求进行:将这些比特编码为UTF8.确切的语法是CONVERT(CAST(col_name AS BINARY) USING utf8).

How to avoid MySQL being so nice and helpful? Just cast to BINARY before doing the conversion to UTF8! That way MySQL won't assume anything and will do exactly as asked: encoding this bunch of bits into UTF8. The exact syntax is CONVERT(CAST(col_name AS BINARY) USING utf8).

这是我现在的最后一个查询:

So this is my final query now:

SELECT CONVERT(CAST(page_title AS BINARY) USING utf8)
FROM page
WHERE
    CONVERT(CAST(page_title AS BINARY) USING utf8)
        LIKE '%keyword_here%'
            COLLATE utf8_spanish_ci

现在,如果我搜索somethingsôMëthîNG或任何变体形式,我将获得所有结果!

Now if I search something or sôMëthîNG or any variation, I get all the results!

请注意,我使用utf8_spanish_ci是因为我希望搜索将ñn区别开来,而不是áa区别开来.根据您的用例使用其他排序规则(此处是完整的列表).

Please note that I used utf8_spanish_ci because I want the search to differentiate ñ from n but not á from a. Use a different collation according to your use case (here is a complete list).

相关链接:

  • MySQL: Converting an incorrect latin1 column to utf8
  • How to MySQL work "case insensitive" and "accent insensitive" in UTF-8
  • Collations available in MySQL 5.5

这篇关于如何在MediaWiki数据库中进行重音和不区分大小写的搜索?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆