在Android的SQLite的全文检索的Uni code支持 [英] Unicode support for Sqlite Full Text Search in Android

查看:479
本文介绍了在Android的SQLite的全文检索的Uni code支持的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

滚动到年底跳过解释。

在我的Andr​​oid应用程序,我想用非英语的Uni code文本字符串来搜索文本文档/存储在一个SQLite数据库中的字段匹配。我已经学会(所以我认为)这就是我需要做的是实现一个全文用FTS3搜索/ FTS4 ,所以这就是我一直在努力学习过去的几天。 FTS是Android支持,如显示在文档存储和搜索数据而在博客文章 Android快速提示:使用SQLite FTS表

In my Android app, I want to use non-English Unicode text strings to search for matches in text documents/fields that are stored in a SQLite database. I've learned (so I thought) that what I need to do is implement a Full Text Search with fts3/fts4, so that is what I have been working on learning for the past couple days. FTS is supported by Android, as is shown in the documentation Storing and Searching for Data and in the blog post Android Quick Tip: Using SQLite FTS Tables.

一切都看起来不错,但后来我读了2012年3月的博客文章的在Android SQLite的全文检索,这说的对不起国家

Everything was looking good, but then I read the March 2012 blog post The sorry state of SQLite full text search on Android, which said

构建全文搜索索引时,第一步就是打破
  文本内容转换成文字,又名令牌。这些标记是那么
  进入了一个特殊的索引,它可以让SQLite的执行速度非常快
  基于令牌(或一组标记)搜索。

The first step when building a full text search index is to break down the textual content into words, aka tokens. Those tokens are then entered into a special index which lets SQLite perform very fast searches based on a token (or a set of tokens).

SQLite的有两个内置的断词,他们都只考虑令牌
  由美国ASCII字符。所有其他非US ASCII字符
  被认为是空白。

SQLite has two built-in tokenizers, and they both only consider tokens consisting of US ASCII characters. All other, non-US ASCII characters are considered whitespace.

在那之后我也发现这个计算器通过的 =htt​​p://stackoverflow.com/users / 11654 / CL> @ CL (谁的基础上,标签和信誉,似乎是SQLite的专家),回答问题有关与不同的变音符号匹配越南字母:

After that I also found this StackOverflow answer by @CL. (who, based on tags and reputation, appears to be an expert on SQLite) replying to a question about matching Vietnamese letters with different diacritics:

您必须创建FTS表可以处理的Uni code一个标记
  字符,即ICU或UNI code61。

You must create the FTS table with a tokenizer that can handle Unicode characters, i.e., ICU or UNICODE61.

请注意,这些断词可能并不适用于所有
  Android的版本,而Android的API不公开任何
  功能增加用户定义的断词。

Please note that these tokenizers might not be available on all Android versions, and that the Android API does not expose any functions for adding user-defined tokenizers.

这2011 SO回答似乎证实,Android不支持断词超越了两个基本简单门房的。

This 2011 SO answer seems to confirm that Android does not support tokenizers beyond the two basic simple and porter ones.

这是2015年有没有这种情况的任何更新?我需要用我的应用程序,而不仅仅是用新手机的人(即使最新的Andr​​oid版本不支持现在的话)。

This is 2015. Are there any updates to this situation? I need to have the full text search supported for everyone using my app, not just people with new phones (even if the newest Android version does support it now).

我觉得很难相信,FTS不统一code在所有的工作。该文档简单标记生成器说

I find it hard to believe that FTS does not work at all with Unicode. The documentation for the simple tokenizer says

一个词是合格的人物,合资格的连续序列
  字符为所有字母数字字符和所有字符
  UNI code $ C $连接点值大于或等于128
。所有其他
  拆分文档到术语时,字符将被丢弃。其
  唯一的贡献就是相邻两项分开。的(强调)

A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters and all characters with Unicode codepoint values greater than or equal to 128. All other characters are discarded when splitting a document into terms. Their only contribution is to separate adjacent terms. (emphasis added)

这给了我希望,一些基本的Uni code功能仍然可以在Android的支持,即使事情像资本和变音符号(以及各种其他等效信形式有不同的Uni code code点)不支持。

That gives me hope that some basic Unicode functionality could still be supported in Android, even if things like capitalization and diacritics (and various other equivalent letter forms that have different Unicode code points) are not supported.

我可以使用Android的SQLite的FTS与非英语统一code文本(codepoints> 128),如果我只使用用空格分隔的文字统一code字符串标记? (也就是说,我正在寻找发生在文本精确匹配。)

Can I use SQLite FTS in Android with non-English Unicode text (codepoints > 128) if I am only using literal Unicode string tokens separated by spaces? (That is, I am searching for exact strings that occur in the text.)


  • UNI code61标记生成器是SQLite中版本3.7.13。此标记生成器支持全单code外壳折叠和确认单code空间和标点符号。 <一href=\"http://stackoverflow.com/questions/2421189/version-of-sqlite-used-in-android/4377116#4377116\">Android棒棒堂(API 20+)使用SQLite 3.8 。

  • The unicode61 tokenizer is available in SQLite version 3.7.13. This tokenizer supports "full unicode case folding" and "recognizes unicode space and punctuation characters." Android Lollipop (API 20+) uses SQLite 3.8.

推荐答案

统一code字符像'正常'的字母处理,这样你就可以在FTS数据和搜索字词使用它们。 (preFIX搜索应该工作了。)

Unicode characters are handled like 'normal' letters, so you can use them in FTS data and search terms. (Prefix searches should work, too.)

问题是,统一code字都没有的的,即所有的字符都被视为字母(即使他们实际上是标点符号( - †)或其他非字母字符(☺♫)),以及大/小写不合并,而变音符号不会被删除。结果
如果你要正确地处理这种情况下,你有你插入的文件到数据库之前,需要手动做这些的标准化,和之前所使用的搜索字词。

The problem is that Unicode characters are not normalized, i.e., all characters are treated as letters (even if they actually are punctuation (―†), or other non-letter characters (☺♫)), and that upper/lowercase are not merged, and that diacritics are not removed.
If you want to handle those cases correctly, you have to do these normalizations manually before you insert the documents into the database, and before you use the search terms.

这篇关于在Android的SQLite的全文检索的Uni code支持的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆