在Android中使用Sqlite全文搜索对非英文字符进行Unicode支持 [英] Unicode support for non-English characters with Sqlite Full Text Search in Android

查看:247
本文介绍了在Android中使用Sqlite全文搜索对非英文字符进行Unicode支持的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



背景



在我的Android应用程序中,我想要使用非英文Unicode文本字符串搜索存储在SQLite数据库中的文本文档/字段中的匹配项。我学到了(所以我认为)我需要做的是实现全文搜索fts3 / fts4 ,所以这就是我过去几天一直在学习的东西。 FTS由Android支持,正如文档存储和搜索数据和博客文章 Android快速提示:使用SQLite FTS表格



问题



一切看起来不错,我阅读2012年3月的博客文章 Android的SQLite全文搜索的遗憾状态,它表示


构建完整版文本搜索索引是将
文本内容分解成单词,又称令牌。然后,这些令牌被
输入到一个特殊索引中,该索引可以让SQLite根据令牌(或一组令牌)执行非常快速的
搜索。

SQLite有两个内置的标记器,它们都只考虑标记
由美国ASCII字符组成。所有其他非美国ASCII字符
都被认为是空白的。


之后,我还发现 @CL (这个StackOverflow的答案)((http://sackoverflow.com/a/17399384/3681880)根据标签和声誉,他似乎是SQLite方面的专家)回答有关使用不同的变音符号匹配越南语字母的问题: $ b


您必须使用可以处理Unicode
字符(即ICU或UNICODE61)的分词器创建FTS表。

请注意,这些分词器可能不可用所有
Android版本,并且Android API不会公开用于添加用户定义的标记器的任何
函数。


2011年的答案似乎证实了Android不支持两个基本的简单 porter ones。



这是2015年。这种情况有没有更新?我需要为使用我的应用程序的每个人支持全文搜索,而不仅仅是使用新手机的人(即使最新的Android版本现在也支持它)。
$ b

潜力部分解决方案?



我很难相信FTS根本不适用于Unicode。针对简单文档标记器说


术语是一个连续的符合条件的字符序列,符合条件的
字符全部是字母数字字符和全部带有
Unicode代码点值大于或等于128的字符
。将文档拆分成条款时,所有其他
字符都将被丢弃。他们仅
的贡献是将相邻的条款分开。 (强调加入)


这让我希望Android仍然可以支持一些基本的Unicode功能,即使像大写和变音符号(以及具有不同的Unicode代码点的各种其他等效的信函形式)的东西不被支持。

我的主要问题



如果我只使用由空格分隔的文字Unicode字符串标记,我可以在Android中使用带有非英文Unicode文本(codepoints> 128)的SQLite FTS吗? (也就是说,我正在搜索文本中出现的确切字符串。)



更新




解决方案

Unicode字符像'普通'字母一样处理,因此您可以在FTS数据和搜索条件。 (也可以使用前缀搜索。)



问题是Unicode字符不是标准化的,即全部字符被视为字母(即使它们实际上是标点符号( - †)或其他非字母字符(☺♫)),并且大小写不会合并,并且不会删除变音符号。

如果要正确处理这些情况,则必须在将文档插入数据库之前以及在使用搜索词之前手动执行这些标准化。


Scroll to the end to skip the explanation.

Background

In my Android app, I want to use non-English Unicode text strings to search for matches in text documents/fields that are stored in a SQLite database. I've learned (so I thought) that what I need to do is implement a Full Text Search with fts3/fts4, so that is what I have been working on learning for the past couple days. FTS is supported by Android, as is shown in the documentation Storing and Searching for Data and in the blog post Android Quick Tip: Using SQLite FTS Tables.

Problem

Everything was looking good, but then I read the March 2012 blog post The sorry state of SQLite full text search on Android, which said

The first step when building a full text search index is to break down the textual content into words, aka tokens. Those tokens are then entered into a special index which lets SQLite perform very fast searches based on a token (or a set of tokens).

SQLite has two built-in tokenizers, and they both only consider tokens consisting of US ASCII characters. All other, non-US ASCII characters are considered whitespace.

After that I also found this StackOverflow answer by @CL. (who, based on tags and reputation, appears to be an expert on SQLite) replying to a question about matching Vietnamese letters with different diacritics:

You must create the FTS table with a tokenizer that can handle Unicode characters, i.e., ICU or UNICODE61.

Please note that these tokenizers might not be available on all Android versions, and that the Android API does not expose any functions for adding user-defined tokenizers.

This 2011 SO answer seems to confirm that Android does not support tokenizers beyond the two basic simple and porter ones.

This is 2015. Are there any updates to this situation? I need to have the full text search supported for everyone using my app, not just people with new phones (even if the newest Android version does support it now).

Potential partial solution?

I find it hard to believe that FTS does not work at all with Unicode. The documentation for the simple tokenizer says

A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters and all characters with Unicode codepoint values greater than or equal to 128. All other characters are discarded when splitting a document into terms. Their only contribution is to separate adjacent terms. (emphasis added)

That gives me hope that some basic Unicode functionality could still be supported in Android, even if things like capitalization and diacritics (and various other equivalent letter forms that have different Unicode code points) are not supported.

My Main Question

Can I use SQLite FTS in Android with non-English Unicode text (codepoints > 128) if I am only using literal Unicode string tokens separated by spaces? (That is, I am searching for exact strings that occur in the text.)

Updates

解决方案

Unicode characters are handled like 'normal' letters, so you can use them in FTS data and search terms. (Prefix searches should work, too.)

The problem is that Unicode characters are not normalized, i.e., all characters are treated as letters (even if they actually are punctuation (―†), or other non-letter characters (☺♫)), and that upper/lowercase are not merged, and that diacritics are not removed.
If you want to handle those cases correctly, you have to do these normalizations manually before you insert the documents into the database, and before you use the search terms.

这篇关于在Android中使用Sqlite全文搜索对非英文字符进行Unicode支持的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆