卡桑德拉的中文 [英] Chinese language in Cassandra

查看:26
本文介绍了卡桑德拉的中文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Cassandra中使用了中文字母,似乎数据输入正确,如下所示,

I used the chinese letters in Cassandra and it seems the data is entered properly like below,

SELECT * FROM user;

 user_id | user_name    | user_phone
---------+--------------+-------------
      23 |      uSer23, | 12345678910
       5 |       uSer5^ | 12345678910
      28 |     uSer28名 | 12345678910
      10 |      uSer10- | 12345678910
      16 |      uSer16{ | 12345678910
      13 |      uSer13= | 12345678910
      30 |   uSer30一些 | 12345678910
      11 |      uSer11_ | 12345678910
       1 |       uSer1@ | 12345678910
      19 |      uSer19" | 12345678910
       8 |       uSer8( | 12345678910
       0 |       uSer0! | 12345678910
       2 |       uSer2# | 12345678910
       4 |       uSer4% | 12345678910
      18 |      uSer18[ | 12345678910
      15 |      uSer15} | 12345678910
      22 |      uSer22< | 12345678910
      27 |      uSer27/ | 12345678910
      20 |      uSer20: | 12345678910
       7 |       uSer7* | 12345678910
       6 |       uSer6& | 12345678910
      29 |     uSer29称 | 12345678910
       9 |       uSer9) | 12345678910
      14 |      uSer14| | 12345678910
      26 |      uSer26? | 12345678910
      21 |      uSer21; | 12345678910
      17 |      uSer17] | 12345678910
      31 | uSer31区中文 | 12345678910
      24 |      uSer24> | 12345678910
      25 |      uSer25. | 12345678910
      12 |      uSer12+ | 12345678910
       3 |       uSer3$ | 12345678910

我为user_name"字段创建了一个索引,如下所示,

I created a index for 'user_name' field like below,

CREATE CUSTOM INDEX user_nontoken_idx ON QCS.user (user_name) 
  USING 'org.apache.cassandra.index.sasi.SASIIndex' 
  WITH OPTIONS = {'mode': 'CONTAINS', 'analyzer_class': 
    'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
    'case_sensitive': 'false'}; 

当我使用这些中文词进行搜索时,它被成功搜索.

When I do a search using those chinese word, It is searched successfully.

SELECT * FROM user WHERE user_name LIKE '%称%';

它实际上是如何工作的?Cassandra 怎么有存储中文的能力?

How does it actually works? How Cassandra has the capability to store chinese?

推荐答案

默认情况下,文本在 Cassandra 中表示为 UTF-8,正如注释中提到的那样.

By default, the text is represented in Cassandra as UTF-8 as it was mentioned in comment.

对于您的问题,主要工作是由SASI完成的,它从文本列中获取数据,并将分析器应用于它 - 在大多数情况下,对于分析器,汉字就像其他字符一样.尽管如果您打算为文本列编制索引,那么您可能需要查看 StandardAnalyzer.但是对于用户名或类似的内容,NonTokenizingAnalyzer 可能会更好.

For your question the main work is done by SASI that gets the data from text column, and apply analyzer to it - and in most cases, for analyzer, the Chinese characters are like other characters. Although if you plan to index text columns, then you may need to look to StandardAnalyzer. But for user names, or something like, NonTokenizingAnalyzer could be better.

这篇关于卡桑德拉的中文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆