卡桑德拉语中的中文 [英] Chinese language in Cassandra

查看:51
本文介绍了卡桑德拉语中的中文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Cassandra中使用了中文字母,看来数据输入如下,

I used the chinese letters in Cassandra and it seems the data is entered properly like below,

SELECT * FROM user;

 user_id | user_name    | user_phone
---------+--------------+-------------
      23 |      uSer23, | 12345678910
       5 |       uSer5^ | 12345678910
      28 |     uSer28名 | 12345678910
      10 |      uSer10- | 12345678910
      16 |      uSer16{ | 12345678910
      13 |      uSer13= | 12345678910
      30 |   uSer30一些 | 12345678910
      11 |      uSer11_ | 12345678910
       1 |       uSer1@ | 12345678910
      19 |      uSer19" | 12345678910
       8 |       uSer8( | 12345678910
       0 |       uSer0! | 12345678910
       2 |       uSer2# | 12345678910
       4 |       uSer4% | 12345678910
      18 |      uSer18[ | 12345678910
      15 |      uSer15} | 12345678910
      22 |      uSer22< | 12345678910
      27 |      uSer27/ | 12345678910
      20 |      uSer20: | 12345678910
       7 |       uSer7* | 12345678910
       6 |       uSer6& | 12345678910
      29 |     uSer29称 | 12345678910
       9 |       uSer9) | 12345678910
      14 |      uSer14| | 12345678910
      26 |      uSer26? | 12345678910
      21 |      uSer21; | 12345678910
      17 |      uSer17] | 12345678910
      31 | uSer31区中文 | 12345678910
      24 |      uSer24> | 12345678910
      25 |      uSer25. | 12345678910
      12 |      uSer12+ | 12345678910
       3 |       uSer3$ | 12345678910

我为用户名字段创建了一个索引,如下所示,

I created a index for 'user_name' field like below,

CREATE CUSTOM INDEX user_nontoken_idx ON QCS.user (user_name) 
  USING 'org.apache.cassandra.index.sasi.SASIIndex' 
  WITH OPTIONS = {'mode': 'CONTAINS', 'analyzer_class': 
    'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
    'case_sensitive': 'false'}; 

当我使用这些中文单词进行搜索时,搜索成功。

When I do a search using those chinese word, It is searched successfully.

SELECT * FROM user WHERE user_name LIKE '%称%';

它实际上如何工作? Cassandra如何存储中文?

How does it actually works? How Cassandra has the capability to store chinese?

推荐答案

默认情况下,文本在Cassandra中表示为 UTF-8

By default, the text is represented in Cassandra as UTF-8 as it was mentioned in comment.

对于您的问题,SASI的主要工作是从文本列获取数据,然后将分析器应用于其中-在大多数情况下在分析器的情况下,汉字与其他字符一样。尽管如果您计划为文本列建立索引,则可能需要查看 StandardAnalyzer 。但是对于用户名或类似的名称, NonTokenizingAnalyzer 可能更好。

For your question the main work is done by SASI that gets the data from text column, and apply analyzer to it - and in most cases, for analyzer, the Chinese characters are like other characters. Although if you plan to index text columns, then you may need to look to StandardAnalyzer. But for user names, or something like, NonTokenizingAnalyzer could be better.

这篇关于卡桑德拉语中的中文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆