SOLR删除表情符号其他字符 [英] SOLR Dropping Emoji Miscellaneous characters

查看:156
本文介绍了SOLR删除表情符号其他字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎SOLR正在考虑将有效的Unicode字符视为无效字符,并将其删除.

It looks like SOLR is considering what should be valid Unicode characters as invalid, and dropping them.

我通过打开查询调试以查看解析器对我的查询所做的工作来证明"了这一点.这是一个示例:

I "proved" this by turning on query debug to see what the parser was doing with my query. Here's an example:

查询='ァ☀'(\ u30a1 \ u2600)

Query = 'ァ☀' (\u30a1\u2600)

这是SOLR所做的:

调试":{ 'rawquerystring':u'\ u30a1 \ u2600', 'querystring':u'\ u30a1 \ u2600', 'parsedquery':u'(+ DisjunctionMaxQuery((text:\ u30a1)))/no_coord', 'parsedquery_toString':u'+(text:\ u30a1)',

'debug':{ 'rawquerystring':u'\u30a1\u2600', 'querystring':u'\u30a1\u2600', 'parsedquery':u'(+DisjunctionMaxQuery((text:\u30a1)))/no_coord', 'parsedquery_toString':u'+(text:\u30a1)',

您可以看到,用'ァ'可以,但是它却加上了黑太阳"字符.

As you can see, was OK with 'ァ', but it ATE the "Black Sun" character.

我还没有尝试过所有的块,但是我已经确认它也不喜欢⛿(\ u26ff)和♖(\ u2656).

I haven't tried ALL of the Block, but I've confirmed it also doesn't like ⛿ (\u26ff) and ♖ (\u2656).

我将SOLR与Jetty一起使用,因此不应应用各种TomCat问题,WRT字符编码.

I'm using SOLR with Jetty, so the various TomCat issues WRT character encoding shouldn't apply.

推荐答案

这很可能与分析器有更多关系.我没有看到任何确切说明这些字符的处理方式,但是 UAX-29,Unicode文本分段中提出的规则,以便将输入分成令牌.

This very likely has more to do with the Analyzer. I don't see anything specifying the treatment of those sorts of characters exactly, but they are probably being treated very much as punctuation by the StandardAnalyzer (or whatever Analyzer you may be using), and so will not be present in the final query. StandardAnalyzer implements the rules set forward in UAX-29, Unicode Text Segmentation, in order to separate input into tokens.

这篇关于SOLR删除表情符号其他字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆