设置编解码器/在Elasticsearch中搜索Python中的unicode值 [英] Setting codec/searching Elasticsearch for unicode values from Python

查看:28
本文介绍了设置编解码器/在Elasticsearch中搜索Python中的unicode值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题可能是由于我对ELK,Python和Unicode不满意.

This issue is probably due to my noobishness to ELK, Python, and Unicode.

我有一个索引,该索引包含经过logstash消化的日志,包括一个包含主机名的字段"host_req".使用Elasticsearch-py,我将该主机名从记录中拉出,并使用它来搜索另一个索引.但是,如果主机名包含多字节字符,它将失败并显示UnicodeDecodeError.当我在命令行中使用"curl -XGET"输入相同的查询时,该查询确实可以正常工作.unicode字符是带有小写字母(两个点)的小写字母"a".UTF-8值为C3 A4,并且Unicode代码点似乎为00E4(语言为瑞典语).

I have an index containing logstash-digested logs, including a field 'host_req', which contains a host name. Using Elasticsearch-py, I'm pulling that host name out of the record, and using it to search in another index. However, if the hostname contains multibyte characters, it fails with a UnicodeDecodeError. Exactly the same query works fine when I enter it from the command line with 'curl -XGET'. The unicode character is a lowercase 'a' with a diaeresis (two dots). The UTF-8 value is C3 A4, and the unicode code point seems to be 00E4 (the language is Swedish).

这些curl命令在命令行上可以正常工作:

These curl commands work just fine from the command line:

 curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utkl\u00E4dningskl\u00E4derna.se" }}}'
 curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utklädningskläderna.se" }}}'

他们找到并返回记录

(第二行显示了主机名在我从中提取的日志中的显示方式,在两个位置显示了带有小写字母的小写字母"a")

(the second line shows how the hostname appears in the log I pull it from, showing the lowercase 'a' with a diaersis, in two places)

我写了一个很短的Python脚本来显示问题:它使用硬连线的查询,打印它们及其类型,然后尝试使用它们在搜索中.

I've written a very short Python script to show the problem: It uses hardwired queries, printing them and their type, then trying to use them in a search.

 #!/usr/bin/python
 # -*- coding: utf-8 -*-

 import json
 import elasticsearch

 es = elasticsearch.Elasticsearch()

 if __name__=="__main__":
   #uq = u'{ "query": { "match": { "req_host": "www.utklädningskläderna.se" }}}'           # raw utf-8 characters. does not work
   #uq = u'{ "query": { "match": { "req_host": "www.utkl\u00E4dningskl\u00E4derna.se" }}}' # quoted unicode characters. does not work
   #uq = u'{ "query": { "match": { "req_host": "www.utkl\uC3A4dningskl\uC3A4derna.se" }}}' # quoted utf-8 characters. does not work
   uq = u'{ "query": { "match": { "req_host": "www.facebook.com" }}}'                     # non-unicode. works fine
   print "uq", type(uq), uq
   result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
   if result["hits"]["total"] == 0:
     print "nothing found"
   else:
     print "found some"

如果我如图所示运行它,使用'facebook'查询,就可以了-输出为:

If I run it as shown, with the 'facebook' query, it's fine - the output is:

$python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host": "www.facebook.com" }}}
found some

请注意,查询字符串'uq'是unicode.

Note that the query string 'uq' is unicode.

但是,如果我使用其他三个包含Unicode字符的字符串,它会炸毁.例如,在第二行中,我得到:

But if I use the other three strings, which include the Unicode characters, it blows up. For example, with the second line, I get:

$python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host": "www.utklädningskläderna.se" }}}
Traceback (most recent call last):
   File "testutf8b.py", line 15, in <module>
    result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
  File "build/bdist.linux-x86_64/egg/elasticsearch/client/utils.py", line 68, in _wrapped
  File "build/bdist.linux-x86_64/egg/elasticsearch/client/__init__.py", line 497, in search
  File "build/bdist.linux-x86_64/egg/elasticsearch/transport.py", line 307, in perform_request
  File "build/bdist.linux-x86_64/egg/elasticsearch/connection/http_urllib3.py", line 82, in perform_request
elasticsearch.exceptions.ConnectionError: ConnectionError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128)) caused by: UnicodeDecodeError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128))
$

同样,请注意查询字符串是一个unicode字符串(是的,源代码行是带有 \ u00E4 字符的代码行).

Again, note that the query string is a unicode string (yes, the source code line is the one with the \u00E4 characters).

我真的很想解决这个问题.我尝试了 uq = uq.encode("utf = 8") uq = uq.decode("utf = 8")的各种组合,但是似乎没有帮助.我开始怀疑 elasticsearch-py 库中是否存在问题.

I'd really like to resolve this. I've tried various combinations of uq = uq.encode("utf=8") and uq = uq.decode("utf=8"), but it doesn't seem to help. I'm starting to wonder if there's an issue in the elasticsearch-py library.

谢谢!

pt

PS:这是在Centos 7下,使用ES 1.5.0.使用logstash-1.4.2将日志消化到稍旧版本的ES中

PS: This is under Centos 7, using ES 1.5.0. The logs were digested into ES under a slightly older version, using logstash-1.4.2

推荐答案

基本上,您不需要将 body 作为字符串传递.使用本地python数据结构.或即时转换它们.试试看,

Basically, you dont need to pass body as string. Use native python datastructures. Or transform them on the fly. Give a try, pls:

>>> import elasticsearch
>>> es = elasticsearch.Elasticsearch()
>>> es.index(index='unicode-index', body={'host': u'www.utklädningskläderna.se'}, doc_type='log')

{u'_id': u'AUyGJuFMy0qdfghJ6KwJ',
 u'_index': u'unicode-index',
 u'_type': u'log',
 u'_version': 1,
 u'created': True}

>>> es.search(index='unicode-index', body={}, doc_type='log')

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
    u'_index': u'unicode-index',
    u'_score': 1.0,
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
    u'_type': u'log'}],
  u'max_score': 1.0,
  u'total': 1},
 u'timed_out': False,
 u'took': 5}

>>> es.search(index='unicode-index', body={'query': {'match': {'host': u'www.utklädningskläderna.se'}}}, doc_type='log')

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
    u'_index': u'unicode-index',
    u'_score': 0.30685282,
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
    u'_type': u'log'}],
  u'max_score': 0.30685282,
  u'total': 1},
 u'timed_out': False,
 u'took': 122}

>>> import json

>>> body={'query': {'match': {'host': u'www.utklädningskläderna.se'}}}

>>> es.search(index='unicode-index', body=body, doc_type='log')

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
    u'_index': u'unicode-index',
    u'_score': 0.30685282,
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
    u'_type': u'log'}],
  u'max_score': 0.30685282,
  u'total': 1},
 u'timed_out': False,
 u'took': 4}

>>> es.search(index='unicode-index', body=json.dumps(body), doc_type='log')

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
    u'_index': u'unicode-index',
    u'_score': 0.30685282,
    u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
    u'_type': u'log'}],
  u'max_score': 0.30685282,
  u'total': 1},
 u'timed_out': False,
 u'took': 5}

>>> json.dumps(body)
'{"query": {"match": {"host": "www.utkl\\u00e4dningskl\\u00e4derna.se"}}}'

这篇关于设置编解码器/在Elasticsearch中搜索Python中的unicode值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆