将所有文档的列更新为Elasticsearch的最佳实践 [英] Best practice to update a column of all documents to Elasticsearch

查看:52
本文介绍了将所有文档的列更新为Elasticsearch的最佳实践的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个日志分析系统.输入是日志文件.我有一个外部Python程序,该程序读取日志文件并确定记录(行)或日志文件是正常"还是恶意".我想使用 Elasticsearch Update API 来通过添加名为 result 的新列,将我的Python程序的结果(正常"或恶意")附加到Elasticsearch.这样我可以通过Kibana UI清楚地看到程序的结果.

I'm developing a log analysis system. The input are log files. I have an external Python program that reads the log files and decide whether a record (line) or the log files is "normal" or "malicious". I want to use Elasticsearch Update API to append my Python program's result ("normal" or "malicious") to Elasticsearch by adding a new column called result. So I can see my program's result clearly via Kibana UI.

简单来说,我的Python代码和Elasticsearch都分别使用日志文件作为输入.现在,我想将结果从Python代码更新为Elasticsearch.最好的方法是什么?

Simply speaking, my Python code and Elasticsearch both use log files as input respectively. Now I want to update the result from Python code to Elasticsearch. What's the best way to do it?

我可以想到几种方法:

  1. Elasticsearch自动为文档分配一个ID( _id ).如果我可以找到Elasticsearch如何计算 _id ,那么我的Python代码可以自行计算,然后通过 _id 更新相应的Elasticsearch文档.但是问题是,Elasticsearch官方文档没有说明它用于生成 _id 的算法.

  1. Elasticsearch automatically assigns a ID (_id) to a document. If I can find out how Elasticsearch calculates _id, then my Python code can calculate it by itself, then update the corresponding Elasticsearch document via _id. But the question is, Elasticsearch official documentation doesn't say about what algorithm it uses to generate _id.

由我自己向日志文件添加一个ID(如行号).我的程序和Elasticsearch都将知道此ID.我的程序可以使用此ID进行更新.但是,缺点是我的程序每次都必须搜索此ID,因为它只是一个普通字段,而不是内置的_id.性能会很差.

Add an ID (like line number) to the log files by myself. Both my program and Elasticsearch will know this ID. My program can use this ID to update. However, the downside is that my program has to search for this ID every time because it's only a normal field instead of a built-in _id. The performance will be very bad.

我的Python代码从Elasticsearch获取日志,而不是直接读取日志文件.但这使系统易碎,因为Elasticsearch成为关键点.我只希望Elasticsearch当前成为日志查看器.

My Python code gets the logs from Elasticsearch instead of reading the log files directly. But this makes the system fragile, as Elasticsearch becomes a critical point. I only want Elasticsearch to be a log viewer currently.

因此,第一个解决方案在当前视图中将是理想的.但是我不确定是否还有更好的方法?

So the first solution will be ideal in the current view. But I'm not sure if there are any better ways to do it?

推荐答案

如果可能,请重新构建应用程序的结构,以便将结构化的日志信息直接写入类似Elasticsearch的日志中,而不是将纯文本转储到日志文件中.待会儿谢谢我.

If possible, re-structure your application so that instead of dumping plain-text to a log file you're directly writing structured log information to something like Elasticsearch. Thank me later.

这并不总是可行的(例如,如果您不控制日志源).我对您的解决方案有一些意见.

That isn't always feasible (e.g. if you don't control the log source). I have a few opinions on your solutions.

  1. 这感觉超级脆.Elasticsearch并不将 _id 建立在特定文档的属性上.根据已存储的现有 _id 字段进行选择(我认为也是根据随机种子进行选择).即使可行,依靠与未公开文档的属性相处的一种好方法是,与像Elasticsearch一样经常对其备有文档的代码进行重大更改的团队打交道.

  1. This feels super brittle. Elasticsearch does not base _id on the properties of a particular document. It's selected based off of existing _id fields that it has stored (and I think also off of a random seed). Even if it could work, relying on an undocumented property is a good way to shoot yourself in the foot when dealing with a team that makes breaking changes even for its documented code as often as Elasticsearch does.

这个实际上还不错.Elasticsearch支持手动选择文档的ID.即使不是这样,它在批量查询中的表现也很好,不会像您想象的那样成为瓶颈.如果您确实有太多数据可能会破坏您的应用程序,那么Elasticsearch可能不是最佳工具.

This one actually isn't so bad. Elasticsearch supports manually choosing the id of a document. Even if it didn't, it performs quite well for bulk terms queries and wouldn't be as much of a bottleneck as you might think. If you really have so much data that this could break your application then Elasticsearch might not be the best tool.

此解决方案很棒.它具有超强的可扩展性,并且无需依赖于如何构建日志文件,如何选择在Elasticsearch中为该索引编制索引以及如何选择使用Python读取日志的复杂依赖关系.相反,您只是得到一个文档,并且如果需要更新它,则可以进行更新.

This solution is great. It's super extensible and doesn't rely on a complicated dependence on how the log file is constructed, how you've chosen to index that log in Elasticsearch, and how you're choosing to read it with Python. Rather you just get a document, and if you need to update it then you do that updating.

Elasticsearch在这里并不是失败的最糟糕点(如果ES出现故障,那么您的应用在任何这些解决方案中都将崩溃)–您所做的查询(读取和写入)是原来的两倍.如果因数2使您的应用程序终止,则您可能需要一个更好的解决方案(即避免使用Elasticsearch),或者需要为其添加更多硬件.ES支持各种分片配置,您可以廉价地构建强大的服务器.

Elasticsearch isn't really a worse point of failure here than before (if ES goes down, your app goes down in any of these solutions) -- you're just doing twice as many queries (read and write). If a factor of 2 kills your application, you either need a better solution to the problem (i.e. avoid Elasticsearch), or you need to throw more hardware at it. ES supports all kinds of sharding configurations, and you can make a robust server on the cheap.

但是,有一个问题,为什么在Elasticsearch中有需要使用此特殊的正常/恶意属性进行更新的日志?如果您是将它们放入ES的人,则只需在存储它们之前对其进行适当的标记,以免多余的读取困扰您.如果不是这种选择,那么您可能仍想直接读取ES,以将日志无论如何都提取到Python中,以避免再次解析原始日志文件的巨大开销.

One question though, why do you have logs in Elasticsearch that need to be updated with this particular normal/malicious property? If you're the one putting them into ES then just tag them appropriately before you ever store them to prevent the extra read that's bothering you. If that's not an option then you'll still probably be wanting to read ES directly to pull the logs into Python anyway to avoid the enormous overhead of parsing the original log file again.

如果这是在发布正常/恶意代码时对现有ES数据的一次性修补程序,则不必担心速度提高2倍.如果您担心要关闭群集,只需限制查询.该修补程序最终将执行,并且可能比我们继续讨论最佳选项时要快.

If this is a one-time hotfix to existing ES data while you're rolling out normal/malicious, then don't worry about a 2x speed improvement. Just throttle the query if you're concerned about bringing down the cluster. The hotfix will execute eventually, and probably faster than if we keep deliberating about the best option.

这篇关于将所有文档的列更新为Elasticsearch的最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆