支持Java的矢量机? [英] Support Vector Machine for Java?

查看:169
本文介绍了支持Java的矢量机?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Java中编写一个智能监视器,它会在检测即将发生的性能问题时发出警报。我的Java应用程序将结构化格式的数据写入日志文件:

I'd like to write a "smart monitor" in Java that sends out an alert any time it detects oncoming performance issues. My Java app is writing data in a structured format to a log file:

<datetime> | <java-method> | <seconds-to-execute>

所以,例如,如果我有一个 Widget#doSomething(String) 执行时间为812毫秒的方法,将记录为:

So, for example, if I had a Widget#doSomething(String) method that took 812ms to execute, it would be logged as:

2013-03-24 11:39:21 | Widget#doSomething(String) | 812

随着性能开始下降(例如在主要收集期间,在峰值负载期间,或者如果系统正在慢慢爬行,方法执行时间开始减慢;所以最右边的列开始看到很大的数字(有时候执行单个方法需要20到40秒)。

As performance starts to degrade (such as during a major collection, during peak loads, or if the system is just slowing to a crawl), method execution timings start to slow down; so the right-most column starts to see huge numbers (sometime 20 - 40 seconds to execute a single method).

在大学里 - 进行机器学习 - 我写道我的教授称之为线性二分法,它采用简单的测试数据(一个人的身高,体重和性别),并学会如何根据身高/体重将一个人分类为男性或女性。然后,一旦它掌握了所有的训练数据,我们就会给它提供新的数据,以确定它可以准确地确定性别。

In college - for a machine learning exercise - I wrote what my professor called a linear dichotomizer that took simple test data (the height, weight and gender of a person) and "learned" how to categorize a person as male or female based on their height/weight. Then, once it had all its training data, we fed it new data to see how accurately it could determine gender.

认为多变量线性二分法的版本称为 支持向量机(SVM)。如果我错了,请澄清,我会将问题的标题更改为更合适的名称。 无论如何,我需要这个应用程序来执行以下操作:

I think the multivariate version of a linear dichotomizer is something called a support vector machine (SVM). If I'm wrong, then please clarify and I'll change the title of my question to something more appropriate. Regardless, I need this app to do the following things:


  • 在测试模式中运行我在从我的主Java应用程序(我想要监视的那个)提供结构化日志文件,它接受每个日志条目(如上所示)并将其用于测试数据

    • 只有 java-method 秒执行列才是重要的输入/测试数据;我不关心日期时间

    • Run in a "test mode" where I feed it the structured log file from my main Java app (the one I wish to monitor) and it takes each log entry (as shown above) and uses it for test data
      • Only the java-method and seconds-to-execute columns are important as inputs/test data; I don't care about the datetime

      重要的是要注意秒执行列并不是唯一重要的因素,因为我已经看到了某些方法在令人敬畏的性能期间的可怕时序,以及服务器时其他方法的非常好的时间看起来好像要死了,推着雏菊。因此,某些方法显然比其他方法加权/更重要。

      It's important to note that the seconds-to-execute column is not the only important factor here, as I've seen horrible timings for certain methods during periods of awesome performance, and really great timings for other methods at times when the server seemed like it was about to die and push daisies. So obviously certain methods are "weighted"/more important to performance than others.


      • 谷歌搜索线性二分法或支持向量机会出现一些非常可怕,高度学术,超脑白皮书,我只是没有精神能量(消费 - 除非他们真的是我唯一的选择;所以我问是否有一个外行人对这些东西的介绍,或者用Java构建这样一个系统的优秀网站/文章/教程

      • 有没有固体/稳定的开源Java库?我只能找到 jlibsvm svmlearn 但前者看起来处于纯beta状态而后者似乎只支持二元决策(比如我的旧线性二分法)。我知道 Mahout ,但它位于Hadoop之上,我认为我没有足够的数据保证设置我自己的Hadoop集群的时间和精力。

      • Googling for "linear dichotomizer" or "support vector machines" turns up some really scary, highly-academic, ultra-cerebral white papers that I just don't have the mental energy (nor time) to consume - unless they truly are my only options; so I ask is there a laymen's introduction to this stuff, or a great site/article/tutorial for building such a system in Java?
      • Are there any solid/stable open source Java libraries? I was only able to find jlibsvm and svmlearn but the former looks to be in a pure beta state and the latter seems to only support binary decisions (like my old linear dichotomizer). I know there's Mahout but that sits on top of Hadoop, and I don't think I have enough data to warrant the time and mental energy into setting up my own Hadoop cluster.

      提前致谢!

      推荐答案

      您描述的智能监控器正是时间序列分类。

      A "smart monitor" you describe is exactly time-series classification.

      有很多分类算法。它们基本上都采用矩阵,其中行是观察值,列是以某种方式描述观察的特征,以及长度为0或1的行的标签向量。在您的问题中,观察可能是一小部分样本,对于遇到性能问题的时间段,您的标签向量的值为1,否则为0。

      There are many classification algorithms. They all basically take an matrix, where the rows are observations and the columns are "features" that somehow describe the observation, and a label vector of length rows that is valued either 0 or 1. In your problem an observation might be a minute sample, and your label vector will be valued 1 for the time periods that are experiencing performance issues and 0 otherwise.

      此定义隐含需要对数据进行重新采样(使用模式/中位数/平均值(如果必要的话),以便每个观察均匀定义,例如秒或分钟或小时。

      Implicit in this definition is the need to resample your data(using the mode/median/mean if necessary) such that each observation is defined evenly, such as seconds or minutes or hours.

      生成特征是关键部分。我可能从观察x_i和x_i-1之间的2个特征开始,原始值和(一次)差异值。我们将这些定义为滞后2.技术上制作这4个特征。每个功能都无法展望未来。每个特征必须代表每个观察的相同内容。

      Generating features is the crucial part. I'd probably start with 2 features, the raw values and the (once) differenced values between observation x_i and x_i-1. We'll define these for a lag of 2. Technically making this 4 features. Each feature can't look into the future. Each feature must represent the same thing for each observation.

      例如,考虑长度为10的时间序列:

      For example consider the time-series of length 10:

      [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
      

      如果我们想在过去使用滞后两个区间产生一组特征,那么时间序列的前两个元素被认为是一个烧焦样本。我们不能使用与它们相关的观察来训练算法。

      If we want to produce a set of features using lag two intervals in the past then the first two element of the time-series are considered a burnt-in sample. We can't use the observations associated with them to train out algorithm.

      8行乘2列的原始值将是

      [[ 1.,  0.]
       [ 2.,  1.],
       [ 3.,  2.],
       [ 4.,  3.],
       [ 5.,  4.],
       [ 6.,  5.],
       [ 7.,  6.],
       [ 8.,  7.]]
      

      差异值

      [[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])
      

      这些列被堆叠。您可以探索许多其他功能。 滚动平均值将是我的下一个选择。

      These get column stacked. There are many additional features you could explore. Rolling mean would be my next pick.

      如果您希望将来进一步预测,那么您的训练数据应该远远超出您的标签向量。

      If you want to predict further in the future then your training data should be lagging further from your label vector.

      如果性能不理想,请尝试添加更多功能通过在更大的窗口上选择滚动平均值,或在将来进一步添加。提高时间序列算法性能的一个聪明技巧是包含前一时间间隔的预测值。

      If performance isn't satisfactory then try adding more features by choosing a rolling mean over a bigger window, or add further back in the future. A clever trick to improve the performance of time-series algorithms is to include the value of the prediction for the previous time interval.

      在早期的某些部分安装分类器数据,然后在数据的后期部分观察其准确性。您可以使用许多分类器指标。如果您选择使用输出概率而不是硬1/0的分类器,那么您的选项甚至会扩大。 (与分类器的用法一样。)

      Fit your classifier on some early part of the data, then observe its accuracy over a later part of the data. There are many metrics for classifiers you can use. If you choose to use a classifier that outputs probabilities instead of hard 1/0, then your options even broaden. (As does the uses of your classifier.)

      精确度和召回率是分类器的直观性能指标。

      Precision and recall are intuitive performance metrics of classifiers.

      训练第一个(早期)一半的数据并测试下半年(后来)。

      Train on the first (early) half of your data and test on the second half (later).

      就算法而言,我会研究逻辑回归。如果性能不令人满意并且您已经用尽了功能提取选项,我只会查看其他地方。

      As far as algorithms go, I'd look into logistic regression. I'd only look elsewhere if the performance isn't satisfactory and you've exhausted feature extraction options.

      Mallet 似乎是一个很好的任务库。 查看文档的这一部分。

      Mallet appears to be a good library for the task. See this bit of the docs.

      我最近发现 JSAT ,看起来很有前景。

      I recently discovered JSAT, which looks promising.

      时间序列分类有更具体的方法,明确考虑了观察和标签的连续性。这是对时间序列的分类的通用修改。

      There are more specific approaches to time-series classification that explicitly take into account the sequential nature of the observations and labels. This is a general purpose adaptation of classification to time-series.

      这篇关于支持Java的矢量机?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆