计算互信息以选择Java中的训练集 [英] Calculating Mutual Information For Selecting a Training Set in Java

查看:76
本文介绍了计算互信息以选择Java中的训练集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

场景

我正在尝试对Java GUI应用程序中的数据集实施监督学习.将为用户提供要检查的项目或报告"的列表,并将基于一组可用标签为它们加标签.一旦完成监督学习,则将标记的实例提供给学习算法.这将尝试根据用户希望查看其他项目的可能性来订购其余项目.

I am attempting to implement supervised learning over a data set within a Java GUI application. The user will be given a list of items or 'reports' to inspect and will label them based on a set of available labels. Once the supervised learning is complete, the labelled instances will then be given to a learning algorithm. This will attempt to order the rest of the items on how likely it is the user will want to view them.

为了从用户的时间中获得最大收益,我想预选报告,这些报告将提供有关整个报告集合的最多信息,并让用户对其进行标记.据我了解,要进行计算,有必要找到每个报告的所有相互信息值的总和,并按该值对它们进行排序.监督学习中标记的报告随后将用于形成贝叶斯网络,以查找每个剩余报告的二进制值的概率.

To get the most from the user's time I want to pre-select the reports that will provide the most information about the entire collection of reports, and have the user label them. As I understand it, to calculate this, it would be necessary to find the sum of all the mutual information values for each report, and order them by that value. The labelled reports from supervised learning will then be used to form a Bayesian network to find the probability of a binary value for each remaining report.

示例

在这里,一个人工的例子可能有助于解释,并且当我无疑使用了错误的术语时可能会消除混乱:-)考虑一个示例,其中应用程序向用户显示新闻故事.它根据显示的用户偏好选择要首先显示的新闻故事.具有相关性的新闻故事的特征是country of origincategorydate.因此,如果用户将一个来自苏格兰的新闻报道标记为有趣,则它告诉机器学习者,来自苏格兰的其他新闻报道对用户来说很有可能会增加.对于类似体育"这样的类别,或类似在2004年12月12日这样的日期也是如此.

Here, an artificial example may help to explain, and may clear up confusion when I've undoubtedly used the wrong terminology :-) Consider an example where the application displays news stories to the user. It chooses which news stories to display first based on the user's preference shown. Features of a news story which have a correlation are country of origin, category or date. So if a user labels a single news story as interesting when it came from Scotland, it tells the machine learner that there's an increased chance other news stories from Scotland will be interesting to the user. Similar for a category such as Sport, or a date such as December 12th 2004.

可以通过为所有新闻报道选择任何顺序(例如按类别,按日期)或随机排序它们,然后随着用户的前进来计算偏好来计算此偏好.我想做的就是让用户查看少量特定新闻故事并说出他们是否对它们感兴趣(监督学习部分),从而在这种排序上获得领先".要选择向用户显示哪些故事,我必须考虑故事的整个集合.这就是相互信息的来源.对于每个故事,我想知道当用户对故事进行分类时,它可以告诉我多少其他故事.例如,如果有大量源自苏格兰的故事,我想让用户对(至少)其中一个进行分类.与其他相关功能(例如类别或日期)相似.目的是查找报告的示例,这些示例在分类时可以提供有关其他报告的最多信息.

This preference could be calculated by choosing any order for all news stories (e.g. by category, by date) or randomly ordering them, then calculating preference as the user goes along. What I would like to do is to get a kind of "head start" on that ordering by having the user to look at a small number of specific news stories and say if they're interested in them (the supervised learning part). To choose which stories to show the user, I have to consider the entire collection of stories. This is where Mutual Information comes in. For each story I want to know how much it can tell me about all the other stories when it is classified by the user. For example, if there is a large number of stories originating from Scotland, I want to get the user to classify (at least) one of them. Similar for other correlating features such as category or date. The goal is to find examples of reports which, when classified, provide the most information about the other reports.

问题

因为我的数学有点生锈,而且我是机器学习的新手,所以在将互信息"的定义转换为Java实现时遇到了一些麻烦.维基百科将共同信息的等式描述为:

Because my math is a bit rusty, and I'm new to machine learning I'm having some trouble converting the definition of Mutual Information to an implementation in Java. Wikipedia describes the equation for Mutual Information as:

但是,我不确定当什么都没有分类并且学习算法还没有计算出任何东西时,是否可以真正使用它.

However, I'm unsure if this can actually be used when nothing has been classified, and the learning algorithm has not calculated anything yet.

在我的示例中,说我有很多此类的新的,未标记的实例:

As in my example, say I had a large number of new, unlabelled instances of this class:

public class NewsStory {
    private String countryOfOrigin;
    private String category;
    private Date date;
    // constructor, etc.
}

在我的特定情况下,字段/功能之间的相关性基于完全匹配,因此,例如,日期和日期之间相差1天和10年的不等式是等效的.

In my specific scenario, the correlation between fields/features is based on an exact match so, for instance, one day and 10 years difference in date are equivalent in their inequality.

相关因子(例如,日期比类别更相关吗?)不一定相等,但它们可以是预定义的且恒定的.这是否意味着函数p(x,y) 的结果是的预定义值,还是我混淆了术语?

The factors for correlation (e.g. is date more correlating than category?) are not necessarily equal, but they can be predefined and constant. Does this mean that the result of the function p(x,y) is the predefined value, or am I mixing up terms?

问题(最后)

The Question (finally)

在这个(虚假的)新闻故事示例下,我该如何去实现相互信息计算?库,javadoc,代码示例等都是欢迎信息.而且,如果这种方法从根本上有缺陷,那么解释为什么会出现这种情况也将是同样有价值的答案.

How can I go about implementing the mutual information calculation given this (fake) example of news stories? Libraries, javadoc, code examples etc. are all welcome information. Also, if this approach is fundamentally flawed, explaining why that is the case would be just as valuable an answer.

PS.我知道诸如Weka和Apache Mahout之类的库,因此仅提及它们对我而言并没有真正的用处.我仍在搜索这两个库的文档和示例,以专门查找相互信息.真正可以帮助我的是指向这些库在相互信息方面有所帮助的资源(代码示例,javadoc).

PS. I am aware of libraries such as Weka and Apache Mahout, so just mentioning them is not really useful for me. I'm still searching through documentation and examples for both these libraries looking for stuff on Mutual Information specifically. What would really help me is pointing to resources (code examples, javadoc) where these libraries help with mutual information.

推荐答案

我猜你的问题是...

I am guessing that your problem is something like...

给出一个未标记示例的列表,对列表进行排序,如果用户标记了示例并将其添加到训练集中,则模型的预测准确性将提高多少."

"Given a list of unlabeled examples, sort the list by how much the predictive accuracy of the model would improve if the user labelled the example and added it to the training set."

如果是这种情况,我认为互信息不是使用的正确方法,因为您无法计算两个实例之间的MI. MI的定义是根据随机变量,单个实例不是随机变量,而只是一个值.

If this is the case, I don't think mutual information is the right thing to use because you can't calculate MI between two instances. The definition of MI is in terms of random variables and an individual instance isn't a random variable, it's just a value.

功能和类标签可以作为随机变量使用.也就是说,它们在整个数据集中具有值的分布.您可以计算两个功能部件之间的相互信息,以了解如何为一个功能部件赋予另一个功能部件冗余",或者在一个功能部件和类标签之间获得相互信息,以了解该功能部件可以为预测提供多少帮助.人们通常是在有监督的学习问题中使用相互信息的方式.

The features and the class label can be though of as random variables. That is, they have a distribution of values over the whole data set. You can calculate the mutual information between two features, to see how 'redundant' one feature is given the other one, or between a feature and the class label, to get an idea of how much that feature might help prediction. This is how people usually use mutual information in a supervised learning problem.

我认为Ferdystschenko的建议是,您应该考虑主动学习方法.

I think ferdystschenko's suggestion that you look at active learning methods is a good one.

针对Grundlefleck的评论,我将通过他对Java对象类比的想法来更深入地了解术语.

In response to Grundlefleck's comment, I'll go a bit deeper into terminology by using his idea of a Java object analogy...

总的来说,我们使用术语实例",事物",报告"和示例"来指代被分类的对象.让我们将这些东西视为Java类的实例(我省略了样板构造器):

Collectively, we have used the term 'instance', 'thing', 'report' and 'example' to refer to the object being clasified. Let's think of these things as instances of a Java class (I've left out the boilerplate constructor):


class Example
{ String f1;
  String f2;
}

Example e1 = new Example("foo", "bar");
Example e2 = new Example("foo", "baz");

机器学习中的常用术语是e1是一个示例,所有示例都具有两个功能 f1和f2,对于e1,f1取值为'foo ',f2取值'bar'.一系列示例称为数据集.

The usual terminology in machine learning is that e1 is an example, that all examples have two features f1 and f2 and that for e1, f1 takes the value 'foo' and f2 takes the value 'bar'. A collection of examples is called a data set.

为数据集中的所有示例获取f1的所有值,这是字符串列表,也可以认为是分布.我们可以将该功能视为随机变量,并且列表中的每个值都是从该随机变量中提取的样本.因此,例如,我们可以计算f1和f2之间的MI.伪代码将类似于:

Take all the values of f1 for all examples in the data set, this is a list of strings, it can also be thought of as a distribution. We can think of the feature as a random variable and that each value in the list is a sample taken from that random variable. So we can, for example, calculate the MI between f1 and f2. The pseudocode would be something like:


mi = 0
for each value x taken by f1:
{  sum = 0
   for each value y taken by f2:
   {  p_xy = number of examples where f1=x and f2=y
      p_x = number of examples where f1=x
      p_y = number of examples where f2=y
      sum += p_xy * log(p_xy/(p_x*p_y))
   }
   mi += sum
}

但是您无法计算e1和e2之间的MI,只是没有这样定义.

However you can't calculate MI between e1 and e2, it's just not defined that way.

这篇关于计算互信息以选择Java中的训练集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆