ai:确定要运行哪些测试以获取最有用的数据 [英] ai: Determining what tests to run to get most useful data

查看:102
本文介绍了ai:确定要运行哪些测试以获取最有用的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是针对 http://cssfingerprint.com

我有一个系统(有关详细信息,请参见网站上的页面)

I have a system (see about page on site for details) where:

  • 我需要输出与特定特征向量匹配的类别的置信度排名列表
  • 二进制特征向量是站点ID和站点列表.此会话是否检测到命中
  • 对于给定的分类,特征向量有些嘈杂(站点将逐渐消失,人们将访问他们通常不访问的站点)
  • 类别是非封闭的大型集合(用户ID)
  • 我的全部特征空间大约有5000万个项目(URL)
  • 对于任何给定的测试,我只能查询大约.该空间的0.2%
  • 根据到目前为止的结果,我只能做出 的查询决定,大约10到30次,并且必须在<〜100毫秒内完成(尽管我可能需要更长的时间才能进行查询)做后处理,相关的汇总等)
  • 根据到目前为止的结果获得AI类别的概率排名是相当昂贵的;理想情况下,决定将主要取决于一些廉价的sql查询
  • 我拥有的训练数据可以权威地说明任何两个特征向量都属于同一类别,但它们却不同(人们有时会忘记他们的代码而使用新的代码,从而产生一个新的用户ID)
  • I need to output a ranked list, with confidences, of categories that match a particular feature vector
  • the binary feature vectors are a list of site IDs & whether this session detected a hit
  • feature vectors are, for a given categorization, somewhat noisy (sites will decay out of history, and people will visit sites they don't normally visit)
  • categories are a large, non-closed set (user IDs)
  • my total feature space is approximately 50 million items (URLs)
  • for any given test, I can only query approx. 0.2% of that space
  • I can only make the decision of what to query, based on results so far, ~10-30 times, and must do so in <~100ms (though I can take much longer to do post-processing, relevant aggregation, etc)
  • getting the AI's probability ranking of categories based on results so far is mildly expensive; ideally the decision will depend mostly on a few cheap sql queries
  • I have training data that can say authoritatively that any two feature vectors are the same category but not that they are different (people sometimes forget their codes and use new ones, thereby making a new user id)

我需要一种算法来确定哪些特征(站点)最有可能具有较高的ROI来查询(即,更好地区分[用户]合理的远距离类别,并提高对给定类别的确定性)

I need an algorithm to determine what features (sites) are most likely to have a high ROI to query (i.e. to better discriminate between plausible-so-far categories [users], and to increase certainty that it's any given one).

这需要兼顾开发(基于先前的测试数据进行测试)和探索(尚未经过充分测试以找出其性能的测试内容)之间的平衡.

This needs to take into balance exploitation (test based on prior test data) and exploration (test stuff that's not been tested enough to find out how it performs).

另一个关于先验排名的问题;这是专门根据到目前为止收集到的结果对后验进行排名的.

There's another question that deals with a priori ranking; this one is specifically about a posteriori ranking based on results gathered so far.

现在,我的数据很少,我只能随时测试其他人曾经受过攻击的一切,但最终情况并非如此,此时这个问题将需要解决.

Right now, I have little enough data that I can just always test everything that anyone else has ever gotten a hit for, but eventually that won't be the case, at which point this problem will need to be solved.

我认为这在AI中是一个相当标准的问题-对要进行哪些昂贵的查询具有廉价的启发式功能-但我的AI课堂未涵盖该问题,因此我实际上不知道是否存在标准答案.因此,不太需要大量数学知识的相关阅读以及对特定算法的建议都将有所帮助.

I imagine that this is a fairly standard problem in AI - having a cheap heuristic for what expensive queries to make - but it wasn't covered in my AI class, so I don't actually know whether there's a standard answer. So, relevant reading that's not too math-heavy would be helpful, as well as suggestions for particular algorithms.

什么是解决此问题的好方法?

What's a good way to approach this problem?

推荐答案

如果您对尚未采样的功能一无所知,那么在决定探索还是利用数据时,您将无所事事.如果您可以在每次查询后将ROI表示为单个数字,则可以通过跟踪置信区间的上限来进行选择,这是一种最佳方法.请参阅多臂匪徒问题的有限时间分析.

If you know nothing about the features you have not sampled, then you have little to go on when deciding whether to explore or exploit your data. If you can express your ROI as a single number following every query, then there is an optimal way of making this choice by keeping track of the upper confidence bounds. See the paper Finite-time Analysis of Multiarmed Bandit Problem.

这篇关于ai:确定要运行哪些测试以获取最有用的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆