ai:确定要运行哪些测试以获取最有用的数据 [英] ai: Determining what tests to run to get most useful data

查看：102 发布时间：2020/9/7 19:20:20 artificial-intelligence heuristics

本文介绍了ai:确定要运行哪些测试以获取最有用的数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是针对 http://cssfingerprint.com

我有一个系统(有关详细信息，请参见网站上的页面)

I have a system (see about page on site for details) where:

我需要输出与特定特征向量匹配的类别的置信度排名列表
对于给定的分类，特征向量有些嘈杂(站点将逐渐消失，人们将访问他们通常不访问的站点)
类别是非封闭的大型集合(用户ID)
我的全部特征空间大约有5000万个项目(URL)
对于任何给定的测试，我只能查询大约.该空间的0.2％
根据到目前为止的结果，我只能做出的查询决定，大约10到30次，并且必须在<〜100毫秒内完成(尽管我可能需要更长的时间才能进行查询)做后处理，相关的汇总等)
根据到目前为止的结果获得AI类别的概率排名是相当昂贵的；理想情况下，决定将主要取决于一些廉价的sql查询
我拥有的训练数据可以权威地说明任何两个特征向量都属于同一类别，但它们却不同(人们有时会忘记他们的代码而使用新的代码，从而产生一个新的用户ID)

I need to output a ranked list, with confidences, of categories that match a particular feature vector
the binary feature vectors are a list of site IDs & whether this session detected a hit
feature vectors are, for a given categorization, somewhat noisy (sites will decay out of history, and people will visit sites they don't normally visit)
categories are a large, non-closed set (user IDs)
my total feature space is approximately 50 million items (URLs)
for any given test, I can only query approx. 0.2% of that space
I can only make the decision of what to query, based on results so far, ~10-30 times, and must do so in <~100ms (though I can take much longer to do post-processing, relevant aggregation, etc)
getting the AI's probability ranking of categories based on results so far is mildly expensive; ideally the decision will depend mostly on a few cheap sql queries
I have training data that can say authoritatively that any two feature vectors are the same category but not that they are different (people sometimes forget their codes and use new ones, thereby making a new user id)

我需要一种算法来确定哪些特征(站点)最有可能具有较高的ROI来查询(即，更好地区分[用户]合理的远距离类别，并提高对给定类别的确定性)

I need an algorithm to determine what features (sites) are most likely to have a high ROI to query (i.e. to better discriminate between plausible-so-far categories [users], and to increase certainty that it's any given one).

这需要兼顾开发(基于先前的测试数据进行测试)和探索(尚未经过充分测试以找出其性能的测试内容)之间的平衡.

This needs to take into balance exploitation (test based on prior test data) and exploration (test stuff that's not been tested enough to find out how it performs).

有另一个关于先验排名的问题；这是专门根据到目前为止收集到的结果对后验进行排名的.

There's another question that deals with a priori ranking; this one is specifically about a posteriori ranking based on results gathered so far.

现在，我的数据很少，我只能随时测试其他人曾经受过攻击的一切，但最终情况并非如此，此时这个问题将需要解决.

Right now, I have little enough data that I can just always test everything that anyone else has ever gotten a hit for, but eventually that won't be the case, at which point this problem will need to be solved.

我认为这在AI中是一个相当标准的问题-对要进行哪些昂贵的查询具有廉价的启发式功能-但我的AI课堂未涵盖该问题，因此我实际上不知道是否存在标准答案.因此，不太需要大量数学知识的相关阅读以及对特定算法的建议都将有所帮助.

I imagine that this is a fairly standard problem in AI - having a cheap heuristic for what expensive queries to make - but it wasn't covered in my AI class, so I don't actually know whether there's a standard answer. So, relevant reading that's not too math-heavy would be helpful, as well as suggestions for particular algorithms.

什么是解决此问题的好方法?

What's a good way to approach this problem?

ai:确定要运行哪些测试以获取最有用的数据 [英] ai: Determining what tests to run to get most useful data

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

ai:确定要运行哪些测试以获取最有用的数据 [英] ai: Determining what tests to run to get most useful data

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭