创建适合数据集的高效函数 [英] Creating an efficient function to fit a dataset

查看:77
本文介绍了创建适合数据集的高效函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,我有一个4字节输入及其对应的4字节输出的大型数据集(最多可以得到100,000-150,000个值).不能保证输入是唯一的(这并不是一个真正的问题,因为我认为我可以生成伪随机数来对输入进行加法运算或异或运算,以使它们成为唯一的),但是并不能保证输出结果是唯一的两者都是唯一的(因此两组不同的输入可能具有相同的输出).

Basically I have a large (could get as large as 100,000-150,000 values) data set of 4-byte inputs and their corresponding 4-byte outputs. The inputs aren't guaranteed to be unique (which isn't really a problem because I figure I can generate pseudo-random numbers to add or xor the inputs with so that they do become unique), but the outputs aren't guaranteed to be unique either (so two different sets of inputs might have the same output).

我正在尝试创建一个对我的数据集中的值进行有效建模的函数.我不需要它进行有效的插值,甚至根本不需要插值(这就是说,我永远不会向它提供此静态数据集中未包含的输入).但是,它确实需要尽可能高效.我研究了插值法,发现它确实不符合我的需求.例如,大量的值意味着样条曲线插值将不起作用,因为它会为每个间隔创建一个多项式.

I'm trying to create a function that effectively models the values in my data-set. I don't need it to interpolate efficiently, or even at all (by this I mean that I'm never going to feed it an input that isn't contained in this static data-set). However it does need to be as efficient as possible. I've looked into interpolation and found that it doesn't really fit what I'm looking for. For example, the large number of values means that spline interpolation won't do since it creates a polynomial per interval.

另外,根据我的理解,多项式插值在计算上会太昂贵(n值表示多项式可以包含高达pow(x,n-1)的项.对于x = 4字节数且n = 100,000只是不可行).我已经尝试在线查找一段时间了,但是我对数学的了解不是很强,并且一定不知道要搜索的正确术语,因为到目前为止我还没有遇到类似的事情.

Also, from my understanding polynomial interpolation would be way too computationally expensive (n values means that the polynomial could include terms as high as pow(x,n-1). For x= a 4-byte number and n=100,000 it's just not feasible). I've tried looking online for a while now, but I'm not very strong with math and must not know the right terms to search with because I haven't come across anything similar so far.

我可以看到这并不是一个完整的编程问题,我事先表示歉意.我不是在寻找确切的解决方案,甚至不是完整的答案.我只需要指向需要阅读的主题的指针,这样我就可以自行解决此问题.谢谢!

I can see that this is not completely (to put it mildly) a programming question and I apologize in advance. I'm not looking for the exact solution or even a complete answer. I just need pointers on the topics that I would need to read up on so I can solve this problem on my own. Thanks!

TL; DR-我需要插值的一种变体,它只需要适合最初给定的数据点,但是计算效率很高.

TL;DR - I need a variant of interpolation that only needs to fit the initially given data-points, but which is computationally efficient.

一些澄清-我确实需要输出准确,而不是近似值.这是对我当前正在做的一些研究工作的一种优化,我需要在没有程序中实际输出字节的情况下实现此查找.目前我还不能说太多,但是我要说的是,出于我的工作目的,加密(或压缩或其他任何形式的混淆处理)不是隐藏表格的一种选择.我需要一个数学函数,只要它可以访问输入,就可以重新创建输出.我希望这能使事情变得更清楚.

Some clarification - I do need the output to be exact and not an approximation. This is sort of an optimization of some research work I'm currently doing and I need to have this look-up implemented without the actual bytes of the outputs being present in my program. I can't really say a whole lot about it at the moment, but I will say that for the purposes of my work, encryption (or compression or any other other form of obfuscation) is not an option to hide the table. I need a mathematical function that can recreate the output so long as it has access to the input. I hope that clears things up a bit.

推荐答案

这是一个主意.让您的函数成为所有4字节整数上线性函数的和(mod 2 32 ),这是一个分段线性函数,其分段取决于第一位的值;另一个分段线性函数,其分段取决于第一位的值取决于前两位的值,依此类推.

Here is one idea. Make your function be the sum (mod 232) of a linear function over all 4-byte integers, a piecewise linear function whose pieces depend on the value of the first bit, another piecewise linear function whose pieces depend on the value of the first two bits, and so on.

实际输出值无处显示,您必须将线性项加在一起才能得到它们.也没有直接记录您拥有哪些输入值. (有人可以推断出有关这些输入值的某些信息,而不是其实际值.)

The actual output values appear nowhere, you have to add together linear terms to get them. There is also no direct record of which input values you have. (Someone could conclude something about those input values, but not their actual values.)

您可以将所需的各种系数存储在哈希中.您在哈希中找不到的所有查询都假定为0.

The various coefficients you need can be stored in a hash. Any lookups you do which are not found in the hash are assumed to be 0.

如果在开始相当有效地编码之前在数据集中添加一定数量的随机噪声",将很难分辨出输入值是什么,甚至很难分辨出输出近似值是什么输入.

If you add a certain amount of random "noise" to your dataset before starting to encode it fairly efficiently, it would be hard to tell what your input values are, and very hard to tell what the outputs are even approximately without knowing the inputs.

这篇关于创建适合数据集的高效函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆