多列整数排序 [英] Multi-Column Integer Ordering

查看:58
本文介绍了多列整数排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道是否为此问题选择了合适的标题(如果没有,请相应地更改它),但请考虑以下我正在使用的简化表结构:

----------------------------------------------
|  date  |  i  |  j  |  k  |  x  |  y  |  z  |
----------------------------------------------
| 100209 |  1  |  2  |  3  |  4  |  5  |  6  |
----------------------------------------------
| 100210 |  2  |  3  |  4  |  5  |  6  |  7  |
----------------------------------------------
| 100211 |  0  |  1  |  2  |  3  |  4  |  5  |
----------------------------------------------
| 100212 |  1  |  2  |  3  |  4  |  5  |  6  |
----------------------------------------------
| 100213 |  6  |  5  |  4  |  3  |  2  |  1  |
----------------------------------------------

ijkxyz都是不相关的整数/浮点数,它们都表示不同的因子,并且可以具有非常不同的数量级( i的范围是1-10,而j的范围是100-1000 .

我正在尝试选择条件相似的日期;给定一组ijkxyz值,我需要返回所有按 closeness 排序的所有结果的所有结果整体而言,例如,如果i = 1j = 2k = 3x = 4y = 5z = 6,查询应按以下顺序返回以下日期:

  1. 100209
  2. 100212
  3. 100210
  4. 100211
  5. 100213

我不确定这是否与问题相关,但是某些值(ijk)表示更好为宜值(xyz)表示相反的含义:越少越好.

我应如何构建这样的查询?单独使用SQL是否有可能?


@ Pentium10:

我将尽力回答您的评论.这是我的数据示例:

---------------------------------------------------------------------------------
  date  |  temperature  |  humidity  |  pressure  |  windSpeed  |  moonDistance  
---------------------------------------------------------------------------------
 090206 |  7            |  87        |  998.8     |  3          |  363953        
---------------------------------------------------------------------------------
 ...... |  ...          |  ...       |  ....      |  ...        |  ......        
---------------------------------------------------------------------------------
 100206 |  10           |  86        |  1024      |  2          |  386342        
---------------------------------------------------------------------------------
 100207 |  9            |  90        |  1015      |  1          |  391750        
---------------------------------------------------------------------------------
 100208 |  13           |  90        |  1005      |  2          |  396392        
---------------------------------------------------------------------------------
 100209 |  12           |  89        |  1008      |  2          |  400157        
---------------------------------------------------------------------------------
 100210 |  11           |  92        |  1007      |  3          |  403012        
---------------------------------------------------------------------------------
 100211 |  6            |  86        |  1012      |  2          |  404984        
---------------------------------------------------------------------------------
 100212 |  6            |  61        |  1010      |  3          |  406135        
---------------------------------------------------------------------------------
 100213 |  7            |  57        |  1010      |  2          |  406542        
---------------------------------------------------------------------------------

我的表结构具有更多的列和数千行,但是希望这足以使我的观点清楚.我不会尝试像上一个示例中那样对这些值进行排序,因为我可能会弄错了,但是基本上我需要使用此数据进行两种类型的查询:

  1. 显示我所有的日期,并按照我提供的条件进行排序
  2. 显示我所有的日期,按照日期X中观察到的条件的相似程度排序

我知道可以通过使用第一个查询轻松地归档第二个查询,但是我的问题在于使用几列相似地进行排序,这就是我的意思,即" closeness 排序"整体价值".例如,如果我只处理一列,按 likeness 进行排序会容易得多,但是当处理几列时,我的头真的开始旋转了.

目标是能够产生如下结果:

今天的确与d/m/yy相似, 与d/m/yy非常相似 类似于d/m/yy,...

就我而言,我正在处理天气和大气数据,但如果有帮助,我想您可以根据employees(具有attendancehoursPerWeekmonthlySalary列)和顺序来考虑此问题例如,最类似于attendance = 100%hoursPerWeek = 40monthlySalary = $5000的员工.

PS:现在,我已经举了这个员工的例子,我现在不确定是否可以将其与我正在使用的天气示例进行比较,因为与员工在一起您可以计算表格(例如,rating = monthlySalary / hoursPerWeek * attendance)并为各列赋权,但是我认为天气表无法做到这一点-无论如何请赞赏输入. /p>

PS2 :我不确定我是否表现得足够好,如果您仍有疑问,请告诉我.


赏金

但是,到目前为止有一些好的建议 他们都没有真正解决我的问题. 我设立了赏金计划,希望 收集更多可能的解决方案以 这个问题.谢谢.

解决方案

您似乎遇到的问题是每一列的比例不同,因此您无法轻松地将它们组合在一起.可以使用称为增白的技术解决此问题.这涉及到计算每列的平均和标准偏差(您可以在1条SQL语句中执行此操作),然后在选择以下内容时将每列重新缩放为该值:

colSortPos = (colValue-colMean) / colStdev

这样做将使您的每一列都在0左右,标准偏差在+/- 1范围内+/-1.然后,诀窍是将它们组合在一起,以便将相似的日期组合在一起.这里的问题是,这不是二维问题,因此您需要进行多维思考.因此,我建议将欧氏距离作为排序顺序.

SELECT
    date,
    i,
    j,
    k,
    SQRT( POW((i-@iMean)/@iStdDEv, 2) + POW((j-@jMean)/@jStdDEv, 2) + POW((k-@kMean)/@kStdDEv, 2) )
AS
    sort_order
FROM
    table
ORDER BY
    sort_order

唯一的问题是将问题投射到一维空间上,这可能会使您错过一些相关性.要解决此问题,我建议使用像K-means这样的聚类技术,该技术实施起来非常简单,而且速度非常快.这将使您可以将日期分组为显示相似度最高的k个群集[ http://en .wikipedia.org/wiki/K-means_clustering ].如果您有原始数据,并且想使用这些(和其他)技术,那么我建议您尝试使用weka工具包[ http://www.cs.waikato.ac.nz/ml/weka/],您将可以使用这些技术.

I don't know if I've chosen the appropriate title for this question (if not, please change it accordingly) but consider the following simplified table structure I'm working with:

----------------------------------------------
|  date  |  i  |  j  |  k  |  x  |  y  |  z  |
----------------------------------------------
| 100209 |  1  |  2  |  3  |  4  |  5  |  6  |
----------------------------------------------
| 100210 |  2  |  3  |  4  |  5  |  6  |  7  |
----------------------------------------------
| 100211 |  0  |  1  |  2  |  3  |  4  |  5  |
----------------------------------------------
| 100212 |  1  |  2  |  3  |  4  |  5  |  6  |
----------------------------------------------
| 100213 |  6  |  5  |  4  |  3  |  2  |  1  |
----------------------------------------------

i, j, k, x, y, z are all unrelated integers / floats, they all represent different factors and can have very different orders of magnitude (i can range from 1 - 10 while j can range from 100 - 1000).

I'm trying to select dates that share similar conditions; Given a set of i, j, k, x, y, z values I need to return all results ordered by closeness of all values as a whole for instance, if i = 1, j = 2, k = 3, x = 4, y = 5 and z = 6 the query should return the following dates in this order:

  1. 100209
  2. 100212
  3. 100210
  4. 100211
  5. 100213

I'm not sure if this is relevant or not to the question, but some values (i, j, k) mean more is better while other values (x, y, z) mean the opposite: less is better.

How I should build such a query? Is this possible with SQL alone?


@Pentium10:

I'll try to answer your comment the best way I can. Here is a sample of my data:

---------------------------------------------------------------------------------
  date  |  temperature  |  humidity  |  pressure  |  windSpeed  |  moonDistance  
---------------------------------------------------------------------------------
 090206 |  7            |  87        |  998.8     |  3          |  363953        
---------------------------------------------------------------------------------
 ...... |  ...          |  ...       |  ....      |  ...        |  ......        
---------------------------------------------------------------------------------
 100206 |  10           |  86        |  1024      |  2          |  386342        
---------------------------------------------------------------------------------
 100207 |  9            |  90        |  1015      |  1          |  391750        
---------------------------------------------------------------------------------
 100208 |  13           |  90        |  1005      |  2          |  396392        
---------------------------------------------------------------------------------
 100209 |  12           |  89        |  1008      |  2          |  400157        
---------------------------------------------------------------------------------
 100210 |  11           |  92        |  1007      |  3          |  403012        
---------------------------------------------------------------------------------
 100211 |  6            |  86        |  1012      |  2          |  404984        
---------------------------------------------------------------------------------
 100212 |  6            |  61        |  1010      |  3          |  406135        
---------------------------------------------------------------------------------
 100213 |  7            |  57        |  1010      |  2          |  406542        
---------------------------------------------------------------------------------

My table structure has more columns and thousands of rows but hopefully this will be enough to get my point clear. I'm not going to attempt to order these values like I did in my previous example because I would probably get it wrong, but I basically need to do two types of queries with this data:

  1. show me all dates, ordered by the resemblance of the conditions provided by me
  2. show me all dates, ordered by the resemblance of the conditions observed in date X

I understand that the second query can easily be archived by using the first one, but my problem lies in sorting by resemblance using several columns, that's what I meant by "ordered by closeness of all values as a whole". As in, if I was dealing with only one column it would be a lot easier to order by likeness but my head really starts spinning when dealing with several columns.

The goal is to to be able to produce results like this:

Today is really similar to d/m/yy, very similar to d/m/yy, somewhat similar to d/m/yy, ...

In my case I'm working with weather and atmospheric data but if it helps I guess you can think in this problem in terms of employees (having attendance, hoursPerWeek and monthlySalary columns) and order by employees that most closely resemble attendance = 100%, hoursPerWeek = 40 and monthlySalary = $5000, for instance.

PS: Now that I've given this employees example I'm not really sure anymore if it can be compared to the weather example I'm working with, since with the employees table you can compute (rating = monthlySalary / hoursPerWeek * attendance for instance) and kinda weight the columns, but I don't think the same can be done with the weather table - input is appreciated anyhow.

PS2: I'm not sure if I expressed myself well enough, if you still have doubts please let me know.


Bounty

Some good suggestions so far, however none of them truly solve my problem. I'm setting up a bounty to hopefully gather even more possible solutions to this problem. Thanks.

解决方案

The problem you seem to have is that each column has a different scale and so you can't easily combine them. This problem can be solved using a technique called whitening. This involves calculating the average and standard deviation of each column (you could do this in 1 SQL statement) and then rescaling each column to this when selecting:

colSortPos = (colValue-colMean) / colStdev

Doing this will give you each column ranging around 0 which +/- 1 standard deviation within the range +/- 1. The trick then is to combine these so that similar dates are together. The problem here is that this is not a 2 dimensional problem and so you need to think multidimensionally. So my suggestion its to take the the Euclidean distance as your sort order.

SELECT
    date,
    i,
    j,
    k,
    SQRT( POW((i-@iMean)/@iStdDEv, 2) + POW((j-@jMean)/@jStdDEv, 2) + POW((k-@kMean)/@kStdDEv, 2) )
AS
    sort_order
FROM
    table
ORDER BY
    sort_order

The only problem with this is that it projects your problem onto a 1 dimensional space that may make you miss some correlations. To work around this I suggest using a clustering technique like K-means which is pretty simple to implement and is really fast. This will allow you to group your dates into k clusters that display the most similarity [ http://en.wikipedia.org/wiki/K-means_clustering ]. If you have the raw data and want to play around with these (and other) techniques then I suggest trying the weka toolkit [ http://www.cs.waikato.ac.nz/ml/weka/ ] which will let you play around with these techniques.

这篇关于多列整数排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆