MYSQL 按 HAVING 距离排序但无法分组? [英] MYSQL sorting by HAVING distance but not able to group?

查看:44
本文介绍了MYSQL 按 HAVING 距离排序但无法分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个查询

SELECT zip, 
( 3959 * acos( cos( radians(34.12520) ) * cos( radians( zip_info.latitude ) ) * cos(radians( zip_info.longitude ) - radians(-118.29200) ) + sin( radians(34.12520) ) * sin( radians( zip_info.latitude ) ) ) ) AS distance, 
user_info.*, office_locations.* 

FROM zip_info 

RIGHT JOIN office_locations ON office_locations.zipcode = zip_info.zip 

RIGHT JOIN user_info ON office_locations.doctor_id = user_info.id 

WHERE user_info.status='yes' 

HAVING distance < 50 ORDER BY distance ASC

输出

距离|医生_id |等

distance | doctor_id | etc.

7 --------------- 5 ------- 等等

7 --------------- 5 ------- etc

8 --------------- 4 ------- 等等

8 --------------- 4 ------- etc

34 --------------- 4 ------- 等等

34 --------------- 4 ------- etc

49 --------------- 5 ------- 等等

49 --------------- 5 ------- etc

当我选择 30 或更小的距离时,它也显示前两个结果,这很好.

When I select a distance of 30 or less, it shows the top two results as well, which is good.

问题:我不想为每个doctor_id显示一个以上的结果,所以我做了一个GROUP BY user_info.doctor_id,当距离小于50时不显示结果.出于某种原因,它希望将所有结果组,否则它将不起作用.有小费吗?你还有什么需要帮助我的吗?

The Problem : I do not want to show more than one result per doctor_id so I do a GROUP BY user_info.doctor_id, which shows no results when distance is less than 50. For some reason it wants to have all the results to group otherwise it won't work. Any tips? Anything else you need to help me out?

所以我想要的是

距离|医生_id |等

distance | doctor_id | etc.

7 --------------- 5 ------- 等等

7 --------------- 5 ------- etc

8 --------------- 4 ------- 等等

8 --------------- 4 ------- etc

即使它想给我所有 4 行的结果,我只想对它们进行分组,以便只显示每个唯一 user_info.doctor_id 距离最小的那些.请记住,距离是一个虚拟的不存在的表格.

Even though it wants to give me all 4 rows for results, I just want to group them so only the ones with smallest distance per unique user_info.doctor_id show up. Keep in mind distance is a virtual non existent table.

根据 llion 的查询,结果如下:

Based on llion's query here are the results:

 (concat(user_info.id))     zip     distance    id
          1                 NULL    6.6643992   1 

它只给出一个结果,为了让它起作用,我不得不再次将 AND 更改为 HAVING 距离.

It only gives one result, and in order to get it to work, I had to change the AND to HAVING distance again.

推荐答案

我不相信 GROUP BY 会给你想要的结果.不幸的是,MySQL 不支持分析函数(这是我们在 Oracle 或 SQL Server 中解决此问题的方法.)

I don't believe a GROUP BY is going to give you the result you want. And unfortunately, MySQL does not support analytic functions (which is how we would solve this problem in Oracle or SQL Server.)

通过使用用户定义的变量,可以模拟一些基本的分析函数.

It's possible to emulate some rudimentary analytic functions, by making use of user-defined variables.

在这种情况下,我们要模拟:

In this case, we want to emulate:

ROW_NUMBER() OVER(PARTITION BY doctor_id ORDER BY distance ASC) AS seq

因此,从原始查询开始,我更改了 ORDER BY,使其首先按 doctor_id 排序,然后按计算出的 distance 排序.(在我们知道这些距离之前,我们不知道哪个最近".)

So, starting with the original query, I changed the ORDER BY so that it sorts on doctor_id first, and then on the calculated distance. (Until we know those distances, we don't know which one is "closest".)

有了这个排序结果,我们基本上为每个doctor_id编号"了行,最接近的一个为1,第二个最接近的为2,依此类推.当我们得到一个新的doctor_id时,我们从最接近的为1重新开始.

With this sorted result, we basically "number" the rows for each doctor_id, the closest one as 1, the second closest as 2, and so on. When we get a new doctor_id, we start again with the closest as 1.

为了实现这一点,我们使用了用户定义的变量.我们使用一个来分配行号(变量名是@i,返回的列有别名seq).我们用来记住"前一行的doctor_id的另一个变量,因此我们可以检测到doctor_id中的中断",因此我们可以知道何时再次从1重新开始行编号.

To accomplish this, we make use of user-defined variables. We use one for assigning the row number (the variable name is @i, and returned column has the alias seq). The other variable we use to "remember" the doctor_id from the previous row, so we can detect a "break" in the doctor_id, so we can know when to restart the row numbering at 1 again.

这是查询:

SELECT z.*
, @i := CASE WHEN z.doctor_id = @prev_doctor_id THEN @i + 1 ELSE 1 END AS seq
, @prev_doctor_id := z.doctor_id AS prev_doctor_id
FROM
(

  /* original query, ordered by doctor_id and then by distance */
  SELECT zip, 
  ( 3959 * acos( cos( radians(34.12520) ) * cos( radians( zip_info.latitude ) ) * cos(radians( zip_info.longitude ) - radians(-118.29200) ) + sin( radians(34.12520) ) * sin( radians( zip_info.latitude ) ) ) ) AS distance, 
  user_info.*, office_locations.* 
  FROM zip_info 
  RIGHT JOIN office_locations ON office_locations.zipcode = zip_info.zip 
  RIGHT JOIN user_info ON office_locations.doctor_id = user_info.id 
  WHERE user_info.status='yes' 
  ORDER BY user_info.doctor_id ASC, distance ASC

) z JOIN (SELECT @i := 0, @prev_doctor_id := NULL) i
HAVING seq = 1 ORDER BY z.distance

<小时>

我假设原始查询正在返回您需要的结果集,它只是有太多行,并且您想消除除最近的"(具有最小值距离的行)之外的所有行每个医生_id.


I'm making an assumption that the original query is returning the result set you need, it just has too many rows, and you want to eliminate all but the "closest" (the row with the minimum value of distance) for each doctor_id.

我已将您的原始查询包装在另一个查询中;我对原始查询所做的唯一更改是按doctor_id 然后按距离对结果进行排序,并删除HAVING distance <50 子句.(如果您只想返回小于 50 的距离,那么继续并将该子句留在那里.不清楚这是您的意图,还是为了将每个医生 ID 的行限制为一个而指定的.)

I've wrapped your original query in another query; the only changes I made to the original query was to order the results by doctor_id and then by distance, and to remove the HAVING distance < 50 clause. (If you only want to return distances less than 50, then go ahead and leave that clause there. It wasn't clear whether that was your intent, or whether that was specified in an attempt to limit rows to one per doctor_id.)

需要注意的几个问题:

替换查询返回两个额外的列;这些在结果集中并不是真正需要的,除了作为生成结果集的手段.(可以将整个 SELECT 再次包装在另一个 SELECT 中以省略这些列,但这确实比它的价值更混乱.我只会检索这些列,并且知道我可以忽略它们.)

The replacement query returns two additional columns; these aren't really needed in the result set, except as means to generate the result set. (It's possible to wrap this whole SELECT again in another SELECT to omit those columns, but that is really more messy than it's worth. I would just retrieve the columns, and know that I can ignore them.)

另一个问题是在内部查询中使用 .* 有点危险,因为我们确实需要保证该查询返回的列名是唯一的.(即使列名现在是不同的,向其中一个表添加一列可能会在查询中引入模棱两可"的列异常.最好避免这种情况,这可以通过替换 轻松解决.* 带有要返回的列列表,并为任何重复"列名指定别名.(在外部查询中使用 z.* 不是问题,只要我们能控制 z 返回的列.)

The other issue is that the use of the .* in the inner query is a bit dangerous, in that we really need to guarantee that the column names returned by that query are unique. (Even if the column names are distinct right now, the addition of a column to one of those tables could introduce an "ambiguous" column exception in the query. It's best to avoid that, and that's easily addressed by replacing the .* with the list of columns to be returned, and specifying an alias for any "duplicate" column name. (The use of the z.* in the outer query is not a concern, as long as we are in control of the columns returned by z.)

附录:

我注意到 GROUP BY 不会为您提供所需的结果集.虽然可以通过使用 GROUP BY 的查询获取结果集,但返回 CORRECT 结果集的语句将是乏味的.您可以指定 MIN(distance) ... GROUP BY doctor_id,这将使您获得最小距离,但不能保证 SELECT 列表中的其他非聚合表达式来自距离最小的行,而不是其他行.(MySQL 在 GROUP BY 和聚合方面非常自由.为了让 MySQL 引擎更加谨慎(并与其他关系数据库引擎保持一致),SET sql_mode = ONLY_FULL_GROUP_BY

I noted that a GROUP BY wasn't going to give you the result set you needed. While it would be possible to get the result set with a query using GROUP BY, a statement that returns the CORRECT result set would be tedious. You could specify MIN(distance) ... GROUP BY doctor_id, and that would get you the smallest distance, BUT there is no guarantee that the other non-aggregate expressions in the SELECT list would be from the row with the minimum distance, and not some other row. (MySQL is dangerously liberal in regards to GROUP BY and aggregates. To get the MySQL engine to be more cautious (and in line with other relational database engines), SET sql_mode = ONLY_FULL_GROUP_BY

附录 2:

Darious 报告的性能问题某些查询需要 7 秒."

Performance Issues reported by Darious "some queries take 7 seconds."

为了加快速度,您可能希望缓存函数的结果.基本上,建立一个查找表.例如

To speed things up, you probably want to cache the results of the function. Basically, build a lookup table. e.g.

CREATE TABLE office_location_distance
( office_location_id INT UNSIGNED NOT NULL COMMENT 'PK, FK to office_location.id'
, zipcode_id         INT UNSIGNED NOT NULL COMMENT 'PK, FK to zipcode.id'
, gc_distance        DECIMAL(18,2)         COMMENT 'calculated gc distance, in miles'
, PRIMARY KEY (office_location_id, zipcode_id)
, KEY (zipcode_id, gc_distance, office_location_id)
, CONSTRAINT distance_lookup_office_FK
  FOREIGN KEY (office_location_id) REFERENCES office_location(id)
  ON UPDATE CASCADE ON DELETE CASCADE
, CONSTRAINT distance_lookup_zipcode_FK
  FOREIGN KEY (zipcode_id) REFERENCES zipcode(id)
  ON UPDATE CASCADE ON DELETE CASCADE
) ENGINE=InnoDB

这只是一个想法.(我希望您正在搜索距特定邮政编码的 office_location 距离,因此 (zipcode, gc_distance, office_location_id) 上的索引是您的查询需要的覆盖索引.(我会避免将计算出的距离存储为 FLOAT,因为较差使用 FLOAT 数据类型查询性能)

That's just an idea. (I expect that you are searching for office_location distance from a particular zipcode, so the index on (zipcode, gc_distance, office_location_id) is the covering index your query would need. (I would avoid storing the calculated distance as a FLOAT, due to poor query performance with FLOAT datatype)

INSERT INTO office_location_distance (office_location_id, zipcode_id, gc_distance)
SELECT d.office_location_id
     , d.zipcode_id
     , d.gc_distance
  FROM (
         SELECT l.id AS office_location_id
              , z.id AS zipcode_id
              , ROUND( <glorious_great_circle_calculation> ,2) AS gc_distance
           FROM office_location l
          CROSS
           JOIN zipcode z
          ORDER BY 1,3
       ) d
ON DUPLICATE KEY UPDATE gc_distance = VALUES(gc_distance)

通过缓存和索引函数结果,您的查询应该会更快.

With the function results cached and indexed, your queries should be much faster.

SELECT d.gc_distance, o.*
  FROM office_location o
  JOIN office_location_distance d ON d.office_location_id = o.id
 WHERE d.zipcode_id = 63101
   AND d.gc_distance <= 100.00
 ORDER BY d.zipcode_id, d.gc_distance

我对在 INSERT/UPDATE 上添加 HAVING 谓词到缓存表犹豫不决;(如果您的纬度/经度有误,并且计算出了 100 英里以下的错误距离;在纬度/经度之后的后续运行是固定的,距离计算为 1000 英里......如果该行从查询中排除,那么缓存表中的现有行将不会得到更新.(您可以清除缓存表,但这并不是真正必要的,这只是数据库和日志的大量额外工作.如果维护查询的结果集太大,它可以分解为针对每个邮政编码或每个 office_location 迭代运行.)

I am hesitant about adding a HAVING predicate on the INSERT/UPDATE to the cache table; (if you had a wrong latitude/longitude, and had calculated an erroneous distance under 100 miles; a subsequent run after the lat/long is fixed and the distance works out to 1000 miles... if the row is excluded from the query, then existing row in the cache table won't get updated. (You could clear the cache table, but that's not really necessary, that's just a lot of extra work for the database and logs. If the result set of the maintenance query is too large, it could be broken down to run iteratively for each zipcode, or each office_location.)

另一方面,如果您对某个值上的任何距离不感兴趣,您可以添加 HAVING gc_distance < 谓词,并大大减少缓存表的大小.

On the other hand, if you aren't interested in any distances over a certain value, you could add the HAVING gc_distance < predicate, and cut down the size of the cache table considerably.

这篇关于MYSQL 按 HAVING 距离排序但无法分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆