用于大量数据的自定义排序算法 [英] Custom sorting algorithm for a large amount of data

查看:181
本文介绍了用于大量数据的自定义排序算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的数据需要根据搜索查询以特定的方式进行排序,但我不确定采取哪种最佳方法。

I have a large amount of data that I need to sort in a specific way based on a search query, but I'm not sure of the best approach to take.

我要排序的数据是按学校分组的课程列表。每个课程由一所学校教授。每个学校可能属于任何数量的伙伴关系,这代表了一些学校之间的关系。用户可以按课程名称搜索任意数量的课程。

The data I'm trying to sort is a list of courses, grouped by schools. Each course is taught by one school. Each school may belong to any number of "partnerships", which represents a relationship between a number of schools. A user can search for any number of courses by the name of the course.

我需要按如下方式对数据进行排序:

I need to sort the data as follows:


  • 课程按学校分组,每页显示10所学校。

  • Courses are grouped by school, with 10 schools appearing per page.

可以提供的学校用户搜索的每个课程应该首先列在列表中。

Schools that can provide every course that the user has searched for should appear first in the list.

在这些结果之后,属于可以容纳所有课程的合作伙伴关系的学校搜索的用户应该彼此相邻。

After these results, schools that belong to a partnership that can accommodate all of the courses the user searched for should appear next to each other.

以下是一个示例:


  • A 教会历史,法语和英语课程。

  • B 教导法语和数学。

  • C 教导历史。

  • B C 共同合作。

  • D 教导历史。

  • A teaches History, French and English courses.
  • B teaches French and Mathematics.
  • C teaches History.
  • B and C are in a partnership together.
  • D teaches History.

用户搜索历史和法语。

A 应该首先出现在e结果,其历史和法语课程,因为它可以提供用户正在寻找的两个课程。

A should appear first in the results, with its History and French courses, as it can provide both of the courses the user is looking for.

B ,接下来是 C ,相关课程将在之后列出,因为合作伙伴关系可以提供用户所需的两门课程。

B, followed by C appears next, with the relevant course it teaches listed after it, as the partnership can provide both of the user's required courses.

D 接下来出现,因为它只提供1个相关课程。

D Appears next as it only provides 1 relevant course.

数据存储在几个表中的Microsoft SQL Server数据库中。这是一个简化的架构:

The data is stored in an Microsoft SQL Server database across a few tables. Here is a simplified schema:

课程:


  • int id

  • varchar name

  • int schoolId

学校


  • int id

  • varchar name

合作伙伴:


  • int id

  • varchar partnershipName

SchoolPartnership:


  • int id

  • int schoolId

  • int partnershipId

有超过10万门课程和大约300所学校。我不知道如何在SQL中指定的方式排序课程,我认为这是我最大的问题。我只需要每页显示10个结果,但是由于我无法在SQL查询中进行排序,所以我必须提取整个结果集,然后在PHP中手动排序,然后我才能将结果设置为10个结果。

There are over 100000 courses and around 300 schools. I don't know of a way to sort the courses as specified in SQL, which I think is my biggest problem. I only need to display 10 results per page, but as I can't do the sorting in the SQL query, I have to extract the entire result set and sort it manually in PHP before I can cut the result set down to 10 results.

我正在使用Doctrine 2使用多个连接在单个查询中提取需要的数据,将结果作为数组进行保湿。那么这个计划就是在PHP中操纵这个大量的记录,使其顺序正确。由于这个数组的大小,我担心这个排序过程会很慢,所以我正在寻找如何使这个更快的建议,方法是:

I'm currently extracting the data I need in a single query with multiple joins using Doctrine 2, hydrating the results as an array. Then the plan is to manipulate this big array of records in PHP to get it into the correct order. Due to the size of this array, I'm worried that this sorting process is going to be very slow, so I'm looking for advice on how to make this quicker, either by:


  • 处理SQL查询中的排序。

  • 建议如何在诸如Solr之类的搜索引擎中实现所描述的算法(我有一点经验的基础知识,但不执行复杂的排序)。

  • 建议如何最好地执行排序在PHP,如果其他两个选项是不可行的。

编辑:

在这方面取得了一些进展,谢谢(特别是@Neil)。我已经单独提出了一个问题(子查询上的分组MAX()),

I've made some good progress on this, thanks (particularly @Neil). I've opened a separate question up ( Groupwise MAX() on a subquery ), which contains some of my progress so far.

推荐答案

通过匹配课程的数量找到学校很简单:

To find schools by the number of matching courses is simple:

SELECT schoolId, COUNT(*) AS schoolCount
  FROM Courses
  WHERE name IN ('History', 'French')
  GROUP BY schoolId

如果这是你需要的,你可以 ORDER BY schoolCount DESC 按照您要求的顺序获取它们。

If this was all you needed, you could ORDER BY schoolCount DESC to get them in the order you want.

要找到与匹配课程的合作伙伴关系,您首先需要找到至少有一所学校的合作伙伴关系:

To find partnerships with matching courses, you first need to find the partnerships that have the course at at least one school:

SELECT partnershipId, COUNT(DISTINCT name) AS partnershipCount
  FROM SchoolPartnership
  INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
  WHERE name IN ('History', 'French')
  GROUP BY partnershipId

请注意, DISTINCT 是需要的,因为我们不在意合伙中有多少所学校有这样的课程。如果您没有 DISTINCT ,则可以使用子选择:

Note that DISTINCT is needed because we don't care how many schools in the partnership have that course. If you don't have DISTINCT then you could use a subselect instead:

SELECT partnershipId, COUNT(*) AS partnershipCount
  FROM (
    SELECT DISTINCT partnershipId, name
      FROM SchoolPartnership
      INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
      WHERE name IN ('History', 'French'))
  GROUP BY partnershipId

然后,您可以在与SchoolPartnership的加入中使用上面的第一个和最后一个查询作为子选择,以按照伙伴关系匹配和校准的降序对学校进行排序。 (请注意,我认为所有学校都是至少有一所学校的合作伙伴)。我认为最终的查询将如下所示:

You can then use the first and last query above as subselects in a join with SchoolPartnership to order the schools in descending order of partnershipMatches and schoolMatches. (Note that I assume that all schools are in a partnership of at least one school.) I think the final query will look like this:

SELECT SchoolMatches.schoolID
  FROM (
    SELECT schoolId, COUNT(*) AS schoolCount
      FROM Courses
      WHERE name IN ('History', 'French')
      GROUP BY schoolId
  ) SchoolMatches
  JOIN SchoolPartnership ON SchoolMatches.schoolID = SchoolPartnership.schoolID
  JOIN (
    SELECT partnershipId, COUNT(DISTINCT name) AS partnershipCount
      FROM SchoolPartnership
      INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
      WHERE name IN ('History', 'French')
      GROUP BY partnershipId
   ) PartnershipMatches ON SchoolPartnership.schoolId = PartnershipMatches.schoolId
   ORDER BY PartnershipMatches.partnershipCount DESC, SchoolMatches.SchoolCount DESC

这篇关于用于大量数据的自定义排序算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆