如何优化/简化Django对象的堆排序? (不能使用模块和数据库排序) [英] How to optimize/simplify heapsorting django objects? (can't use modules and db sorts)

查看:79
本文介绍了如何优化/简化Django对象的堆排序? (不能使用模块和数据库排序)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要为我作为django实习考试而得到的作业寻求帮助.我必须用兔子和它们的胡萝卜制作和想象的api.每只兔子都应该有许多胡萝卜,但是必须将api设计为允许轻松添加其他种类的蔬菜.我拒绝了每种蔬菜的整数字段,而是选择了具有蔬菜类型和值的蔬菜对象.

I have to ask for some help with an assignment I got as a test for a django internship. I had to make and imaginary api with rabbits and their carrots. Each rabbit was supposed to have a number of carrots, but the api had to be designed to allow for easy addition of other kind of vegetable. I rejected integer field for each vegetable and instead went for vegetable object with type and value of vegetable.

问题是,任务还包括列出按胡萝卜分类的兔子,降落的兔子.他们要我实现堆排序,不允许数据库排序,没有外部库.尽管我对此没有任何问题,但他们在考虑到时间限制时遇到了麻烦-将20 000只兔子在30秒以内分类,理想情况下是5秒.并且已经用了200只兔子花了5秒(只需排序并序列化为json).

Problem is, the assignment also included listing the rabbits sorted by carrots, descending. They wanted me to implement heapsort, no database sort was allowed, no external libs. While i had no problem with that, I am having trouble with time constraints they thought of - for 20 000 rabbits to be sorted in under 30 seconds, ideally 5 seconds. And it already takes 5 seconds with 200 rabbits(just sorting and serializing to json).

我创建一个查询集,该查询集仅包含带有胡萝卜"蔬菜的兔子.然后,我将其强制进入普通列表并在其上运行heapsort函数.

I make a queryset that has only rabbits with "carrots" vegetables. Then I force it into normal list and run heapsort function on it.

我需要如何将其更改为更快?可能吗?如果有人帮助我,我将非常高兴.预先谢谢您!

How would I need to change it to be faster? Is it even possible? I will be very happy if someone helps even a bit. Thank You in advance!

我的模特:

class Bunny(models.Model):
    """Bunny model for bunny usage"""
    def __str__(self):
        return self.name + " " + str(list(self.vegetables.all()))

    name = models.CharField("Name", max_length=50)
    userAccount = models.ForeignKey(User, on_delete=models.CASCADE)

    def getVegetable(self, vegetableType):
        for vegetable in self.vegetables.all():
            if vegetable.vegetableType == vegetableType:
                return vegetable
        return False


class Vegetable(models.Model):
    """Vegetable model for storing vegetable counts"""
    def __str__(self):
        return self.vegetableType + ":" + str(self.value)

    vegetableType = models.CharField(max_length=30, choices=vegetableChoices)
    value = models.PositiveIntegerField(default=0, validators=[MinValueValidator(0)])
    bunny = models.ForeignKey(Bunny, related_name="vegetables", on_delete=models.CASCADE)

我的堆排序函数:

def heapsort(bunnies, vegetableType):
    """Heapsort function for bunnies, works in place, descending"""

    for start in range((len(bunnies) - 2) // 2, -1, -1):
        siftdown(bunnies, start, len(bunnies) - 1, vegetableType)

    for end in range(len(bunnies) - 1, 0, -1):
        bunnies[end], bunnies[0] = bunnies[0], bunnies[end]
        siftdown(bunnies, 0, end - 1, vegetableType)
    return bunnies


def siftdown(bunnies, start, end, vegetableType):
    """helper function for heapsort"""
    root = start
    while True:
        child = root * 2 + 1
        if child > end:
            break
        if child + 1 <= end and bunnies[child].vegetables.get(vegetableType=vegetableType).value > bunnies[
                    child + 1].vegetables.get(vegetableType=vegetableType).value:
            child += 1
        if bunnies[root].vegetables.get(vegetableType=vegetableType).value > bunnies[child].vegetables.get(
                vegetableType=vegetableType).value:
            bunnies[root], bunnies[child] = bunnies[child], bunnies[root]
            root = child
        else:
            break

还有他们要求的性能测试(我不知道有更好的方法.只创建兔子需要很长时间)

And also the performance test they asked for(I do not know of a better way. Just creating bunnies takes a long time)

def test_20000_rabbits_performance(self):
    print("Creating bunnies")
    register20000Bunnies()

    print("Created bunnies")
    timestart = time()

    url = reverse("api:list", args=["carrots"])

    response = self.client.get(url)
    timeMeasured = time() - timestart
    print("Sorted. Took: " + str(timeMeasured))

    self.assertEqual(response.status_code, status.HTTP_200_OK)

我的观点:

@api_view(["GET"])
def bunnyList(request, vegetableType):
""" Displays heap-sorted list of bunnies, in decreasing order.
    Takes word after list ("/list/xxx") as argument to determine
    which vegetable list to display"""
    if vegetableType in vegetablesChoices:
        bunnies =
    Bunny.objects.filter(vegetables__vegetableType=vegetableType)
        bunnies = list(bunnies)  # force into normal list

        if len(bunnies) == 0:
            return Response({"No bunnies": "there is %d bunnies with this vegetable" % len(bunnies)},
                        status=status.HTTP_204_NO_CONTENT)

        heapsort(bunnies, vegetableType)
        serialized = BunnySerializerPartial(bunnies, many=True)
        return Response(serialized.data, status=status.HTTP_200_OK)
    else:
        raise serializers.ValidationError("No such vegetable. Available are: " + ", ".join(vegetablesChoices))

现在检查一下,当前排序需要1202秒……我的机器是2核1.86GHz,但是还是.

just checked now, currently it takes 1202 seconds to sort... My machine is 2 core 1.86GHz, but still.

Edit2,新代码:

Edit2, new code:

@api_view(["GET"])
def bunnyList(request, vegetableType):
""" Displays heap-sorted list of bunnies, in decreasing order.
    Takes word after list ("/list/xxx") as argument to determine
    which vegetable list to display"""
if vegetableType in vegetablesChoices:
    vegetables =  Vegetable.objects.filter(vegetableType=vegetableType).select_related('bunny')
    vegetables = list(vegetables)

    if len(vegetables) == 0:
        return Response({"No bunnies": "there is 0 bunnies with this vegetable"},
                        status=status.HTTP_204_NO_CONTENT)

    heapsort(vegetables)

    bunnies = [vegetable.bunny for vegetable in vegetables]
    serialized = BunnySerializerPartial(bunnies, many=True)
    return Response(serialized.data, status=status.HTTP_200_OK)
else:
    raise serializers.ValidationError("No such vegetable. Available are: " + ", ".join(vegetablesChoices))

更新的堆排序:

def heapsort(vegetables):
"""Heapsort function for vegetables, works in place, descending"""

for start in range((len(vegetables) - 2) // 2, -1, -1):
    siftdown(vegetables, start, len(vegetables) - 1)

for end in range(len(vegetables) - 1, 0, -1):
    vegetables[end], vegetables[0] = vegetables[0], vegetables[end]
    siftdown(vegetables, 0, end - 1)
return vegetables


def siftdown(vegetables, start, end):
"""helper function for heapsort"""
root = start
while True:
    child = root * 2 + 1
    if child > end:
        break
    if child + 1 <= end and vegetables[child].value > vegetables[child+1].value:
        child += 1
    if vegetables[root].value > vegetables[child].value:
        vegetables[root], vegetables[child] = vegetables[child], vegetables[root]
        root = child
    else:
        break

我的序列化器:

class BunnySerializerPartial(serializers.ModelSerializer):
"""Used in list view, mirrors BunnySerializerFull but without account details"""
    vegetables = VegetableSerializer(many=True)

    class Meta:
        model = Bunny
        fields = ("name", "vegetables")


class VegetableSerializer(serializers.ModelSerializer):
"""Used for displaying vegetables, for example in list view"""
    class Meta:
        model = Vegetable
        fields = ("vegetableType", "value")

并从工具栏查询:

SELECT ••• FROM "zajaczkowskiBoardApi_vegetable" INNER JOIN "zajaczkowskiBoardApi_bunny" ON ("zajaczkowskiBoardApi_vegetable"."bunny_id" = "zajaczkowskiBoardApi_bunny"."id") WHERE "zajaczkowskiBoardApi_vegetable"."vegetableType" = '''carrots'''


SELECT ••• FROM "zajaczkowskiBoardApi_vegetable" WHERE "zajaczkowskiBoardApi_vegetable"."bunny_id" = '141'

第二次重复2万次

推荐答案

这是经典的N + 1查询问题.您执行单个查询以获取所有兔子,但是接着继续对每个兔子执行bunnies[child].vegetables.get(vegetableType=vegetableType),这将对每个兔子执行附加查询,从而执行附加数据库往返.因此,您对N个兔子执行1个查询,再对N个查询进行大约一次查询,以获取所有蔬菜(因此为N + 1).

This is the classic N+1 queries problem. You perform a single query to fetch all the bunnies, but then you go on to do bunnies[child].vegetables.get(vegetableType=vegetableType) for each bunny, which performs an additional query, and thus an additional database roundtrip, for each bunny. So you perform 1 query for N bunnies, plus around N queries to get all the vegetables (hence the N+1).

数据库往返是Web开发人员可以使用的最昂贵的资源之一.虽然比较花费的时间大约为纳秒,但数据库往返花费的时间则为毫秒.进行约20K次查询,这很快就会花费几分钟.

Database roundtrips are one of the most expensive resource available to web developers. While comparisons take somewhere in the order of nanoseconds, a database roundtrip takes in the order of milliseconds. Do ~20K queries and this will soon add up to take several minutes.

快速的解决方案是使用prefetch_related('vegetables'),而专门使用bunny.getVegetable('carrot')来获得胡萝卜. prefetch_related()将执行单个查询以获取 all 兔子的所有蔬菜并缓存它们,因此在getVegetables()中迭代self.vegetables.all()不会执行任何其他查询.

The quick solution is to use prefetch_related('vegetables') and exclusively use bunny.getVegetable('carrot') to get the carrot. prefetch_related() will perform a single query to get all the vegetables for all bunnies and cache them, so iterating self.vegetables.all() in getVegetables() won't perform any additional queries.

不过,还有更好的解决方案.在这种情况下,似乎每个兔子最多应具有特定vegetableType的1个Vegetable对象.如果在数据库级别实施此操作,则当有人决定向兔子添加第二个'carrot'类型的Vegetable时,您不必担心排序算法中的错误.相反,数据库将首先阻止他们执行此操作.为此,您需要一个unique_together约束:

There are better solutions, though. In this case, it seems that each bunny should have at most 1 Vegetable object of a specific vegetableType. If you enforce this at the database level, you won't have to worry about errors in your sorting algorithm when someone decides to add a second Vegetable of type 'carrot' to a bunny. Instead, the database will stop them from doing that in the first place. To do this, you need a unique_together constraint:

class Vegetable(models.Model):
    ...
    class Meta:
        unique_together = [
            ('vegetableType', 'bunny'),
        ]

然后,您可以获取所有类型为"carrot"和 加入 相关的兔子.现在您只有一个查询:

Then, rather than fetching all bunnies and prefetching all related vegetables, you can fetch all vegetables of type "carrot" and join the related bunnies. Now you will only have a single query:

carrots = Vegetable.objects.filter(vegetableType='carrot').select_related('bunny')

由于vegetableTypebunny的组合是唯一的,因此您不会得到任何重复的兔子,并且仍然会得到所有带有一些胡萝卜的兔子.

Since the combination of vegetableType and bunny is unique, you won't get any duplicate bunnies, and you will still get all bunnies that have some carrots.

当然,您必须调整算法以适用于蔬菜,而不是兔子.

Of course you'd have to adapt your algorithm to work with the vegetables rather than the bunnies.

这篇关于如何优化/简化Django对象的堆排序? (不能使用模块和数据库排序)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆