Django全文搜索优化-Postgres [英] Django Full Text Search Optimization - Postgres

查看:73
本文介绍了Django全文搜索优化-Postgres的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试利用Django(v2.1)和Postgres(9.5)创建用于地址自动完成功能的全文搜索,但是该性能目前不适合自动完成功能,我不知道了解我获得的性能结果背后的逻辑。对于信息表是相当大的,有1400万行。

I am trying to create a Full Text Search for an address autocomplete feature, leveraging Django (v2.1) and Postgres (9.5), but the performance is not suitable for an auto-complete at the moment and I don't get the logic behind the performance results I get. For info the table is quite sizeable, with 14 million rows.

我的模型:

from django.db import models
from postgres_copy import CopyManager
from django.contrib.postgres.indexes import GinIndex

class Addresses(models.Model):
date_update = models.DateTimeField(auto_now=True, null=True)
longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
number = models.CharField(max_length=16, null=True, default='')
street = models.CharField(max_length=60, null=True, default='')
unit = models.CharField(max_length=50, null=True, default='')
city = models.CharField(max_length=50, null=True, default='')
district = models.CharField(max_length=10, null=True, default='')
region = models.CharField(max_length=5, null=True, default='')
postcode = models.CharField(max_length=5, null=True, default='')
addr_id = models.CharField(max_length=20, unique=True)
addr_hash = models.CharField(max_length=20, unique=True)
objects = CopyManager()

class Meta:
    indexes = [
        GinIndex(fields=['number', 'street', 'unit', 'city', 'region', 'postcode'], name='search_idx')
    ]

我创建了一个小测试,用于根据搜索中的单词数检查性能:

I created a little test to check performance based on number of words in the search:

    search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode')

    searchtext1 = "north"
    searchtext2 = "north bondi"
    searchtext3 = "north bondi blair"
    searchtext4 = "north bondi blair street 2026"

    print('Test1: 1 word')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext1)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

    #print(AddressesAustralia.objects.annotate(search=search_vector).explain(verbose=True))

    print('Test2: 2 words')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext2)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

    print('Test3: 3 words')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext3)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

    print('Test4: 5 words')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext4)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

我得到以下结果,这似乎很正确:

I get the following results, which seems quite correct:

Test1: 1 word
0:00:00.001841

Test2: 2 words
0:00:00.001422

Test3: 3 words
0:00:00.001574

Test4: 5 words
0:00:00.001360

但是,如果我取消注释print(len(result))行,则会得到以下结果:

However if I uncomment the print(len(results)) lines, I get the following results:

Test1: 1 word
10
0:00:00.046392

Test2: 2 words
10
0:00:06.544732

Test3: 3 words
10
0:01:12.367157

Test4: 5 words
10
0:01:17.786596

这显然不适合自动完成功能。

This is obviously not suitable for an autocomplete feature.

有人可以解释为什么对查询集结果执行操作时需要更长的执行时间吗?看来数据库检索总是很快速,但是遍历结果要花时间,这对我来说没有意义,因为我将结果限制为10,返回的queryset总是大小相同。

Could someone be able to explain why the execution takes longer when performing an operation on the queryset result? It seems the Database retrieval is always fast but then going through the results takes time, which does not make sense to me as because I'm limiting the results to 10, the queryset returned is always the same size.

此外,尽管我创建了GIN索引,但似乎未使用该索引。看来它已经正确创建了:

Also, although I have created a GIN index, this index does not seem to be used. It seems it has been created correctly:

=# \d public_data_au_addresses
                                   Table 
"public.public_data_au_addresses"
Column    |           Type           | Collation | Nullable |                            
Default                            
-------------+--------------------------+-----------+----------+------ 
---------------------------------------------------------
id          | integer                  |           | not null | 
nextval('public_data_au_addresses_id_seq'::regclass)
date_update | timestamp with time zone |           |          | 
longitude   | numeric(9,6)             |           |          | 
latitude    | numeric(9,6)             |           |          | 
number      | character varying(16)    |           |          | 
street      | character varying(60)    |           |          | 
unit        | character varying(50)    |           |          | 
city        | character varying(50)    |           |          | 
district    | character varying(10)    |           |          | 
region      | character varying(5)     |           |          | 
postcode    | character varying(5)     |           |          | 
addr_id     | character varying(20)    |           | not null | 
addr_hash   | character varying(20)    |           | not null | 
Indexes:
"public_data_au_addresses_pkey" PRIMARY KEY, btree (id)
"public_data_au_addresses_addr_hash_key" UNIQUE CONSTRAINT, btree (addr_hash)
"public_data_au_addresses_addr_id_key" UNIQUE CONSTRAINT, btree (addr_id)
"public_data_au_addresses_addr_hash_e8c67a89_like" btree (addr_hash varchar_pattern_ops)
"public_data_au_addresses_addr_id_9ee00c76_like" btree (addr_id varchar_pattern_ops)
"search_idx" gin (number, street, unit, city, region, postcode)

在查询中运行explain()方法时,我会得到:

When I run the explain() method on my query I get that:

Test1: 1 word
Limit  (cost=0.00..1110.60 rows=10 width=140)
->  Seq Scan on public_data_au_addresses  (cost=0.00..8081472.41 rows=72767 width=140)
    Filter: (to_tsvector((((((((((((COALESCE(number, ''::character varying))::text || ' '::text) || (COALESCE(street, ''::character varying))::text) || ' '::text) || (COALESCE(unit, ''::character varying))::text) || ' '::text) || (COALESCE(city, ''::character varying))::text) || ' '::text) || (COALESCE(region, ''::character varying))::text) || ' '::text) || (COALESCE(postcode, ''::character varying))::text)) @@ plainto_tsquery('north'::text))

因此它仍然显示顺序扫描而不是使用索引扫描。有人知道如何解决或调试吗?

So it still shows a Sequential scan instead of using an Index scan. Does anyone know how to fix or debug that?

GIN索引仍然可以通过如此多的字段进行搜索仍然有效吗?

Would a GIN index still be efficient with so many fields to search on anyway?

最后,还有谁知道我可以如何改进代码以进一步提高性能?

And lastly, does anyone know how I can improve the code to improve the performance further?

谢谢!
问候

Thank you! Regards

我试图按照下面Paolo的建议创建搜索向量,

I have tried to create a search vector as suggested by Paolo below, but it seems the search is still sequential and not leveraging the GIN index.

class AddressesQuerySet(CopyQuerySet):

    def update_search_vector(self):
        return self.update(search_vector=SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english'))


class AddressesAustralia(models.Model):
    date_update = models.DateTimeField(auto_now=True, null=True)
    longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
    latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
    number = models.CharField(max_length=16, null=True, default='')
    street = models.CharField(max_length=60, null=True, default='')
    unit = models.CharField(max_length=50, null=True, default='')
    city = models.CharField(max_length=50, null=True, default='')
    district = models.CharField(max_length=10, null=True, default='')
    region = models.CharField(max_length=5, null=True, default='')
    postcode = models.CharField(max_length=5, null=True, default='')
    addr_id = models.CharField(max_length=20, unique=True)
    addr_hash = models.CharField(max_length=20, unique=True)
    search_vector = SearchVectorField(null=True, editable=False)

    objects = AddressesQuerySet.as_manager()

    class Meta:
        indexes = [
            GinIndex(fields=['search_vector'], name='search_vector_idx')
        ]

然后我使用更新命令更新了search_vector字段:

I have then updated the search_vector field using the update command:

AddressesAustralia.objects.update_search_vector()

然后我运行查询进行测试具有相同的搜索向量:

Then I ran a query to test with the same search vector:

class Command(BaseCommand):

    def handle(self, *args, **options):

        search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english')

        searchtext1 = "north"

        print('Test1: 1 word')
        start_time = time.time()
        result = AddressesAustralia.objects.filter(search_vector=searchtext1)[:10].explain(verbose=True)
        print(len(result))
        print(result)
        time_exec = str(timedelta(seconds=time.time() - start_time))
        print(time_exec)

我得到以下结果,仍然显示顺序搜索:

And I get the following results, still showing a sequential search:

Test1: 1 word
532
Limit  (cost=0.00..120.89 rows=10 width=235)
  Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
  ->  Seq Scan on public.public_data_au_addressesaustralia  (cost=0.00..5061078.91 rows=418651 width=235)
        Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
        Filter: (public_data_au_addressesaustralia.search_vector @@ plainto_tsquery('north'::text))
0:00:00.075262

我也尝试过:


  • 有和没有config =搜索向量(包括更新和查询)中的英语

  • With and without the config="english" in the search vector (both in the update and in the query)

要删除GIN索引,然后重新创建它,然后重新运行update_search_Vector

To delete the GIN index, then recreate it and then re-run the update_search_Vector

但结果仍然相同。对我做错了什么或如何进一步排除故障有任何想法吗?

But still the same results. Any idea on what I am doing wrong or how I can troubleshoot further?

推荐答案

如@knbk所建议的那样,您可以提高性能必须阅读全文搜索性能 Django 文档中的部分。

As already suggested by @knbk for performance improvement you have to read the Full-text search Performance section in the Django documentation.


如果此方法变得太慢,则可以添加 SearchVectorField 到您的模型。

在您的代码中,您可以在模型中添加带有相关GIN的搜索向量字段索引和具有用于更新字段的新方法的查询集:

In your code you can add a search vector field in your model with a related GIN index and a queryset with a new method to update the field:

from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVector, SearchVectorField
from django.db import models
from postgres_copy import CopyQuerySet


class AddressesQuerySet(CopyQuerySet):

    def update_search_vector(self):
        return self.update(search_vector=SearchVector(
            'number', 'street', 'unit', 'city', 'region', 'postcode'
        ))


class Addresses(models.Model):
    date_update = models.DateTimeField(auto_now=True, null=True)
    longitude = models.DecimalField(max_digits=9, decimal_places=6, null=True)
    latitude = models.DecimalField(max_digits=9, decimal_places=6, null=True)
    number = models.CharField(max_length=16, null=True, default='')
    street = models.CharField(max_length=60, null=True, default='')
    unit = models.CharField(max_length=50, null=True, default='')
    city = models.CharField(max_length=50, null=True, default='')
    district = models.CharField(max_length=10, null=True, default='')
    region = models.CharField(max_length=5, null=True, default='')
    postcode = models.CharField(max_length=5, null=True, default='')
    addr_id = models.CharField(max_length=20, unique=True)
    addr_hash = models.CharField(max_length=20, unique=True)
    search_vector = SearchVectorField(null=True, editable=False)

    objects = AddressesQuerySet.as_manager()

    class Meta:
        indexes = [
            GinIndex(fields=['search_vector'], name='search_vector_idx')
        ]

您可以使用新的queryset方法更新新的搜索向量字段:

You can update your new search vector field using the new queryset method:

>>> Addresses.objects.update_search_vector()
UPDATE "addresses_addresses"
SET "search_vector" = to_tsvector(
  COALESCE("addresses_addresses"."number", '') || ' ' ||
  COALESCE("addresses_addresses"."street", '') || ' ' ||
  COALESCE("addresses_addresses"."unit", '') || ' ' ||
  COALESCE("addresses_addresses"."city", '') || ' ' ||
  COALESCE("addresses_addresses"."region", '') || ' ' ||
  COALESCE("addresses_addresses"."postcode", '')
)

如果执行查询并阅读了解释您可以看到您使用的GIN索引:

If you execute a query and read the explain you can see your GIN index used:

>>> print(Addresses.objects.filter(search_vector='north').values('id').explain(verbose=True))
EXPLAIN (VERBOSE true)
SELECT "addresses_addresses"."id"
FROM "addresses_addresses"
WHERE "addresses_addresses"."search_vector" @@ (plainto_tsquery('north')) = true [0.80ms]
Bitmap Heap Scan on public.addresses_addresses  (cost=12.25..16.52 rows=1 width=4)
  Output: id
  Recheck Cond: (addresses_addresses.search_vector @@ plainto_tsquery('north'::text))
  ->  Bitmap Index Scan on search_vector_idx  (cost=0.00..12.25 rows=1 width=0)
        Index Cond: (addresses_addresses.search_vector @@ plainto_tsquery('north'::text))

如果您想进一步加深,可以阅读我写的关于这一主题的文章

If you want to deepen further you can read an article that I wrote on the subject:

使用PostgreSQL在Django中进行全文搜索

"Full-Text Search in Django with PostgreSQL"

我尝试执行由Django ORM生成的SQL:
http:/ /sqlfiddle.com/#!17/f9aa9/1

I tried execute the SQL generate by Django ORM: http://sqlfiddle.com/#!17/f9aa9/1

这篇关于Django全文搜索优化-Postgres的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆