Django模型选择:IntegerField vs CharField [英] Django Model Choices: IntegerField vs CharField
问题描述
TL; DR :我有一个数百万个实例的表,我想知道如何索引。
TL;DR: I have a table with millions of instances and I'm wondering how should I index it.
我有一个使用SQL Server作为数据库后端的Django项目。
I have a Django project that uses SQL Server as the database backend.
在生产环境中拥有大约1400万个实例的模型后,我意识到我正在获得性能问题:
After having a model with around 14 million instances in the Production environment, I realized that I was getting performance issues:
class UserEvent(models.Model)
A_EVENT = 'A'
B_EVENT = 'B'
types = (
(A_EVENT, 'Event A'),
(B_EVENT, 'Event B')
)
event_type = models.CharField(max_length=1, choices=types)
contract = models.ForeignKey(Contract)
# field_x = (...)
# field_y = (...)
我在此字段中使用了大量查询,并且它的效率非常低,因为该领域没有被索引。仅使用此字段过滤模型需要近7秒,而通过索引的外键进行查询不会带来性能问题:
I use a lot of queries based in this field, and it is being highly inefficient, since the field isn't indexed. Filtering the model using only by this field takes almost 7 seconds, while querying by an indexed foreign key doesn't carry performance issues:
UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count()
# elapsed time: 0:00:06.921287
UserEvent.objects.filter(contract_id=62).count()
# elapsed time: 0:00:00.344261
当我意识到这一点也向自己提出了一个问题:这个字段不应该是SmallIntegerField吗?由于我只有一小部分选择,而基于整数字段的查询比基于文本/ varchar的查询更有效。
When I realized this, I also made a question to myself: "Shouldn't this field be a SmallIntegerField? Since I only have a small set of choices, and queries based in integer fields are more efficient than text/varchar based queries."
所以,从我的理解,我有两个选项*:
So, from what I understand, I have two options*:
*我意识到第三个选项可能存在,因为索引低基数字段可能不会导致严重的改进,但是由于我的值为[1% - 99%]分配(我正在寻找1%的部分),索引此字段似乎是一个有效的选项。
*I realize that a third option may exist, since indexing fields with low cardinality may not cause severe improvements, but since my values have a [1%-99%] distribution (and I'm looking for the 1% part), indexing this field seems to be a valid option.
-
A) strong>只需索引该字段,并将其作为CharField。
A) Simply index this field, and leave it as a CharField.
A_EVENT = 'A' B_EVENT = 'B' types = ( (A_EVENT, 'Event A'), (B_EVENT, 'Event B') ) event_type = models.CharField(max_length=1, choices=types, db_index=True)
-
B)执行迁移以在SmallIntegerField中转换此字段(我不希望它是一个BooleanField,因为可能会向该字段添加更多选项),然后将
B) Perform a migration to transform this field in a SmallIntegerField (I don't want it to be a BooleanField, since it may be possible to add more options to the field), and then index the field.
A_EVENT = 1 B_EVENT = 2 types = ( (A_EVENT, 'Event A'), (B_EVENT, 'Event B') ) event_type = models.SmallIntegerField(choices=types, db_index=True)
- 模式迁移以创建新的 SmallIntegerField
- 数据迁移将数百万个实例从旧字段复制(并转换)到新字段。 li>
- 更新项目代码以使用新字段或执行另一个模式迁移以重命名新字段a
- 删除旧字段。
- 是否任何使用Django选项使用CharField的情况比使用Boolean / Integer / SmallIntegerField更好吗?
优点:简单
缺点: 基于CharField 的索引的效率低于基于整数的索引
Cons: CharField based indexes are less efficient than Integer based indexes
优点:基于整数的索引比 CharField 的索引
Pros: Integer based indexes are more efficient than CharField based indexes
缺点:我必须执行复杂的操作: / p>
Cons: I have to perform a complex operation:
总结一下,真正的问题是:
Summing up, the real question here is:
从将该领域迁移到SmallIntegerField的性能提升值得冒险? strong>
The performance improvement I get from migrating the field to a SmallIntegerField worths the risk?
我倾向于尝试选项A,并检查性能改进是否足够。
I'm leaned to try option A, and check if the performance improvements are adequate.
我也把这个问题提交给StackOverflow,因为一个更一般的问题是:
I also brought up this question to StackOverflow because a more generic question arised:
这种情况是由于我的灵感来自于 Django文档代码片段:
This situation was originated because when defining the project models I was inspired by Django documentation code snippet:
YEAR_IN_SCHOOL_CHOICES = (
('FR', 'Freshman'),
('SO', 'Sophomore'),
('JR', 'Junior'),
('SR', 'Senior'),
)
year_in_school = models.CharField(max_length=2,
choices=YEAR_IN_SCHOOL_CHOICES,
default=FRESHMAN)
为什么他们可以使用整数时使用chars,因为它只是一个不应该永远不显示的值表示? / p>
Why are they using chars when they could be using integers, since it is just a value representation that shouldn't never be displayed?
推荐答案
计数查询速度
UserEvent.objects.filter(event_type=UserEvent.B_EVENT).count()
# elapsed time: 0:00:06.921287
当表具有大量条目时,这种性质的查询不幸会在数据库中缓慢。
Queries of this nature, unfortunately will always be slow in databases when the table has a large number of entries.
Mysql通过查看索引来优化计数查询提供的索引列是数字的。所以这是一个很好的理由使用SmallIntegeField而不是Charfield如果你在mysql,但显然你不是。您的里程因其他数据库而异。我不是SQL服务器的专家,但我的理解是它是特别差。
Mysql optimizes count queries by looking at the index provided the indexed columns are numeric. So that's a good reason to use SmallIntegeField instead of Charfield if you were on mysql but apparently you are not. Your mileage varies with other databases. I am not an expert on SQL server but my understanding is that it's particularly poor at using indexes on COUNT(*) queries.
分区
您可能能够提高涉及event_type的查询的整体性能通过分割数据。因为当前索引的基数很差,所以计划者往往更好地进行全表扫描。如果数据被分区,则只需要扫描特定的分区。
You might be able to improve overall performance of queries involving event_type by partitioning the data. Because the cardinality of the current index is poor it's often better for the planner to do a full table scan. If the data was partitioned, only that particular partition would need to be scanned.
Char或Smallint
哪个占用更多的空格char(2)或小int?答案是这取决于你的字符集。如果字符集每个字符只需要一个字节,那么小整数和char(2)将占用相同的空间。由于该领域的基数非常低,所以在这种情况下使用char或smallint不会有任何显着差异。
Which takes up more space char(2) or small int? The answer is that it depends on your character set. If the character set requires only one byte per character small integer and char(2) would take up the same amount of space. Since the field is going to have very low cardinality, using char or smallint will not make any significant difference in this case.
这篇关于Django模型选择:IntegerField vs CharField的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!