索引&低选择性色谱柱的替代品 [英] Indexing & alternatives for low-selectivity columns

查看:111
本文介绍了索引&低选择性色谱柱的替代品的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在低选择性列上选择记录的策略范围是多少?



一个例子可能是一个订单表,多年来,您建立了一个大量完成订单,但往往需要选择活动订单。订单可能会经历生命周期,例如放置,库存分配,从仓库中挑选,发送给客户,开具发票和付款。订单可能另外被取消,保留等。大多数记录最终将处于最终状态(例如付费),但是您可能经常需要选择分配的订单。在这种情况下,顺序读取速度会很慢。



有关索引的类似问题

MySQL:低基数/选择性列=如何索引?

索引吸入SQL?

什么是索引,如何使用它们优化数据库中的查询?

定义索引:哪些列和性能影响?

和许多其他与之相关的



我读过的方法(在stackoverflow和其他地方)包括




  • 使用位图索引

  • 使用部分索引(在t(c2)上创建索引x,其中c1 ='a'

  • 使用聚集索引

  • 不要索引低选择性列,使用顺序读取

  • 将数据分区(例如,分成几个表使用相同的模式)

  • 使用补充表(例如 active_customers(customer_id)



我目前的DBMS不支持上面列出的前三个选项,其余部分似乎有问题 - 有没有其他常用的方法?



更新:我看到
- 索引您的低选择性列,但只能选择高选择性值。

解决方案

我同意Unreason的然而分支。但是有一些事情要知道这个例子。



这被称为歪斜和歪斜。这是部分指数的完美用法,您可以排除95%的付费发票,仅索引更有趣和有选择性的统计信息。但你没有这个。您可以将所有行水平分割为单独的表/分区,但是您需要考虑行迁移(从一个状态转移到另一个状态),这很昂贵。 DBMS必须执行更新,删除和插入以更改状态。如果你是一个很大的系统会受伤。



忘记你对是否根据选择性进行索引的看法,因为在快速变化的列上放置索引通常也是一个坏主意。您的索引将具有热块,其中所有的步骤1将被删除,另一个在所有的步骤2的插入和哦btw,一些步骤2的同时被删除到步骤3的。这将不会很好地扩展。



我建议您将状态垂直分割成单独的表格。



您的发票表将有一个PK和所有列,除了状态。



您的状态可以处理两种方式。该表将PK值作为FK返回到发票表,状态和输入该状态的时间戳。最好的是状态上的水平分区表。您可以为每个状态分配一个分区。因此,找到所有或一个放置状态将分割修剪并只读取所需的分区 - 这是一个非常少的块。因为该行是如此狭窄,您可能会在单个块上获得400个发票状态。查找任何一张发票的状态很容易,因为PK上有一个全局索引。



如果您的RDBMS不支持使用行迁移进行分区,则需要将这些分区作为表进行管理,并从其中删除并插入另一个。你会将这些动作封装在一个过程的事务中,这样你就可以保持数据的清洁。每个发票都在一个状态表中。更难的部分是通过发票ID进行查询,您必须检查每个表格以查看它的位置。



您有另一个选择
您可以写付款状态还是不支付。如果是分区表,则可以在发票状态表中移动到付款时从发票状态表中删除发票。 (当然,你会写一个有偿记录到奖金资料中提到的历史表)。那么你会做一个外部加入到状态表,并且null表示支付。如果您几乎从不查询付费状态,确实没有任何理由可以快速查询。



奖励材料



<在这两种情况下,您都可以在报表中跟踪这些动作。每次更新状态时,您都需要将其写入历史表。最后你会想要分析我所说的过境时间。平均每月从付费到付费的时间是多少?由于经济不景气而增加?从一个月到几个月的交通时间是多少?因为假期失踪的身体,夏季月份需要更长的时间吗?你得到点通过更新该列,您将失去这些答案,因此您需要将该历史记录嵌入到您的程序中。


What are the range of tactics available for selecting records on low selectivity columns?

An example might be an orders table where, over many years, you build up a large number of completed orders but often need to select active orders. An order might go through a lifecycle such as placed, stock-allocated, picked from warehouse, despatched to customer, invoiced and paid. An order might additionally be cancelled, held, etc. The majority of records will eventually be in the final state (e.g. paid) but you might often need to select, say, allocated orders. In this case a sequential read would be slow.

Similar questions on indexing
MySQL: low cardinality/selectivity columns = how to index ?
Do indexes suck in SQL?
What are indexes and how can I use them to optimize queries in my database?
Defining indexes: Which Columns, and Performance Impact?
and numerous others decreasingly related.

The approaches I have read about (in stackoverflow and elsewhere) include

  • Use a bitmap index
  • Use a partial index (create index x on t(c2) where c1='a')
  • Use a clustered index?
  • Don't index low selectivity columns, use sequential read
  • Partition the data (e.g. into several tables with identical schema)
  • Use a supplementary table (e.g. active_customers(customer_id)

My current DBMS doesn't support the first three options listed above and the remainder seem problematic - are there any other commonly used approaches?

Update: I've seen - index your low-selectivity column, but only ever select for high-selectivity values.

解决方案

I agree with Unreason's However branch. But there are some things to know about this case.

This is called skew and skew kills. This is a perfect use for a partial index where you'd exclude the 95% of paid invoices and only index the more interesting and selective stats. But you don't have that. You can horizontally partition all the rows into separate table/partitions but then you need to account for row migration (moving from one status to another) and that's expensive. The DBMS has to perform an Update, a Delete and an insert to change the status. If you're a high volume system that will hurt.

Forget what you said about whether or not to index based on selectivity because putting an index on a rapidly changing column is also usually a bad idea. Your index will have hot blocks where all the step 1's are being removed and another where all the step 2's are being inserted and oh btw, some step 2's are being removed at the same time into step 3's. This won't scale well.

I would recommend vertically partitioning your status into a separate table(s).

Your invoice table will have a PK and all the columns except status.

Your status you can handle two ways. That table will have the PK value as an FK back to the invoice table, the Status and a timestamp for when you entered that status. The best is a horizontally partitioned table on status. You'll have a partition for each status possible. So finding all or one "Placed" status will partition prune and read only the partition it needs - which is a very small number of blocks. Because the row is so narrow, you might get 400 invoice statuses on a single block. Looking up that status of any one invoice is easy since there's a global index on the PK.

If your RDBMS doesn't support partitioning with row migration, you'll need to manage these partitions as tables and delete from one and insert into another. You'll encapsulate these movements in a transaction in a procedure, so you keep the data clean. Every invoice is in one and only one status table. The harder part is querying by invoice ID, you'll have to check every table to see where it is.

You have another choice You can either write paid statuses or not. If it's a partitioned table, you can just delete the invoice from the invoice status table when it moves to paid. (Of course you'll write a paid record to the history table mentioned in the bonus material). Then you'll do an outer join to the status table and nulls mean paid. If you almost never query for paid status, there's really no reason to make that a fast query.

Bonus Material

in either case you'll want to keep track of these movements in a reporting table. Everytime you update a status, you'll want to write that to a history table. Eventually you'll want to analyze what I call transit times. What's the average time from filled to paid, by month? Is that increasing as a result of the bad economy? what's the transit time from placed to filled, by month. Do the summer months take longer because of missing bodies on vacation? you get the point. By updating that column you're losing those answers, so you'll need to embed that history log into your procedures.

这篇关于索引&amp;低选择性色谱柱的替代品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆