数据库索引如何工作? [英] How does database indexing work?

查看:158
本文介绍了数据库索引如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于索引是如此重要,因为您的数据集在大小上增加,有人可以解释索引在数据库不可知级别如何工作吗?



有关查询的信息索引字段,请查看如何索引数据库列



当数据存储在基于磁盘的存储设备上时,被存储为数据块。这些块被整体访问,使它们成为原子磁盘访问操作。磁盘块的结构与链表大致相同;两者都包含用于数据的部分,指向下一个节点(或块)的位置的指针,并且两者不需要被连续存储。



由于以下事实:记录数只能在一个字段上排序,我们可以声明在不排序的字段上搜索需要一个线性搜索,需要 N / 2 块访问平均),其中 N 是表跨越的块的数量。如果该字段是非关键字段(即不包含唯一条目),则必须在 N 块访问时搜索整个表空间。



而对于sorted字段,可以使用二进制搜索,它具有 log2 N 块访问。此外,由于数据是在给定非关键字字段的情况下排序的,所以一旦找到较高的值,则不需要为表的其余部分搜索重复值。



索引是一个非常有用的索引,一种对多个字段上的记录数进行排序的方法。在表中的字段上创建索引会创建另一个数据结构,该结构保存字段值以及与其相关的记录的指针。然后对该索引结构进行排序,允许对其执行二进制搜索。



索引的缺点是这些索引需要磁盘上的额外空间,因为索引使用MyISAM引擎一起存储在一个表中,如果同一个表中的许多字段都被索引,这个文件可以快速达到底层文件系统的大小限制。





首先,我们概述一个示例数据库表模式;

 
字段名数据类型磁盘大小
id(主键)无符号INT 4个字节
firstName Char )50 bytes
lastName Char(50)50 bytes
emailAddress Char(100)100 bytes



<注意:使用char代替varchar,以允许磁盘上的精确大小值。
此示例数据库包含500万行,并且未经过索引。现在将分析几个查询的性能。这些是使用 id (排序的关键字段)和使用 firstName (非键未排序字段)的查询。



示例1 - 排序与未排序的字段



我们的 r = 5,000,000 样本数据库具有固定大小的记录,记录长度为 R = 204 存储在使用默认块大小 B = 1,024 字节的MyISAM引擎的表中。表的阻塞因子将是每个磁盘块的 bfr =(B / R)= 1024/204 = 5 条记录。保存表所需的块的总数为 N =(r / bfr)= 5000000/5 = 1,000,000 块。



对id字段进行线性搜索需要平均 N / 2 = 500,000 值,因为id字段是关键字字段。但是由于id字段也被排序,因此可以进行二分查找,要求平均 log2 1000000 = 19.93 = 20 块访问。



现在 firstName 字段既不是排序也不是键字段,因此二进制搜索是不可能的,这些值也不是唯一的,因此该表将需要搜索结束以获得 N = 1,000,000 块访问。在这种情况下,索引的目的是纠正。



鉴于索引记录只包含索引字段和指向原始记录的指针,小于它指向的多字段记录。因此索引本身需要比原始表少的磁盘块,因此需要较少的块访问来迭代。 firstName 字段上的索引模式如下所示;

 
字段名称数据类型磁盘大小
firstName Char(50)50字节
(记录指针)特殊4字节

注意:MySQL中的指针长度为2,3,4或5个字节,具体取决于大小

- 索引
$ b

给定我们的示例数据库 r = 5,000,000 ,索引记录长度为 R = 54 字节,并使用默认块大小 B = 1,024 字节。索引的阻塞因子将为每个磁盘块的 bfr =(B / R)= 1024/54 = 18 记录。保存索引所需的块总数为 N =(r / bfr)= 5000000/18 = 277,778 块。



现在,使用 firstName 字段进行搜索可以利用索引来提高性能。这允许对具有 log2 277778 = 18.08 = 19 块访问的平均值的索引的二分搜索。要找到实际记录的地址,这需要进一步的块访问读取,使总计为 19 + 1 = 20 块访问,与1,000,000块



何时应该使用? / p>

由于创建索引需要额外的磁盘空间(上例中额外增加277,778个块,增加约28%),而太多的索引可能会导致文件系统大小限制,仔细考虑必须使用选择正确的字段来索引。



由于索引只用于加速在记录中搜索匹配字段,其原因在于,当进行插入或删除操作时,仅用于输出的索引字段将简单地浪费磁盘空间和处理时间,因此应当避免。还给出二分搜索的性质,数据的基数或唯一性是重要的。在基数为2的字段上建立索引会将数据拆分为一半,而基数为1,000的数据将返回大约1,000条记录。以这样低的基数,有效性被减少到线性排序,并且如果基数小于记录数的30%,则查询优化器将避免使用索引,有效地使索引浪费空间。


Given that indexing is so important as your data set increases in size, can someone explain how does indexing works at a database agnostic level?

For information on queries to index a field, check out How do I index a database column.

解决方案

Why is it needed?

When data is stored on disk based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires N/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire table space must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, this has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

What is indexing?

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

The downside to indexing is that these indexes require additional space on the disk, since the indexes are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

How does it work?

Firstly, let’s outline a sample database table schema;

Field name       Data type      Size on disk
id (Primary key) Unsigned INT   4 bytes
firstName        Char(50)       50 bytes
lastName         Char(50)       50 bytes
emailAddress     Char(100)      100 bytes

Note: char was used in place of varchar to allow for an accurate size on disk value. This sample database contains five million rows, and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).

Example 1 - sorted vs unsorted fields

Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a table using the MyISAM engine which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.

A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value, given that the id field is a key field. But since the id field is also sorted, a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.

Now the firstName field is neither sorted nor a key field, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.

Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks than the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;

Field name       Data type      Size on disk
firstName        Char(50)       50 bytes
(record pointer) Special        4 bytes

Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.

Example 2 - indexing

Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the index is N = (r/bfr) = 5000000/18 = 277,778 blocks.

Now a search using the firstName field can utilise the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 1,000,000 block accesses required to find a firstName match in the non-indexed table.

When should it be used?

Given that creating an index requires additional disk space (277,778 blocks extra from the above example, a ~28% increase), and that too many indexes can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.

Since indexes are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.

这篇关于数据库索引如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆