优化生成的字符串以存储到数据库中 [英] optimizing generated string for storing into a database

查看:167
本文介绍了优化生成的字符串以存储到数据库中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个64位整数时间戳和一个Sting用户名组合成一个字符串并最终存储到数据库列中。抛开为什么我不能将它们存储在具有适当类型的单独列中,我的问题是如何将它们组合起来以从底层数据库中获得更好的性能。那将是sqlite,PostgreSQL或MySQL,还不确定。

I have a 64bit integer timestamp and a Sting username to be combined into one string and eventually stored into a database column. Leave aside why I can't store them in separate columns with appropriate type, my question is how to combine them to get better performance from the underlying database. That would be sqlite, PostgreSQL or MySQL, not sure yet.

我想象他们会使用b-tree作为索引,并且像连接一样糟糕( timestamp-username)因为时间戳通常总是会进展,而树需要经常进行平衡。
用户名 - 时间戳应该要好得多,但每个用户记录仍会随着每个新条目而增加。
我还在考虑将时间戳与相反的位顺序放在一起。

I am imagining that they would be using b-trees as indexes and it would be bad to concat like (timestamp-username) because timestamp would generally always progress and tree would need balancing often. username-timestamp should be much better but still each user record would increase with every new entry. I was thinking to also put timestamp with reverse order of bits.

我还能做什么?一些聪明的xor或其他什么?什么是合理的最佳架构?通过请求确切生成的字符串,没有范围等来访问数据。

Anything else I can do? Some clever xor or whatever? What would be the reasonably best schema? Data will ever be accessed by requesting the exact generated string, no ranges and such.

唯一的要求是在生成的字符串和源数据之间进行相对快速的转换方式。

The only requirements are to have relatively fast conversion between the generated string and source data in both ways.

更新:请大家,我想知道什么样的字符串更适合存储为数据库的主键(sqlite,mysql之一)和postgresql)。也许答案是无关紧要,或者取决于数据库引擎。我没有使用我正在使用的架构或缓存解决方案的特定问题。我只想问是否有任何改进空间以及如何改进。我会理解一些主题答案。

UPDATE: Please guys, I'm reaching for information what kind of string would be better for storing as a primary key to a database (one of sqlite, mysql and postgresql). Perhaps the answer is that it doesn't matter, or depends on the DB engine. I don't have a particular problem with the schema I'm using or the caching solution. I'm just asking if there is any room to improve and how. I'll appreciate some on-topic answers.

UPDATE2:对我来说仍然不是很好的答案:
增量列是否使列上的b-tree索引不平衡?
https://stackoverflow.com/a/2362693/520567

UPDATE2: Great answers still not definitive for me: does incremented column makes the b-tree index on the column unbalanced? https://stackoverflow.com/a/2362693/520567

推荐答案

你的问题中有一个矛盾,你指定你不能拆分它们并将它们存储在不同的列中但是你要说的是分别索引两个部分 - 你如果不拆分它们就不能这样做。

There is a contradiction in your question, you specify you can't split them and store them in separate columns but then you're talking about indexing both parts separately - you can't do that without splitting them.

我可以看到你真的有两个选择:

I can see you really have two choices:


  1. 将它们存储在不同的列中

  2. 散列输出以降低索引内存占用量

理想情况下,您应该将它们存储在两列中,并创建一个复合索引总是以相同的顺序一起搜索它们。在这种情况下,如果不首先提供更多信息,很难给出准确的建议 - 但是通常用户名,时间戳在符合每个用户的情况下具有逻辑意义,或者如果您想通过时间戳查询则反转它。如果你需要在一个或另一个上搜索,你也可以在每一列上创建索引。

Ideally you should store them in two columns and create a composite index if you will always search for them together in the same order. In that case its hard to give accurate advice without first giving more information - however generally username, timestamp would make logical sense if you query per user, or reversing it if you want to query by timestamp. You could also create an index on each column if you need to search on one or the other.

哈希生成的字符串

INSERT INTO table (crc_hash_column, value_column_name)
values (CRC32(@generated_value), @generated_value)

会将大小减小到32位整数(每行只有4字节的索引),远小于所需的等值VARCHAR或CHAR索引空间。

would reduce the size to a 32bit integer (only 4bytes of index per row), much smaller than the equilivant VARCHAR or CHAR index space required.

如果采用这种方法,那么你应该采取措施避免碰撞,因为它会发生生日悖论,并且随着数据集的增长更有可能。即使存在冲突,额外过滤仍会在给定索引大小的情况下产生更高的性能。

If you take this approach then you should take measures to avoid collisions, due to the Birthday Paradox it will happen, and be more likely as your dataset grows. Even with collisions the extra filtering will still yield greater performance given the size of the index than the alternatives.

SELECT * FROM table
WHERE crc_hash_column = CRC32(@search_value) 
AND value_column_name = @searchvalue

使用散列会导致更多的CPU周期 - 但是CRC32散列非常快,所以即使你每次搜索时都需要重新散列,这些额外的工作对于索引大量数据所带来的好处微不足道。

Using the hash will cause a few more CPU cycles - but a CRC32 hash is very quick so even though you have to rehash each time you search this extra work is tiny for the benefits given over indexing larger amounts of data.

一般来说我更喜欢第一种选择,但在不知道你的用例的情况下几乎不可能推荐。

Generally I would prefer the first option, but its almost impossible to recommend without knowing your use case.

你应该分析这两个选项和看看它们是否符合您的要求。

You should profile both options and see if they fit your requirements.

这篇关于优化生成的字符串以存储到数据库中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆