SQL性能搜索长字符串 [英] SQL performance searching for long strings

查看:147
本文介绍了SQL性能搜索长字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将用户代理字符串存储在数据库中,以跟踪和比较不同浏览器之间的客户行为和销售业绩。一个非常简单的用户代理字符串大约有100个字符。决定使用 varchar(1024)来保存数据库中的useragent数据。 (我知道这太过于夸张了,但这就是主意;它应该适应未来几年的useragent数据,一些设备,工具栏和应用程序已经推送了500个字符长度。)包含这些字符串的表将被标准化(每个不同的用户代理字符串只会被存储一次)并且像缓存一样对待,所以我们不必一遍又一遍地解释用户代理。



典型的用例是:




  • 用户来到我们的网站,被检测为新的访客

  • 创建新的会话信息
  • 确定我们是否需要分析用户代理字符串,或者我们是否有有效的文件分析

  • 如果我们拥有它,很好,如果没有,分析它(目前,我们打算调用第三方API)
  • 将相关信息(浏览器名称,版本,os等)存储在一个连接表中现有用户会话信息并指向缓存项



注意:我倾向于在数据库中对用户代理字符串进行搜索,因为它不是简单的查找。但要清楚的是,查询将使用'='运算符,而不是正则表达式或LIKE%语法。

因此,查找用户代理字符串的速度至关重要。我已经探索了确保它具有良好性能的几种方法。由于大小的原因,对整列进行索引是正确的。部分索引不是一个好主意,因为大多数用户代理在最后都有明显的信息;部分索引需要相当长的时间才能使它的大小值得引起问题。



所以它归结为哈希函数。我的想法是在Web服务器代码中散列用户代理字符串,然后运行select查找数据库中的散列值。我觉得这样可以最大限度地减少数据库服务器上的负载(而不是让它计算哈希),尤其是因为如果找不到哈希,代码会转过来并要求数据库再次在插入时计算哈希。

散列为整数值可以提供最佳的性能,但会有更高的冲突风险。我预计最多可以看到数千或数以万计的用户代理;即使100,000个用户代理也能很好地适合2 ^ 32大小的整数,并且只有很少的冲突可以被web服务解密并且性能影响最小。即使你认为整数散列不是一个好主意,使用32个字符的摘要(SHA-1,MD5例如)应该比原始字符串快得多,对吧?



我的数据库是MySQL InnoDB引擎。网站代码首先来自C#,之后是PHP(在我们整合了一些托管和认证之后)(不是网络代码应该有很大的不同)。



如果您认为这是蹩脚的choose-my-hash-algorithm问题​​,请让我道歉。我真的希望从之前做过类似事情的人以及他们的决策过程中得到一些意见。所以,问题:


  • 您将使用哪种散列形式来存储此应用程序?

  • 在数据库中存储/搜索长字符串是否存在根本不同的方法?

在代码中使用散列还是让数据库处理它?

解决方案

我已经看到这样做对于非常大的字符串,并且在高容量环境中,并且它工作得很好。



您将使用哪个散列用于此应用程序?




  • 我不认为加密(散列)算法真的很重要,因为您不是用哈希来加密数据,而是通过哈希来创建令牌将其用作查找更长的值的关键。因此,散列算法的选择应该基于速度。



您会计算代码中的散列还是让db处理它?




  • 如果是我的项目,我会在应用层执行哈希处理,然后通过它在商店内查找(缓存,然后是数据库)。


在数据库中存储/搜索长字符串是否有完全不同的方法? p>


  • 正如我所提到的,我认为您的具体目的,您提出的解决方案是一个很好的解决方案。



表格建议(仅限演示):

user




  • id int(11)unsigned not null

  • name_first varchar(100)not null



  • user_agent_history




    • user_id int(11)unsigned not null

    • agent_hash varchar(255)不为空



    代理




    • agent_hash varchar(255)not null

    • 浏览器 varchar(100)not null

    • 代理文本不为空






    • 从您的OP听起来您需要用户和代理之间的M:M关系,因为用户可能在工作中使用Firefox,但随后可能会在家中切换到IE9。因此需要数据透视表。


    • 用于 agent_hash 的varchar(255)辩论。 MySQL 建议使用varbinary列类型来存储散列,其中有几种类型。

    • 我也建议或者使 agent_hash 为主键,或者至少在列上添加一个UNIQUE约束。

    I need to store user agent strings in a database for tracking and comparing customer behavior and sales performance between different browsers. A pretty plain user agent string is around 100 characters long. It was decided to use a varchar(1024) for holding the useragent data in the database. (I know this is overkill, but that's the idea; it's supposed to accommodate useragent data for years to come and some devices, toolbars, applications are already pushing 500 characters in length.) The table holding these strings will be normalized (each distinct user agent string will only be stored once) and treated like a cache so we don't have to interpret user agents over and over again.

    The typical use case is:

    • User comes to our site, is detected as being a new vistor
    • New session information is created for this user
    • Determine if we need to analyze the user agent string or if we have a valid analysis on file for it
    • If we have it, great, if not, analyze it (currently, we plan on calling a 3rd party API)
    • Store the pertinent information (browser name, version, os etc.) in a join table tied the existing user session information and pointing to the cache entry

    Note: I have a tendency to say 'searching' for the user agent string in the database because it's not a simple look up. But to be clear, the queries are going to use '=' operators, not regexes or LIKE % syntax.

    So the speed of looking up the user agent string is paramount. I've explored a few methods of making sure it will have good performance. Indexing the whole column is right out for size reasons. A partial index isn't such a good idea either because most user agents have the distinguishing information at the end; the partial index would have to be fairly long to make it worthwhile by which point its size is causing problems.

    So it comes down to a hash function. My thought is to hash the user agent string in web server code and run the select looking for the hash value in the database. I feel like this would minimize the load on the database server (as opposed to having it compute the hash), especially since if the hash isn't found, the code would turn around and ask the database to compute the hash again on the insert.

    Hashing to an integer value would offer the best performance at the risk of higher collisions. I'm expecting to see thousands or tens of thousands user agents at the most; even 100,000 user agents would fit reasonably well into a 2^32 size integer with very few collisions which could be deciphered by the webservice with minimal performance impact. Even if you think an integer hash isn't such a good idea, using a 32 character digest (SHA-1, MD5 e.g.) should be much faster for selects than the raw string, right?

    My database is MySQL InnoDB engine. The web code will be coming from C# at first and php later (after we consolidate some hosting and authentication) (not that the web code should make a big difference).

    Let me apologize at this point if you think this is lame choose-my-hash-algorithm question. I'm really hoping to get some input from people who've done something similar before and their decision process. So, the question:

    • Which hash would you use for this application?
    • Would you compute the hash in code or let the db handle it?
    • Is there a radically different approach for storing/searching long strings in a database?

    解决方案

    Your idea of hashing long strings to create a token upon which to lookup within a store (cache, or database) is a good one. I have seen this done for extremely large strings, and within high volume environments, and it works great.

    "Which hash would you use for this application?"

    • I don't think the encryption (hashing) algorithm really matters, as you are not hashing to encrypt data, you are hashing to create a token upon which to use as a key to look up longer values. So the choice of hashing algorithm should be based off of speed.

    "Would you compute the hash in code or let the db handle it?"

    • If it were my project, I would do the hashing at the app layer and then pass it through to look up within the store (cache, then database).

    "Is there a radically different approach for storing/searching long strings in a database?"

    • As I mentioned, I think for your specific purpose, your proposed solution is a good one.

    Table recommendations (demonstrative only):

    user

    • id int(11) unsigned not null
    • name_first varchar(100) not null

    user_agent_history

    • user_id int(11) unsigned not null
    • agent_hash varchar(255) not null

    agent

    • agent_hash varchar(255) not null
    • browser varchar(100) not null
    • agent text not null

    Few notes on schema:

    • From your OP it sounds like you need a M:M relationship between user and agent, due to the fact that a user may be using Firefox from work, but then may switch to IE9 at home. Hence the need for the pivot table.

    • The varchar(255) used for agent_hash is up for debate. MySQL suggests using a varbinary column type for storing hashes, of which there are several types.

    • I would also suggest either making agent_hash a primary key, or at the very least, adding a UNIQUE constraint to the column.

    这篇关于SQL性能搜索长字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆