全文搜索加密数据 [英] Full text search on encrypted data

查看:129
本文介绍了全文搜索加密数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一台存储加密文本的服务器(端对端:服务器永远看不到纯文本)。

Suppose I have a server storing encrypted text (end-to-end: server never sees plain text).

我希望能够进行全文本搜索

我知道这很棘手,但是我的想法是使用传统的全文本设计( list和 match表,其中存储单词并与内容表中的id匹配) )。当用户提交加密的文本时,他们还会发送单词及其各自匹配项的盐腌MD5。每位用户使用的盐都是唯一的,并且可以从其密码中恢复。

(简而言之:唯一的区别是列表表将包含哈希词)

I want to be able to do full text search on that text.
I know this is tricky, but my idea is to use the traditional full text design ("list" and "match" tables where words are stored and matched with ids from the content table). When users submit the encrypted text, they also send a salted MD5 of the words and respective matches. The salt used is unique for each user and is recovered from their password.
(in short: the only difference is that the "list" table will contain hashed words)

现在,这个系统有多脆弱?

请注意,我说的是多么脆弱而不是多么安全,因为我承认它不能完全安全。

我确实了解功能(全文搜索)和安全性(从单词索引中公开一些信息)之间的权衡。例如,我了解到,能够获得列表和匹配表的攻击者可以获得有关原始加密文本的信息,并且可能能够通过统计分析来解密某些单词(但是,盐是唯一的)对于每个用户,都需要为每个用户重复此操作。)

Now, how vulnerable would this system be?
Note that I said "how vulnerable" instead of "how safe", because I acknowledge that it can't be totally safe.
I DO understand the tradeoff between features (full text search) and security (disclosing some information from the word index). For example, I understand that an attacker able to get the list and match tables could get information about the original, encrypted text and possibly be able to decipher some words with statistical analysis (however, being the salt unique for each user, this would need to be repeated for each user).

这种威胁有多严重?还有其他严重的威胁吗?

How serious would this threat be? And would there be any other serious threats?

免责声明

我正在尝试构建的东西(并在加密专家的帮助下进行实际实现) ;现在我只是想了解这是否可能)是一种消费级产品,它将处理机密但并非完全机密的数据。

我的目标是提供一些足够安全,从而使攻击者更容易尝试窃取用户密码(例如,闯入客户端-最终他们是消费者),而不是花费大量时间和计算能力试图强行使用索引或运行复杂的统计分析。

DISCLAIMER
What I'm trying to build (and with the help of a cryptographer for actual implementation; right now I'm just trying to understand wether this will be possible) is a consumer-grade product which will deal with confidential yet not totally secret data.
My goal is just to provide something safe enough, so that it would be easier for an attacker to try stealing users' passwords (e.g. breaching into clients - they're consumers, eventually) rather than spending a huge amount of time and computing power trying to brute force the index or run complicated statistical analysis.

(可能与任何人相关


  1. 如您所述,其他解决方案是不可行的。将所有数据存储在客户端中意味着用户无法从其他客户端访问其数据。服务器端加密可以使用,但是我们无法为用户提供客户端加密的附加安全性。 ,对我/我们来说非常重要。

  1. As you noted, other solutions are not viable. Storing all the data inside the client means that users cannot access their data from other clients. Server-side encryption would work, but then we won't be able to give users the added security of client-side encryption.
    The only "true alternative" is just to not implement search: while this is not a required feature, it's very important to me/us.

盐将以与用户的解密密钥(用于解密存储的密钥)完全相同的方式受到保护。文字)。因此,如果某人能够捕获盐,则他或她也可能也可以捕获密钥,从而造成更大的问题。
准确地说,密钥和盐将以加密方式存储在服务器上。客户端会使用用户的密码在本地将它们解密并保存在内存中;服务器永远看不到解密的密钥和盐。用户可以更改密码,然后他们只需要重新加密密钥和密码,而不是全部存储的文本。据我所知,这是业内相当标准的方法。

The salt will be protected in the exactly same way as the users' decryption key (the one used to decrypt stored texts). Thus, if someone was able to capture the salt, he or she would likely be able to capture also the key, creating a much bigger issue.
To be precise, the key and the salt will be stored encrypted on the server. They will be decrypted by the client locally with the user's password and kept in memory; the server never sees the decrypted key and salt. Users can change passwords, then, and they just need to re-encrypt the key and the salt, and not all stored texts. This is a pretty standard approach in the industry, to my knowledge.

实际上,数据库的设计如下(仅报告相关条目)。这种设计就像您在评论中建议的那样。

Actually, the design of the database will be as follow (reporting relevant entries only). This design is like you proposed in your comment. It disallows proximity searches (not very relevant to us) and makes frequency less accurate.


  • content ,包含所有加密的文本。列是 content.id content.text

  • words ,其中包含所有哈希的列表。列是 words.id words.hash

  • match ,用于匹配带有散列/单词的文本(一对多关系)。列为 match.content_id match.word_id

  • Table content, containing all encrypted texts. Columns are content.id and content.text.
  • Table words, containing the list of all hashes. Columns are words.id and words.hash.
  • Table match, that matches texts with hashes/words (in a one-to-many relationship). Columns are match.content_id and match.word_id.

我们必须实现一些功能,例如删除停用词等。这不是一个大问题(当然,将在客户端上完成)。最终,这些列表对于国际(即非英语)用户一直是有限的实用性。

我们希望查找/插入比率会很高(例如,很多查找,但很少有插入内容,并且大部分是

We would have to implement features like removing stopwords etc. Sure. That is not a big issue (will, of course, be done on the client). Eventually, those lists have always been of limited utility for international (i.e. non English-speaking) users.
We expect the lookup/insert ratio to be pretty high (i.e. many lookups, but rare inserts and mostly in bulk).

解密整个哈希数据库当然是可能的,但是需要蛮力攻击。

假设盐是保持安全(按照上述第2点)。如果盐足够长(您引用了32位...但是为什么不添加320位?–仅作为示例),将花费很多时间。

Decrypting the whole hash database is certainly possible, but requires a brute force attack.
Suppose the salt is kept safe (as per point 2 above). If the salt is long enough (you cited 32 bits... but why not 320? - just an example) that would take A LOT of time.

总结……您证实了我对频率分析可能存在的风险的怀疑。但是,我觉得这种风险不太高。您可以确认吗?

确实,首先每个用户的盐都是唯一的。这意味着必须同时攻击一个用户。

其次,通过每个文本仅报告一次单词(无论它们出现多少次),频率分析的可靠性就会降低。

Third ...例如,对散列词的频率分析听起来不如对Caesar移位的频率分析好。仅英语就有25万个单词(同样,并不是我们所有的用户都会说英语),即使某些单词比其他单词更普遍,我相信无论如何要进行这种攻击。

To conclude... You confirmed my doubts about the possible risk of frequency analysis. However, I feel like this risk is not so high. Can you confirm that?
Indeed, first of all the salt would be unique per each user. This means that one user must be attacked at time.
Second, by reporting words only once per text (no matter how many times they appear), frequency analysis becomes less reliable.
Third... Frequency analysis on hashed words doesn't sound as something as good as frequency analysis on a Caesar-shift, for example. There are 250,000 words in English alone (and, again, not all our users will be English-speaking), and even if some words are more common than others, I believe it'd be hard to do this attack anyway.

PS:我们将存储的数据是消息,例如即时消息。这些简短,包含很多缩写,语等。每个人在编写文本时都有不同的风格,进一步降低了发生频率攻击的风险(在我看来)。

PS: The data we'll be storing is messages, like instant messages. These are short, contain a lot of abbreviations, slang, etc. And every person has a different style in writing texts, further reducing the risk (in my opinion) of frequency attacks.

推荐答案

TL; DR:如果这需要足够安全以至于需要按用户进行端到端加密,请执行以下操作:

TL;DR: If this needs to be secure enough that it requires per-user end-to-end encryption: Don't do it.

评论太久了,所以就走了-如果我理解正确:

Too long for a comment, so here goes - if I understand correctly:


  1. 您已经加密了用户提交的数据(客户端已加密,因此不使用数据库来处理)。

  2. 您希望此内容可被搜索到用户(您对此一无所知-加密的文本块是没有用的。)

  3. 您为此提出的解决方案是,还存储提交的哈希字词列表(或段落)

所以数据记录如下:


  • 第1列:加密的数据块

  • Colu mn 2 :(空格)分隔上述加密文本中散列的,有序的单个单词

然后搜索您就对搜索项进行散列并将散列的术语视为单词以搜索第2列中文本的段落。这肯定可以-仅考虑使用无意义的搜索词搜索无意义的文本。您甚至还可以使用这种方法对术语进行一些近似排名。

Then to search you just hash the search terms and treat the hashed terms as words to search the paragraph(s) of "text" in column 2. This will definitely work - just consider searching nonsense text with nonsense search terms. You would even still be able to do some proximity ranking of terms with this approach.

关注点:


  1. 以单独的散列单词作为文本的列将难以置信与加密文本相比,它的功能较弱。您正在极大地削弱您的解决方案,因为不仅存在有限的单词需要使用,而且生成的文本也容易受到单词频率分析的影响,等等。

  2. 如果这样做,请单独存储一个盐与密码无关。假设如果捕获了盐(仅字典单词),则彩虹表将很容易创建(将其存储在字典中)。

  3. 您将失去FTS的许多好处,例如忽略诸如 the之类的单词。 -如果需要,您将需要自己重新实现此功能(即在提交数据/搜索字词之前在客户端删除这些字词)。

您暗示的其他方法不可接受/不可行:

Other approaches that you imply are not acceptable/workable:


  1. 执行客户端搜索(所有数据必须

  2. 利用内置功能的数据库集中加密

I请理解以下论点:您的方法为用户提供了对其数据的唯一访问权限(即,您无法查看/解密数据)。我认为这种散列方法会充分削弱数据,从而可以合理地计算出用户数据(也就是说,您将所需的工作量降低到了非常合理的程度,您可以在不了解用户密钥的情况下解密用户的信息) /盐)。我不会降低描述这种混淆的门槛,但您应该真正考虑一下它的重要性。

I understand the argument being that your approach provides the user with the only access to their data (i.e. you cannot see/decrypt it). I would argue that this hashed approach weakens the data sufficiently that you could reasonably work out a users data (that is, you have lowered the effort required to the point that it is very plausible you could decrypt a user's information without any knowledge of their keys/salts). I wouldn't quite lower the bar to describe this as just obfuscation, but you should really think through how significant this is.

如果您确定会削弱系统性能,这样实现搜索是有道理的,而另一种方法是不够的,可能有帮助的一件事是将单词中的哈希值存储为仅出现的唯一单词列表(即没有频率或接近度信息可用)。这将稍微减少实施的攻击面,但是通过将方法描述为FTS,也将失去您暗中想要的好处。但是,由于散列词本质上变成了附加到包含它们的所有记录的标签,因此您可能会得到非常快速的结果。然后,搜索查找可能会变得非常快(以插入内容为代价)。

If you are sure that weakening your system to implement searching like this makes sense, and the another approach is not sufficient, one thing that could help is to store the hashes of words in the text as a list of uniquely occuring words only (i.e. no frequency or proximity information would be available). This would reduce the attack surface area of your implementation a little, but would also lose the benefits you are implying you want by describing the approach as FTS. You could get very fast results like this though as the hashed words essentially become tags attached to all the records that include them. The search lookup then could become very fast (at the expense of your inserts).

* 请明确一点-我很想成为在实施之前,请确保我的业务需求需要这样的东西...

*Just to be clear - I would want to be REALLY sure my business needs demanded something like this before I implemented it...

编辑:

问题的简单示例-说我知道您使用的是32位盐,并且正在对诸如 the之类的常见单词进行哈希处理。 2 ^ 32种可能的盐= 40亿种可能的盐(也就是说,如果您只需要散列少数单词来进行初始攻击,则数目不多)。假设添加或添加了盐,这仍然只有80亿个条目需要预先计算。即使不太常见的单词,也不需要创建太多列表来确保您会获得成功(如果不是这种情况,则不值得搜索数据)。

Quick example of the issues - say I know you are using 32-bit salts and are hashing common words like "the". 2^32 possible salts = 4 billion possible salts (that is, not that many if you only need to hash a handful of words for the initial attack). Assume the salt is appended or prepended, this still is only 8 billion entries to pre-calculate. Even if it is less common words you do not need to create too many lists to ensure you will get hits (if this is not the case your data would not be worth searching).

然后在每个预先计算的盐表中查找给定文本块的最高频率盐,并使用匹配项来查看其是否正确解密了文本。一旦您有一个合理的候选者,为该盐生成250,000字的英语彩虹表并解密文本。

Then lookup the highest frequency salts for a given block of text in our each of our pre-calculated salt tables and use the match to see if it correctly decrypts other words in the text. Once you have a plausible candidate generate the 250,000 word English language rainbow table for that salt and decrypt the text.

我想您可以在系统中解密散列数据了小时到几天可以访问数据库。

I would guess you could decrypt the hashed data in the system in hours to days with access to the database.

这篇关于全文搜索加密数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆