设计支持海量数据的存储和查询系统 [英] design a system supporting massive data storage and query

查看:119
本文介绍了设计支持海量数据的存储和查询系统的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是问面试官设计一个系统来存储千兆字节的数据和系统也有支持某种形式的查询。

I was asked by the interviewer to design a system to store gigabytes of data and the system also has to support some kind of query.

说明:

有在一个IDC生成的记录巨量,每个记录是由一个网址,一个IP其访问的网址,时间,当访问发生。记录大概可以表述为这样的结构,但我不知道我应该选择重的数据类型为present他们:

There are massive amount of records generated in an IDC, each record is composed of a url, an IP which visits the url, and the time when the visit occurs. The record can probably be stated as a struct like this, but I'm not sure which data type should I pick to represent them:

struct Record {
    url;  //char *
    IP;   //int?
    visit_time;   //time_t or simply a number?
}

要求:

设计一个系统来存储100十亿条记录,而且该系统得支持2种查询至少有:

Design a system to store 100 billion records, and also the system gotta support 2 kinds of query at least:

首先,给定一个时间周期(T1,T2)和一个IP地址,查询多少网址该IP已经访问了给定的周期。

First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.

二,给定一个时间段(T1,T2)和URL,查询多少次这样的网址已被访问。

Second, given a time period (t1, t2) and a url, query how many times this url has been visited.

我迷迷糊糊的,这里是我的愚蠢的解决方案:

I was stumbled, and here is my stupid solution:

分析:

由于每个查询是在一个特定的时间段进行的 ,这样:

because every query is performed upon a given period of time, so:

1 创建集,把所有访问一次进组,并保持一套按照从旧到最新时间的值进行排序。

1.Create a set, put all visit time into the set, and keep the set ordered according to the time's value from older to latest.

2.创建一个哈希表的使用哈希(visit_time)为重点,,这个哈希表被调用时,哈希表,然后点击的每个节点在一个特定的水桶有2个三分球分别指向另外2哈希表。

2.Create a hash table using hash(visit_time) as the key, this hash table is called time-hash-table, then each node in a specific bucket has 2 pointers pointing to another 2 hash-tables respectively.

3,另外2哈希表将是一个 IP哈希表 URL哈希表

IP哈希表使用哈希(IP)为重点,并在同一个IP哈希表的所有IP地址具有相同的访问时间;

ip-hash-table uses hash(ip) as the key and all the ips in the same ip-hash-table have the same visit-time;

URL哈希表使用散列(URL)为重点,并在相同的URL,哈希表中的所有网址具有相同的访问时间。

url-hash-table uses hash(url) as the key and all the urls in the same url-hash-table have the same visit-time.

举一个图如下:

time_hastbl
  []
  []
  []-->[visit_time_i]-->[visit_time_j]...[visit_time_p]-->NIL
  []                     |          |
  []               ip_hastbl       url_hastbl
                      []               []
                      :                :
                      []               []
                      []               []

因此​​,在执行查询时(T1,T2):

So, when doing the query upon (t1, t2):

  1. 您设定的时间最接近的比赛,让我们说,与之匹配的是(T1',T2'),则所有的有效浏览时间将陷入组从T1开始'到t2的一部分;

  1. find the closest match from the time set, let's say the match is (t1', t2'), then all the valid visit time will fall into the part of set starting from t1' to t2';

每次访问时间t的设定[T1':T2']的时候,做散列(t)和发现T的ip_hastbl或url_hastbl,然后计算和记录多少次给定的IP地址或URL出现。

for each visit-time t in the time set[t1':t2'], do hash(t) and find t's ip_hastbl or url_hastbl, then count and log how many times the given ip or url appears.

问题:

1.My解决方案是愚蠢的,希望你能给我一个解决方案。

1.My solution is stupid, hope you can give me another solution.

2.有关于如何存储大量的记录在磁盘上,任何意见?我想到了B树,但如何使用它,或适用于这个系统?

2.with respect to how to store the massive records on disk, any advice? I thought of B-tree, but how to use it or is B-tree applicable in this system?

推荐答案

我相信面试官期待一个基于分布式计算解决方案,尤其百十亿记录参与的时候。随着分布式计算的有限的知识,我有,我建议你看看分布式哈希表的map-reduce (并行查询处理)

I believe the interviewer was expecting a distributed computing based solution, esp when "100 billion records" are involved. With the limited knowledge of Distributed Computing I have, I would suggest you to look into Distributed Hash Table and map-reduce (for parallel query processing)

这篇关于设计支持海量数据的存储和查询系统的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆