如何有效地存储数百万条统计记录? [英] How to store millions of statistics records efficiently?

查看:99
本文介绍了如何有效地存储数百万条统计记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的网上商店中大约有170万种产品,我们要记录该产品在1年的时间里有多少次观看,我们希望至少每2小时记录一次观看次数,问题是要使用哪种结构这个任务?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?

现在,我们尝试将统计信息保留在具有2列classified_id,stats的记录中30天,其中统计信息就像带格式date:views,date:views ...的已剥离json ...例如,一条记录看起来像

Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like

345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views

如果您想回到1年前,这当然是愚蠢的,因为如果您想获得1000种产品的视图总数,则需要从数据库中获取30mb之类的数据并自己计算.

This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.

我们现在想到的另一种方法是拥有一个具有3列的大型表classified_id,date,view并将其记录存储在自己的行中,这当然会导致一个具有亿万行的巨大表,例如,如果我们有180万分类广告,并且每2小时将记录保持24/7一年的时间

The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need

1800000 * 365 * 12 = 7.884.000.000(十亿个带B的行)虽然在postgres的理论极限之内,但我想像一下它的查询(例如更新视图),即使索引正确,需要一些时间.

1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.

有什么建议吗?我什至无法想象Google Analytics(分析)如何存储统计信息...

Any suggestions? I can't even imagine how google analytics stores the stats...

推荐答案

此数字不像您想象的那样高.在当前的工作中,我们存储网站的指标数据,而我们拥有的总行数要高得多.在上一份工作中,我使用了pg数据库,该数据库从移动网络中收集指标,每天收集约20亿条记录.因此,不要害怕数十亿的记录.

This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.

您肯定需要对数据进行分区-最有可能是按天.有了这么多的数据,您会发现索引毫无用处.取决于您将在EXPLAIN命令输出中看到的平面.例如,该电信应用程序根本不使用任何索引,因为它们只会降低整个引擎的速度.

You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.

另一个问题是您需要如何快速响应查询.以及您允许用户查询的粒度(每小时,几天,几周之和的总和)中的哪一步.您甚至可能需要对诸如周,月或季度之类的粒度进行一些汇总.

Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.

添加:

该电信应用程序中每天约有20亿条记录每天消耗约290GB.这意味着使用带有COPY命令的大容量插入每秒可插入约23000条记录.每个批量都有数千条记录.原始数据按分钟划分.为了避免磁盘等待,db在4个不同的磁盘/阵列上有4个表空间,并在其上分配了分区. PostreSQL能够处理所有问题.因此,您也应该考虑正确的硬件配置.

Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.

好主意也是将pg_xlog目录移动到单独的磁盘或阵列.不只是不同的文件系统.所有这些都必须是单独的硬件.我只能在具有正确错误检查的阵列中推荐SSD.最近,我们遇到了单个SSD上的数据库损坏的问题.

Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.

这篇关于如何有效地存储数百万条统计记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆