计算具有特定列的所有行并按周分组 [英] Counting all rows with specific columns and grouping by week

查看:111
本文介绍了计算具有特定列的所有行并按周分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经尝试了一段时间,以创建一个查询,该查询每天对一个表中的所有行进行计数,其中包括具有特定ID的列,然后根据UNIX时间戳列将它们分组为每周值.我有一个具有3700万行的中型数据集,并且一直在尝试运行以下类型的查询:

I've been trying now for some time to create a query that would count all rows from a table per day that include a column with certain id, and then group them to weekly values based on the UNIX timestamp column. I have a medium sized dataset with 37 million rows, and have been trying to run following kind of query:

SELECT DATE(timestamp), COUNT(*) FROM `table` WHERE ( date(timestamp) 
between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X ) 
group by week(date(startdate)) 

虽然我得到的结果很奇怪,但查询不能正确地对计数进行分组,但是在结果计数列上显示的值太大(我通过查询很小的特殊数据集验证了值错误.)

Though I'm getting weird results, and the query doesn't group the counts correctly but shows too large values on the resulting count column (I verified the value errors by querying very small spesific datasets.)

如果我按date(startdate)分组,则行计数每天匹配,但我想将这些每日行数与每周行数相结合.这怎么可能呢?数据需要采用以下格式:

If I group by date(startdate) instead, the row counts match per day basis but I'd like to combine these daily amount of rows to weekly amounts. How this could be possible? The data is needed in format:

2006-01-01 | 5 
2006-01-08 | 10

因此日期时间戳记是第一列,第二个时间戳记是每周的行数.

so that the day timestamp is the first column and second is the amount of rows per week.

推荐答案

您的查询不确定,因此获得意外结果也就不足为奇了.我的意思是,您可以对同一数据运行此查询5次,并获得5个不同的结果集.这是由于您选择了DATE(timestamp)但按WEEK(DATE(startdate))进行分组,因此查询将按 ANY 的顺序返回每个开始日期周的第一行的时间.

Your query is non deterministic so it is not surprising you are getting unexpected results. By this I mean you could run this query on the same data 5 times and get 5 different result sets. This is due to the fact you are selecting DATE(timestamp) but grouping by WEEK(DATE(startdate)), the query is therefore returning the time of the first row it comes accross per startdate week in ANY order.

请考虑以下两行(为便于阅读,带有日期格式的时间戳记):

Consider the following 2 rows (with timestamp in date format for ease of reading):

TimeStamp       StartDate
20120601        20120601
20120701        20120601

您的查询按WEEK(StartDate)分组(为23),因为两行的值均相同,因此您希望结果中有1行的计数为2.

Your query is grouping by WEEK(StartDate) which is 23, since both rows evaluate to the same value you would expect your results to have 1 row with a count of 2.

如何 DATE(Timestamp)也在选择列表中,并且由于没有ORDER BY语句,因此查询不知道哪个时间戳返回"20120601"或"20120701".因此,即使在如此小的结果集上,您也有50:50的机会得到:

HOWEVER DATE(Timestamp) Is also in the select list and since there is no ORDER BY statement the query has no idea which Timestamp to return '20120601' or '20120701'. So even on this small result set you have a 50:50 chance of getting:

TimeStamp       COUNT
20120601        2

并且有50:50的机会获得

and a 50:50 chance of getting

TimeStamp       COUNT
20120701        2

如果您这样向数据集添加更多数据:

If you add more data to the dataset as so:

TimeStamp       StartDate
20120601        20120601
20120701        20120601
20120701        20120701

你可以得到

TimeStamp       COUNT
20120601        2
20120701        1

TimeStamp       COUNT
20120701        2
20120701        1

您可以看到如何通过37,000,000行很快获得意想不到且无法预测的结果!

You can see how with 37,000,000 rows you will soon get results that you do not expect and cannot predict!

编辑

由于您似乎想获得结果中的起始日,因此按周分组时,您可以使用以下内容获取起始日(将CURRENT_TIMESTAMP替换为所需的任何列):

Since it looks like you are trying to get the weekstart in your results, while group by week you could use the following to get the week start (replacing CURRENT_TIMESTAMP with whichever column you want):

SELECT  DATE_ADD(CURRENT_TIMESTAMP, INTERVAL 1 - DAYOFWEEK(CURRENT_TIMESTAMP) DAY) AS WeekStart

然后,您也可以按此日期进行分组,以获取每周的结果,并避免麻烦将不在选择列表中的内容放入您的选择列表中.

You can then group by this date too to get weekly results and avoid the trouble of having things in your select list that aren't in your group by.

这篇关于计算具有特定列的所有行并按周分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆