24Nov/097
group date
A non-rigorous, non-scientific, totally off-the-cuff test of which function to pick when you need to group by year and month.
I populated a table with 262K rows of random dates, and then ran
SELECT %s, COUNT(*) FROM table GROUP BY %s ORDER BY NULL
with various functions, which should all result in the same grouping. I repeated each query five times and show the average time, using three different column types DATE, DATETIME, and TIMESTAMP.
| expression | DATE | DATETIME | TIMESTAMP |
|---|---|---|---|
| EXTRACT(YEAR_MONTH FROM d) | 0.362 | 0.369 | 0.581 |
| LAST_DAY(d) | 0.374 | 0.389 | 0.582 |
| DATE_SUB(d, INTERVAL DAY(d) DAY) | 0.429 | 0.882 | 1.452 |
| d - INTERVAL DAY(d) DAY | 0.429 | 0.887 | 1.535 |
| SUBSTRING(d, 1, 7) | 0.454 | 0.488 | 0.719 |
| YEAR(d), MONTH(d) | 1.046 | 1.126 | 2.045 |
| MONTH(d), YEAR(d) | 1.116 | 1.196 | 2.112 |
| LEFT(d, 7) | 1.307 | 1.405 | 2.123 |
| DATE_FORMAT(d, '%Y%m') | 1.480 | 1.565 | 2.112 |
| DATE_FORMAT(d, '%Y-%m') | 1.514 | 1.615 | 2.564 |
| DATE_FORMAT(d, '%m%Y') | 1.517 | 1.604 | 2.420 |
| DATE_FORMAT(d, '%m-%Y') | 1.562 | 1.656 | 2.465 |
| MONTHNAME(d), YEAR(d) | 1.613 | 1.713 | 2.812 |
| YEAR(d), MONTHNAME(d) | 1.663 | 1.766 | 2.873 |
And just in case you want to extract a full date (which really only makes sense for datetime and timestamp):
| expression | DATE | DATETIME | TIMESTAMP |
|---|---|---|---|
| DATE(d) | 0.357 | 0.374 | 0.591 |
| EXTRACT(YEAR_MONTH FROM d), EXTRACT(DAY FROM d) | 0.377 | 0.407 | 0.730 |
| EXTRACT(DAY FROM d), EXTRACT(YEAR_MONTH FROM d) | 0.395 | 0.422 | 0.751 |
| MONTH(d), YEAR(d), DAY(d) | 0.395 | 0.426 | 0.870 |
| YEAR(d), DAY(d), MONTH(d) | 0.398 | 0.440 | 0.862 |
| YEAR(d), MONTH(d), DAY(d) | 0.406 | 0.441 | 0.867 |
| DAY(d), YEAR(d), MONTH(d) | 0.409 | 0.444 | 0.859 |
| LEFT(d, 10) | 0.437 | 0.487 | 0.728 |
| SUBSTRING_INDEX(d, ' ', 1) | 0.439 | 0.475 | 0.743 |
| DAY(d), MONTH(d), YEAR(d) | 0.441 | 0.477 | 0.901 |
| MONTH(d), DAY(d), YEAR(d) | 0.442 | 0.472 | 0.914 |
| SUBSTRING(d, 1, 10) | 0.460 | 0.496 | 0.729 |
| DATE_FORMAT(d, '%d%Y%m') | 0.539 | 0.571 | 0.852 |
| DATE_FORMAT(d, '%m%Y%d') | 0.542 | 0.570 | 0.841 |
| DATE_FORMAT(d, '%m%d%Y') | 0.543 | 0.572 | 0.846 |
| DATE_FORMAT(d, '%Y%m%d') | 0.544 | 0.570 | 0.842 |
| DATE_FORMAT(d, '%Y%d%m') | 0.544 | 0.572 | 0.843 |
| DATE_FORMAT(d, '%d-%Y-%d') | 0.547 | 0.574 | 0.848 |
| DATE_FORMAT(d, '%d%m%Y') | 0.549 | 0.573 | 0.842 |
| DATE_FORMAT(d, '%m-%Y-%d') | 0.552 | 0.583 | 0.869 |
| DATE_FORMAT(d, '%Y-%m-%d') | 0.558 | 0.583 | 0.854 |
| DATE_FORMAT(d, '%Y-%d-%m') | 0.569 | 0.593 | 0.867 |
| d - INTERVAL HOUR(d) HOUR - INTERVAL MINUTE(d) MINUTE - INTERVAL SECOND(d) SECOND | 0.573 | 0.653 | 1.249 |
| DATE_FORMAT(d, '%d-%m-%Y') | 0.598 | 0.637 | 0.920 |
| DATE_FORMAT(d, '%m-%d-%Y') | 0.601 | 0.633 | 0.904 |
November 24th, 2009 - 09:19
Out of interest how does ‘substring(d,1,7)’ compare on your test data?
Thats what I tend to use.
November 24th, 2009 - 09:28
I would be interested to see the same test repeated grouping by year-month when storing times with timestamp and datetime – since this seems to be the common field users will store dates in.
You could also try using ORDER BY NULL to remove any time that might be spent sorting. This might produce greater differences between the results.
November 24th, 2009 - 10:54
Updated the post with those suggestions. Note that the SUBSTRING() approach is not so hot for the DATE field, but much better than most others for DATETIME and TIMESTAMP.
November 24th, 2009 - 13:07
Thanks for the update!
November 25th, 2009 - 05:12
(ack – your wordpress cut my comment off at the first less than sign.. reposting)
I primarily group by date on my datetime/timestamp fields. I compared some times of my own and found the speeds:
DATE(created) quicker than substring(created,1,10)
substring(created,1,10) is the same as TO_DAYS(created)
DATE_FORMAT(created, ‘%Y-%m-%d’) is slower than substring(created,1,10)
I was already using DATE(created) so I don’t get a performance boost today. Thanks for the idea though.
The next time I have to create a table for data that is primarily going to be DATE aggregated, I am creating separate date and time fields and only indexing the date.
November 25th, 2009 - 09:40
Marques: Grouping by a simple indexed column will almost certainly be faster than grouping by any function call.
You might consider storing a datetime and a date. It repeats data, which probably violates some academic normal form, but then you can group by the date and still do math against the datetime without having to concatenate the date and the time. It’s not very often that I find a need for just the time part.
January 8th, 2010 - 15:02
here is a good website for helping you format dates using the mysql date_format function.