HyperLogLog in Pure SQL

Features About Jobs Blog Sign Up Sign In The High-Performance SQL Blog Subscribe Thank you HyperLogLog in Pure SQL 29 May 2014 Earlier we showed how to make count distinct 50x faster with subqueries . Using probabilistic counting we'll make count distinct even faster, trading a little accuracy for the increase in speed. We'll optimize a very simple query, which calculates the daily distinct sessions for 5,000,000 gameplays (~150,000/day): select date ( created_at ), count ( distinct session_id ) from gameplays The original query takes 162.2s. The HyperLogLog version is 5.1x faster (31.5s) with a 3.7% error, and uses a small fraction of the RAM. Databases often implement count(distinct) in two ways: When there are few distinct elements, the database makes a hashset in RAM and then counts the keys. When there there are too many elements to fit in R...

Linked on 2014-05-30 18:46:16 | Similar Links