Evaluate and compare methods for efficient solution to “count-distinct problem”

Master Thesis

Evaluate and compare methods for efficient solution to “count-distinct problem”

Background

Count-distinct problem is about calculating the number of distinct elements in a multiset.

Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality.

Assignment

Compare and evaluate existing algorithms and explore the relative trade-offs between accuracy, memory, and computational requirements of different algorithms. Demonstrate the choice of good algorithms by using a representative dataset from Agama’s datastore.

Implement a solution for some relevant metrics in the Agama system, for example “unique active devices”.

Apply

To learn more about this master thesis or to apply, contact:

Aner Gusic
aner.gusic@agama.tv

Related reading:

1. Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2008. Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. In Proceedings of the 11th international conference on Extending database technology: Advances in database technology (EDBT ’08)

2. Peter Clifford, Ioana A COSMA. (2012), A Statistical Analysis of Probabilistic Counting Algorithms. Scandinavian Journal of Statistics, 39: 1-14.

To learn more about this master thesis or to apply, contact:

Aner Gusic
aner.gusic@agama.tv

Want to know more?

Would you like to know more about Agama and our story?