The Majority Event In Streaming Data
An interesting (and idiosyncratic) streaming algorithm
--
Imagine a possibly unbounded stream of incoming symbols. At a particular point, say there is some symbol that has occurred more than 50% of the time till then. We call it a majority symbol. It is obviously unique if it exists.
This post discusses an efficient streaming algorithm that “half” solves this problem. If the stream has a majority symbol, it finds it. (This is equivalent to saying that if the algorithm outputs NO, the stream for sure doesn’t have a majority symbol.) If the stream doesn’t have one, unfortunately, it can return a bad output. This definitely limits its usefulness.
Well then, given its idiosyncratic behavior, why even discuss this algorithm? Because of its attractive properties in streaming settings. It hardly needs any memory and is ‘blindingly fast’. It also uses an interesting trick.
This post will be of interest to you if you are new to ultra-lightweight streaming algorithms and data structures (aka data sketches) and wish to see an example with an interesting trick. Or if you have a use case that can benefit from it despite its idiosyncrasies.
The Algorithm
First, some terminology. We will express the symbols that have arrived in a stream as a string of lower-case characters. For example, the string aba denotes that the symbols a, b, and a arrived in that order into the stream. When we say “majority symbol of the string” we mean the majority symbol of the stream after the last character has arrived.
Key Problem Property
The algorithm relies on this key property of the problem. Say the string has a majority symbol, m. Say it is possible to express the string Z as XY, where the prefix substring X does not have a majority symbol. Then Y must have a majority symbol and it must be m.
The reasoning goes like this. Since X does not have a majority symbol, m occurs no more than 50% of the time in X. But since m must occur more than 50% of the time in XY, to make this happen, it must also occur more than 50% of the time in Y.
Key Algorithm Idea
The algorithm uses an interesting idea involving canceling majority symbol occurrences with non-majority symbol ones.
To describe it, let’s formalize it a bit. Assume we somehow know the identity of the…