Speaker: Trong Nguyen, ECpE Graduate Student
Advisor: Srikanta Tirthapura
Title: Stratified Random Sampling from Streaming Data
Abstract: Stratified random sampling is a widely used sampling technique for approximate query processing. We consider stratified random sampling on continuously arriving data streams, and make the following contributions. We present a lower bound that shows that any streaming algorithm for stratified random sampling must have, in the worst case, a variance that is Ω(r ) factor away from the optimal, where r is the number of strata. We present S-VOILA, a streaming algorithm for stratified random sampling that is locally variance-optimal. Results from experiments on real and synthetic data show that S-VOILA results in a variance that is typically close to an optimal offline algorithm, which was given the entire input beforehand. We also present a variance-optimal offline algorithm VOILA for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant, i.e. has a large number of data points to choose from. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.