We had a lot to celebrate recently. Last year was the 300th anniversary of Jacob Bernoulli’s Ars Conjectandi. In this book he consolidated central ideas in probability theory, such as the very first version of the law of large numbers. It was also the 250th anniversary of Bayes theorem named after Thomas Bayes (1701–1761), who first suggested using the theorem to update beliefs.
Fast forward.
The enthusiasm around Big Data hinges on the use of Statistics to provide relevant and meaningful analysis of ever-increasingly large data sets. Statistical science has produced excellent machine learning tools and methods that go beyond just classification and ranking of data sets. Today, we try to explicitly quantify the uncertainty of what can be concluded from a data set, be it a prediction or a scientific inference. Our work today, firmly rests on the shoulders of Bernoulli and Bayes a few hundred years ago.
Big data will give us increasingly more precise answers. That is a direct consequence of Bernoulli’s work. However, we have to be aware of issues such as ‘Selection Bias’, ‘Regression to the Mean‘, ‘Over-interpretation of Associations‘. Particularly in our field, it’s important to very carefully examine the underlying science explaining the data.
As the data sets continue to get bigger, the problem of potential false findings grows exponentially. In order to protect us from that, we will have to use statistical methods to quantify the uncertainty associated with our results. A careful and considered approach to applying statistical methods is the best option to get out of this dilemma. Every variable in a study must be examined for completeness and consistency in how it is coded, and the assumptions of each statistical routine must be validated.