False Positives in Big Data Analytics

· Andreas Scherer · Big Picture

We had a lot to celebrate recently. Last year was the 300th anniversary of Jacob Bernoulli’s Ars Conjectandi. In this book he consolidated central ideas in probability theory, such as the very first version of the law of large numbers. It was also the 250th anniversary of  Bayes theorem named after Thomas Bayes (1701–1761), who first suggested using the theorem to update beliefs.

Fast forward.

Thomas Bayes
Thomas Bayes

The enthusiasm around Big Data hinges on the use of Statistics to provide relevant and meaningful analysis of ever-increasingly large data sets. Statistical science has produced excellent machine learning tools and methods that go beyond just classification and ranking of data sets. Today, we try to explicitly quantify the uncertainty of what can be concluded from a data set, be it a prediction or a scientific inference. Our work today, firmly rests on the shoulders of Bernoulli and Bayes a few hundred years ago.

Big data will give us increasingly more precise answers. That is a direct consequence of Bernoulli’s work. However, we have to be aware of issues such as ‘Selection Bias’, ‘Regression to the Mean‘, ‘Over-interpretation of Associations‘. Particularly in our field, it’s important to very carefully examine the underlying science explaining the data.

As the data sets continue to get bigger, the problem of potential false findings grows exponentially. In order to protect us from that, we will have to use statistical methods to quantify the uncertainty associated with our results. A careful and considered approach to applying statistical methods is the best option to get out of this dilemma. Every variable in a study must be examined for completeness and consistency in how it is coded, and the assumptions of each statistical routine must be validated.

Leave a comment

Andreas Scherer

About Andreas Scherer

Dr. Andreas Scherer is CEO of Golden Helix. The company has been delivering industry leading bioinformatics solutions for the advancement of life science research and translational medicine for over a decade. Its innovative technologies and analytic services empower scientists and healthcare professionals at all levels to derive meaning from the rapidly increasing volumes of genomic data produced from next-generation sequencing. With its solutions, hundreds of the world’s hospitals and testing labs are able to harness the full potential of genomics to identify the cause of disease, develop genomic diagnostics, and advance the quest for personalized medicine. Golden Helix products and services have been cited in thousands of peer-reviewed publications. Golden Helix is also on the Inc 5000 list of the fastest-growing private companies in the US. He is also Managing Partner of Salto Partners, Inc, a management consulting firm headquartered in Nevada.  He has extensive experience successfully managing growth as well as orchestrating complex turnaround situations. His company, Salto Partners, advises on business strategy, financing, sales, and operations. Clients are operating in the high-tech and life sciences space. Dr. Scherer holds a Ph.D. in computer science from the University of Hagen, Germany, and a Master of Computer Science from the University of Dortmund, Germany. He is author and co- author of over 20 international publications and has written books on project management, the Internet, and artificial intelligence. His latest book, “Be Fast Or Be Gone”, is a prizewinner in the 2012 Eric Hoffer Book Awards competition, and has been named a finalist in the 2012 Next Generation Indie Book Awards! 

View all posts by Andreas Scherer →