Data science is having an identity crisis.
Indications of this crisis have been around for years. For instance, the inaugural issue of Harvard Data Science Review found it easier to define what data science is not rather than what it is (Meng, 2019). This confusion hasn’t cleared up. In fact, a case can be made that it has gotten worse. As Meng noted years ago (2019), most of us have some knowledge about other kinds of scientists. But what is a data scientist and what exactly do they do?
The history of data science is deeply rooted in statistics. As far back as 1962, one of the most influential statisticians of the 20th century, John Tukey, was calling for recognition of a new science focused on learning from data. Subsequent work by the statistics community, particularly Jeff Wu (Donoho, 2015) and William Cleveland (2001), formally proposed the name “data science” and suggested academic statistics expand its boundaries (Donoho, 2015). Yet, the ensuing years have seen a significant influence from computer science, calls for data science to be recognized as a unique discipline distinct from statistics, and a fundamental reckoning with data science being a science.
The expansion of the probabilistic and inferential traditions of statistics along with the algorithmic, programming, and system-design concerns of computer science has led to a modern view of data science as an interdisciplinary field, which Blei and Smyth (2017) affectionately refer to as ‘the child of statistics and computer science’. Wing and colleagues (2018) see the defining characteristic being data science is not just about methods, but also about the use of those methods in the context of a domain. This interplay between domain and methods makes data science not merely the sum of its parts, but a distinct field with its own focus.
Yet, there is the fundamental question of the name itself. Wing’s probing question (2020), “Is there a problem unique to data science that one can convincingly argue would not be addressed or asked by any of its constituent disciplines, e.g., computer science and statistics?” is a crucial litmus test for whether data science should be considered a science. Some questions emerging from data science may feel novel (Wing, 2020); however, even these often reduce to applications of existing disciplines (statistics, computer science, optimization theory) rather than indicate a fundamentally new science.