We Must Recognize That A Lot Of Data Science Comes From Teams Not Individuals
Society loves heroes. Unfortunately, teams can rarely be heroes. Complex stories of discovery and innovation involving teams of hundreds of collaborators all working together are ultimately distilled down to one or two figureheads who become the public face and “heroes” responsible for the team’s success. This has become increasingly true in the world of data science, in which collaborative interdisciplinary teams are the norm, yet the accolades and adulation are typically heaped on single individuals arbitrarily chosen by the press and outside public to represent those teams. What will it take for the public and press to begin seeing data science as the teamwork it typically is?
Science has long been perceived as the work of solitary geniuses toiling away in solitude, working steadily towards a problem until a burst of inspiration overcomes the obstacles before them and leads to a breakthrough discovery that changes the very course of humankind. Such is the Disney-like narrative that guides how we see the scientific process that has in turn influenced how the press and increasingly social media portrays scientific and technological innovation.
As data science has infused across nearly every imaginable field, the work of programmers, analysts, visualization specialists and myriad other disciplines come together to form modern data science teams tackling society’s grand challenges. Rather than recognize the contributions of all these individuals as a cohesive team in which success owes itself to everyone’s efforts, we still describe data science in terms of the solitary data scientist pecking away at a laptop solving the world’s challenges by themselves.
Data science discoveries that garner media attention typically discard this idea of collaboration and replace it with the one or two individuals that become the “heroes” of an innovation that was in reality the work of hundreds.
While the low-resourced startup, NGO or open source researcher might very well be a data science group of one, even they typically rely upon a vast landscape of datasets and open and commercial algorithms to conduct their analyses. As with larger data science groups, those datasets and tools are rarely acknowledged.
This unfortunately can be especially detrimental to those from underrepresented groups or those whose contributions were focused on infrastructure rather than public-facing areas like visualization. The system administrators who kept the data center humming or the architects who wrote the simulation algorithms upon which the project ran rarely receive the attention of those who crafted the final beautiful scientific visualizations that became the face of the project or who contextualized the findings for publication.
Rarely do data science teams create all of their data, write all of their algorithms and code their entire workflow. Datasets, algorithms, code and tools originate from myriad places, yet few warrant even a footnote in the acknowledgements section of the final paper or press release, let alone an actual mention in the article text itself.
In some cases, this is to maintain a competitive advantage in that a particularly successful application of an off-the-shelf algorithm might not wish to tip competitors off regarding the precise tool that was used or its configuration.
In most cases, however, it is simply that researchers in the data disciplines are increasingly hailing from curriculums that don’t emphasize the notion of credit and acknowledgement.
While young historians are relentlessly drilled on the critical importance of citing every fact they reference back to its primary source, computer scientists are rarely taught the art of citation.
In fact, the rise of vast repositories of open code like Stack Overflow and GitHub have led new generations of coders to precisely the opposite behavior: copy-and-paste snippets of code without worrying about where they came from or who wrote them and certainly without bothering to grant any acknowledgement to the original authors.
Not only does such action deprive the authors of those datasets, algorithms and software of credit, this lack of acknowledgement also greatly complicates the replication process.
Putting this all together, scientific advances have long had leaders and liaisons acting as their public faces, but we must do better at recognizing all of the people beneath their shadows and the incredibly array of collaborative contributions that power today’s data-driven discoveries. In the end, by helping the press and public to better understand how data science works and the collaboration that typically underlies it, we can help better tell the story of innovation and help everyone’s contributions be recognized.