The possibilities and limits of using data to predict scientific discoveries

Published: Feb. 3, 2017 • By Trent Knoss

Amidst the vast and varied ecosystem of modern science, the emerging interdisciplinary field known as the 鈥渟cience of science鈥� is exploring a difficult, but provocative, question: In the age of data science, are future discoveries now predictable?

, CU Boulder researcher Aaron Clauset and co-authors Daniel Larremore () and Roberta Sinatra (Central European University) examine the possibilities and limits of using massive data sets of scientific papers and information on scientific careers to study the social processes that underlie discoveries.

鈥淭here is more interest than ever in quantifying scientific behavior,鈥� said Aaron Clauset, an assistant professor in CU Boulder鈥檚 Department of Computer Science and a baby直播app member in the . 鈥淭he question is: Can we use the abundant data on the scientific process in order to make better predictions about scientific discoveries, which could improve funding decisions, peer review听and hiring decisions?鈥�

Historically, scientific discoveries have fallen on a spectrum between highly expected (such as the Higgs boson, which evidence pointed to years in advance) and entirely unexpected (such as penicillin, which arrived with minimal preceding research). Predicting such advances has value to scientists (when choosing a research field), funding agencies (who want to allocate dollars effectively),听hiring committees (who want to hire successful baby直播app)听and taxpayers (who fund a large percentage of research projects).

The recent proliferation of bibliographic databases such as Google Scholar, Web of Science, PubMed, ORCID and others has given researchers new tools by which to examine various aspects of the scientific community as a whole, such as the number of citations a given article receives or how many journal articles a given researcher publishes. But, do such metrics make some kinds of discoveries easier to predict?

Feedback loops

One problem with using such data to make predictions is the likelihood that the scientific community and the various incentives for scientists may currently be structured in a way that creates self-reinforcing feedback loops in which future opportunity depends on being lucky,听undermining听the potential for other less-heralded projects to advance science.

鈥淲e tend to reward and reinvest in people and subjects that have paid off in the past, but there鈥檚 no guarantee they will continue to do so. This can create a kind of purifying selection,鈥� said Clauset, who is also an external baby直播app member at the Santa Fe Institute. 鈥淓cology teaches us that the most robust systems in the face of uncertainty are diverse systems. We may be killing the golden goose of scientific discovery very slowly by focusing on minutiae at the expense of variety.鈥�

Clauset鈥檚 data also questions the conventional academic narrative that scientists achieve an early productivity peak followed by a long and slow decline. In , he and his co-authors analyzed over 200,000 publications from 2,453 tenure-track baby直播app in all 205 PhD-granting computer science departments in the U.S. and Canada. They found the conventional pattern accurately described only one-third of baby直播app while the remaining two-thirds exhibited a wide variety of productivity patterns over the course of their careers.

Sleeping beauties

Another insight into the unpredictability of scientific advances comes from so-called 鈥渟leeping beauties.鈥� While bibliographic data illuminate that some aspects of scientific impact are predictable, the broad existence of 鈥渟leeping beauty鈥� papers, which lay dormant for years before a sudden uptick in relevance, implies that some aspects of discovery may be fundamentally unpredictable. A notable example is a now-famous 1935 Albert Einstein paper on quantum mechanics that was only modestly cited for several decades听before fairly recently becoming one of the most important papers in quantum mechanics.

鈥淭his suggests that there鈥檚 another scale to consider, one in which we need to zoom out even farther to understand how these various scientific fields and subfields are interacting with one another,鈥� said Clauset.

The article also states that while publication data is useful in some ways, citations are fundamentally lagging indicators, which only look backward at the past, and thus may have limited utility for predicting the future.

Looking forward, Clauset and his co-authors suggest that better predictions could be made using data sets on scientific preprints, workshop papers, conference presentations and rejected grant proposals. Such databases鈥攕hould they ever become available鈥攎ight provide additional trends and insights that are not being captured currently听by better illustrating how the frontier of scientific discovery is moving.

Overall, the authors state, the limits of data in predicting future advances point to the importance of maintaining a wide-ranging scientific community.

鈥淲e would be wise to hedge our bets by building a diverse ecosystem of scientists and approaches to science rather than focus on predicting individual discoveries,鈥� said Clauset.

A workbench in a chemistry laboratory. Photo:听Jean-Pierre听/ Wikipedia

Categories:

baby直播app