How to Build a GPT-3 for Science

Want to create an image of velociraptors working on a skyscraper, in the style of “Lunch Atop A Skyscraper” of 1932? Use DALL-E. Want to create an imaginary standup comedy show by Peter Thiel, Elon Musk, and Larry Page? Use GPT-3. Want to deeply understand COVID-19 research and answer your questions based on evidence? Learn how to do a Boolean search, read scientific papers, and maybe get a PhD, because there are no generative AI models trained on the vast body of scientific research publications. If there were, getting evidence-backed, plain-language answers to scientific questions would be among the simplest benefits. Generative AI for science could help reverse the deceleration of innovation in science by making it easier and cheaper to find new ideas. Such models could also provide data-backed warnings of therapeutic hypotheses that are certain to fail, counterbalancing human bias and avoiding billion-dollar, decades-long blind alleys. Finally, such models could combat the reproducibility crisis by mapping, weighing, and contextualizing research results, providing a score on trustability.

So why don’t we have a DALL-E or GPT-3 for science? The reason is that although scientific research is the world’s most valuable content, it is also the world’s least accessible and understandable content. I’ll explain what it would take to unlock scientific data at scale to make generative AI for science possible, and how it would transform the way we engage with research.

What makes scientific research data challenging

Research publications are some of the world’s most important repositories for content and information ever created. They tie ideas and findings together across time and disciplines, and are forever preserved by a network of libraries. They are supported by evidence, analysis, expert insight, and statistical relationships. They are extremely valuable, yet they are largely hidden from the web and used very inefficiently. The web is rife with cute, cuddly cat videos but largely devoid of cutting-edge cancer research. As an example, the Web of Science is one of the most comprehensive indexes of scientific knowledge. It has been around for decades, but it’s probably something most readers have never even heard of, let alone interacted with. Most of us don’t have access to research papers, and even when we do, they’re dense, hard to understand, and packaged as a PDF — a format designed for printing, not for the web.

Because scientific papers are not easily accessible, we can’t easily use the data to train generative models like GPT-3 or DALL-E. Can you imagine if a researcher could propose an experiment and an AI model could instantly tell them if it had been done before (and better yet, give them the result)? Then, once they have data from a novel experiment, the AI could suggest a follow-up experiment based on the result. Finally, imagine the time that could be saved if the researcher could upload their results and the AI model could write the resulting manuscript for them. The closest we’ve ever come to a DALL-E of science is Google Scholar, but it’s not a sustainable or scalable solution. IBM Watson also set out to achieve much of what I describe here, but most of the work came ahead of recent advances in large language models and didn’t utilize appropriate or sufficient data to match the marketing hype.

For the kind of value unlock I’m describing, we need long-term investment, commitment, and vision. As proposed recently in Future, we need to treat scientific publications as substrates to be combined and analyzed at scale. Once we remove the barriers, we will be able to use science to feed data-hungry generative AI models. These models have immense potential to accelerate science and increase scientific literacy, such as through training them to generate new scientific ideas, helping scientists manage and navigate the vast scientific literature, help identify flawed or even falsified research, and synthesize and translate complex research findings into ordinary human speech.

How do we get a DALL-E or GPT-3 for science?

If you’re in tech, showing a friend outputs from generative AI models like DALL-E or GPT-3 is like showing them magic. These tools represent the next generation of the web. They derive from the synthesis of massive amounts of information, beyond a simple linkage, to create tools with generative capacity. So how can we create a similarly magical experience in science, where anyone can ask a question of the scientific literature in plain language and get an understandable answer backed by evidence? How can we help researchers create, develop, refine, and test their hypotheses? How can we potentially avoid wasting billions of dollars on failing hypotheses in Alzheimer’s research and erroneous connections between genetics and depression?

The solutions to these questions might sound like science fiction, but there is proof that we can do amazing and unthinkable things when scientific work is used for more than just the sum of its parts. Indeed, utilizing nearly 200,000 protein structures in the Protein Data Bank has given AlphaFold the ability to accurately predict protein structures, something that was just done for every protein ever documented (over 200 million!). Leveraging research papers in a manner similar to protein structures would be a natural next step.

Decompose papers into their minimal components

Research papers are full of valuable information, including figures, charts, statistical relationships, and references to other papers. Breaking them down into various components and using them at scale could help us train machines for different types of science-related jobs, prompts or queries. Simple questions might be answered with training on one component type, but more complex questions or prompts would require incorporation of multiple component types, and an understanding of their relation to each other.

Some examples of complex potential prompts are:

“Tell me why this hypothesis is wrong”
“Tell me why my treatment idea won’t work”
“Generate a new treatment idea”
“What evidence is there to support social policy X?”
“Who has published the most reliable research in this field?”
“Write me a scientific paper based on my data”

Some groups are making headway on this vision. For example, Elicit applies GPT-3 to millions of paper titles and abstracts to help answer researchers’ questions — kind of like Alexa, but for science. System extracts statistical relations between entities showing how different concepts and entities are linked. Primer doesn’t focus on research papers per se, but it does work with arXiv and provides a dashboard of information used by corporations and governments to synthesize and understand large amounts of data from many sources.

Access all the components

Unfortunately, these groups primarily rely upon titles and abstracts only, not the full texts, since roughly five out of six articles are not freely or easily accessible. For the groups like Web of Science and Google that have the data or the papers, their licenses and scope of use are limited or undefined. In the case of Google, it is unclear why there have been no publicly announced efforts to train AI models on the full-text scientific research in Google Scholar. Amazingly, this didn’t even change in the midst of the COVID-19 pandemic, which brought the world to a standstill. The Google AI team stepped up, prototyping a way for the public to ask about COVID-19. But — and here’s the kicker — they did so using only open access papers from PubMed, not Google Scholar.

The issue of getting access to papers and using them for more than just reading them one at a time is something groups have advocated for decades. I have personally worked on it for nearly a decade myself, launching an open access publishing platform called The Winnower during the last year of my PhD, and then working to build the article of the future at another startup called Authorea. While neither of those initiatives fully panned out the way I wanted them to, they led me to my current work at scite, which has, at least partially, solved the access issue by working directly with publishers.

Connect the components and define relationships

Our aim at scite is to introduce the next generation of citations — called Smart Citations — which show how and why any article, researcher, journal, or topic has been cited and more generally discussed in the literature. By working with publishers, we extract the sentences directly from full-text articles where they use their references in-text. These sentences offer a qualitative insight into how papers were cited by newer work. It’s a bit like Rotten Tomatoes for research.

This requires access to full-text articles, and cooperation with publishers, so that we can use machine learning to extract and analyze citation statements at scale. Because there were enough Open Access articles to get started, we were able to build out the proof of concept and one by one, we demonstrated to publishers the increased discoverability of articles indexed in our system and provided them with a system to show better metrics for more responsible research assessment. What we saw as expert statements, they saw as previews of their articles. Publishers have now signed on en masse and we have indexed over 1.1 billion Smart Citations from more than half of all articles published.

Use relational data to train AI models

The components and relations extracted from papers could be used to train new large language models for research. GPT-3, while very powerful, was not built to work on science and does poorly at answering questions you might see on the SAT. When GPT-2 (an earlier version of GPT-3) was adapted by training it on millions of research papers, it worked better than GPT-2 alone on specific knowledge tasks. This highlights that the data used to train the models is exceedingly important.

Some groups have recently used GPT-3 to write academic papers, and while this is impressive, the facts or arguments they might purport to show could be very wrong. If the model can’t get simple SAT-style questions right, can we trust it to write a full paper? SCIgen, which predates GPT-3 by nearly 20 years, showed that generating papers that look real is relatively easy. Their system, while much simpler, generated papers that were accepted into various conferences. We need a model that doesn’t just look scientific but is scientific, and that requires a system to verify claims for machines and humans. Meta recently introduced a system for verifying Wikipedia citations, something some publishers have vocally wished they had for scholarly publications.

Current progress

Again, one key blocker to bringing this system to fruition is a lack of access to the papers and resources to create it. Where papers or information become available to use at scale, we do see tools and new models flourish. The Google Patent team used 100 million patents to train a system for help with patent analysis, effectively a GooglePatentBERT. Others have introduced models like BioBERT and SciBERT, and despite the fact that they have only been trained on about ~1% of scientific texts in only specific subject domains, they are impressive at scholarly tasks, including our citation classification system at scite.

More recently, a ScholarBERT model has been released, which effectively does use all of the scientific literature to train BERT. They overcome the access issue but are notably mum on how, simply emphasizing their use to be “non-consumptive.” This use case might open the doors to others using articles without express permission from publishers and could be an important step in creating a DALL-E of science. Surprisingly, however, ScholarBERT did worse at various specialized knowledge tasks than smaller science language models like SciBERT.

Importantly, BERT-style models are much smaller scale than the large language models like GPT-3, and they don’t allow the same kind of generic prompting and in-context learning that has powered much of the GPT-3 hype. The question remains: what if we applied the same data from ScholarBERT to train a scaled-up generative model like GPT-3? What if we could somehow show where the answers from the machine were sourced, perhaps tying them directly to the literature (like Smart Citations)?

Why now?

Fortunately, papers are becoming more open and machines are becoming more powerful. We can now begin using the data contained within papers and connected repositories to train machines to answer questions and synthesize new ideas based on research. This could be transformative for healthcare, policy, technology, and everything around us. Imagine, if we didn’t search just for document titles but specifically for answers, how that would impact research and workflows across all disciplines.

Liberating the world’s scientific knowledge from the twin barriers of accessibility and understandability will help drive the transition from a web focused on clicks, views, likes, and attention to one focused on evidence, data, and veracity. Pharma is clearly incentivized to bring this to fruition, hence the growing number of startups identifying potential drug targets using AI — but I believe the public, governments, and anyone using Google might be willing to forgo free searches in an effort for trust and time-saving. The world desperately needs such a system, and it needs it fast.

Posted August 18, 2022

Josh Nicholson is co-founder and CEO of scite. He holds a PhD in Cell Biology from Virginia Tech and has built and sold two companies aimed at improving how researchers collaborate and publish their work.

Follow Twitter

How to Build a GPT-3 for Science

What makes scientific research data challenging

How do we get a DALL-E or GPT-3 for science?

Decompose papers into their minimal components

Access all the components

Connect the components and define relationships

Use relational data to train AI models

Current progress

Why now?

Why Applying Machine Learning to Biology is Hard – But Worth It

What Synthetic Embryos Can and Can’t Do, Now and in the Future

AI’s Next Frontier: Brains on Demand

Mid-year Recap: Web3 and Science Collide

The Two Things We’ll Need for the Next AlphaFold