Daphne Koller is the founder and CEO of insitro. Prior to insitro, she served as Chief Computing Officer at Calico. She was also a Professor in the Computer Science department at Stanford University before making a mid-career transition to co-found Coursera.  

In this interview, she shares some of the challenges with creating accurate machine learning models from biomedical data and how she’s tackling them. We also examine the success of AlphaFold, why it happened now, and whether we can expect to see similar leapfrogs in other areas of biomedical ML. But first, we begin by understanding what pulled her away from a successful career in academia and get her insights on how to thrive after the PI-to-industry transition.


FUTURE: You were a Professor at Stanford focusing on AI research and then you transitioned into industry when you founded Coursera in 2012. What induced you to make that move?

DAPHNE KOLLER: I had been experiencing an increased sense of urgency to make a difference in the world more directly, rather than by proxy via students or writing papers. I tried to do that with some of my work, for example, on cancer histopathology, and it didn’t really work to translate that into impact from within an academic environment.

Then the work that I had initiated at Stanford on technology-assisted education kind of blossomed into the launch of those first massive open online courses. We saw the impact that these were having, not only in the number of people that were participating, 100,000 plus per course, but also seeing participants from every country, every age group and every walk of life. I felt like I needed to see it through and not just assume that someone else would carry the baton forward. 

Some of the least successful transitions I’ve seen from academia into industry are people who keep the PI mindset.

So I took what was supposed to be a two-year leave of absence. But then, I really enjoyed it and didn’t feel like the company was in a position that I could leave it in that state. So Stanford forced me to make a choice and I did, and haven’t really looked back.

For someone who might be going through this kind of transition right now, is there anything that you wish you’d known at the time?

I think it’s important for people to understand just how different life is in industry compared to academia. One difference is in things like its structure. There are certain things that one does at a company that are a lot less free form than in academia. But maybe more fundamental than that are two other things.

One is that being in a company is really a team sport. It’s not about what you do and what your accomplishments are. It’s about what the company as a whole, especially in a startup, is able to accomplish by working as a team. One needs to be willing to put one’s ego aside. Some of the least successful transitions I’ve seen from academia into industry are people who keep the principal investigator (PI) mindset.

At insitro, we have a fundamental core value, which is that we engage with each other openly, constructively, and with respect. And all of those words matter.

The other big difference is the building for durability rather than for the quick win. In academia, you do work and then it gets published as a paper. You get a lot of visibility for that work. Then, other than the manuscript, there is not often a durable artifact that persists. The code that you write, if one’s lucky, gets deposited on GitHub but it’s rarely intended for reuse by anybody else, including even the person who wrote it. The data set may be deposited in some repository but you don’t really think about it as something that is a durable artifact that you expect others to build on. 

In industry, there’s no point to that. There’s no such thing as a quick win. Sure, you can do proofs of concept and stuff. But ultimately what you’re building has to enable other people contributing on top of that. So you need to be thinking about how to build something that is robust enough to withstand the test of time to allow other people to make use of it.

You’ve been really successful at building a culture that fosters cross-functional collaboration, bridging tech and biology. What’s your philosophy on how to build that?

You have to be very deliberate about culture no matter what. Culture is one of those things that if you let it evolve organically, it will often degenerate towards the worst. Especially as you grow and you bring in new people who haven’t necessarily fully understood what you’re doing and will introduce their own color on it. That will often end up just either diffusing the culture or even dragging it in the wrong direction.

So, you have to be really deliberate in instilling the culture, in hiring towards it, in rewarding it, in performance assessments and other ways, in highlighting examples of that and in instilling organizational structures that make it easier to do the right thing than the wrong thing. 

What would it take to have a similar leapfrog in other areas? I would say two things. One is a substantial amounts of high-quality data. . . The other is having a really well-defined question and a way to assess whether you’re solving the problem.

At insitro, we have a fundamental core value, which is that we engage with each other openly, constructively, and with respect. And all of those words matter.

The ‘engagement’ means that we actually talk to each other, not siloed in our little teams. ‘Openly’ means that we have to be open to both expressing our ignorance and asking naive questions. Also when someone from a discipline outside of our own makes a naive suggestion, don’t dismiss it. Maybe it’s a good idea. Oftentimes it is.

‘Constructively’ means that all of these discussions need to be done with the view of making the outcome better rather than being the smartest person in the room. And ‘respect’ means you have to have deep respect for the expertise and the value that each person provides to the endeavor, irrespective of their role, their background, or their level. 

So, I think it’s hiring the right people. People who are life scientists who want to understand how to do data science on the data they produce. Or, machine learning data scientists who want to really work closely with the life scientists to make sure that the machine learning they’re doing is not some kind of abstraction but has real value to patients. And some people who speak both languages because they are      really critical in serving as translators.

Maybe you can serve as that kind of translator for me for a moment. Arguably the biggest scientific advance of the past year was the development of really accurate AI-powered protein folding predictions. This was a huge benchmark for establishing the potential to use AI to tackle hard biological questions. What do you think changed, that machine learning methods are now beginning to make traction in biology and biopharma?

If I had to put a finger on the single biggest contributor to the success of AlphaFold, it’s data availability. 

It is certainly the case that the machine learning methods that were employed were very thoughtful and very sophisticated. That’s an area where the field as a whole has made tremendous progress on multiple different types of problems in natural language and speech and images. 

AlphaFold draws on a number of those advances as well as on the many years of insight and thought that went into the more traditional algorithms for protein folding. They use a lot of the same tricks, but they didn’t incorporate those in some hand-coded way into an algorithm, as had typically been done. Instead, they were used as a substrate in designing a machine learning model that incorporated those insights but basically learned the model specifics from the data. But that’s where it comes down to the data.

What would it take to have a similar leapfrog in other areas? I would say two things. One is substantial amounts of high-quality data. In this case, it is the sequences and the structures into which they fold, which is the result of a huge community effort to crystallize protein structures, measure them, and deposit them in a public way.

The other is having a really well-defined question and a way to assess whether you’re solving the problem. That’s what allows the machine to optimize. And when you think about some of the other critical problems that we have in biology, in drug discovery, neither of these is true. Let’s take the example of predicting which small molecule is going to modulate a protein, which is a next step beyond the protein folding problem. How much data do we have in the public domain? Not very much. And it’s largely of poor quality. In many cases, it’s very inconsistently measured. There are not a lot of gold-standard data sets that one could use to assess the progress. And this is a problem where at least I gave a well-defined question.

I design an experiment specifically to feed a machine learning model. When you do that, it turns out that the experimental design is actually quite different to the experimental design that you do when you’re trying to do scientific discovery.

If you think about an even more high-level one, such as, does modulating this gene have clinical benefit to patients? There’s not a very clear database where it is recorded that if I modulated this target, it helped this patient population. And there’s not a well-defined ground truth.

So, how do you design a machine learning model and how do you assess that it’s doing better as you continue to optimize the model architecture? That’s really at the heart of this: a lack of data and the lack of a well-defined problem where you could really assess progress.

Are there any specific disease areas where they are amenable to machine learning today? Are there specific problems that seem sufficiently defined or diseases where we have the data to do that?

So, let me clarify. When I said these are hard, I didn’t say they were impossibly hard or should not be tackled. It is really important to answer whether intervening in a particular gene is actually going to modulate a disease. And it creates an interesting question for anyone tackling it: how do I create a proxy data set that allows us to answer the question?

At insitro, we address this by looking at two complementary forms of data. One is human genetics, where nature has intervened in a gene and then we can see what clinical impact that had. The other is in a human cell-based system where we can actually intervene in a gene and see what happens.

The question is, how do you take those two forms of data, neither of which provides exactly the information that you would like to have, and put them together to provide input to the right machine learning? And how do you define the problem the machine learning algorithm is trying to solve? But that is one way to get around the problem of creating a proxy data set.

The other way to get around this problem of using a proxy data set is that biology, chemistry, and the life sciences have provided us in the last two years with a number of methods that enable the creation of biological and chemical data at scale. 

There needs to be also more standardization on how experiments are done and more sharing of core methods and protocols.

What we’re doing at insitro, and I think others are starting to do as well, is to create data with the specific purpose not of scientific hypothesis discovery or validation, but rather for machine learning methods.

That is, we design an experiment specifically to feed a machine learning model. When you do that, it turns out that the experimental design is actually quite different to the experimental design that you do when you’re trying to do scientific discovery. 

Getting at the issue of needing large amounts of good data: it’s pretty widely acknowledged that the large human genetic datasets that are widely used aren’t representative of the genetic diversity of the general population. So, how is it possible to avoid bias in building an AI platform from biased data? How do you ensure that you discover and develop drugs that work for the general population, knowing that limitation in the data?

Yeah, so I think that’s a really important question and it’s useful to tease it apart. I mean, if what we were trying to do was build a predictive model of going from genetics to phenotype, then obviously if you train a model on Caucasians, it doesn’t give good predictions on African Americans.

But if what you’re doing is uncovering core biological processes that drive disease, then ultimately we’re all human beings and it’s the same set of biologies that are typically dysfunctional in disease regardless of who has the disease. Now, the proportion of biological mechanism X versus biological mechanism Y causing the disease could be different because each of us has a preponderance to a particular set of mutations based on our own genetics and the background from which we come. But if someone else has the same mutations as me, they still cause a disease in the same way. It’s just that those mutations might occur at a lower frequency in their population than they do in mine. So, it certainly biases the set of discoveries that you can make, but it doesn’t usually shift the validity of those discoveries.

In many cases, it’s actually better to have a single standard even if it’s imperfect than to have a million standards that are inconsistent with each other.

Now, I absolutely think that we will want to expand our genetic diversity as we try to interrogate diseases that are considerably more common in certain populations. Or if we want to uncover new mechanisms for certain diseases that are more common in one population than the other. That way we can make sure that we find enough examples of those mechanisms so that we have drugs that also hit those.

Coming back to an earlier point you made about academics not being incentivized to build for durability. A lot of the open-source software in the bioscience community tends to come out of academic labs but then, for the reasons you mentioned, it’s not super well maintained. Interestingly, at Insitro you open sourced your data science tool redun last year. Given that companies build for durability, as you said, could companies releasing their code and workflows as open-source be a potential solution for this lack of durability in bioscience software tools?

I think it’s certainly a big part of it. We were very proud to release redun because it’s such a broadly useful tool that helps address what to my mind is one of the biggest gaps, which is how to do reproducible science. It effectively keeps track of the version that you used for every single step of your process. So, if you need to reproduce the result that you put in place, then you know how you did that. It keeps track of data provenance. 

It’s not going to be the full solution though because there’s so much variability right now in the data types that we generate and how they’re generated and how to even think about them. There needs to be also more standardization on how experiments are done and more sharing of core methods and protocols. That too will help with the reproducibility problem.

We also don’t do enough as a community to really establish a consistent set of best practices and standards for things that everybody does. That would be hugely helpful, not only in making science more reproducible but also in creating a data repository that is much more useful for machine learning. 

I mean, if you have a bunch of data collected from separate experiments with a separate set of conditions, and you try to put them together and run machine learning in that, it’s just going to go crazy. It’s going to overfit on things that have nothing to do with the underlying biology because those are going to be much more predictive and stronger signals than the biology that you’re trying to interrogate. So I think we need to be doing better as a community to enable reproducible science.

How do you see the solutions for reproducibility, as you described, coming together? 

I think some communities have done a better job than others in creating a set of standards that most people follow. For example, the statistical genetics community has a set of tools there is consensus on to  use to call variants, measure linkage disequilibrium,  measure association, and how they define genome-wide significance.

Each of these decisions can be second-guessed and there’s improvements that can be made to all of them. But in many cases, it’s actually better to have a single standard even if it’s imperfect than to have a million standards that are inconsistent with each other.