Why Applying Machine Learning to Biology is Hard – But Worth It

Jimmy Lin is CSO of Freenome, which is developing blood-based tests for early cancer detection, starting with colon cancer. He is a pioneer in developing computational approaches to extract insights from large-scale genomic data, having spearheaded the computational analyses of the first genome-wide sequencing studies in multiple cancer types.

Lin talked to Future about the challenges of executing on a company mission to marry machine learning approaches and biological data. He explains what three types of people you need to hire to build a balanced techbio company, the traps you should avoid, how to tell when the marriage of two fields is or isn’t working, and the nuances of adapting biological studies and machine learning to each other.

FUTURE: Like many disciplines, there is a lot of excitement around the potential to apply machine learning to bio. But progress has seemed more hard-won. Is there something different about biomolecular data compared to the types of data that are typically used with machine learning?

JIMMY LIN: Traditional machine learning data are very broad and shallow. The type of problems machine learning is often solving are what humans can solve in a nanosecond, such as image recognition. To teach a computer to recognize the image of a cat you’d have billions upon billions of images to train on, but each image is relatively limited in its data content. Biological data are usually the reverse. We don’t have billions of individuals. We’re lucky to get thousands. But for each individual, we have billions and billions of data points. We have smaller numbers of very deep data.

At the same time, biological questions are less often the problems that humans can solve. We’re doing things that even world experts in this aren’t able to do. So, the nature of the problems are very different, so it requires new thinking about how we approach this.

Do the approaches need to be built from scratch for biomolecular data, or can you adapt existing methods?

There are ways you can take this deep information and featurize it so that you can take advantage of the existing tools, whether it’s statistical learning or deep learning methods. It’s not a direct copy-paste, but there’s a lot of ways that you can transfer many of the machine learning methods and apply them to biological problems even if it’s not a direct one-to-one map.

Digging into the data issue some more, with biological data there’s a lot of variability–there’s biological noise, there’s experimental noise. What’s the best way to approach generating machine-learning-ready biomedical data?

That’s a great question. From the very beginning, Freenome has taken into consideration how to generate the best data suited for machine learning. Throughout the entire process from study design, to sample collection, to running the assays, to data analysis, there needs to be care in every step to be able to optimize for machine learning, especially when you have so many more features than samples. It’s the classical big-p little-n problem.

First and foremost, we have designed our study to minimize confounders. A lot of companies have relied on historical datasets and have done a lot of work to try to minimize cohort effects and remove confounders. But is that really the best way to do it? Well, no, the best way to do it is a prospective study where you control for the confounders upfront. This is why, even in our discovery efforts, we decided to do a large multisite prospective trial that collects gold-standard data upfront, as in our AI-EMERGE trial.

Fortunately we have investors who believed in us enough to allow us to generate these data. That was actually a big risk to take because these studies are very expensive.

Then once you get the data, what do you do with it?

Well, you need to train all the sites in a consistent manner, and control for confounders from all the different sites so the patients look as similar as possible. And then once you run the samples, you need to think through how to minimize batch effects, such as by putting the right mix of samples on different machines at the right proportions.

This is very difficult when you’re doing multiomics because the machines that analyze one class of biomolecules may take hundreds of samples at one run, whereas the machines that analyze another class of biomolecules may take only a few. On top of that, you want to remove human error. So, we introduced automation pretty much upfront, at the stage of just generating training data.

Also, when you have billions of data points per person it becomes very, very easy to potentially overfit. So we make sure our training is generalizable to the populations that we ultimately want to apply it to, with the right statistical corrections and many successive train and test holdout sets.

Combining machine learning with biomolecular data is something a lot of biotech companies are trying to do, but oftentimes there’s a lot of vagueness about how they’ll do this. What do you view as an essential feature of effectively integrating them?

At Freenome we are melding machine learning and multiomics. In order to do that, you need to do both well. The key here is you need to have strong expertise in both of them, and then be able to speak the language of both. You need to be bilingual.

There are lots of companies that are experts in one and then sprinkle in a layer of the other. For example, there are tech companies that decide they want to jump into bio, but all they do is hire a handful of wet lab scientists. On the other hand, there are biology companies that hire some machine learning scientists, then they’ll declare that they are an AI/ML company now.

What you really need is deep bench strength in both. You need a deep biological understanding of the system, of the different assays, of the features of the knowledge space. But you also need to have a deep understanding of machine learning, data science, computational methods, and statistical learning, and have the platforms to apply that.

That’s really challenging because those two areas are often very siloed. When you’re thinking about the people that you’re hiring for the company, how do you create bridges between these two different domains?

I think there’s sort of three types of people you want to hire to bridge between tech and bio. The first two are your standard ones, the domain experts in machine learning or biology. But they also need to be open and willing to learn about the other domain, or even better, have had exposure and experience working in these additional domains.

For machine learning experts, we choose people who are not just there to develop the latest algorithm, but who want to take the latest algorithms and apply them to biological questions.

Biology is messy. Not only do we not have all the methods to measure the different analytes, but we are discovering new biomolecules and features continually. There are also a lot of confounding factors and noise one needs to take into consideration. These problems are generally more complex than the standard machine learning problems, where the problem and knowledge space is much more well defined. ML experts wanting to apply their craft in biology need to have humility to learn about the complexity that exists within biology and be willing to work with less than optimal conditions and differences in data availability.

The flip side is hiring biologists who think of their problems in terms of larger-scale quantitative data generation, design studies to optimize signal-to-noise ratios, and are aware of the caveats of confounders and generalizability. It is more than just being able to speak and think in the language of code. Many of our biologists already code and have a good statistical background, and are willing and wanting to grow into these areas. In fact, at Freenome, we actually have training programs for biologists who want to learn more about coding to be able to develop their statistical reasoning.

What is even more important is that study design, and the questions we are able to ask, look different when designed in the context of big data and ML.

What’s the third type?

The third type of person to hire is the hardest one to find. These are the bridgers – people who have worked fluently in both of these areas. There are very few places and labs in the world that are right at this intersection. Getting the people who can translate and bridge both areas is very, very important. But you don’t want to build a company of only bridgers because often these people are not the experts on one area or the other, due to what they do. They are often more general in their understanding. However, they provide the critical work of bringing the two fields together.

So, having all three groups of people is important. If you have only one of the domain expert specialists, you’ll only be strong in one area. Or, if you don’t have the bridge builders, then you have silos of people who won’t be able to talk to each other. Optimally, teams should include each of these three types of people to allow for a deep understanding of both ML and biology as well as providing effective synergy of both these fields.

Do you see differences in how specialists in tech or computation attack problems versus how biologists approach problems?

Yeah. To one extreme, we definitely have people who come from a statistical and quantitative background and they speak in code and equations. We need to help them to take those equations and explain it in a clear way so that a general audience can understand.

Biologists have great imagination because they work with things that are invisible. They use a lot of illustrations in presentations to help visualize what is happening molecularly, and they have great intuition about mechanisms and complexity. A lot of this thinking is more qualitative. This provides a different way of thinking and communicating.

So, how people communicate is going to be very, very different. The key is – we sort of jokingly say – we need to communicate in a way that even your grandma can understand.

It requires true mastery of your knowledge to be able to simplify it so that even a novice can understand. I think it’s actually great training for someone to learn to communicate very hard concepts outside of the normal shortcuts, jargon, and technical language.

What has inspired your particular viewpoint on how to marry machine learning and biology?

So, the problem isn’t new, but rather the latest iteration of an age-old problem. When the fields of computational biology and bioinformatics were first created, the same problem existed. Computer scientists, statisticians, data scientists, or even physicists joined the field of biology and brought their quantitative thinking to the field. At the same time, biologists had to start modeling beyond characterizing genes as up-regulated and down-regulated, and start to approach the data more quantitatively.The digitization of biological data has now just grown exponentially in scale. The problem is more acute and expansive in scope, but the fundamental challenges remain the same.

What do you view as either the success metrics or red flags that tell you whether or not the marriage is working?

If you look at companies that are trying to combine fields, you can very quickly see how much they invest into one side or the other. So, if it’s a company where 90% of the people are lab scientists, and then they just hired one or two machine learning scientists and they’re calling themselves an ML company, then that’s probably more of an afterthought.

Is there one take-home lesson that you have learned in this whole process of marrying biology and machine learning?

I think intellectual humility, especially coming from the tech side. With something like solving for search, for example, all the information is already in a text form that you can easily access, and you know what you’re looking for. So, it becomes a solvable problem, right? The problem with biology is that we don’t even know what datasets we are looking for, whether we even have the right flashlight to shine on the right areas.

So, sometimes when tech experts jump into bio they fall into a trap of oversimplification. Let’s say, as an example, for next generation sequencing they might say, “Wow. We can sequence DNA. Why don’t we just sequence lots and lots of DNA? It becomes a data problem, and then we solve biology.”

But the problem is that DNA is one of dozens of different analytes in the body. There’s RNA, protein,post-translational modifications, different compartments such as extracellular vesicles, and differences in time, space, cell type, among others. We need to understand the possibilities as well as the limitations of each data modality we use.

While it may be hard to believe, biology is still a field in its infancy. We just sequenced a human genome a little over two decades ago. Most of the time, we can’t access individual biological signals so we are still taking measurements that are a conglomerate or average across a lot of signals. We are just starting to measure one cell at a time. There’s still a lot to do and this is why it’s an exciting time to go into biology.

But with that infancy comes great potential to solve problems that will have huge impacts on human health and wellbeing. It’s a pretty amazing time because we’re opening new frontiers of biology.

What kinds of frontiers? Is there an area of biology or medicine where you are most excited to see computation applied?

Yeah – everything! But let me think. In cancer, I believe that within our generation the new therapies and early detection efforts that are coming out will transform cancer into a chronic disease that’s no longer so scary, like we’ve done for HIV. And we can probably use very similar types of methods to look at disease detection and prevention more generally. The key thing I’m excited about is that we can start detecting whether the disease is already there before symptoms.

Outside of cancer diagnostics, what’s also really cool is the transition to building with biology instead of just reading and writing. I’m excited about the areas of synthetic biology where we’re using biology as a technology, whether it’s CRISPR or synthetic peptides or synthetic nucleotides. Leveraging biology as a tool creates expansive possibilities to completely transform traditional resource generating industries, from agriculture to energy. This is truly an amazing time to be a biologist!

Posted October 5, 2022

Why Applying Machine Learning to Biology is Hard – But Worth It

What Synthetic Embryos Can and Can’t Do, Now and in the Future

How to Build a GPT-3 for Science

AI’s Next Frontier: Brains on Demand

Mid-year Recap: Web3 and Science Collide

The Two Things We’ll Need for the Next AlphaFold