The artificial intelligence field has evolved dramatically since the deep learning revolution kicked off in 2012, and Richard Socher has been around for all of it. He earned his PhD from Stanford working on NLP (natural language processing) before co-founding an AI startup called MetaMind in 2013. He then spent several years leading the AI team at Salesforce (after it acquired MetaMind) before tackling the search space with his new startup,, in 2021.

In this interview, Socher discusses a number of topics, including: how things have changed for AI startups in the last decade; the differences between doing AI in startups, enterprises, and academia; and how new machine learning techniques, such as transformer models, empower companies to build advanced products with a fraction of the resources they would have needed in the past.

FUTURE: It seems like a common move is for AI researchers – students and professors – to move from academia into startups, like you did. What are some key differences between those two worlds today?

RICHARD SOCHER: In academia, people still push forward to try to make progress toward new areas where AI can have an impact, and some of them hope to make progress toward AGI (artificial general intelligence). I think two exciting examples of novel, high-impact areas are in the protein space – sequences of proteins or amino acids – and in economics. The latter is so important for the world, but really hasn’t seen as much impact from AI as I think it should.

At the same time, for startups, if you have a lot of data and you have a process that is mostly dependent on the data that you’re already seeing, you basically can just say, “We know how it works.” You have a radiology image, and you try to identify, ‘“Is this bone broken or not?” Or you have a head CT scan and you try to identify, “Is there intracranial hemorrhage or brain bleeds?” Or you’re classifying different kinds of cancer from pathology images. All of these applications are essentially taking a relatively well-established sequence of identifying a problem and collecting data for it; training a large neural network on it; and then optimizing and automating parts or the entirety of that process.

And with that well-proven approach, you can actually have a lot of impact. It’s similar to what we’ve seen with electricity: Once we had the basics of electricity figured out, you could have a lot of impact by just giving it to a town that had only oil lamps and fire before.

This is possible in part because a lot of interesting and important ideas have been developed over the last 10 years. Things that would have been impossible  – like having an AI write a reasonably long text – are now possible. One major change is that not just images, but all data, is essentially vectors. Everything is a list of numbers, and then that list of numbers can be given as an input to a large neural network that can really train anything you want on top of it. There are lots of interesting and important algorithmic improvements, too – not to mention more data and more computer power – but that main idea of end-to-end learning was the big one that changed a lot of things.

Vectors in NLP

In natural language processing, word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. ~ Wikipedia

What about the transition from startups to large enterprises? It seems like a double-edged sword, with probably more budget but also more restrictions …

The two are different along so many dimensions. I’ll just mention two examples.

AI-tooling startups are successful in the B2B space if they find one part of the process that every other company might have to spend one or two developers on, and they build a product around that process that costs, say, a quarter of a developer. So, a lot of startups now in the AI tooling space are taking the less pleasant, less fun bits, and helping developers do those things.

The best way to do this is probably developing an experience where the companies using the product can still feel like they are building and controlling the AI, but really they found a partner to label their data. They also found partners to look through the bias of the data; collect the data in the first place; implement the model via Huggingface; scale the model analytics as they train it via Weights and Biases; and deploy the model via ZenML.

In the end, they’re dependent on 10 to 15 external systems, but they were able to train AI much more quickly, much more scalably, and much more accurately than if they had to try to reinvent 95 percent of the tooling around a particular AI model. It’s been really interesting for startups to identify these various things that exist already, but they don’t exist in a super-professional way where a strong team is focused on that particular aspect.

At a larger enterprise company like Salesforce, you’re mostly thinking about what really moves the needle for a lot of different customers. How can you help those customers with their datasets that are already in your system, in a way where they still feel like – and actually do – have the control? That’s non-trivial to do because at Salesforce, for example, trust was our No. 1 value. You couldn’t just take everyone’s data and train something on it, because they own their data and they’re paying for the storage. And, so, you need to also work together with customers to try to get their AI projects off the ground.

Once we had the basics of electricity figured out, you could have a lot of impact by just giving it to a town that had only oil lamps and fire before.

So for an enterprise software vendor, the concerns are that customers are paying a lot of money, and you can’t throw a wrench in the works in the name of experimenting with a new feature?

That’s part of it. But maybe more importantly, you have to make sure that it’s trusted, it’s easy to use, it scales across all these different use cases, and the cost of the service is still relatively low. If you’re a platform company like Salesforce, you also have to not just build one classifier, but you have to enable all your customers to build their own classifiers, which comes with all kinds of interesting and hard technical challenges, as well.

How does having an enterprise budget change things?

The biggest difference is that the larger you are as a company, the further you can and should look into the future, do more interesting research work, and actually have a stronger overlap with the academic world. Because you might get disrupted in two or three years, and you have enough runway to be able to think four or five years into the future. So you need to anticipate a little bit of what’s going to happen then.

So as an AI researcher in a large company, you have more of the luxury of thinking longer-term and building something, whereas in a startup, of course, you need to build something that people want right now. And it needs to be really, really good. And you need to be able to ship it in a reasonable timeframe. That’s the big difference: The vast majority of startups are working on applications and applied AI, rather than basic research; larger companies can do both.

You mentioned a lot of what we might call horizontal applications when you were talking about B2B startups. Why do you think those are proving successful today, when that wasn’t always the case?

There are always very useful vertical AI applications, but then there was a short phase where we thought maybe horizontal platforms could work. However, the early AI platform startups took on too many of the different tasks.

For instance, at MetaMind, we built technologies where you could just drag and drop some text or images into your web browser, and then you’d have a fully scalable system that classifies those documents. In some ways, it was pretty magical because this was all pre-TensorFlow and pre-PyTorch. You had to implement all of these neural networks and all their gory details from scratch and with very little abstractions and dev tools around them. That’s changed significantly.

We built all of these things in MetaMind – labeling, error analysis, deployments, modeling, analytics of how it’s training. What’s interesting now is that each of these separate things is worth more than MetaMind ever was, in terms of the companies that are in that space just doing one of these things.

I think most companies and developers want to feel like they’re in control of the AI, but they’re OK to give up a bunch of separate smaller parts of that stack that aren’t actually that exciting to code up. So, in a funny way, there’s a little bit of a balance between what’s fun to implement and what makes everyone feel like they’re in control. As a vendor in the machine-learning tooling space, you must not take away too much of the control.

Transformer Models

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing and computer vision. Transformers are designed to handle sequential input data, such as natural language, for tasks such as translation and text summarization. ~ Wikipedia

How has the evolution of networks and models changed how someone might think about starting a company or building an AI product?

I don’t actually think the particular model changes much about how people would start companies. But I think there are certain models that are currently more efficient because they deal better with the hardware that we have. We’re not really brain-inspired or theory-inspired or principles-inspired – we’re GPU-inspired. We’re mostly inspired by what works well on a GPU.

The current popular model, transformers, is very efficient for GPUs and can be trained very efficiently. And if we had different computing architectures, then it might be LSTMs still, or maybe even recursive neural networks. There are all kinds of different models as encoders of vectors that will come and go.

That does change things a little bit for hardware startups. They look at Nvidia, and some other larger companies, and say, “Well, there has got to be some way to get a slice of that pie.” So, we’re going to see some innovation. At the same time, it is really hard for them to scale because for most major use cases, they have to offer their special hardware within one of the large cloud providers.

And then, of course, the whole stack of AI development has matured so much in the last 8 years. Back then, if you wanted it to be fast you had to implement everything in C++ from scratch, which is just incredibly slow. It took people a long time to get up to speed and learn. And nowadays, all of that complexity can be abstracted away and you can use products like the ones we discussed before, that make it so much faster, convenient, and easy to build high-quality AI systems.

[Back in 2013] you had to implement all of these neural networks and all their gory details from scratch and with very little abstractions and dev tools around them. That’s changed significantly.

But algorithmic advances do make a difference, right? For example, is big on privacy, and it seems like one reason you can prioritize that is the ability to do more with less data.

That’s a great question, and it’s absolutely true. I think if we had wanted to build a search engine company 5 or 10 years ago, it would have been insanely hard and basically impossible to compete with Google because we would have needed hundreds of people and huge amounts of training data to build the ranking systems that we’re building. Now, with a very small – albeit extremely smart and capable, but very small – team, we’re actually able to build a ranking system that ranks any arbitrary intent and query that you type in the search engine, and provides the right sets of apps and the right sets of sources for those.

And the only reason a small company like can compete with a large company like Google is because of the progress we’ve seen in AI. In particular, when it comes to so-called unsupervised and transfer learning. The idea here is that you can train very large neural networks on unsupervised text – basically all of Wikipedia, Common Crawl, and as much web text as you can find, while keeping in mind that not everything on the web is great in terms of training AI.

Unsupervised models are trained with these very simple objectives, such as predicting the next word in a sentence. For example, “I went to Southern California and enjoyed the …” If you know a lot about language and the world, you’ll realize a good next word might be “beach,” “desert,” or any of the things you might enjoy in Southern California. But you need a lot of knowledge to be able to predict what that word is. By training a model to keep predicting the next word in these very long sequences of millions and billions of words, it actually starts to incorporate all of that knowledge.

It’s unsupervised because no one needs to sit there and label what the next word is. You just take Wikipedia, and you get a lot of words in the right sequence.

That’s been an incredibly powerful idea that has essentially enabled NLP models that are very large, but then can be modified just a little bit to do what you want them to do. And they will generalize much more beyond the specific, small labeled data that you have because they have a sense of world knowledge; they know things like “best Thai restaurants near me” is very similar to “best Southeast Asian restaurants in my area.” Even though we’ve never had that particular phrase in our training data, our neural networks and our ranking systems can actually do this because they know those phrases are similar.

We’re not really brain-inspired or theory-inspired or principles-inspired – we’re GPU-inspired. We’re mostly inspired by what works well on a GPU.

Speaking of search: One big thing I noticed about is the way it summarizes results. How much of that is solely a UI/UX decision that could have been implemented by anyone at any time, and how much is also a function of advances in machine learning where you’re able to treat results in a different way?

Although it doesn’t sound that cool, summarization is actually one of the hardest AI tasks, especially in natural language processing. And it’s hard for a lot of interesting reasons. One, it’s very personalized. Like, if I know what you (the recipient of the summary) know, I can give you much better and more accurate results for that summary.

If, for instance, you don’t know what a word vector is, then it’s very hard to understand transformers. So you would first need to get a primer on word vectors in order to understand transformer networks for NLP. But if you already know what a transformer is, then a summary of a research paper might be very short. It could just say, “They’re training it on language modeling instead of machine translation, and that is a better objective function.”

And I think summarization is an important technology trend that more and more people in the next couple of years will appreciate as they see their time disappearing. When your time is valuable, you want simple tools to help you get stuff done. Instead, we get sucked into engagement loops from companies whose business model is often advertisements. They don’t want to help you get things done; they want to help you look at more content in order to show you more ads.

We want to push against that. Summarization is a big part of that, to help you search less and do more, or search less and code more. We have apps that have code snippets you can just copy and paste, and that is often the right summary. If you’re searching “How do I sort a dictionary in Python,” the right answer isn’t a long sequence of text. It’s just a code snippet, and that’s it. Or when we show you a paper, there’s a link to a GitHub repo that implements an open-source version of that paper.

I think the next generation of search engines is fundamentally based on different values, but also different business models.

I think most companies and developers want to feel like they’re in control of the AI, but they’re OK to give up a bunch of separate smaller parts of that stack that aren’t actually that exciting to code up.

Given all the advances we’ve discussed, if you were giving advice to someone looking to get into the AI space right now, what would be the things to do or the skills to learn?

That highly depends on their age, their skill sets, their time commitment, and which part of the field they want to be in. If you’re young and you really want to set up your career toward that, you still have to learn the basics of programming, math, statistics, some probability, a lot of linear algebra, and things like that.

And then if you are a practitioner and you want to get into it, there are a ton of really exciting new online classes, videos and platforms. There’s so much material out there now. Even the Stanford CS224 NLP lessons are out there, so you can go quite deep if you want to. That’s what I’d encourage folks to do.

Once you’ve done that, the next level is to just get your hands dirty and program something, play around with these models. Think about what kinds of processes and tasks people are currently doing manually, or sometimes maybe mechanically, but still requiring a human to oversee. Could you automate those and build something unique?

If we had wanted to build a search engine company 5 or 10 years ago, it would have been insanely hard and basically impossible to compete with Google because we would have needed hundreds of people and huge amounts of training data.

How far can you get just using, say, cloud APIs and various levels of abstraction versus having to really get a meaningful understanding of how this stuff works?

It all depends on your background. If you have a math background at some point during your higher education, then you can very quickly understand some of the fundamentals of it, and skip directly to hacking up actual models and you don’t need to re-implement them all from scratch. But the more you rely on abstractions, the harder it can be to do something truly novel or understand how to solve performance issues and bugs.

However, there are a lot of use cases where you don’t have to do anything novel. You might want to automate a sprinkler system, so you’re just interested in answering: “Is there a person standing here? Yes or no.” And if there isn’t, turn on the sprinkler system. You don’t need to invent anything novel for that. You just need to do all the right standard steps and use good tools for an image classifier.

But the abstractions are still leaky and they’re not perfect. So, the more important the application – the more important it is to your company, affected users, or your career – the more you still want to have experts that understand these systems deeply. Experts that know how to fix certain errors or performance issues, and also folks who think through how that AI system might impact people. Only then can you really automate certain processes in a way that is safe and maximizes positive impact.