How to Use Massive AI Models (Like GPT-3) in Your Startup

As machine learning technology has matured and moved from research curiosity to something industrial-grade, the methods and infrastructure needed to support large-scale machine learning have also evolved. Taking advantage of these advances represents both opportunities and risks for startups – almost all of which are leveraging machine learning in one way or another as they compete for a piece of their respective markets.

The journey to this point began a little more than 9 years ago, when the deep learning revolution was kicked off by a 2012 submission, called AlexNet, to the annual ImageNet LSVRC competition (a computer vision contest run by the research community). In this submission, a team of three (Alex Krizhevsky, Illya Sutskever, and Geoffrey Hinton) used a technique known as a convolutional neural network to understand the content of photos. They won the competition hands down – beating all others by a significant margin – and did it with a system trained on a $700 computer graphics card used to play video games.

The world of machine learning was forever changed. Within a year, startups began springing up to replicate AlexNet. My previous company, AlchemyAPI (acquired by IBM in 2015), released one of the first commercial versions of this work with our AlchemyVision computer-vision API back in 2013. Other startups founded around this time include DeepMind (acquired by Google), MetaMind (acquired by Salesforce), and Clarifai, among many others. Academia also shifted dramatically, with many experts moving from skepticism about artificial intelligence to whole-hearted embrace of deep learning extremely quickly.

Fast forward to 2022: Neural networks have changed every aspect of machine intelligence in software systems we all use daily, from recognizing our speech to recommending what’s in our news feed (for better or for worse). Today’s systems still employ neural networks – but at a vastly different scale. Recent systems for understanding and generating human language, such as OpenAI’s GPT-3, were trained on supercomputer-scale resources: thousands of GPUs (each costing $10,000 or more) woven into a complex fabric of high-speed network interconnects and data-storage infrastructure. While 2012’s state-of-the-art systems could be trained on a $700 video game card, today’s state-of-the-art systems – often referred to as foundation models – likely require tens of millions of dollars in computation to train.

The emergence of these massive-scale, high-cost foundation models brings opportunities, risks, and limitations for startups and others that want to innovate in artificial intelligence and machine learning. Although they likely can’t compete with Google, Facebook, or OpenAI on the bleeding edge of research, smaller entities can utilize the work of these giants, including foundation models, to kickstart development of their own machine-learning-powered applications.

Pre-trained networks give smaller teams a leg up

Neural networks such as AlexNet were originally trained from scratch for every task – something doable when networks required a few weeks on a single piece of gaming hardware, but much more difficult as network sizes, compute resources, and training data volumes began to scale by orders of magnitude. This led to the popularization of an approach known as pre-training, whereby a neural network is first trained on a large general-purpose dataset using significant amounts of computational resources, and then fine-tuned for the task at hand using a much smaller amount of data and compute resources.

The use of pre-trained networks has exploded in recent years as the industrialization of machine learning has taken over many fields (such as language or speech processing) and as the amount of data available for training has dramatically increased. The use of pre-trained networks allows a startup, for example, to build a product with much less data and compute resources than would otherwise be needed if starting from scratch. This approach is also becoming popular in academia, where researchers can quickly fine-tune a pre-trained network for a new task, and then publish the results.

For certain task domains – including understanding or generating written text, recognizing the content of photos or videos, and audio processing – pre-training has continued to evolve with the emergence of foundation models such as BERT, GPT, DALL-E, CLIP, and others. These models are pre-trained on large general-purpose datasets (often in the order of billions of training examples) and are being released as open source by well-funded AI labs such as the ones at Google, Microsoft, and OpenAI.

The rate of innovation in commercialized machine learning applications and the democratizing effect of these foundation models cannot be understated. They’ve been a panacea for those working in the field who don’t have a spare supercomputer lying around. They allow startups, researchers, and others to quickly get up to speed on the latest machine learning approaches without having to spend the time and resources needed to train these models from scratch.

The risks of foundation models: size, cost, and outsourced innovation

However, not all is rosy in the land of pre-trained foundation models, and there are several risks associated with their increasing use.

One of the risks associated with foundation models is their ever-increasing scale. Neural networks such as Google’s T5-11b (open sourced in 2019) already require a cluster of expensive GPUs simply to load and make predictions. Fine-tuning these systems requires even more resources. More recent models created in 2021-2022 by Google/Microsoft/OpenAI are often so large that these companies are not releasing them as open source – they now require tens of millions of dollars to create and are increasingly viewed as significant IP investments even for these large companies.

However, even if these latest models were open sourced, simply loading these networks for making predictions (“inference,” in machine learning parlance) involves spinning up more resources than many startups and academic researchers can readily access. OpenAI’s GPT-3, for example, requires a significant number of GPUs simply to load. Even using modern compute clouds such as Amazon Web Services, this would involve provisioning dozens of Amazon’s most expensive GPU machines into a high-performance computing cluster.

Dataset alignment can also be a challenge for those using foundation models. Pre-training on a large general-purpose dataset is no guarantee that the network will be able to perform a new task on proprietary data. The network may be so lacking in context or biased based on its pre-training, that even fine-tuning may not readily resolve the issue.

For example, GPT-2, a popular foundation model in the natural language processing space, was originally announced in early 2019 and, thus, trained on data collected on or before that date. Think about everything that has happened since 2019 – pandemic, anyone? The original GPT-2 model will surely know what a pandemic is, but will lack the detailed context around COVID-19 and its variants that has emerged in recent years.

To illustrate this point, here is GPT-2 trying to complete the sentence “COVID-19 is a …”:

GPT-2 (2019): “COVID-19 is a high capacity LED-emitter that displays information about the size and state of the battery.”

By comparison, GPT-J, an open-source language model released in 2021, completes the sentence as follows:

GPT-J (2021): “COVID-19 is a novel coronavirus that mainly affects the respiratory system resulting in a disease that has a wide variety of clinical manifestations.”

Pretty dramatic difference, right? Dataset alignment and recency of training data can matter immensely depending on the use case. Any startup leveraging foundation models in its machine learning efforts should pay close attention to these types of issues.

Cloud APIs are easier, but outsourcing isn’t free

Companies such as OpenAI, Microsoft, and Nvidia have seen the scale challenges and are responding with cloud APIs that enable running inference and fine-tuning of large-scale models on their hosted infrastructure. And, of course, every major cloud provider now offers a suite of machine learning services as well as, in some cases, custom processors designed specifically for these workloads. This can provide a limited pressure relief valve to startups, researchers, and even individual hobbyists by offloading the compute and infrastructure challenges to a larger company.

This approach has its own risks, however. Not being able to host your own model means relying on centralized entities for both training and inference. This can create externalized risks in building production-ready machine learning applications: Network outages, concurrency or rate limits on APIs, or simply changes in policy by the hosting company could lead to significant operational impact. Additionally, the potential for IP leakage may be uncomfortable to some when sensitive labeled datasets (some of which might be covered by regulations such as HIPAA) must be sent to cloud providers for fine-tuning or inference to occur.

From a bottom-line perspective, COGS (cost of goods sold) impact from calling these APIs can also be a concern for those using cloud providers for their machine learning needs. Pricing models vary by provider, but needless to say the cost of API calls, data storage, and cloud instances will scale along with your usage. Many companies that have used cloud APIs for machine learning today may eventually attempt to transition to self-hosted or self-trained models to gain more control over their machine learning pipelines and eliminate externalized risks and costs.

The opportunities and risk around using hosted and pre-trained models has led many companies to leverage cloud APIs in the “experimentation phase” to kickstart product development. This is when a company is trying to find product-market fit for its offering. Leveraging cloud APIs can allow a company to quickly get its product up and running on a large scale without having to invest in expensive infrastructure, model training, or data collection. Cloud machine learning services and hosted pre-trained models from providers such as Google, IBM, Microsoft, and OpenAI now power thousands of startups and academic research projects.

Once a company has determined it has a product-market fit, it often transitions to self-hosted or self-trained models in order to gain more control over data, process, and intellectual property. This transition can be difficult, as the company needs to be able to scale its infrastructure to match the demands of the model, as well as manage the costs associated with data collection, annotation, and storage. Companies are raising increasingly large amounts of investor capital in order to make this transition.

My latest startup, Hyperia, recently made such a transition. Early on, we experimented with cloud APIs as we worked to understand the content of business meetings and customer voice conversations. But eventually we jumped into the deep end of the pool, spinning up large-scale data collection and model training efforts to build our own proprietary speech and language engines. For many business models, such an evolution is simply unavoidable if one is to achieve positive unit economics and market differentiation.

Be strategic and keep an eye on the big AI labs

Foundation models are one of the latest disruptive trends in machine learning, but they will not be the last. While companies continue to build ever-larger machine learning supercomputers (Facebook’s latest includes more than 16,000 GPUs), researchers are busy developing new techniques to reduce the computational costs associated with training and hosting state-of-the-art neural networks. Google’s latest LaMDA model leverages a number of innovations to train more efficiently than GPT-3, and techniques such as model distillation and noisy student training are being rapidly developed by the research community to reduce model size.

These innovations and others mean startups can continue to innovate – but it’s important to keep one’s eyes open as the landscape continues to change. Things to keep in mind include:

Cloud APIs can definitely accelerate a company’s path to product-market fit, but often bring their own problems long-term. It’s important to have a strategic exit plan so these APIs do not control your product destiny.
Foundation models can vastly speed up your machine learning efforts and reduce overall training and data collection costs, but being aware of the limitations of these systems (e.g., recency of training data) is important.
Keep tabs on what is coming out of the big corporate AI labs (Google, Microsoft, IBM, Baidu, Facebook, OpenAI, etc). Machine learning is changing at an extremely rapid pace with new techniques, models, and datasets being released every month. These releases can often come at unexpected times and have a dramatic impact on your company’s machine learning efforts if you can quickly adapt.

Ultimately, the future of machine learning and its impact on startups and technology companies is uncertain, but one thing is clear: Companies that understand what’s available and make smart decisions about using it will be in a much better position to succeed than those just looking for a quick AI fix.

Posted February 11, 2022

Elliot Turner Elliot Turner is CEO and cofounder of Hyperia, which provides conversational intelligence via AI. He previously founded startups AlchemyAPI and MimeStar. He advises ML, security and robotics startups.

Follow Twitter Website