This is an edited excerpt from Trustworthy AI: A Business Guide for Navigating Trust and Ethics in AI by Beena Ammanath (Wiley, March 2022). Ammanath is executive director of the Global Deloitte AI Institute and leads Trustworthy & Ethical Technology at Deloitte. She has held leadership positions in artificial intelligence and data science at multiple companies, and is the founder of Humans For AI, an organization dedicated to increasing diversity in AI.

With AI model training, datasets are a proxy for the real world. Models are trained on one dataset and tested against another, and if the results are similar, there is an expectation that the model functions can translate to the operational environment. What works in the lab should work consistently in the real world, but for how long? Perfect operating scenarios are rare in AI, and real-world data is messy and complex. This has led to what leading AI researcher Andrew Ng called a “proof-of-concept-to-production gap,” where models train as desired but fail once they are deployed. It is partly a problem of robustness and reliability.

When outputs are inconsistently accurate and become worse over time, the result is uncertainty. Data scientists are challenged to build provably robust, consistently accurate AI models in the face of changing real-world data. In the information flux, the algorithm can meander away, with small changes in input cascading into large shifts in function.

To be sure, not all tools operate in environments prone to dramatic change, and not all AI models present the same levels of risk and consequence if they become inaccurate or undependable. The task for enterprises as they grow their AI footprint is to weigh robustness and reliability as a component of their AI strategy and align the processes, people, and technologies that can manage and correct for errors in a dynamic environment.

To that end, we start with some of the primary concepts in the area of robust and reliable AI.

Robust vs brittle AI

The International Organization for Standardization defines AI robustness as the “ability of an AI system to maintain its level of performance under any circumstances.” In a robust model, the training error rate, testing error rate, and operational error rate are all nearly the same. And when unexpected data is encountered in operation or when the model is operating in less-than-ideal conditions, the robust AI tool continues to deliver accurate outputs.

For example, if a model can identify every image of an airplane in a training dataset and is proven to perform at a high level on testing data, then the model should be able to identify airplane pictures in any dataset, even if it has not encountered them previously. But how does the airplane-identifying model perform if a plane is pink, photographed at dusk, missing a wing or viewed at an angle? Does its performance degrade, and if so, at what point is the model no longer viable?

When small changes in the environment lead to large changes in functionality and accuracy, a model is considered inelastic or “brittle.” Brittleness is a known concept in software engineering, and it is apt for AI as well. Ultimately, all AI models are brittle to some degree. The different kinds of AI tools we use are specific to their function and their application. AI does only what we train it to do.

There is another component to this. Those deploying and managing AI must weigh how changing real-world data leads to degrading model accuracy over time. In the phenomenon of “model drift,” the predictive accuracy of an AI tool decreases as the underlying variables that inform the model change. Signals and data sources that were once trusted can become unreliable. Unexpected malfunctions in a network can lead to changes in data flows.

An AI that plays chess is likely to remain robust over time, as the rules of chess and the moves the AI will encounter are predictable and static. Conversely, a natural language processing (NLP) chatbot operates in the fluid landscape of speech patterns, colloquial language, incorrect grammar and syntax, and a variety of changing factors. With machine learning, unexpected data or incorrect computations can lead a model astray, and what begins as a robust tool deteriorates to brittleness, unless corrective tactics are employed.

Developing reliable AI

The European Commission’s Joint Research Centre notes that assessing reliability requires consideration of performance and vulnerability. Reliable AI performs as expected even given inputs that were not included in training data, what are called out-of-distribution (OOD) inputs. These are data points that are different from the training set, and reliable AI must be able to detect whether data is OOD. One challenge is that for some models, OOD inputs can be classified with high confidence, meaning the AI tool is ostensibly reliable when in fact it is not.

Take an autonomous delivery robot. Its navigation AI is optimized to find the most direct path to its destination. The training dataset has all the example data the AI needs to recognize sidewalks, roads, crosswalks, curbs, pedestrians, and every other variable—except railroad tracks intersecting a pathway. In operation, the robot identifies rail tracks in its path, and while they are OOD, the AI computes high confidence that the tracks are just a new kind of footpath, which it follows to expedite its delivery. Clearly, the AI has gone astray due to an OOD input. If it is not hit by a train, it validates for the delivery robot, “this is a viable path” and may look for other rail tracks to use. And the operators may be none the wiser – until a train comes along.

Reliable AI is accurate in the face of any novel input. This is different from average performance. A model that offers good average performance may still yield occasional outputs with significant consequences, hampering reliability. If an AI tool is accurate 80% of the time, is it a trustworthy model? A related matter is resilience to vulnerabilities, be they natural outcomes from operation or the result of adversarial exploits.

Lessons in data reliability

The quality of a model is only as good as the training and testing data used to develop it. Without confidence in the data quality vis-à-vis its representation of the real world, the model’s outputs may not reliably deliver accurate outputs in the operational environment. For the U.S. Government Accountability Office, data reliability hinges on:

  • Applicability – Does the data provide valid measures of relevant qualities?
  • Completeness – To what degree is the dataset populated across all attributes?
  • Accuracy – Does the data reflect the real world from which the dataset was gathered?

These are cross-cutting components of trustworthy data, as well as AI. Datasets need to be sufficiently curated and in some cases labeled or even supplemented with synthetic data, which can compensate for missing data points or fill in for protected information that cannot (or should not) be used in training. Data must also be scrubbed for latent bias, which skews model training and leads to undesirable outputs or predictions.

As with the AI tool itself, real-world operational data needs to be monitored for shifting trends and emerging data science needs. For example, a model conducting sentiment analysis may be trained to score sentiment across a dozen variables, but after deployment, the AI team identifies other variables that need to be accounted for in model drift and retraining.

Like reliability, data applicability is not static. Likewise, data accuracy might fluctuate based on how well sensors perform, whether there are latency or availability issues, or any of the known factors that can hamper data reliability.

Leading practices in building robust and reliable AI

Whether a model is hampered by unfamiliar data, perturbed by a malicious actor, or drifting from accuracy, organizations should embed within their AI initiatives the capacity to evaluate risk of deployment, track performance to intended specifications, gauge (if not measure) robustness, and have the processes in place to fix failing or drifting models as their reliability degrades. Because reliability flows out of robustness, some of the activities that can contribute to AI reliability include:

Benchmarks for reliability

Even while model training is ongoing, identify and define which benchmarks are most valuable for tracking and measuring reliability. The benchmarks might include how the AI system performs relative to human performance, which is particularly apt given that deep learning models attempt to mimic human cognition.

Perform data audits

As a component of testing, review data reliability assessments, corrective actions, and data samples from training. Engage data stakeholders (e.g., IT leaders, legal experts, ethicists) to explore the data quality and reliability. AI models require datasets that reflect the real world, so as a component of data audits, investigate the degree to which datasets are balanced, unbiased, applicable, and complete.

Monitor reliability over time

Reliability evolves throughout the AI lifecycle. When the model output or prediction diverges from what is expected, catalog the data for analysis and investigation. The types of data often used in this analysis are time-to-event (how long until the model diverged), degradation data (information surrounding how the model degrades), and recurrent events data (errors that occur more than once).

Uncertainty estimates

Insight breeds confidence. To give deeper visibility into how AI is functioning, there are tools emerging that permit the model to report the degree of uncertainty alongside a prediction or output. This moves toward trust in robust systems. If a model reports high uncertainty, that is valuable insight for the human operator or another networked AI. Uncertainty estimates can flag a drifting model, highlight changes in data, or provide awareness that an adversarial example entered the data stream.

Managing drift

Operators can assess drift by comparing the model’s inputs and outputs during live deployment with inputs and outputs in a reference set. Similarity is measured on a pairwise basis between test and training data inputs, with a segmentation carried out on the outputs. By maintaining a close understanding of how inputs and outputs are changing relative to the reference set, human operators are positioned to take corrective steps (e.g., retrain the model).

Continuous learning

Establish continuous learning workflows to monitor model performance against predefined acceptable thresholds. These thresholds might include measures of how resilient the system accuracy remains in the face of small perturbations, as well as safety constraints for the system and the environment in which it is operating. As a part of this, maintain a data version control framework to enable auditability, transparency, and reproducibility of the AI model.

Ongoing testing

Develop a testing regime that includes variability (e.g., changes in the system or training data) to evaluate if AI is robust enough to function as intended. The frequency at which models are checked for robustness and accuracy should depend on the priority of the model and how often the model is updated. High-risk, regularly updated models might best be checked daily (with a human verifying outputs). Slower changing, low-priority models could be checked on a longer timeline, in some cases using an API for automatic assessments of functionality. The results of these checks should prompt investigation and resolution of any exceptions, discrepancies, and unintended outcomes.

Explore alternative approaches

Given that robustness and generalizability are areas of active research, new tools, designs, and tactics will continue to emerge and advance the field. These are likely to be technical approaches, and the organization’s data science professionals are positioned to explore how new ideas can support deployed AI, as well as model development. For example, “Lipschitz constrained models” have bounded derivatives that can help neural networks become more robust against adversarial examples. Most simply, they promote and can certify that small perturbations in input lead only to small changes in output.