Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 20;13(1):6039.
doi: 10.1038/s41467-022-33128-9.

Technology readiness levels for machine learning systems

Affiliations

Technology readiness levels for machine learning systems

Alexander Lavin et al. Nat Commun. .

Abstract

The development and deployment of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end. Lack of diligence can lead to technical debt, scope creep and misaligned objectives, model misuse and failures, and expensive consequences. Engineering systems, on the other hand, follow well-defined processes and testing standards to streamline development for high-quality, reliable results. The extreme is spacecraft systems, with mission critical measures and robustness throughout the process. Drawing on experience in both spacecraft engineering and machine learning (research through product across domain areas), we've developed a proven systems engineering approach for machine learning and artificial intelligence: the Machine Learning Technology Readiness Levels framework defines a principled process to ensure robust, reliable, and responsible systems while being streamlined for machine learning workflows, including key distinctions from traditional software engineering, and a lingua franca for people across teams and organizations to work collaboratively on machine learning and artificial intelligence technologies. Here we describe the framework and elucidate with use-cases from physics research to computer vision apps to medical diagnostics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. MLTRL spans research (red) through prototyping (orange), productization (yellow), and deployment (green).
Most ML workflows prescribe an isolated, linear process of data processing, training, testing, and serving a model. Those workflows fail to define how ML development must iterate over that basic process to become more mature and robust, and how to integrate with a much larger system of software, hardware, data, and people. Not to mention MLTRL continues beyond deployment: monitoring and feedback cycles are important for continuous reliability and improvement over the product lifetime.
Fig. 2
Fig. 2. The maturity of each ML technology is tracked via TRL Cards, which we describe in the “Methods” section.
Here is an example reflecting a neuropathology machine vision use-case, detailed in the “Discussion” section. Note this is a subset of a full TRL Card, which in reality lives as a full document in an internal wiki. Notice the card clearly communicates the data sources, versions, and assumptions. This helps mitigate invalid assumptions about performance and generalizability when moving from R&D to production and promotes the use of real-world data earlier in the project lifecycle. We recommend documenting datasets thoroughly with semantic versioning and tools such as datasheets for datasets, and following data accountability best practices as they evolve (see ref. 81).
Fig. 3
Fig. 3. In the left diagram we show a discovery switchback (dashed) from 3 to 2, and an embedded switchback (solid) from 4 to 2.
The difference is the former is circumstantial while the latter is predefined in the process. The other embedded switchback we define in the main MLTRL process is from level 9 to 4, shown in Fig. 4. While it is true that the majority of ML projects start at a reasonable readiness out of the box, e.g. level 4, this can make it challenging and problematic to switchback to R& D levels that the team have not encountered and may not be equipped for. In the right diagram we show a common review switchback from Level 5 to 4 (staying in the prototyping phase (orange)), and a switchback (faded) that should not be implemented because the prior level was not explicitly done; level 2 is squarely in the research pipeline (red).
Fig. 4
Fig. 4. Most ML/AI projects live at levels 3–9 of MLTRL, not concerned with fundamental R&D—that is, completely using existing methods and implementations, and even pretrained models.
In the left diagram (a subset of the Fig. 1 pipeline, same colors), the arrows show a common development pattern with MLTRL in the industry): projects go back to the ML toolbox to develop new features (dashed line), and frequent, incremental improvements are often a practice of jumping back a couple of levels to Level 7 (which is the main systems integrations stage). At Levels 7 and 8 we stress the need for tests that run use-case-specific critical scenarios and data-slices, which are highlighted by a proper risk-quantification matrix. Reviews at these Levels commonly catch gaps or oversight in the test and validation scenarios, resulting in frequent cycles back to Level 7 from 8. Cycling back to previous lower levels is not just a late-stage mechanism in MLTRL, but rather “switchbacks” occur throughout the process. Cycling back to Level 7 from 8 for more tests is an example of a review switchback, while the solid line from Level 9 to 7 is an embedded switchback where MLTRL defines certain conditions that require cycling back levels—see more in the “Methods” section and throughout the text. In the right diagram, we show the more common approach in the industry (without using our framework), which skips essential technology transition stages (gray)—ML Engineers push straight through to deployment, ignoring important productization and systems integration factors. This will be discussed in more detail in the “Methods” section.
Fig. 5
Fig. 5. Computer vision pipeline for an automated recycling application (a), which contains multiple ML models, user input, and image data from various sources.
Complicated logic such as this can mask ML model performance lags and failures, and also emphasized the need for R& D-to-product handoff described in MLTRL. Additional emphasis is placed on ML tests that consider the mix of real-world data with user annotations (b, right) and synthetic data generated by Unity AI’s Perception tool and structured domain randomization (b, left).

References

    1. Henderson, P. et al. Deep reinforcement learning that matters. In Proc. AAAI Conference on Artificial Intelligence (2018).
    1. de la Tour, A., Portincaso, M., Blank, K. & Goeldel, N. The Dawn of the Deep Tech Ecosystem. Technical Report (The Boston Consulting Group, 2019).
    1. NASA. The NASA Systems Engineering Handbook (NASA, 2003).
    1. United States Department of Defense. Defense Acquisition Guidebook (U.S. Department of Defense, 2004).
    1. Leslie, D. Understanding artificial intelligence ethics and safety: A guide for the responsible design and implementation of AI systems in the public sector. The Alan Turing Institute. 10.5281/zenodo.3240529 (2019).