The difference between training an AI and using one in real conditions can be surprising. Discover the notion of “field of use”.

Validating internally an AI and using this AI in a real environment can be significantly different. AI is often confronted with common thinking versus reality issues1. The performance of an AI algorithm on a testing set can sometimes be misleading and lead to surprises or even deception when put in real conditions. To better evaluate how the performance can behave in real conditions it is important to first clarify what is an AI “domain of use”.

The lifecycle of an AI algorithm

To illustrate the problem, we can follow the lifecycle of an AI algorithm (and more specifically a machine learning algorithm).

It begins with the conceptualization of the project, the analysis of a need. Then comes identification, collection and preparation of data in order to build a data set. Next step is the choice, construction and training of a test algorithm to verify the feasibility and expected performance of the project on a small scale. If the previous step is validated, it is time to develop and to train the final algorithm.

In the end, it is necessary to make an evaluation of the algorithm and the results it produces2. It is this final step that often reveals that using AI in real conditions is harder than anticipated. This is usually because of the differences between the modelisation done for the training versus the reality of use. These differences are in fact appearing in the building of the training set. The notion of “domain of use” of AI plays a fundamental role during the data training set building phase and we will describe what it consists of.

The concept of domain of use in AI

The concept of domain of use is about describing the set of situations in which an AI algorithm is supposed to be used. Performances of the algorithms must be in line with their purpose and context of use. But to be able to define a domain of use on which the system engineer would build a requirement supposes to delimit some boundaries of this domain.

In some cases those domains are wide open, they can variate greatly. So what are the acceptable limits? For a driverless car, how to define the domain of “being capable of driving on a French road”? To model a domain one needs to rely on some “parameters” which aim to describe it and put some bounds on it. Those parameters can then be instantiated to form a particular setting. The set of these settings will help define the data set on which the AI would train and be evaluated. For example the parameter “rainy day” can be expressed by the setting about the “water droplets on the camera lens”. The surface covered and the number and size of droplets can be adjusted by different settings. It can be also described by a lower range of brightness of the images due to the clouds.

The aim would be that the training of the algorithm would be done on a set in which there are enough images with representative settings. However, those parameters are multiple (even uncountable sometimes) and they can be non-numerical. Indeed, it is possible to imagine that a chabot has to be able to interact correctly with users being furious, frustrated, unhappy, ironic… Under these conditions, how to define a context of use that allows AI designers to fulfill a requirement on the user being between “happy to furious”?

To achieve it the first goal is to collect existing data and to identify parameters that the AI’s designer can use to implement the training set. In the case of self-driving car algorithms using image recognition’s technology, they need images of cars and obstacles in various environments. It is described through various parameters that we can incorporate, such as : road width, weather conditions, road luminosity, temperature, etc. In our example, parameters’ possibilities to describe the concept of “driving on a road in France” are very numerous and can be complex. Thus AI designers have to choose parameters that they find the most relevant and appropriate to describe specific concepts. However the description of a specific environment can lead to an exponential number of possibilities of settings. Obviously, the higher the number of parameters used in the data training set, the greater the environmental representation analysis accuracy but the harder to implement. One must keep in mind that a training set is fundamentally restricted but strives to be the most representative possible.

The importance of AI validation

One main issue concerning the domain of use is underspecification. It is difficult to detect because AI can appear to be good enough on the test set which also might not sufficiently cover the real conditions. AIs are trained with underspecificate machine learning algorithms, with too weak validation, and it can cause them to be not efficient enough under the real conditions of the domain of use. It is a problem shown in several examples on which we will rely : self-driving vehicles in England cities, Google’s medical AI for eye analyses, and more.

To begin, there can be quality differences on the data due to differences in the acquisition system (for example a different camera). Which leads to distortions in the data processing : angles can be different, image resolution too. That was the problem with the Google’s AI doing eye analysis once in real condition. The root cause comes from the fact that nurses had not enough time to produce very high quality images to give to the system. Since the algorithm was trained with high quality images and had a function impeding analyses on bad quality images. That led to a huge time loss for nurses because exams had to be done again and because they now had to take longer than expected to produce high quality images3.

Also, parameters can be hard to anticipate : the falls of leaves on the road were taken as mobile objects by England self-driving cars4. It is not easy to know how it will work if data is changing from what the algorithm is used to. Because AI is trained on past and present data and has issues if it has to adapt to something new. Even though generalization is the whole point of AI algorithms it can be difficult when conditions vary too greatly, as presented in a previous article.

A study made by Google researchers pointed out that even if algorithms have the same training, their performances can be very different and can’t be anticipated. Out of a pool of 50 algorithms, with only starting values changing, some AI will have better performance on blurry images, others on pixelated or contrast altered ones, and also some have better overall results than others5. This was due also to underspecification.

Conclusion

To put it in a nutshell, the domain of use of an open world application is delicate to express precisely.
Algorithms can have trouble adapting to real conditions which vary too much from the domain of use implicitly assumed by the AI design. The main issue is the underspecification of algorithms.

Parameters of the domain (when defined) and their variation have to be sufficiently representative of the domain of use. Underspecification can lead to a project failure. This is why from the starting point to the validation phase it is crucial to ensure that the domain of use is properly handled. Once it is done, some robustness assessment and explainability of the decision can improve the chance to mitigate any failure once in the real world6.

Explainability helps to point out which parameters are mostly used by the algorithm and so restrict parameters to the one that really matters. Validating AI and its robustness in the best possible way will allow us to resist the variability of the domain of use and to have the most performant AI as possible for the tasks to be carried out.

  1. Artificial Intelligence and Machine Learning – Hype vs Reality ↩︎
  2. How to build a machine learning model in 7 steps | TechTarget ↩︎
  3. Google’s medical AI was super accurate in a lab. Real life was a different story. | MIT Technology Review ↩︎
  4. Inside the UK government’s weird and wacky self-driving car trials | WIRED UK ↩︎
  5. The way we train AI is fundamentally flawed | MIT Technology Review ↩︎
  6. Robustness and Explainability of Artificial Intelligence ↩︎