Pneumonia is an infection affecting part of the lungs. This disease is the leading cause of death in children under 5 years of age according to the World Health Organization (WHO). In fact, approximately 1.4 million children die of pneumonia each year.
To detect this disease, doctors perform a physical examination on patients with symptoms. They then perform a chest x-ray and make a diagnosis. However, in third-world countries, the tools needed to diagnose this disease are scarce and diagnoses are therefore often inaccurate.
The objective of this case study is therefore to help doctors to have a faster and more accurate diagnosis. Today, data collection is standardized but diagnostic applications are rare in the medical field. The interest is to develop an automated system using artificial intelligence, where Saimple, a tool developed by Numalis, would allow health professionals to understand the origin of the model's clinical decision.
This case study uses a dataset from a Kaggle competition:
https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia
This dataset contains about 6000 images of chest radiographs of children under 5 years old. The images are labeled according to two classes "NORMAL" and "PNEUMONIA".
It should be noted that in a radiograph, clarity corresponds to the black color and opacity to the white color.
It is important to remember that pneumonia is detected on a chest x-ray when an abnormal accumulation of fluid in the lungs is visible. To identify it, look for areas of opacity, specifically in the lung parenchyma (functional tissue of the lung). For example, in the x-rays below, the one on the right shows that water has accumulated in one of the lungs, while the one on the left does not.
Before starting the analysis, it is important to visualize the proportion of images in each class and in each dataset.
The above graphs show that the "PNEUMONIA" class is over-represented in the training and test sets. This seems consistent since, in reality, before performing a chest X-ray, the patient presents, most of the time, symptoms which increase the probability that the patient is sick. Yet, to improve the training of the model, we need to have a balanced training set, i.e., the same number of images for each class.
In order to balance the dataset, we will use data augmentation, which is an effective technique to increase the number of images in the underrepresented class. To do this, several images from the underrepresented class are selected and will be transformed to create new images.
The transformation can correspond to:
- Rotate the image randomly,
- Resize vertically or horizontally,
- Crop the image,
- Zooming,
- Fill in pixels (i.e. fill in missing spaces in the resized image).
However, in reality, the images must meet medical quality criteria:
- Symmetry,
- Penetrance,
- Centering,
- Deep inspiration and clearance of the scapulae.
Thus, it is possible to determine that image rotation is not included in the data augmentation.
The four X-rays below are an example of data augmentation.
Indeed, from a single image, four new ones have been generated by cropping and filling the pixels of the missing areas.
For this case study, the prediction model is a convolutional neural network. It is a particular type of neural network used to process image data. The output value of this model is the probability of belonging to a class ("NORMAL" or "PNEUMONIA").
Both radiographs were well predicted: the radiograph of the healthy patient was predicted as "normal" and the radiograph of the patient with pneumonia was predicted as "pneumonia". A question then arises: what features of the image allowed the model to predict correctly?
Saimple is a neural network analysis tool offering the possibility to automatically measure and extract robustness and explainability elements of models. In this use case, Saimple allows identifying the areas of the image that were important in the prediction of the model and thus make it more understandable for health professionals.
Before the Saimple tool can be used, it is necessary to resize the input images of the model. As can be seen below, in order to standardize the size of the radiographs, the radiograph has been resized and this transformation can be easily visualized.