DNA decrypted thanks to AI

DNA sequencing is now a technique well mastered by biology laboratories. However, DNA still hides many secrets that AI could help to unveil in the medical area.

A particularly complex genetic code

DNA contains all the genetic information of a living being, including hair, eye and skin color, as well as the risk of predisposition to certain diseases. These hereditary traits are passed on from one generation to another through genes, of which there are an average of 25,000 in humans.

DNA can be modeled as a sentence where a letter is assigned to each nucleotide (A, C, G, T) that can be parsed by automatic natural language processing (NLP)¹. However, unlike a classical text, the human genome is composed of more than six billion letters². In this long text, genes represent interesting “words”, because they are the ones that encode effects on individuals.

However, recent research tends to demonstrate the unexpected role of non-coding regions of the genome in the emergence of hereditary genetic diseases. Non-coding regions make up most of the genome’s sequence (about 98% of the genome) without being translated into proteins. Geneticists used to think, wrongly, that these regions did not carry useful information because they were not translated into proteins. But it turns out that these regions could be the main cause of altered gene expression, which can lead to diseases such as cancer³.

The impact of these regions in the genome is still unclear and studying them is difficult for several reasons. The characteristics of non-coding regions such as their size or their location in the genetic code are very diverse. In some cases, the action of non-coding regions remains local and is limited to impacting neighboring genes. But in other cases, they can target very distant genes⁴.

In this respect, it is difficult to examine the entire genome and simply identify the parts that have a real influence on the development of genetic diseases. To do this, it is necessary to automate this analysis by using algorithms. In this context, AI is useful because it can process the vast amount of data that DNA represents. And on the other hand, AI is able to distinguish sometimes imperceptible patterns from non-coding regions that may eventually cause genetic diseases.

Applications and limits of AI

For Pierre Tambourin, former director of the Genopole, the possibilities offered by AI are important for analyzing the relationship between heredity and the appearance of genetic diseases. These contributions could help several scientific fields, including paleo-pathology. This field of study examines, among other things, the degenerative evolutions resulting from bacteria observed in ancient populations and which are still found today in certain individuals.

For example, a team of researchers from the University of Paris-Saclay, use a neural network to trace the “demographic history” of bacterial populations. To do this, the neural network relies on a sample of genetic differences that may exist between members of a population⁵.

If we find certain similarities between two populations, we can determine that bacteria may have spread from one to the other. This may be due to different migrations or colonizations that may have taken place. In this way, we would potentially be able to understand more about the genetics of human populations from the history of migrations of peoples and successive natural selections.

By studying the evolution of genetic diseases throughout history, it would be possible to identify early warning signs. The use of algorithms would make it possible to predict the dispositions of certain people to develop these diseases. The challenge of applying AI to DNA analysis is to perfect and automate these methods of predicting genetic diseases. However, as technologies develop, new challenges also appear.

Indeed, problems related to genome sequencing can occur, for example, because of the poor quality of the sample. This may be due to its age, but also to possible sequencing errors. AI algorithms will therefore have to take into account heterogeneous samples comprising genomes of various qualities.

Finally, the occurrence of genetic diseases is not solely related to inherited genetic traits. For example, an individual’s phenotypic traits are the result of a complex interaction between inherited genetic factors and exposure to a particular environment. These traits but also the behavior of individuals are significant in increasing the risk of disease⁶.

In this respect, predicting the pathology of an individual by taking into account only the hereditary genetic risks is not sufficient. It will then be necessary to be able to specify which factors the AI must take into account in its analysis in order to improve its predictions.

Conclusion

The use of AI is the next step in the advancement of genome analysis methods. It is an interesting tool for studying the evolution of our DNA and the hereditary character of certain genetic diseases. Eventually, AI would offer the possibility to study the predisposition of populations to certain genetic diseases.

However, this method still needs to be perfected, especially to be able to take into account both genetic and non-genetic criteria in its analysis.