Semantically Rich Local Dataset Generation for Explainable AI in Genomics

Barbosa, Pedro; Savisaar, Rosina; Fonseca, Alcides

doi:10.1145/3638529.3653990

Computer Science > Machine Learning

arXiv:2407.02984 (cs)

[Submitted on 3 Jul 2024 (v1), last revised 17 Jul 2024 (this version, v3)]

Title:Semantically Rich Local Dataset Generation for Explainable AI in Genomics

Authors:Pedro Barbosa, Rosina Savisaar, Alcides Fonseca

View PDF HTML (experimental)

Abstract:Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms. Therefore, interpreting these models may provide novel insights into the underlying biology, supporting downstream biomedical applications. Due to their complexity, interpretable surrogate models can only be built for local explanations (e.g., a single instance). However, accomplishing this requires generating a dataset in the neighborhood of the input, which must maintain syntactic similarity to the original data while introducing semantic variability in the model's predictions. This task is challenging due to the complex sequence-to-function relationship of DNA.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity. Our custom, domain-guided individual representation effectively constrains syntactic similarity, and we provide two alternative fitness functions that promote diversity with no computational effort. Applied to the RNA splicing domain, our approach quickly achieves good diversity and significantly outperforms a random baseline in exploring the search space, as shown by our proof-of-concept, short RNA sequence. Furthermore, we assess its generalizability and demonstrate scalability to larger sequences, resulting in a ~30% improvement over the baseline.

Subjects:	Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Genomics (q-bio.GN)
Cite as:	arXiv:2407.02984 [cs.LG]
	(or arXiv:2407.02984v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2407.02984
Related DOI:	https://doi.org/10.1145/3638529.3653990

Submission history

From: Alcides Fonseca [view email]
[v1] Wed, 3 Jul 2024 10:31:30 UTC (2,914 KB)
[v2] Fri, 5 Jul 2024 10:48:27 UTC (2,917 KB)
[v3] Wed, 17 Jul 2024 09:30:42 UTC (2,917 KB)

Computer Science > Machine Learning

Title:Semantically Rich Local Dataset Generation for Explainable AI in Genomics

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Semantically Rich Local Dataset Generation for Explainable AI in Genomics

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators