Clustering strings with mutations using an expectation-maximization algorithm

Mathematical Biology Seminar
Wednesday, October 2, 2019 - 11:00am for 1 hour (actually 50 minutes)
Skiles 006
Afaf Saaidi – Georgia Tech
Christine Heitsch

An expectation-maximization (EM) algorithm is a powerful clustering method that was initially developed to fit Gaussian mixture distributions. In the absence of a particular probability density function, an EM algorithm aims to estimate the "best" function that maximizes the likelihood of data being generated by the model. We present an EM algorithm which addresses the problem of clustering "mutated" substrings of similar parent strings such that each substring is correctly assigned to its parent string. This problem is motivated by the process of simultaneously reading similar RNA sequences during which various substrings of the sequence are produced and could be mutated; that is, a substring may have some letters changed during the reading process. Because the original RNA sequences are similar, a substring is likely to be assigned to the wrong original sequence. We describe our EM algorithm and present a test on a simulated benchmark which shows that our method yields a better assignment of the substrings than what has been achieved by previous methods. We conclude by discussing how this assignment problem applies to RNA structure prediction.