Predicting Transcription Factor Binding Activity with Deep Neural Networks
Zachary Rothenberg, Ameet Soni
Transcription factors are proteins that play an essential role in the regulatory processes of organic systems. Therefore understanding their binding behavior has become an important research goal in the genetics community. In the past decade new lab techniques have been developed to collect data on the binding activity of transcription factors on a larger scale than previously possible. This has led to the creation of large binding site datasets for numerous transcription factors across several varied cell types in the human body. This has motivated an examination into the different statistical models that can be trained on this data to predict binding sites. Such models could help improve our understanding of transcription factor activity without requiring any expensive lab work.
Previously, attempts have used this data to identify small DNA motifs (5-10 nucleotides long) that are informative to the binding behavior of a chosen transcription factor. More recent methods have begun to explore the possibility of training convolutional neural networks for this task. This family of models has seen a wave of success across several different problem domains in the past five years, driven by the creation of datasets large enough to allow for their effective training. They work by scanning for many different small patterns in the data and combining them into a hierarchical representation for classification. Their application to transcription factor binding site prediction has proven fruitful, becoming state of the art for the task. However, these approaches have for the most part used shallow networks, examining only small sequential dna segments for their classification. In this research we examine the application of deep networks for this task, with the expectation that they will be able to learn a deeper hierarchy that captures more long range interactions in the binding behavior. In addition, these networks have proven very effective at incorporating additional data into their prediction, allowing us to use information like chromatin accessibility to further improve classification accuracy. We show that increasing model depth has a positive effect on performance, motivating further investigation into deep architectures for this task.