Computational Biology & Data Mining

logo of Johannes Gutenberg University Mainz

Our group develops and applies methods that integrate data at different levels of molecular biology, to investigate biological questions, including the function of genes and proteins and the mechanisms that control cell identity or cause disease. Our projects often overlap, both in terms of the resources and methods they use. For example, we develop data mining methods that associate keywords to therapeutic drugs, which we can then apply to the interpretation of profiles of gene expression. In a different project, we created particular phylogenetic analyses of protein families that we can then use to study the evolution of the human protein interaction network. By being carried out within the same group, our projects benefit from and complement each other.

Study of protein interaction networks

One of our recent lines of work on protein-protein interactions (PPIs) regards the study of the topological properties of PPI networks. The topologies of many networks, including the PPI network, can be modelled using hyperbolic geometry, a space whose mathematical properties naturally lead to the emergence of network scale invariance and strong clustering. Such networks can be generated by adding new nodes and connecting them according to two variables: node popularity and node similarity. To create such a network in hyperbolic space, nodes can appear sequentially at random positions in a disk with a radius proportional to the number of nodes already generated and they can then be connected if they are close (modelling similarity). Older nodes in this model are closer to the centre of the disk and to the rest of the nodes in the network, thus attracting more connections (modelling popularity). We will be using such maps for the prediction of links between proteins and to study network evolution (Figure 1).

Diagrammatic representation of a human protein interactome

Figure 1. The protein clock. Each dot in this human protein interactome corresponds to one human protein. The proteins with many interactions have been mapped closer to the centre and proteins that interact are close to one other. Proteins were then clustered and coloured according to those clusters. Enriched functions in each cluster are indicated as Gene Ontology terms (BP: Biological Process, CC: Cellular Component, MF: Molecular Function) and KEGG pathway terms. An interactive version is accessible here.

Protein sequence and structure analysis

We studied the increasing redundancy in the new protein sequences deposited in the databases using a pragmatic clustering method that puts together sequences of similar lengths. We proposed how to use this clustering to examine taxonomic ranges that need further genome sequencing, and to direct experimental studies to proteins from large uncharacterised families. We implemented the method as a web tool (FASTA Herder). This method was further developed to account for fragments; various applications that allow browsing and querying these clusters for various properties were made available (FastaHerder2) and a web service was implemented that clusters and annotates the results of a BLAST search (CABRA). A related resource that we created is orthoFind, which facilitates the interpretation of the results of a protein sequence similarity search by evaluating the homologues of the query protein in terms of orthologues and paralogues and complementing this with reports on domains and functions.

Genomic analysis and gene expression

Topologically associating domains (TADs) are genomic self-interacting regions containing multiple genes. To investigate their function and evolution, we have studied the position of pairs of paralogous genes with respect to TADs. We observed significantly more pairs within TADs than expected. Since most paralogous gene pairs are formed by tandem duplication, we propose that there is selective pressure to keep paralogues in the same TAD. Paralogues can have related functions and might require common regulatory mechanisms. Our results support the idea that TADs may provide such mechanisms.

Data and text mining

In this topic, our current efforts are directed towards improving the functional characterisation of large lists of candidate genes derived from high-throughput biological experiments. Enrichment analysis of gene-related annotations such as Gene Ontology terms, molecular pathways or diseases, are already well established. However, methods offering disease enrichment analysis on gene sets are less developed and are based on individual gene-disease associations from manually curated or experimental data that do not cover all diseases discussed in the literature. We developed Gene Set 2 Diseases as an alternative tool to identify gene-disease associations, by deriving these directly and automatically from the biomedical literature.

Future Directions

After developing clustered datasets of proteins that can be annotated for taxonomic distribution and sequence features, we now exploit these for predictive approaches. We will study the evolution and function of homorepeats (e.g. polyQ, polyP), analysing their position within protein sequences, their relation to protein functions, their time of emergence in evolution, and their co-occurrence with other homorepeats. We are also developing computational methods to improve the predictions of pseudogenes and to identify the functions of transcripts arising from pseudogenes and from other non-protein-coding RNA sequences. Finally, we want to take advantage of the increasingly large ChIP-seq datasets that should provide information on the combinations of epigenetic marks and transcription factors that result in gene expression; however, the heterogeneity and size of the data complicates this task. Towards this goal, we are working on the extraction and integration of ChIP-seq data from different databases and we plan to create algorithms that measure similarity between ChIP-seq datasets.