Computational Biology & Data Mining

Our group develops and applies computational methods for the study of gene and protein function with emphasis on molecules related to human disease. Recently, our main topics of research have been the study of protein interaction networks, gene expression, and data and text mining of the biomedical literature. The results of our work are often distributed as software or online web tools.

Study of protein interaction networks

We created HIPPIE [], a database that integrates protein-protein interaction (PPI) data from several databases scored according to attached experimental information. We recently showed that measured interactions where the interacting proteins are expressed in the same tissue, or cellular location, and involved in similar processes are more reliable. These properties are useful to focus on relevant PPI subnetworks; we implemented mechanisms to do this in HIPPIE and illustrated how to apply this to the study of biological cases.

Expanded polyglutamine (polyQ) stretches are observed in the proteins of patients with different neurodegenerative diseases and are thought to trigger disease. We integrated genomic, phylogenetic, protein interaction network and functional information to add evidence that polyQ tracts modulate protein interactions (see Figure). PolyQ expansion results in gain of abnormal interactions, leading to pathological effects like protein aggregation. Furthermore, we characterized proteins that interact with ataxin-1, which has a polyQ tract whose abnormal expansion results in disease. Interactors that enhance the toxicity of mutant ataxin-1 are enriched in coiled-coil regions. We propose that while interactions of coiled-coil proteins with polyQ regions might increase aggregation, blocking this interaction, or interacting with other regions, might reduce aggregates and neurodegeneration.

Protein sequence analysis

We have updated a neural network method (ARD2 []) that identifies repeats like HEAT and Armadillo, which form similar structures composed of alpha-helices. We evaluated the phylogenetic distribution of these repeats, pointing to multiple likely events of independent emergence of these repeats in distant taxa and to their increased frequency in organisms of high cellular complexity such as eukarya in general, and cyanobacteria and planctomycetes within prokarya.

In general, proteins perform their functions in concrete sub-cellular locations (e.g., cytoplasmic, nuclear, extra-cellular). It was hypothesized that proteins adapt to their physicochemical environment by mutations on the residues exposed to the media not closely implicated in function and that therefore the composition of the exposed residues of a protein can be used for the prediction of protein sub-cellular location. We studied the predictive power of amino acid composition at variable ranges from buried to exposed and found that both buried and exposed amino acids carry complementary information about subcellular location. An optimized two step predictor trained with vectors of amino acid composition in different ranges of exposure and using a Support Vector Machine followed by a Neural Network (NYCE []), reaches an accuracy of 62% when predicting nuclear, nucleocytoplasmic, cytoplasmic or extracellular location of eukaryotic proteins.

Transcript regulation and prediction

We are developing computational methods to predict ncRNAs and to evaluate experimental results identifying ncRNAs, including the study of transcripts arising from pseudogenes and the characterization of miRNA function based on the overlap of their targets to those of particular transcriptional repressors as derived from ChIP-seq experiments.

Analysis of gene expression

We also develop methods for the study of DNA microarray data, for example to detect chromosomal regions with gross deletions or duplications by their abnormally low or high general gene expression. We have implemented a method (CAFE) as an R package to do such analysis and visualize the results.

Data and text mining

We develop web tools to assist researchers exploring the biomedical literature to understand the molecular basis of disease. Recently, we developed the Alkemio web server [] to rank thousands of chemicals for their relevance to any topic according to their associated bibliography (in PubMed). We are also working on methods to mine the biomedical literature for relevant manuscripts. In this respect, we recently examined how the references cited by a manuscript improve document retrieval, and also how to associate experts to particular topics using the lists of authors of manuscripts in PubMed. We provide the latter analysis as a tool to find peer reviewers appropriate for a given manuscript (peer2ref []).