Miami computational biologist Iddo Friedberg, helps channel the flood of data from genome research to deduce the function of proteins

Skip to Page Level Navigation Skip to Page Content Skip to Contact Information

Miami computational biologist Iddo Friedberg, helps channel the flood of data from genome research to deduce the function of proteins

We live in the post-genomic era, when DNA sequence data is growing exponentially. Soon, having your entire genome sequenced will be an affordable and possibly routine biomedical diagnostic tool. However, says Iddo Friedberg, Miami University computational biologist, “for most of the genes that we identify, we have no idea of their biological functions. They are like words in a foreign language, waiting to be deciphered.”

Friedberg works in the new bioinformatics research field of computational prediction of protein and gene function. He and colleagues Predrag Radivojac, associate professor of computer science and informatics, Indiana University, Bloomington, and Sean Mooney, associate professor, Buck Institute for Research on Aging, are the leaders of CAFA (Critical Assessment of Function Annotation).

CAFA is a new community-wide experiment to assess the performance of the multitude of methodologies developed by research groups worldwide to “help channel the flood of data from genome research to deduce the function of proteins,” Friedberg explained.

Thirty research groups participated in the first CAFA, presenting a total of 54 methods. The results are published in an article in Nature Methods co-authored by all the participating groups, with Friedberg and Radivojac as lead authors.

The paper was published Jan. 28 in Nature Methods.

Concurrently, 15 articles edited by Friedberg and Radivojac were published in BMC Bioinformatics. These articles are companions to the Nature Methods article, and describe the top-ranking methods in-depth.

Why CAFA? “The accurate annotation of protein function is key to understanding life at the molecular level and has great biochemical and pharmaceutical implications.”

The essential cell protein PNPASE shuttles RNA into mitochondria, the energy-producing “power plant” of the cell. Human PNPASE, the function of which was discovered in 2010 by researchers from UCLA, was a case study for the CAFA experiment.

Image courtesy Maureen Heaster, UCLA

The accurate annotation of protein function is key to understanding life at the molecular level and has great biochemical and pharmaceutical implications, explain the study authors; however, with its inherent difficulty and expense, experimental characterization of function cannot scale up to accommodate the vast amount of sequence data already available.

Friedberg and Radivojac explain:

The computational annotation of protein function has therefore emerged as a problem at the forefront of computational and molecular biology.

Recently, the availability of genomic-level sequence information for thousands of species, coupled with massive high-throughput experimental data, has created new opportunities as well as challenges for function prediction.

Many methodologies have been developed by research groups worldwide, many based in comparing unsolved sequences with databases of proteins whose functions are known. Other methods aim at mining the scientific literature associated with some of these proteins, yet others combine sophisticated machine-learning algorithms with an understanding of biological processes to decipher what these proteins do, said Friedberg.

“Indeed, we may have already identified a protein that is an ideal drug target for cancer, but it is lost in the myriad of data labeled as ‘function unknown.’”

In CAFA, researchers participated in blind-test experiments in which they predicted the function of protein sequences for which the functions are already known but haven't yet been made publicly available. Independent assessors then judged their performance.

The participating groups came from leading universities in North America, Europe, Asia and Australia.

“Only by a group effort can we move the field forward and learn to harness the deluge of genomic data, turning it into useful information.”

The growth of biological databases (through 2009 – growth has increased even more).The red line is the growth of protein sequences deposited in TrEMBL, a comprehensive protein sequence database. The blue line illustrates the growth proteins in TrEMBL whose function is known, or at least can be predicted with some reasonable accuracy. The green line is the growth in the proteins whose 3D structure has been solved. Note the logarithmically increasing gap between what we know (blue) and what we do not know (red).

Image courtesy of Predrag Radivojac

We have discovered a great enthusiasm and community spirit", said Friedberg, who since 2005 has been organizing Automated Function Prediction (AFP) meetings internationally.

"This is despite the competitive environment in which research groups want their methods to perform better than their peers' methods. Overall, throughout CAFA there was a highly collegial spirit, and a willingness to share information and science.

Everyone recognized that this is an important endeavor, and that only by a group effort can we move the field forward and learn to harness the deluge of genomic data, turning it into useful information."

“We will continue running CAFA in the future."

“For the fist time we have broad insight into what works, where improvement is needed, and how we should move the field forward.

We will continue running CAFA in the future, as we are confident it will only help generate better methods to understand the information locked in our genomes, and those of other organisms," Friedberg said.

CAFA was funded by grants from the National Institutes of Health and by the Department of Energy, section of Biological and Environmental Research.

Written by Susan Meikle with contributions from Iddo Friedberg.