Sunday, July 14, 2013

How Not to Do Science

Theme
Genomes
& Junk DNA
Many reputable scientists are convinced that most of our genome is junk. However, there are still a few holdouts and one of the most prominent is John Mattick. He believes that most of our genome is made up of thousand of genes for regulatory noncoding RNA. These RNAs (about 100 of them for every single protein-coding gene) are mostly involved in subtle controls of the levels of protein in human cells. (I'm not making this up. See: John Mattick on the Importance of Non-coding RNA )

It was a reasonable hypothesis at one point in time.

How do you evaluate a hypothesis in science? Well, one of the things you should always try to do is falsify your hypothesis. Let's see how that works ...

  1. The RNAs should be conserved. FALSE
  2. The RNAs should be abundant (>1 copy per cell). FALSE
  3. There should be dozens of well-studied specific examples. FALSE
  4. The hypothesis should account for variations in genome size. FALSE
  5. The hypothesis should be consistent with other data, such as that on genetic load. FALSE
  6. The hypothesis should be consistent with what we already know about the regulation of gene expression. FALSE
  7. You should be able to refute existing hypotheses, such as transcription errors. FALSE
Normally, you would abandon a hypothesis that had such a bad track record but true believers aren't about to do that. So what's next? Maybe these regulatory RNAs don't show sequence conservation but maybe their secondary structures are conserved. In other words, these RNAs originated as functional RNAs with a secondary structure but over the course of time all traces of sequence conservation have been lost and only the "conserved" secondary structure remains.1 The Mattick lab looked at the "conservation" of secondary structure as an indicator of function using the latest algorithms (Smith et al., 2013). Here's how they describe their attempts to prove their hypothesis in light of conflicting data ...
The majority of the human genome is dynamically transcribed into RNA, most of which does not code for proteins (1–4). The once common presumption that most non–protein-coding sequences are nonfunctional for the organism is being adjusted to the increasing evidence that noncoding RNAs (ncRNAs) represent a previously unappreciated layer of gene expression essential for the epigenetic regulation of differentiation and development (5–8). Yet despite an exponential accumulation of transcriptomic data and the recent dissemination of genome-wide data from the ENCODE consortium (9), limited functional data have fuelled discourse on the amount of functionally pertinent genomic sequence in higher eukaryotes (1, 10–12). What is incontrovertible, however, is that evolutionary conservation of structural components over an adequate evolutionary distance is a direct property of purifying (negative) selection and, consequently, a sufficient indicator of biological function The majority of studies investigating the prevalence of purifying selection in mammalian genomes are predicated on measuring nucleotide substitution rates, which are then rated against a statistical threshold trained from a set of genomic loci arguably qualified as neutrally evolving (13, 14). Conversely, lack of conservation does not impute lack of function, as variation underlies natural selection. Given that the molecular function of ncRNA may at least be partially conveyed through secondary or tertiary structures, mining evolutionary data for evidence of such features promises to increase the resolution of functional genomic annotations.
Here's what they found ..
When applied to consistency-based multiple genome alignments of 35 mammals, our approach confidently identifies >4 million evolutionarily constrained RNA structures using a conservative sensitivity threshold that entails historically low false discovery rates for such analyses (5–22%). These predictions comprise 13.6% of the human genome, 88% of which fall outside any known sequence-constrained element, suggesting that a large proportion of the mammalian genome is functional.
Apparently 13.6% of the human genome is a "large proportion." Taken at face value, however, the Mattick lab has now shown that the vast majority of transcribed sequences don't show any of the characteristics of functional RNA, including conservation of secondary structure. Of course, that's not the conclusion they emphasize in their paper.

Why not?

1. I can't imagine how this would happen, can you? You'd almost have to have selection AGAINST sequence conservation.

Smith, M.A., Gese, T., Stadler, P.F. and Mattick, J.S. (2013) Widespread purifying selection on RNA structure in mammals. Nucleic Acid Research advance access July 11, 2013 [doi: 10.1093/nar/gkt596]