Cufflinks 2.0.0 released

This release substantially improves the accuracy and robustness of differential analysis with Cuffdiff. The update also resolves several user-reported issues and bugs, and several requested features. Due to the large number of enchancements and fixes, users are encouraged to treat this as a beta release. A manuscript describing the algorithmic improvements to the software is in preparation. Changes include:

  • Cuffdiff now reports estimated counts assigned to each gene and transcript, along with count variances due to uncertainty and cross-replicate variability. See the manual for more details on the new count tracking file format. Count tracking is not yet available in Cufflinks (e.g. with the -G option), but this functionality will be ported over in a future release.
  • Cuffdiff now tracks and records per-replicate FPKMs and counts. See the manual for more details on the new replicate tracking files. A version of CummeRbund that exposes this information in many plot types will be forthcoming.
  • Cuffdiff reports replicate and run metadata as part of each run.
  • Some users were reporting a high FAIL rate on gene and transcripts quantification. This has been resolved according to a battery of tests using real and simulated data. The root cause was that in conditions with substantial overdispersion across replicates, the FPKM variance-covariance matrices produced by the Cuffdiff model were not always positive-definite. Cuffdiff was detecting this, and marking those genes as having unreliable confidence intervals. Prior to 2.0.0, the model contained a heuristic approximation of the covariances between assigned fragment counts (which are necessary for calculating the variance on each gene’s expression level), and this approximation was producing poorly conditioned matrices. We have replaced the heuristic approximation with a direct sampling approach, in effect “simulating” the assignment of fragments to each isoform many times for each gene. By simulating fragment generation and assignment to each transcript, we are reconstructing variance-covariance matrices for assigned fragment counts that are always properly condition. This sampling approach produces more accurate estimates of variance and covariance as well, improving accuracy of transcript and gene level differential analysis. Users should expect more accurate quantification and shorter, more conservative lists of differentially expressed genes and trasncripts.
  • After substantial performance testing, we have determined that the false discovery rate of Jensen-Shannon-based tests (differential splicing, CDS switching, and promoter switching) can be unnacceptably high when used with fewer than three replicates in the conditions being compared. Cuffdiff now refrains from performing significance tests when one of the conditions involved has fewer than three replicates. You can change this behavior with the new –min-reps-for-js-test option. Cuffdiff still produces splicing.diff, cds.diff, and promoters.diff regardless of how many replicates you have. These files will include the JS distance scores, but none of the genes will be marked significant if you have fewer than the required number of replicates.
  • Cufflinks and Cuffdiff can now be told to ignore fragments that map to the genome more than a specified number of times using the –max-frag-multihits option. By default, Cufflinks and Cuffdiff still consider all fragments in the alignment file.
  • Cufflinks by default doesn’t report assembled transfrags that are built mostly from multiply mapping reads. This behavior can now be controlled or disabled with the new –max-multiread-fraction option.
  • Cufflinks by default fills small gaps in coverage when assembling transcripts. Gaps smaller than 50bp are filled and the transfrags joined. This behavior can be controlled or disabled with the new –overlap-radius option.
  • Before testing for differential expression or regulation of genes and transcripts, Cuffdiff now checks that the variance model for the gene or transcript in question is a good fit. This behavior can be controlled or disabled with the –min-outlier-p option. See the relevant section in the “How Cufflinks works” page for more on this.
  • A few bugs in the bias correction code and isoform deconvolution routines have been fixed, improving transcript-level expression accuracy.
  • Positional bias correction was reducing accuracy on certain datasets in some genes, so we have changed the default bias correction algorithm to model sequence-specific bias only. Bias correction is still disabled by default, and positional bias correction is still available as an optional mode.
  • Several minor issues related to library size normalization have been fixed.
  • A new library size normalization mode based on the geometric mean has been added, and is now the default for Cuffdiff. This method was introduced by Anders and Huber (Genome Biology, 2010).
  • Cuffdiff now uses the Eigen linear algebra libraries. Eigen is a fast package for matrix operations and makes good use of vector registers in modern processors, speeding up some of the numerical routines used during abundance estimation.
  • Cuffdiff now requires Boost version 1.47 or later.