Europe PMC

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


Motivation

A key component in many RNA-Seq-based studies is contrasting multiple replicates from different experimental conditions. In this setup, replicates play a key role as they allow to capture underlying biological variability inherent to the compared conditions, as well as experimental variability. However, what constitutes a 'bad' replicate is not necessarily well defined. Consequently, researchers might discard valuable data or downstream analysis may be hampered by failed experiments.

Results

Here we develop a probability model to weigh a given RNA-Seq sample as a representative of an experimental condition when performing alternative splicing analysis. We demonstrate that this model detects outlier samples which are consistently and significantly different compared with other samples from the same condition. Moreover, we show that instead of discarding such samples the proposed weighting scheme can be used to downweight samples and specific splicing variations suspected as outliers, gaining statistical power. These weights can then be used for differential splicing (DS) analysis, where the resulting algorithm offers a generalization of the MAJIQ algorithm. Using both synthetic and real-life data, we perform an extensive evaluation of the improved MAJIQ algorithm in different scenarios involving perturbed samples, mislabeled samples, same condition groups, and different levels of coverage, showing it compares favorably to other tools. Overall, this work offers an outlier detection algorithm that can be combined with any splicing pipeline, a generalized and improved version of MAJIQ for DS detection, and evaluation metrics with matching code and data for DS algorithms.

Availability and implementation

Software and data are accessible via majiq.biociphers.org/norton_et_al_2017/.

Contact

[email protected].

Supplementary information

Supplementary data are available at Bioinformatics online.

Free full text 


Logo of bioinfoLink to Publisher's site
Bioinformatics. 2018 May 1; 34(9): 1488–1497.
Published online 2017 Dec 11. https://doi.org/10.1093/bioinformatics/btx790
PMCID: PMC6454425
PMID: 29236961

Outlier detection for improved differential splicing quantification from RNA-Seq experiments with replicates

Bonnie Berger, Associate Editor

Abstract

Motivation

A key component in many RNA-Seq-based studies is contrasting multiple replicates from different experimental conditions. In this setup, replicates play a key role as they allow to capture underlying biological variability inherent to the compared conditions, as well as experimental variability. However, what constitutes a ‘bad’ replicate is not necessarily well defined. Consequently, researchers might discard valuable data or downstream analysis may be hampered by failed experiments.

Results

Here we develop a probability model to weigh a given RNA-Seq sample as a representative of an experimental condition when performing alternative splicing analysis. We demonstrate that this model detects outlier samples which are consistently and significantly different compared with other samples from the same condition. Moreover, we show that instead of discarding such samples the proposed weighting scheme can be used to downweight samples and specific splicing variations suspected as outliers, gaining statistical power. These weights can then be used for differential splicing (DS) analysis, where the resulting algorithm offers a generalization of the MAJIQ algorithm. Using both synthetic and real-life data, we perform an extensive evaluation of the improved MAJIQ algorithm in different scenarios involving perturbed samples, mislabeled samples, same condition groups, and different levels of coverage, showing it compares favorably to other tools. Overall, this work offers an outlier detection algorithm that can be combined with any splicing pipeline, a generalized and improved version of MAJIQ for DS detection, and evaluation metrics with matching code and data for DS algorithms.

Availability and implementation

Software and data are accessible via majiq.biociphers.org/norton_et_al_2017/.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Alternative splicing (AS), the process by which segments of pre-mRNA can be arranged in different subsets to yield distinct mature transcripts, is a major contributor to transcriptome complexity. In humans, over 90% of multi-exon genes are alternatively spliced, and most of these exhibit splicing variations which are tissue- or condition-dependent (Pan et al., 2008). This key role of AS in transcriptome complexity, combined with the fact that aberrant splicing is commonly associated with disease state (Wang and Cooper, 2007), has led to various research efforts. These include efforts to accurately map transcriptome complexity and identify splicing variations between different cellular conditions, across developmental stages, or between cohorts of patients and controls.

Detection of splicing variations and the mapping of transcriptome complexity has been greatly facilitated by the development of high-throughput technologies to sequence transcripts, collectively known as RNA-Seq. Briefly, RNA from the cells of interest, typically poly-A selected or ribo-depleted, are sheared to a specific size range, amplified, and sequenced. In most technologies used today, the resulting sequence reads are typically around 100 bp with read number varying greatly, from around 20 to 200 M reads. The shortness of the reads, their sparsity, and various experimental biases make inference about changes in RNA splicing a challenging computational problem (Alamancos et al., 2014). To aid RNA-Seq based analysis, many studies include samples which are biological replicates of the conditions being studied. Replicates are a key component in helping researchers distinguish between the biological variability of interest (e.g. differences between two tissues) and variability associated with experimental or technical factors. However, what constitutes a ‘good’ replicate or an outlier experiment is not always clear. Intuitively, in the context discussed here, an outlier is a sample which exhibits disproportionately large deviations in exon inclusion levels compared with other samples from the same experimental condition (biological replicates). An outlier could be the result of a failed experiment or of some hidden deviation in sample identity (i.e. different tissue source). Remarkably, despite the obvious importance of the question of what constitutes an outlier, this question has been mostly ignored in the literature. Instead, researchers are left to define outliers based on heuristics which may not be accurate or carry unconscious biases. For example, in a recent work focused on good practices for RNA-Seq data analysis, the authors recommend performing PCA analysis to see if samples from the same condition ‘cluster together’ but admit ‘no clear standard exists for biological replicates’ (Conesa et al., 2016). As we demonstrate here, the presence of outliers can have deleterious effects on algorithms that aim to detect differential splicing (DS) between groups of replicates from different conditions. Thus an important contribution of this work is to formalize what constitutes an outlier replicate and suggest a method which researchers could use to detect outliers before applying any DS algorithm.

A second contribution of this work is in assessing the effect of outliers on DS algorithms and, more broadly, producing a general framework with matching datasets for assessing the performance of DS algorithms. Generally, DS algorithms can be divided into two classes. The first, which includes tools such as RSEM (Li and Dewey, 2011) and Cuffdiff (Trapnell et al., 2012), aims to quantify full gene isoforms, typically by assuming a known transcriptome and assigning the observed reads to the various gene isoforms in the given transcriptome database. The second class of algorithms, which includes rMATS (Shen et al., 2014) and DEXSeq (Anders et al., 2012), works at the exon level, detecting differential inclusion of exons or exonic fragments. Some algorithms, such as SUPPA (Entizne et al., 2017), can be considered a hybrid as they collapse isoform abundance estimates from other algorithms such as Salmon (Patro et al., 2017) to compute relative exon inclusion levels. Previous works showed that for the task of DS, quantification algorithms that work at the exon level generally perform better since they solve a simpler task and are less sensitive to isoform definitions or RNA-Seq biases within samples or along full isoforms (Liu et al., 2014). Thus, for the comparative analysis section of this paper, we focus on the second class of algorithms, and specifically the commonly used DEXSeq, rMATS, as well as the recently updated SUPPA, all of which support replicates. Using these algorithms and the one we propose (MAJIQ-lw) we detail here several metrics to assess DS algorithm performance based on biological or technical replicates. We also develop a synthetic dataset where each sample is generated so as to model a matching real sample.

A third contribution of this work is in developing MAJIQ-lw, an adaption of our recently published MAJIQ algorithm (Vaquero-Garcia et al., 2016) to automatically detect and downweight samples and splicing variations suspected to be outliers. MAJIQ offers a combined solution to detect, quantify, and visualize DS between experimental conditions with or without replicates. Besides the details of its statistical model and its visualization capabilities, two key features distinguish MAJIQ from the DS algorithms mentioned above. First, MAJIQ allows users to supplement previous transcriptome annotation with reliably detected de novo junctions, which results in over 30% increase in detection of differential variations and robustness to transcriptome definitions (Vaquero-Garcia et al., 2016). Second, MAJIQ uses a novel formulation for transcript variations. It does not quantify whole gene isoforms as the first class of algorithms described, nor does it use only the previously defined AS ‘types’ (e.g. cassette exons) as the second class of algorithms. Instead, MAJIQ defines a more general concept of ‘local splicing variations’, or LSVs. Briefly, LSVs are defined as splits in a gene splice graph where a reference exon is spliced together with other segments downstream or upstream of it (see Fig. 1a). Importantly, the formulation of LSVs enables MAJIQ to capture all previously defined types of AS but also many other variations which are more complex (see Fig. 1b and c). Similar to classical AS events, LSVs are quantified by the fraction of inclusion of each of the junctions, termed percent spliced in (PSI, ψ[0,1]), see Fig. 1d). However, despite its novel features and capabilities, the original MAJIQ algorithm was not built to handle outlier replicates. Indeed, we show here that such outliers can significantly affect MAJIQ’s performance. In contrast, the new generalization of MAJIQ, denoted MAJIQ-lw, offers a DS algorithm which is not only robust to outliers but can also salvage useful information from outlier samples. In order to demonstrate this improved performance, we perform extensive comparative analysis of the new MAJIQ to the original algorithm as well as the other algorithms listed earlier.

An external file that holds a picture, illustration, etc.
Object name is btx790f1.jpg

Illustration of LSVs. (a) LSVs are splits in a gene splice graph where multiple edges spawn out from (single-source, 5’ split) or merge into (single-target, 3’ join) a reference exon. (b) The LSV definition captures classical binary splicing event types. (c) The LSV definition can capture complex variations that involve more than two junctions. (d) A sample from MAJIQ’s output for a single-target LSV involving Exon 8 of Clta gene in mouse cerebellum. MAJIQ quantifies the LSV shown on the left by the marginal inclusion level (ψj[0,1]) of each of the possible junction j, estimating a posterior distribution over those (right, in matching order). Similarly, when comparing two conditions, a distribution over Δψj[1,1] is derived (not shown)

2 Materials and methods

2.1 Outlier weight model

Let ψi,jt be the inclusion ratio of junction j of LSV i within experiment t, such that j=1Jψi,jt=1, with i[1L],tT=[1T], and jJ=[1J]. Notably, we never observe ψi,jt directly, only junction spanning reads drawn according to it. Consequently, the MAJIQ model does not infer ψi,jt as a point estimate but rather as a posterior beta distribution, denoted P(Ψi,jt) or, with slight abuse of notation, simply Ψi,jt (Vaquero-Garcia et al., 2016). Importantly for the discussion here, MAJIQ assumes that all the experiments are replicates of the same biological condition denoted T (e.g. tissue type, treatment, disease state) such that they share an underlying (hidden) junction inclusion levels, denoted ψi,jT,i,j,tT. Under this model, an outliers[1T] is an experiment which significantly deviates from ψi,jT for a disproportionately large set of LSVs.

In order to make the above definition formal we first need to specify a way to estimate ψi,jT or at least a distribution over its possible values. We use the median over {Ψi,jt}t=1T, denoted Ψ^i,jT, as a proxy for Ψi,jT and quantify the degree to which an experiment’s Ψi,jt deviates from Ψ^i,jT using the L1 divergence:

d1(Ψi,jt,Ψ^i,jT)=1201|P(ψi,jt)P(ψ^i,jT)|dψ.
(1)

This allows us to define the deviation of sample t for LSV i as φt,i=maxjd1(Ψi,jt,Ψ^i,jT). According to our model, when all experiments are true biological replicates they should all have a similar distribution of φt,i over i[1L]. Outliers however, will have a distribution skewed towards higher values, reflecting an accumulation of LSVs for which φt,i is high. This skew towards high φt,i values is illustrated in Figure 2 by the blue line (outlier) compared with the red and green lines (true replicates). To capture such skewness, we find Kt=|{i:φt,i>τ}|, i.e. the number of highly-disagreeing LSVs for each experiment t, where τ is an arbitrary threshold for strong disagreement. In the experiments described below we used τ=0.75 but we find that results were robust to changes in τ value as long as it was reasonably high (see blue line in Fig. 2). Theoretically, if all experiments are equally ‘well-behaved’, we expect Kt to follow a binomial distribution KtBinom(KT,p), where p=1T is the equal probability for every sample to ‘misbehave’ and KT is the union over all LSVs that have a highly disagreeing experiment, KT=|{i:t,φt,i>τ}|. In practice, we find the distribution of Kt to have higher dispersion which we model using a Beta Binomial such that pBeta(αT,α(11T)). We set α = 5 and observe that outlier detection and consequent local weights model described next were robust to α’s setting even when considering GTEx samples from different individuals (see Supplementary Figures). Finally, we set the weight ρt[0,1] representing our belief that sample t belongs to condition T using the following ratio test: We compute the probability mass for the tail of the beta binomial probability PBB for the observed number of highly disagreeing LSVs Kt, divided by the tail from the excepted KTT under the null-hypothesis:

ρt=min(1,PBB(X>Kt)PBB(X>KTT)).
(2)

An external file that holds a picture, illustration, etc.
Object name is btx790f2.jpg

Per-experiment empirical PDF over LSVs L1 divergences from group median. The L1 divergence (φt,i) is defined in Section 2.1. Each solid line represents a different RNA-seq experiment (t). The gray dashed line represents a replicate model as described in Section 2.2. (a) A swap case of three ‘replicates’ from Zhang et al. (2014) with two liver samples (green, red) and an outlier, ‘mislabeled’, muscle sample (blue). (b) Same as in (a) but a less extreme case where less LSVs have extreme divergence from the median. Here the blue outlier was produced by synthetically perturbing 10% of the LSVs by 10% in a cerebellum sample at the same coverage level (see Section 2.5). In both scenarios most LSVs have low divergence (φt,i<0.2) but a small fraction of LSVs in the blue outlier have φt,i>0.5

This formulation has the desired characteristic that ρ = 1 for ‘well behaved’ experiments for which Kt is less than the expected KTT, while other experiments are penalized by their deviation from this expected value in proportion to the tail of the null distribution. In the new MAJIQ implementation users can now invoke the option to compute these weights for groups of replicates to detect possible outliers. We note that since the LSVs in the experiments are expected to be quantified in any case, the computational overhead associated with these weight computations is negligible.

In addition, the above global weight per-experiment ρt can be plugged into MAJIQ to produce a generalized version of the previous MAJIQ algorithm:

P(Ψi,jt|{Ri,jt,inc}tT,{Ri,jt,exc}tT)=Beta(tTρtRi,jt,inc,tTρtRi,jt,exc),
(3)

where Ri,jt,inc,Ri,jt,exc represent estimated read rates supporting inclusion or exclusion of a specific junction. Details of how these read rates are derived from observed reads are given in (Vaquero-Garcia et al., 2016). Also, we note these weights and above computation is done per condition T and does not change when two conditions are compared with compute ΔΨ. In the evaluations that follow, we refer to the resulting MAJIQ model with global weights per experiments as MAJIQ-gw while the previous MAJIQ model with no weighting scheme is denoted MAJIQ-nw.

2.2 Local weights

The previous section described MAJIQ-gw as a way to globally weigh experiments using ρt so that outliers can be effectively detected and suppressed. However, by doing this we run the risk of discarding valuable information: Even if a sample is a bone-fide outlier, it is quite likely that this is a result of a subset of LSVs, while the majority of the LSVs are good representatives of the condition of interest T. This phenomenon is nicely captured in Figure 2. To balance the benefits of outlier removal with the possible loss in statistical power we develop a local weight model (νt,i[0,1],t,i) based on the distribution of φt,i we expect to find in ‘well-behaved’ replicates.

First, we estimate the null distribution of ‘well-behaved’ replicates by weighting the per-experiment densities of φt,i by ρt. This distribution is denoted ‘Replicate Model’ in Figure 2, and represented by the light gray line. Next, for each t, i, we estimate the matching LSV’s L1 divergence falling within a neighborhood of φt,i under the distribution for experiment t compared with the null model. Explicitly, we define

νt,i=min(1,P¯(|Xφt,i|<ε)Pt(|Xφt,i|<ε)),
(4)

where ε is a reasonable neighborhood parameter (default 0.05), P¯ is the null distribution of φt,i (Replicate Model) such that P¯(φ)=Z1tTρtPt(φ), Pt is the distribution of φt,i for experiment t and Z1=tTρt. Plugging νt,i into (3) instead of ρt gives us the local weight model, which we denote MAJIQ-lw.

2.3 Performance evaluation metrics

When using synthetic RNA-Seq samples one can use standard measures of performance such as false discovery rate (FDR) or false negative rate (FNR). For real data, there is an inherent challenge in assessing the accuracy of DS methods since the underlying true values are rarely known. This problem is corroborated by the fact that different DS methods define transcriptome variations differently (Exonic segments, classical AS events, LSVs) and use vastly different criteria to detect those. DEXSeq, e.g. analyzes expression of exonic segments and assigns a P-value to expression changes of those. Other methods discussed here instead quantify the relative inclusion of RNA segments (PSI) yet still vastly differ in the underlying model and statistical criteria. One issue regarding comparison of different AS units is possible redundancy or overlap. For example, two overlapping single-source and single-target LSVs can capture the same differentially spliced cassette exon reported by other methods (see illustration in Fig. 1b). To avoid such cases of ‘double counting’ we impose an additional constraint on MAJIQ’s output where LSVs are not allowed to overlap. However, this constraint does not resolve the issue that a single LSV detected by MAJIQ may contain several ‘classic’ binary AS events reported separately by other methods. Here we suggest to circumvent the above challenges of different AS definitions and statistical criteria by focusing instead on ranked-based statistics derived using multiple biological and technical replicates. These statistics are applied to each method’s own units of variations, assessing how reproducible are the algorithm’s significantly changing events and how likely they are to be related to changes between the conditions of interest.

We start by describing the reproducibility ratio (RR) measure, following (Vaquero-Garcia et al., 2016). The RR measure is similar in spirit to the ireproducible discovery rate, which has been used extensively to evaluate ChIP-Seq peak calling methods (Li et al., 2011) and, more recently, for methods detecting cancer driver mutations (Tokheim et al., 2016). Conceptually, RR measures the proportion of high-ranked events (e.g. ChIP-Seq peaks or, in our case, differentially spliced events) that are also observed in a second, independent iteration of the same experimental procedure using biological or technical replicates. To compute RR, an algorithm A is run on a set of samples S1 comprised of two groups of replicates from two conditions (e.g. liver and cerebellum tissues). Algorithm A reports a set of AS events S it detects, such that a subset S*S are identified as differentially spliced with NA=|S*| denoting the size of that subset. This set S* can then be ranked by the events’ relative significance or score by A. Next, we run A on a second ‘test’ set (S2) of replicates. Then, for any nNA ranked subset of S* we compute the size of the subset of events (RA(n)=nn) of those n events which were again scored as significant by A on S2 and ranked in the n highest ranking events of S. We note that in (Vaquero-Garcia et al., 2016) the reproducibility graphs plotted nNA against nNA with perfect reproducibility corresponding to a 45° line. However, we found this setup was unsuitable for the algorithms evaluated here as these varied greatly in their NA due to different definitions of AS events and different statistical criteria (see above and Section 3). We therefore opted to plot RR(n)=nn against n, a statistic which captures the RR for any given n and is agnostic to the value of NA.

Although useful, the RRA statistic may not be sufficient to assess an algorithm’s performance. First, both RRA and NA are not inherent characteristics of an algorithm but rather a combination of an algorithm and a specific dataset. Second, reproducibility alone is not a guarantee for accuracy as algorithms can be highly reproducible yet maintain a strong bias. Thus, in order to better assess accuracy of DS methods we combine RR with two additional tests. First, we assess how many DS events are reported by each algorithm when comparing groups of samples that are either biological or technical replicates from the same experimental condition. Biological replicates were obtained from the original studies we used, detailed below. To obtain technical replicates, we randomly partition the original paired-end FASTQ files and remap each read subset. This procedure results in a very strict definition of technical replicates which are experiments by the same lab, protocol, biological sample, machine and lane. Presumably, true variation in any specific sample would be canceled out when comparing two groups of replicates. However, since we can not completely rule out true variations between two groups of technical or biological replicates, we denote the reported DS events as putative FP, or PFP. Finally, we compute for each algorithm the intra-to-inter ratio (IIR) statistic, which is the ratio between the number of PFP and the number of DS events reported when comparing equivalent groups but from different conditions, i.e. IIR=NPFPNA. The IIR is similar in spirit to the FDR in the sense that it estimates the fraction of reported varying events between two conditions of interest whose variation may not be associated with the difference between the experimental conditions being compared. As we show in the next section, in practice this ratio which we derive from real data is indicative of the FDR estimates, which we get from our synthetic dataset.

Finally, to further assess accuracy, we used RT-PCR triplicate experiments from previous studies (Vaquero-Garcia et al., 2016). We note this measure also carries limitations as it is constrained by the total number of events quantified, possible selection biases, and limitations of the experimental procedure. We discuss some of those limitations in Section 4. Nonetheless, carefully executed RT-PCR provide valuable independent experimental validation and is considered the gold standard in the RNA field.

2.4 Data and software versions

In order to perform some of the evaluations described below, datasets with at least six replicates are required. Most of the results described in the next section were derived using RNA-seq experiments from a recent study by Zhang et al. (2014), which includes samples from twelve mouse tissues (heart, hypothalamus, adrenal, brown and white fat, cerebellum (Cer), liver (Liv), muscle (Mus), lung, aorta, brainstem and kidney) across four circadian time points, repeated twice, with 60–120 million reads per experiment. When comparing groups of samples in the analysis described below time points were matched between the groups to avoid time as a confounding factor. The high coverage allowed us to also compare groups of technical replicates by randomly partitioning files to subsets of 50 and 25% of the original reads. To show our comparative results are not specific to the data of Zhang et al. (2014), we also repeated much of the analysis in a different dataset from the Mouse Genome Project (Keane et al., 2011), which includes six tissues (heart, hippocampus, liver, thymus, spleen, and lung) with six biological replicates each, at 18–30 million reads per experiment (see Supplementary Material).

For the comparative analysis we used DEXSeq version 1.18.4, rMATS-turbo version 0.1, SUPPA from public GitHub source pulled on 21 June 2017, and MAJIQ version 1.0.6.

2.5 Synthetically perturbed and mislabeled samples

To observe the impact of an outlier sample in a controlled fashion, we used a real biological replicate and perturbed it to create a synthetic new pseudo-replicate outlier. Three parameters were used to control the sample’s perturbation: θ[0,1] determined the fraction of LSVs randomly selected to be perturbed, δ[0,1] controlled the shift of E[Ψ], and γ0 set the relative coverage level in the perturbed sample. We note that when γ = 1 and θ = 0 or δ=0 the synthetic perturbation does not alter μ for any LSV. We measure the effect of variations in θ, δ, and γ on ρT and RR by applying the above Ψ perturbation to one replicate in set S1. A more detailed description of the perturbation is given in the Supplementary Material.

We also explore the effects of mislabeling a sample as an extreme case of an outlier. We simulate this by swapping out one replicate with a sample from a different condition within the same dataset. Specifically, in the analysis described below we swapped a Cerebellum sample for a Muscle sample.

2.6 Synthetic dataset

Synthetic datasets are commonly used to evaluate algorithms, including those for DS analysis, under controlled settings. Some works use actual sequencing of synthetically generated samples with specific transcripts spiked at different concentrations. Using spiked transcripts helps avoid making simplifying assumptions on factors such as read distribution but the sequenced samples may be very different from real life samples. This limitation along with the associated technical complexity and costs, lead most works that employ synthetic datasets to resort to in silico generated datasets. However, these datasets typically involve strong simplifying assumptions about expression levels, distribution of variation in and between the groups compared, read errors, coverage levels, underlying known transcriptome, and variants that involve only predefined ‘classical’ AS event types such as cassette exons.

In order to create more realistic synthetic data for DS analysis, we again employed real data from biological replicates, and developed an LSV-based approach for in silico RNA-Seq generation. Specifically, given a set of real RNA-Seq samples, we create a matching synthetic dataset using the following procedure: First, we estimate each gene’s coverage in each sample using standard FPKM. Next, we monitor for each gene g the most complex LSV (g) detected for it, using a user-configurable conservative threshold over STAR’s junction spanning reads. Then, we randomly select a set of transcripts Gg from a pre defined transcriptome (e.g. ensembl), each of which includes one of g’s junctions. Finally, we assign each of Gg’s transcripts a relative abundance as a fraction of g’s total FPKM based on the ratio of junction spanning reads that match each of g’s junctions. With the resulting transcriptome and expression levels at hand, we then simulate reads using BEERS (Grant et al., 2011). More details can be found in the Supplementary Material.

Several key points are important to make regarding the above synthetic data generation scheme. First, expression levels and transcriptome variations are not based on a synthetic model but on real data. Second, the LSV formulation used here offers a way to define a lower bound on transcriptome complexity in each sample without resorting to assumptions about specific variations such as cassette exons. Third, we stress that the LSV detection and quantification is based directly on STAR junction-spanning reads and raw read ratios to avoid biasing estimations towards any specific algorithm. Finally, since each sample is modeled after a real experiment, no algorithm-specific assumptions are made about group-specific variations.

3 Results

To demonstrate the robustness of both the global (MAJIQ-gw) and local (MAJIQ-lw) weighting algorithms, we applied the synthetic perturbation procedure in Section 2.5 to one experiment in a group of three cerebellum samples from (Zhang et al., 2014), varying each of the three hyperparameters θ,δ,γ while holding the other two fixed. We compared these new generalizations of MAJIQ to both the previous algorithm (MAJIQ-nw) run with the outlier sample and to a case where we assume some heuristic (e.g. PCA) was able to detect and remove the outlier before the old algorithm was run (MAJIQ-rm). Figure 3 shows the effect of varying each hyper-parameter on the global weight associated with that experiment (log10ρt, left column), the number (NA, middle column) of LSVs reported as differentially spliced with high confidence (P(|ΔΨ|>0.2)>0.95), and the RR (RRA(NA), right column). At δ=0.6,γ=1 (a-c, top row), the outlier’s weight ρt scales linearly in log scale to the fraction of LSVs perturbed (θ) such that perturbing 10% is sufficient to drop ρt below 0.1. Consequently, the original MAJIQ-nw detects ~900 additional events which are likely to be false positives as reproducibility drops down to ~60%, while the NA and RR for both weighting algorithms (MAJIQ-gw and MAJIQ-lw) remain stable (Fig. 3b and c). Notably, the power of MAJIQ-lw to detect changing LSVs under control conditions of no outliers (θ=0.0) is approximately that of the old MAJIQ-nw and slightly exceeds that of MAJIQ-gw, possibly due to the heterogeneous nature of the replicates (different circadian time points). When outlier LSVs are added (θ>0.0) MAJIQ-gw and even more so MAJIQ-lw maintain robust RR while being able to identify more changing LSVs (NA) compared with simply removing the sample as in MAJIQ-rm. At θ=0.3,γ=1 (Fig. 3d–f, middle row), increasing δ initially causes the weight on the outlier to decrease towards a nonzero infimum. For larger δ>0.5, the original NMAJIQ-nw increases over 3-fold with a corresponding 45% drop in RR, indicating these are likely false positives. At θ=0.3,δ=0.6 (Fig. 3g–i, bottom row), decreasing γ towards 0 causes the outlier weight to shrink below 108, suggesting that the algorithm is sensitive to a sample with very low read counts. Indeed, with very few reads the estimated Ψ distribution does not vary significantly from the prior in most cases but the sample has minimal effect on the posterior for the group as it contributes only few reads. This is reflected in the fact that the original MAJIQ-nw is not affected too at such low coverage levels. Unsurprisingly, increasing the read rates to 150% (Fig. 3g–i for x-axis δ=1.5) does not significantly affect the weight, which remains very low. It does, however, increase the unreliability of the original MAJIQ, more than doubling NMAJIQ-nw while nearly halving RRMAJIQ-nw, again indicating a high FPR. In summary, we find that in all synthetic perturbation schemes we tested both MAJIQ-gw and MAJIQ-lw where able remain resistant to the perturbations. Both new algorithms and MAJIQ-lw in particular, exhibited improved statistical power and detected more changing LSVs than MAJIQ-rm while maintaining the same level of reproducibility. In contrast, the previous MAJIQ-nw detected many more changing LSVs which were FP, as indicated by a sharp drop in RR.

An external file that holds a picture, illustration, etc.
Object name is btx790f3.jpg

Synthetic perturbation of tissue replicates. In each test, three cerebella were compared against three livers from Zhang et al. (2014), but one of the cerebella was perturbed. Perturbation procedure detailed in Section 2.5. MAJIQ-nw is the previous algorithm equivalent to fixed weights (ρt=1). MAJIQ-rm is a control case where we assume some heuristic (e.g. PCA) was able to detect the outlier and remove it before executing the previous fixed-weights MAJIQ. (a, d, g) Effect on ρt for the perturbed ‘outlier’ (blue) and unperturbed replicates (Rep1, 2 in green and orange). (b, e, h) Effect on the number NA of events detected to have P(ΔΨ>0.20)>0.95 between the cerebellum and liver samples. (c, f, i) Effect on the RR(N), defined in Section 2.3. (a–c) θ, the fraction of LSVs perturbed, is varied between 0 and 0.5. At 0, the effect is the same as having no perturbation at all. (d–f) δ, the maximal amount by which Ψ is perturbed, is varied between 0 and 1. δ = 0 is equivalent to no perturbation. (g–i) γ, the ‘read scaling factor’, is varied between 0 and 1.5. When this factor is 0, it is functionally equivalent to a global weight of 0

Given the above results, we decided to focus subsequent comparative analysis on MAJIQ with local-weights (MAJIQ-lw) and the original algorithm (MAJIQ-nw). Using the procedure described in Section 2.6 we created a synthetic dataset based on real RNA-Seq samples from (Keane et al., 2011), involving mouse liver and hippocampus with 5-6 biological replicates per tissue. Following a common criteria in the RNA splicing field we used E[|ΔΨ|]20% and E[|ΔΨ|]5% to define the sets of events that are changing (CHG) and not changing (NO_CHG) respectively between conditions. Any events which may be considered borderline (20%>E[|ΔΨ|]>5%, denoted GREY) were removed from this analysis. We also tested another common threshold, E[|ΔΨ|]10%, for defining changing events, with similar results (data not shown). We tested two configurations involving naive or realistic read simulation, and a third configuration where an outlier (swapped sample) was introduced. All tests included three samples in each group (see Supplementary Figures for a test involving the full set of five and six replicates). Each algorithm was evaluated based on its own definition of splicing variation. Figure 4 summarizes the results of this evaluation, with the outlier mostly affecting FNR and sensitivity. In all scenarios MAJIQ compared favorably with MAJIQ-lw capable of detecting more true events with reduced FNR when an outlier was introduced.

An external file that holds a picture, illustration, etc.
Object name is btx790f4.jpg

Evaluation using ‘realistic’ synthetic datasets. Each synthetic sample is created to match a real sample in terms of gene expression and a lower bound on transcriptome complexity. Three datasets were created: ‘Naive’ with naive read simulation (uniform coverage, no errors, biases or indels); ‘Realistic’ with more realistic read generation; ‘Realistic + swap’ where one sample in an outlier. All datasets involve three biological replicates per group. Each method was evaluated using its own definition of AS events, which means the number of events is not comparable between methods. CHG are changing events (|E[ΔΨ]|20%), NO_CHG are non changing events (|E[ΔΨ]|5%), and GREY are events for which 20%>|E[ΔΨ]|>5% and on which the algorithm is not evaluated. DEXSeq is not included here as it is not able to return dPSI values

Next, we turned to evaluating the algorithms’ performance on real data with and without a sample swap. In all cases, we used three cerebellum versus three liver for the validation Set 2. Set 1 included either 3 cerebellum compared with three liver samples (control, no swap) or two cerebellum and one muscle compared with three livers (swap case). We chose to evaluate these tissues specifically because of the large number of reproducibly changing LSVs between them (Vaquero-Garcia et al., 2016), though similar results were observed for other tissues as well as for a different dataset from (Keane et al., 2011) (see Supplementary Figures). To produce the reproducibility curves discussed below, we used the following procedure. For MAJIQ, NA was defined as the set of events for which P(|ΔΨ|0.2)>0.95 and events were ranked by |E[ΔΨ]|, as in (Vaquero-Garcia et al., 2016). rMATS was run with an equivalent threshold of 0.2 and for rMATS, SUPPA, and DEXSeq events were ranked by their respective p-values, with the significance threshold for all three methods set to the recommended P-value ≤ 0.05. We also tried ranking the events by |E[ΔΨ]| (rMATS, SUPPA) and log2foldchange (DEXSeq), and plotting the RR graph without thresholding at P-value ≤ 0.05, which gave similar results (see Supplementary Figures). Finally, since DEXSeq is not geared for intron retention (IR) we excluded those (see Supplementary Figures including IR).

Figure 5a and matching Supplementary Figures summarize the RR results for the control (no swap) and outlier (swap) cases. As noted in Section 2.3, one immediate observation is the huge difference in the number of events reported as significantly changing (NA), a consequence of the vastly different statistical criteria and event definition used by the various algorithms. Specifically, MAJIQ, which captures de-novo junctions and may combine multiple classical AS events into a single complex LSV, detects 22 766 events and 15 336 non overlapping ones while rMATS reports 18 732 events and SUPPA, which was run with RefSeq according to authors’ recommendation, only detects 7770 events. rMATS and MAJIQ report roughly 9% of the events as differentially spliced compared with 27% by SUPPA. DEXSeq which measures expression of exonic segments, naturally reports many more units detected (75 871) and finds about 40% of those to be differentially included (see Supplementary Fig. S2e for details).

An external file that holds a picture, illustration, etc.
Object name is btx790f5.jpg

Evaluation using real data. (a)RR plots for detection of DS events between cerebellum and liver, with (Swap, dark line) or without (Control, faded line) a mislabeled muscle. The end of the line marks the point in the graph matching the number of events reported as significantly changing (RR(NA), see main text). Because DEXSeq report more than 3000 significantly changing exonic segments, we present its extended RR curve in Supplementary Figures. (b) Number of events reported by each methods as significantly changing between two groups of biological replicates from the same condition (light color—liver, dark color—cerebellum). See Supplementary Material for equivalent plots when comparing groups of technical replicates. (c) The IIR, representing the ratio between the number of DS events reported when comparing biological or technical replicates (NPFP) and the number of events reported when comparing similarly sized groups but from different conditions (NA, see Section 2.3 for details). See Supplementary Figures for equivalent plots when NPFP is derived using technical replicates. (d) The same control and swap experimental setup as in (a) with accuracy assessed using 50 RT-PCR experiments from (Vaquero-Garcia et al., 2016). Values represent fraction of variations explained (R2) and the number of events detected in parentheses. DEXSeq is not included here since it does not output Ψ values. Scatterplots are presented in Supplementary Figures. Mean and SD derived by swapping out each of the cerebellum samples. (e, f) Repeating (a) and (d) but with subsets of the FASTQ files to test the effect of read coverage

As shown in Figure 5a, the RR(n) was typically noisy for low n, which is to be expected, but varied greatly between algorithms and between runs with or without an outlier. MAJIQ consistently exhibited higher reproducibility compared with other algorithms regardless of n, with MAJIQ-lw and MAJIQ-nw practically tied in the control case with no outlier (RR(NA)=83%, overlapping light purple and light blue curves). However, when an outlier was introduced the reproducibility of the previous MAJIQ-nw dropped to 80% (dark blue) while MAJIQ-lw identified more differential LSVs (1410 versus 1426) with higher reproducibility (82%, dark purple). Importantly, the number of reported events did not change much for MAJIQ-lw when the outlier was introduced, dropping by only 7% (1428 versus 1328), while maintaining similar reproducibility. In comparison, other algorithms were much more affected by the introduction of the outlier. Their number of events NA drop between 40% (rMATS, DEXSeq) to 56% (SUPPA) while reproducibility dropped between 3% (SUPPA) to 6% (rMATS, DEXSeq) at RR(NA). However, the RR(NA) numbers by themselves do not capture the entire picture. DEXSeq maintained a RR of around 70% for the control case with n < 2000 compared with its RR(NA) of only 60%, and all three algorithms suffered a drop of ~15–30% in reproducibility due to the outlier for the most high ranking events.

Next, we assessed the number of events reported by each method as DS between groups of biological or technical replicates from the same two conditions (liver, cerebellum) used in Figure 5a. Figure 5b and c show the resulting putative false positives (NPFP) and IIR (IIR = NPFP/NA) for each algorithm for biological replicates. See Supplementary Figures for a similar plot using technical replicates which results in lower IIR, as expected. Overall, MAJIQ maintained significantly lower NPFP and IIR with DEXSeq exhibiting a particularly high IIR of over 20%.

Using 50 DS events quantified by Vaquero-Garcia et al. (2016) for the same two tissues from Zhang et al. (2014), we quantified the correlation between RNASeq based E[ΔΨ] and the RT-PCR quantifications, with and without an outlier sample. Figure 5b summarizes the accuracy of the various algorithms measured as the fraction of variations explained (R2). We found MAJIQ-lw, MAJIQ-nw, and rMATS gave similar R2 values in the control case, with MAJIQ-lw performing the best in the outlier (swap) case. The lower number of events detected by SUPPA may be due to its usage of Refseq only transcripts as recommended by its authors, but may also reflect selection bias as (Vaquero-Garcia et al., 2016) originally selected events detectable by both rMATS and MAJIQ in order to compare these methods. DEXSeq was not included in this analysis since it is not able to produce PSI or dPSI estimates.

Finally, we tested the effect of read coverage levels on the various algorithms. In general, all methods take into account the number of reads that hit an AS event when assessing its statistical significance. Consequently, we do not expect to see a dramatic effect on the reproducibility of events deemed significantly changing (RR) but rather a significant reduction in the number of events detected as differentially spliced (NA). Unlike other methods, MAJIQ relies solely on junction-spanning reads and is therefore likely to suffer more in detection power if its default parameters are not adjusted accordingly. We therefore tested two different MAJIQ execution configurations. The first is MAJIQ’s default settings, which are particularly conservative. These settings were designed for high reproducibility assuming multiple replicates with an average coverage of ~60 M and read length of 100 bases as in (Zhang et al., 2014). The second configuration, denoted ‘relaxed’, drops the quantifiable filter restrictions to allow the detection of junctions in a single experiment with as few as two reads mapping to two or more unique positions across the junction. We note that the other algorithms we analyzed do not have such coverage specific user adjustable parameters as these do no rely solely on junction spanning reads and as we show are less sensitive to changes in coverage.

By randomly sub-sampling the FASTQ files from (Zhang et al., 2014), read coverage was varied from the original average of 80 million reads (denoted 100%) through 40 million (50%) to 20 million (25%). The RR and detection power of each method evaluated is reported in Figure 5e, and correlations with RT-PCR quantifications in Figure 5f. In general, all methods lose power to detect differentially spliced events as coverage drops. As expected, MAJIQ run with default parameters is the most sensitive to sparse coverage, with detected events dropping up to 75%. When run with the relaxed parameters, MAJIQ-lw loses 50%, while rMATS, SUPPA, and DEXSeq lose 37, 32 and 48% respectively. However, MAJIQ’s conservative settings pay off in consistent high reproducibility of 79–83% compared with 61–70% for MAJIQ-lw with relaxed settings, and generally lower values for other algorithms. Similarly, when evaluating IIR we found MAJIQ-lw relaxed settings slightly increased the IIR by about 4% while detecting many more events overall (see Supplementary Figure). When evaluating the effect of sparse read coverage using the 50 RT-PCR experiments (Fig. 5f), we find a similar picture. It is worth noting that these 50 events were designed by (Vaquero-Garcia et al., 2016) to cover the wide range of read coverage levels across splicing events. Here, SUPPA and rMATS maintain the same number of detected events for all coverage levels and similar R2 values. For MAJIQ, the default and the relaxed settings maintain high R2 values which compare favorably to the other methods, but the default settings loses 62% of the events (19 versus 50).

4 Discussion

In this article, we developed a new model to automatically detect and downweight outliers in RNA-Seq datasets with replicates for DS analysis. The problem of detecting outliers in batches of biological replicates has not received much attention in the literature. As a result, researchers may include outliers that might affect their results, or discard samples before publication based on some heuristic such as PCA (Conesa et al., 2016). Using such a heuristic to discard samples may also reflect unconscious bias or cause good samples to be lost. Instead, we proposed a model for what constitutes an outlier sample in splicing analysis. This model is supported by our software, MAJIQ, so users can evaluate possible outliers regardless of which pipeline they use.

In addition, we developed an outlier weighing scheme and incorporated it into MAJIQ’s model either at the global, per-sample, level (MAJIQ-gw) or locally at the LSV level (MAJIQ-lw). Using both synthetic data perturbation and real data, we demonstrated that both MAJIQ-gw and MAJIQ-lw can maintain high levels of reproducibility when confronted with outliers. Furthermore, even when a sample is an outlier, using the local weight model (MAJIQ-lw) can help gain detection power by retaining LSVs which are ‘well-behaved’. In addition to sensitivity to outliers, we showed that low read coverage can also affect MAJIQ’s detection power as it relies on junction spanning reads. However, by adjusting MAJIQ’s default parameters, the new MAJIQ-lw was able to maintain high detection power with a moderate price in terms of reproducibility and correlation to RT-PCR, comparing favorably to other algorithms. Given that there is a natural trade-off between detection power and reproducibility, we recommend users decide between default and relaxed parameters based on their needs and dataset specifics.

Another contribution of this work is in developing a framework to tackle the challenge of evaluating DS algorithms based on both real and synthetic data. This work is definitely not the first to tackle this challenge, but most previous work focused on synthetic data using simplifying assumptions about binary AS events or transcript expression levels (Drewe et al., 2013; Entizne et al., 2017; Shen et al., 2014). Instead, we utilize biological replicates and create technical replicates by dividing the original FASTQ files. Using replicates we assess algorithms by using RR plots for DS events between different conditions, and by computing the IIR. The latter allows us to assess putative false positives with respect to DS between conditions and was in good qualitative agreement with FDR values we obtained from the synthetic dataset. We also use an extensive set of RT-PCR experiments from Vaquero-Garcia et al. (2016) to compute R2 correlation to the algorithms’ E[Ψ] estimations from the matching RNASeq data. RT-PCR experiments are the gold standard in the RNA field and are indispensable for validation. In comparison, qPCR used in some other works (Drewe et al., 2013) can help validate significant splicing variations but is generally not considered suitable for quantitative assessment of splicing changes. RT-PCR can serve to measure PSI values as done here, but it must be executed and quantified carefully for this. For example, experiments need to be executed in triplicates with careful reading of the gel bands. Also, PCR primers need to be carefully mapped to each method’s quantified event for valid comparisons, and past studies have mostly focused on classical AS events types rather than the more complex ones (Vaquero-Garcia et al., 2016). In practice, much of the RT-PCR data available in publications have been produced only for validating significant splicing changes and as such may not be suitable for quantitative assessment of DS algorithms. Given these limitations we do not recommend automatic pulling of all available RT-PCR quantification across publications for assessing DS algorithms. Nonetheless, developing good quality sets of RT-PCR validations are important for DS algorithm assessment and we hope the work presented here will help achieve this goal.

Although the new version of MAJIQ compared favorably to other algorithms in the experiments surveyed here, users should keep in mind that different datasets may suffer from other types of noise or biases than those tested here. It is therefore advisable for potential users to always test DS algorithms using the evaluation criteria introduced here, including reproducibility plots, same condition group comparisons, and RT-PCR. To aid such evaluations, we made all the scripts used for this work available at majiq.biociphers.org. Similarly, we also made all datasets, including the synthetic data, available for developers to test against.

Another important point to consider is that the methods tested here differ greatly in the set of features they offer. MAJIQ is the only one that offers the ability to detect complex splicing variations involving more than two alternative junctions, coupled with interactive visualization and genome browser connectivity. It is also capable of supplementing a given transcriptome annotation with reliable de novo junctions detected in the RNA-Seq data, making it less dependent on the exact annotation. Although useful for even normal tissues (Vaquero-Garcia et al., 2016), this feature is particularly relevant for disease studies, cases where uncharacteristic splicing is expected, and for species with poorly annotated transcriptomes. Although some software offers detection of de novo junctions between known splice sites, MAJIQ is able to detect completely novel exons and junctions.

Notably, MAJIQ’s detection of novel exons and junctions comes with a price of algorithmic complexity. MAJIQ first runs a builder which constructs de novo splice graphs per gene, then quantifies the detected LSVs in a second step. Importantly, the builder can be executed once and then used in multiple comparisons. This construction complicates comparison of time and memory to the other algorithms, making it highly context specific. For example, the analysis presented here involved 15 RNA-Seq samples, divided to groups three samples each, and a total of five pairwise comparisons. For this, MAJIQ used ~200 M of memory per node, making it suitable for running on anything from a laptop to an HPC. Running MAJIQ with 16 processes on a 32 Intel Xeon 2.4 GHz cores machine with 64GB of memory took about 3 h for the builder and another 40 m for the pairwise comparisons (each of which can be run in parallel). SUPPA and rMATS-turbo are significantly faster, taking a total of ~35 m (SUPPA) and 1 h (rMATS-turbo) with a similar amount of memory consumption. As for DEXSeq, we found it to be significantly slower and memory heavy, taking up to 1 GB of memory and over 6 h to finish. In addition, the fact DEXSeq does not report PSI or dPSI makes it less suitable for quantitative splicing focused research.

We note the algorithms surveyed here are by no means a complete list of DS methods and other algorithms may offer specific desirable features. For example, VAST-TOOLS (Irimia et al., 2014) offers sensitive detection of micro-exons, and DiffSplice (Hu et al., 2013) offers detection of AS modules in each gene’s splice graph.

There are several important directions in which this work can be extended. First, MAJIQ can be further improved both in terms of memory consumption and running time. More importantly, all the algorithms compared here were designed for datasets with small sets of biological replicates. Although we saw that outlier detection can work even for non biological replicates (see analysis of ρt using GTEx samples in Supplementary Figures) we note that large heterogeneous datasets, such as those created in cancer studies or GTEx (GTEx Consortium, 2013), are likely to benefit from different statistical models. Finally, MAJIQ’s improved quantifications can be used to subsequently derive new models for splicing codes and splicing predictions given genetic variations (Barash et al., 2010, 2013; Xiong et al., 2015). Such improvements form a promising path for future algorithm development.

Supplementary Material

Supplementary Data

Acknowledgements

We would like to thank an anonymous ISMB 2017 reviewer for helpful suggestions, as well as Matthew R. Gazzara and Anupama Jha from the BioCiphers Lab for helpful comments and suggestions.

Funding

This work has been supported in part by the Penn Institute for Biomedical Informatics Pilot [Grant R01 AG046544 to Y.B.].

Conflict of Interest: none declared.

References


Articles from Bioinformatics are provided here courtesy of Oxford University Press

Citations & impact 


Impact metrics

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/30276767
Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/30276767

Article citations


Go to all (21) article citations

Data 


Data behind the article

This data has been text mined from the article, or deposited into data resources.

Similar Articles 


To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.


Funding 


Funders who supported this work.

NIA NIH HHS (1)

NIGMS NIH HHS (1)

Penn Institute for Biomedical Informatics Pilot (1)