What is rna seq analysis




















The reward of standardizing analysis protocols as well as RNA-seq data will be that of endowing the research community with powerful instruments for understanding the complexity of transcription and, in turn, facilitating the development of personalized expression-based panels of biomarkers to employ at every stage of the therapeutic pathway.

The main goal of this Research Topic is that of dissecting the RNA-seq process: from data production and validation to the analysis and extraction of new knowledge, elucidating weaknesses and opportunities and proposing new approaches and protocols.

To this end, either in-silico data analyses and in-vitro experiments can contribute to improve protocols and, in turn, lead RNA-seq to become a mature technology. Important Note : All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements.

Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review. Similarly, longer reads improve mappability and transcript identification [ 5 , 6 ]. The best sequencing option depends on the analysis goals. The cheaper, short SE reads are normally sufficient for studies of gene expression levels in well-annotated organisms, whereas longer and PE reads are preferable to characterize poorly annotated transcriptomes.

Another important factor is sequencing depth or library size, which is the number of sequenced reads for a given sample. More transcripts will be detected and their quantification will be more precise as the sample is sequenced to a deeper level [ 1 ].

Nevertheless, optimal sequencing depth again depends on the aims of the experiment. While some authors will argue that as few as five million mapped reads are sufficient to quantify accurately medium to highly expressed genes in most eukaryotic transcriptomes, others will sequence up to million reads to quantify precisely genes and transcripts that have low expression levels [ 7 ].

When studying single cells, which have limited sample complexity, quantification is often carried out with just one million reads but may be done reliably for highly expressed genes with as few as 50, reads [ 8 ]; even 20, reads have been used to differentiate cell types in splenic tissue [ 9 ].

Moreover, optimal library size depends on the complexity of the targeted transcriptome. Experimental results suggest that deep sequencing improves quantification and identification but might also result in the detection of transcriptional noise and off-target transcripts [ 10 ]. Saturation curves can be used to assess the improvement in transcriptome coverage to be expected at a given sequencing depth [ 10 ]. Finally, a crucial design factor is the number of replicates.

The number of replicates that should be included in a RNA-seq experiment depends on both the amount of technical variability in the RNA-seq procedures and the biological variability of the system under study, as well as on the desired statistical power that is, the capacity for detecting statistically significant differences in gene expression between experimental groups.

These two aspects are part of power analysis calculations Fig. The adequate planning of sequencing experiments so as to avoid technical biases is as important as good experimental design, especially when the experiment involves a large number of samples that need to be processed in several batches. In this case, including controls, randomizing sample processing and smart management of sequencing runs are crucial to obtain error-free data Fig.

The actual analysis of RNA-seq data has as many variations as there are applications of the technology. In this section, we address all of the major analysis steps for a typical RNA-seq experiment, which involve quality control, read alignment with and without a reference genome, obtaining metrics for gene and transcript expression, and approaches for detecting differential gene expression. We also discuss analysis options for applications of RNA-seq involving alternative splicing, fusion transcripts and small RNA expression.

Finally, we review useful packages for data visualization. The acquisition of RNA-seq data consists of several steps — obtaining raw reads, read alignment and quantification. At each of these steps, specific checks should be applied to monitor the quality of the data Fig.

Quality control for the raw reads involves the analysis of sequence quality, GC content, the presence of adaptors, overrepresented k -mers and duplicated reads in order to detect sequencing errors, PCR artifacts or contaminations. Acceptable duplication, k -mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments.

Software tools such as the FASTX-Toolkit [ 13 ] and Trimmomatic [ 14 ] can be used to discard low-quality reads, trim adaptor sequences, and eliminate poor-quality bases. Reads are typically mapped to either a genome or a transcriptome, as will be discussed later. An important mapping quality parameter is the percentage of mapped reads, which is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. When reads are mapped against the transcriptome, we expect slightly lower total mapping percentages because reads coming from unannotated transcripts will be lost, and significantly more multi-mapping reads because of reads falling onto exons that are shared by different transcript isoforms of the same gene.

Other important parameters are the uniformity of read coverage on exons and the mapped strand. Once actual transcript quantification values have been calculated, they should be checked for GC content and gene length biases so that correcting normalization methods can be applied if necessary. If the reference transcriptome is well annotated, researchers could analyze the biotype composition of the sample, which is indicative of the quality of the RNA purification step.

The quality-control steps described above involve individual samples. In addition, it is also crucial to assess the global quality of the RNA-seq dataset by checking on the reproducibility among replicates and for possible batch effects.

If gene expression differences exist among experimental conditions, it should be expected that biological replicates of the same condition will cluster together in a principal component analysis PCA.

When a reference genome is available, RNA-seq analysis will normally involve the mapping of the reads onto the reference genome or transcriptome to infer which transcripts are expressed. Mapping solely to the reference transcriptome of a known species precludes the discovery of new, unannotated transcripts and focuses the analysis on quantification alone.

By contrast, if the organism does not have a sequenced genome, then the analysis path is first to assemble reads into longer contigs and then to treat these contigs as the expressed transcriptome to which reads are mapped back again for quantification.

In either case, read coverage can be used to quantify transcript expression level Fig. A basic choice is whether transcript identification and quantification are done sequentially or simultaneously. Two alternatives are possible when a reference sequence is available: mapping to the genome or mapping to the annotated transcriptome Fig. Regardless of whether a genome or transcriptome reference is used, reads may map uniquely they can be assigned to only one position in the reference or could be multi-mapped reads multireads.

Genomic multireads are primarily due to repetitive sequences or shared domains of paralogous genes. They normally account for a significant fraction of the mapping output when mapped onto the genome and should not be discarded.

When the reference is the transcriptome, multi-mapping arises even more often because a read that would have been uniquely mapped on the genome would map equally well to all gene isoforms in the transcriptome that share the exon. In either case — genome or transcriptome mapping — transcript identification and quantification become important challenges for alternatively expressed genes.

Read mapping and transcript identification strategies. Three basic strategies for regular RNA-seq analysis. Next novel transcript discovery and quantification can proceed with or without an annotation file.

Novel transcripts are then functionally annotated. Transcript identification and quantification can occur simultaneously. For quantification, reads are mapped back to the novel reference transcriptome and further analysis proceeds as in b followed by the functional annotation of the novel transcripts as in a. Representative software that can be used at each analysis step are indicated in bold text. Identifying novel transcripts using the short reads provided by Illumina technology is one of the most challenging tasks in RNA-seq.

Short reads rarely span across several splice junctions and thus make it difficult to directly infer all full-length transcripts. In any case, PE reads and higher coverage help to reconstruct lowly expressed transcripts, and replicates are essential to resolve false-positive calls that is, mapping artifacts or contaminations at the low end of signal detection. Several methods, such as Cufflinks [ 23 ], iReckon [ 24 ], SLIDE [ 25 ] and StringTie [ 26 ], incorporate existing annotations by adding them to the possible list of isoforms.

Montebello [ 27 ] couples isoform discovery and quantification using a likelihood-based Monte Carlo algorithm to boost performance. Gene-finding tools such as Augustus [ 28 ] can incorporate RNA-seq data to better annotate protein-coding transcripts, but perform worse on non-coding transcripts [ 29 ].

In general, accurate transcript reconstruction from short reads is difficult, and methods typically show substantial disagreement [ 29 ]. When a reference genome is not available or is incomplete, RNA-seq reads can be assembled de novo Fig. In general, PE strand-specific sequencing and long reads are preferred because they are more informative [ 33 ]. Although it is impossible to assemble lowly expressed transcripts that lack enough coverage for a reliable assembly, too many reads are also problematic because they lead to potential misassembly and increased runtimes.

Therefore, in silico reduction of the number of reads is recommended for deeply sequenced samples [ 33 ]. For comparative analyses across samples, it is advisable to combine all reads from multiple samples into a single input in order to obtain a consolidated set of contigs transcripts , followed by mapping back of the short reads for expression estimation [ 33 ].

Either with a reference or de novo, the complete reconstruction of transcriptomes using short-read Illumina technology remains a challenging problem, and in many cases de novo assembly results in tens or hundreds of contigs accounting for fragmented transcripts. The most common application of RNA-seq is to estimate gene and transcript expression. This application is primarily based on the number of reads that map to each transcript sequence, although there are algorithms such as Sailfish that rely on k -mer counting in reads without the need for mapping [ 34 ].

The simplest approach to quantification is to aggregate raw counts of mapped reads using programs such as HTSeq-count [ 35 ] or featureCounts [ 36 ].

This gene-level rather than transcript-level quantification approach utilizes a gene transfer format GTF file [ 37 ] containing the genome coordinates of exons and genes, and often discard multireads.

Raw read counts alone are not sufficient to compare expression levels among samples, as these values are affected by factors such as transcript length, total number of reads, and sequencing biases. The measure RPKM reads per kilobase of exon model per million reads [ 1 ] is a within-sample normalization method that will remove the feature-length and library-size effects. This measure and its subsequent derivatives FPKM fragments per kilobase of exon model per million mapped reads , a within-sample normalized transcript expression measure analogous to RPKs, and TPM transcripts per million are the most frequently reported RNA-seq gene expression values.

The dichotomy of within-sample and between-sample comparisons has led to a lot of confusion in the literature. Correcting for gene length is not necessary when comparing changes in gene expression within the same gene across samples, but it is necessary for correctly ranking gene expression levels within the sample to account for the fact that longer genes accumulate more reads. Furthermore, programs such as Cufflinks that estimate gene length from the data can find significant differences in gene length between samples that cannot be ignored.

TPMs, which effectively normalize for the differences in composition of the transcripts in the denominator rather than simply dividing by the number of reads in the library, are considered more comparable between samples of different origins and composition but can still suffer some biases.

These must be addressed with normalization techniques such as TMM. Cufflinks [ 39 ] estimates transcript expression from a mapping to the genome obtained from mappers such as TopHat using an expectation-maximization approach that estimates transcript abundances. This approach takes into account biases such as the non-uniform read distribution along the gene length.

Cufflinks was designed to take advantage of PE reads, and may use GTF information to identify expressed transcripts, or can infer transcripts de novo from the mapping data alone. These methods allocate multi-mapping reads among transcript and output within-sample normalized values corrected for sequencing biases [ 35 , 41 , 43 ]. NURD [ 44 ] provides an efficient way of estimating transcript expression from SE reads with a low memory and computing cost.

Differential expression analysis Fig. RPKM, FPKM, and TPM normalize away the most important factor for comparing samples, which is sequencing depth, whether directly or by accounting for the number of transcripts, which can differ significantly between samples. These approaches rely on normalizing methods that are based on total or effective counts, and tend to perform poorly when samples have heterogeneous transcript distributions, that is, when highly and differentially expressed features can skew the count distribution [ 45 , 46 ].

Additional factors that interfere with intra-sample comparisons include changes in transcript length across samples or conditions [ 50 ], positional biases in coverage along the transcript which are accounted for in Cufflinks , average fragment size [ 43 ], and the GC contents of genes corrected in the EDAseq package [ 21 ].

The NOISeq R package [ 20 ] contains a wide variety of diagnostic plots to identify sources of biases in RNA-seq data and to apply appropriate normalization procedures in each case.

Finally, despite these sample-specific normalization methods, batch effects may still be present in the data. These approaches, although initially developed for microarray data, have been shown to work well with normalized RNA-seq data STATegra project, unpublished.

As RNA-seq quantification is based on read counts that are absolutely or probabilistically assigned to transcripts, the first approaches to compute differential expression used discrete probability distributions, such as the Poisson or negative binomial [ 48 , 54 ]. The negative binomial distribution also known as the gamma-Poisson distribution is a generalization of the Poisson distribution, allowing for additional variance called overdispersion beyond the variance expected from randomly sampling from a pool of molecules that are characteristic of RNA-seq data.

However, the use of discrete distributions is not required for accurate analysis of differential expression as long as the sampling variance of small read counts is taken into account most important for experiments with small numbers of replicates. Methods for transforming normalized counts of RNA-seq reads while learning the variance structure of the data have been shown to perform well in comparison to the discrete distribution approaches described above [ 55 , 56 ].

Moreover, after extensive normalization including TMM and batch removal , the data might have lost their discrete nature and be more akin to a continuous distribution. Some methods, such as the popular edgeR [ 57 ], take as input raw read counts and introduce possible bias sources into the statistical model to perform an integrated normalization as well as a differential expression analysis.

In other methods, the differential expression requires the data to be previously normalized to remove all possible biases. DESeq2, like edgeR, uses the negative binomial as the reference distribution and provides its own normalization approach [ 48 , 58 ]. Other approaches include data transformation methods that take into account the sampling variance of small read counts and create discrete gene expression distributions that can be analyzed by regular linear models [ 55 ].

Finally, non-parametric approaches such as NOISeq [ 10 ] or SAMseq [ 61 ] make minimal assumptions about the data and estimate the null distribution for inferential analysis from the actual data alone. For small-scale studies that compare two samples with no or few replicates, the estimation of the negative binomial distribution can be noisy. In such cases, simpler methods based on the Poisson distribution, such as DEGseq [ 62 ], or on empirical distributions NOISeq [ 10 ] can be an alternative, although it should be strongly stressed that, in the absence of biological replication, no population inference can be made and hence any p value calculation is invalid.

Methods that analyze RNA-seq data without replicates therefore only have exploratory value. Considering the drop in price of sequencing, we recommend that RNA-seq experiments have a minimum of three biological replicates when sample availability is not limiting to allow all of the differential expression methods to leverage reproducibility between replicates.

Recent independent comparison studies have demonstrated that the choice of the method or even the version of a software package can markedly affect the outcome of the analysis and that no single method is likely to perform favorably for all datasets [ 56 , 63 , 64 ] Box 4. We therefore recommend thoroughly documenting the settings and version numbers of programs used and considering the repetition of important analyses using more than one package.

Transcript-level differential expression analysis can potentially detect changes in the expression of transcript isoforms from the same gene, and specific algorithms for alternative splicing-focused analysis using RNA-seq have been proposed. These methods fall into two major categories. The first approach integrates isoform expression estimation with the detection of differential expression to reveal changes in the proportion of each isoform within the total gene expression.

One such early method, BASIS, used a hierarchical Bayesian model to directly infer differentially expressed transcript isoforms [ 65 ]. CuffDiff2 estimates isoform expression first and then compares their differences.

By integrating the two steps, the uncertainty in the first step is taken into consideration when performing the statistical analysis to look for differential isoform expression [ 66 ]. The flow difference metric FDM uses aligned cumulative transcript graphs from mapped exon reads and junction reads to infer isoforms and the Jensen-Shannon divergence to measure the difference [ 67 ]. Recently, Shi and Jiang [ 68 ] proposed a new method, rSeqDiff, that uses a hierarchical likelihood ratio test to detect differential gene expression without splicing change and differential isoform expression simultaneously.

All these approaches are generally hampered by the intrinsic limitations of short-read sequencing for accurate identification at the isoform level, as discussed in the RNA-seq Genome Annotation Assessment Project paper [ 30 ]. This approach is based on the premise that differences in isoform expression can be tracked in the signals of exons and their junctions. DEXseq [ 69 ] and DSGSeq [ 70 ] adopt a similar idea to detect differentially spliced genes by testing for significant differences in read counts on exons and junctions of the genes.

DiffSplice uses alignment graphs to identify alternative splicing modules ASMs and identifies differential splicing using signals of the ASMs [ 73 ]. The advantage of exon or junction methods is their greater accuracy in identifying individual alternative splicing events. Exon-based methods are appropriate if the focus of the study is not on whole isoforms but on the inclusion and exclusion of specific exons and the functional protein domains or regulatory features, in case of untranslated region exons that they contain.

Visualization of RNA-seq data Fig. Some visualization tools are specifically designed for visualizing multiple RNA-seq samples, such as RNAseqViewer [ 79 ], which provides flexible ways to display the read abundances on exons, transcripts and junctions. Introns can be hidden to better display signals on the exons, and the heatmaps can help the visual comparison of signals on multiple samples Figure S1b, c in Additional file 1.

Some of the software packages for differential gene expression analysis such as DESeq2 or DEXseq in Bioconductor have functions to enable the visualization of results, whereas others have been developed for visualization-exclusive purposes, such as CummeRbund for CuffDiff [ 66 ] or Sashimi plots, which can be used to visualize differentially spliced exons [ 80 ].

The advantage of Sashimi plots is that their display of junction reads is more intuitive and aesthetically pleasing when the number of samples is small Figure S1d in Additional file 1. Sashimi, structure, and hive plots for splicing quantitative trait loci sQTL can be obtained using SplicePlot [ 81 ]. Splice graphs can be produced using SpliceSeq [ 82 ], and SplicingViewer [ 83 ] plots splice junctions and alternative splicing events. TraV [ 84 ] is a visualization tool that integrates data analysis, but its analytical methods are not applicable to large genomes.

Owing to the complexity of transcriptomes, efficient display of multiple layers of information is still a challenge. All of the tools are evolving rapidly and we can expect more comprehensive tools with desirable features to be available soon. Users should visualize changes in read coverage for genes that are deemed important or interesting on the basis of their analysis results to evaluate the robustness of their conclusions.

The discovery of fused genes that can arise from chromosomal rearrangements is analogous to novel isoform discovery, with the added challenge of a much larger search space as we can no longer assume that the transcript segments are co-linear on a single chromosome. Artifacts are common even using state-of-the-art tools, which necessitates post-processing using heuristic filters [ 85 ]. Artifacts primarily result from misalignment of read sequences due to polymorphisms, homology, and sequencing errors.

Families of homologous genes, and highly polymorphic genes such as the HLA genes, produce reads that cannot be easily mapped uniquely to their location of origin in the reference genome. For genes with very high expression, the small but non-negligible sequencing error rate of RNA-seq will produce reads that map incorrectly to homologous loci.

Filtering highly polymorphic genes and pairs of homologous genes is recommended [ 86 , 87 ]. Also recommended is the filtering of highly expressed genes that are unlikely to be involved in gene fusions, such as ribosomal RNA [ 86 ].

Finally, a low ratio of chimeric to wild-type reads in the vicinity of the fusion boundary may indicate spurious mis-mapping of reads from a highly expressed gene the transcript allele fraction described by Yoshihara et al. Given successful prediction of chimeric sequences, the next step is the prioritization of gene fusions that have biological impact over more expected forms of genomic variation. Examples of expected variation include immunoglobulin IG rearrangements in tumor samples infiltrated by immune cells, transiently expressed transposons and nuclear mitochondrial DNA, and read-through chimeras produced by co-transcription of adjacent genes [ 88 ].

Care must be taken with filtering in order not to lose events of interest. For example, removing all fusions involving an IG gene may remove real IG fusions in lymphomas and other blood disorders; filtering fusions for which both genes are from the IG locus is preferred [ 88 ]. Transiently expressed genomic breakpoint sequences that are associated with real gene fusions often overlap transposons; these should be filtered unless they are associated with additional fusion isoforms from the same gene pair [ 89 ].

Read-through chimeras are easily identified as predictions involving alternative splicing between adjacent genes. Where possible, fusions should be filtered by their presence in a set of control datasets [ 87 ]. When control datasets are not available, artifacts can be identified by their presence in a large number of unrelated datasets, after excluding the possibility that they represent true recurrent fusions [ 90 , 91 ].

Strong fusion-sequence predictions are characterized by distinct subsequences that each align with high specificity to one of the fused genes. As alignment specificity is highly correlated with sequence length, a strong prediction sequence is longer, with longer subsequences from each gene. Longer reads and larger insert sizes produce longer predicted sequences; thus, we recommend PE RNA-seq data with larger insert size over SE datasets or datasets with short insert size.

Another indicator of prediction strength is splicing. For most known fusions, the genomic breakpoint is located in an intron of each gene [ 92 ] and the fusion boundary coincides with a splice site within each gene.

Furthermore, fusion isoforms generally follow the splicing patterns of wild-type genes. Thus, high confidence predictions have fusion boundaries coincident with exon boundaries and exons matching wild-type exons [ 91 ]. Fusion discovery tools often incorporate some of the aforementioned ideas to rank fusion predictions [ 93 , 94 ], though most studies apply additional custom heuristic filters to produce a list of high-quality fusion candidates [ 90 , 91 , 95 ].

Next-generation sequencing represents an increasingly popular method to address questions concerning the biological roles of small RNAs sRNAs. Ligated adaptor sequences are first trimmed and the resulting read-length distribution is computed. In animals, there are usually peaks for 22 and 23 nucleotides, whereas in plants there are peaks for and nucleotide redundant reads. For instance, miRTools 2. The downside of this technique is the increased burden of data analysis to achieve the same accuracy that would be achieved with a richer input.

Downstream sequencing, fastq data must be validated and processed to distill raw reads into a quantitative measure of gene expression. Usually reads are: subjected to adapter removal, aligned against a reference genome, grouped by functional unit e. Subsequent analyses can vary dramatically according to the application.

In the simplest setting, the subset of genes responsible for the differences on the phenotype between two populations should be discovered.

In other cases, one may want to build the co-expression or reverse expression network in order to find interacting genes or a pathway related to a certain phenotype. Other applications involve the discovery of unknown cell types, the organization of cell types in homogeneous families, the identification of new molecules e.

This Research Topic is divided into three main sections: five articles cover the RNA-seq workflow, four papers discuss the most recent frontier of single cell RNA sequencing, while the last four contributions report on case studies, related to tumor profiling and plant science. In the first part, we attempted to analyze the RNA-seq process from experimental design to analysis and extraction of new knowledge by highlighting the key choices of the state-of-the-art workflows.

Although we have mainly focused on computational aspects, we believe that this Research Topic can catch the interest of those readers, specialized in the field of life science, who intend to become independent and autonomous in the analysis of their own data. Two papers of this section describe new methods: for the identification of differentially expressed genes and for the prediction of the circRNA coding ability.

Although conceptually similar to sequencing cells in bulk, the single cell resolution of this technique introduces a lot of noise, that requires ad hoc analysis methods. Much of this section is dedicated to the introduction of basic single cell RNA sequencing concepts, from laboratory protocols to the most common analyses. In particular, the problems of assessing the results of clustering cell types and the reproducibility of differential expression experiments are discussed.

Finally, this section concludes with the description of a new method to infer missing counts due to poor coverage of sequencing. The last part of the Research Topic was dedicated to four case studies: three concerning tumors and one application in plant science.

As the reaction moves forward, the high concentration of rRNA constructs relative to their cDNA partners drives the reaction even further until rRNA and other abundant constructs are effectively depleted from the sample. Step 4 after enzymatic depletion of highly abundant molecules, the remaining constructs are the target RNA molecules. Because this reaction relies on molecular kinetics, the higher the input for the reaction the faster the reaction proceeds, leading to an inverse relationship between input and incubation times.

This probe-free approach is beneficial because there are no organism-specific panels required for separate purchase; one kit can do them all! It is universal. This is especially beneficial for projects involving non-model organisms that previously required the development of their organism-specific probe panels.

Another benefit of an enzymatic approach is the reduction in depletion bias. Probe-based approaches only deplete constructs that bind to probes, whereas the enzymatic approach depletes the most abundant constructs first and most efficiently. Once our cDNA has been synthesized and our transcripts of interest are no longer crowded out by rRNAs and overly abundant constructs, it is time for adaptors to be ligated onto the cDNA.

Adaptors are short, synthetic oligonucleotides that are attached to the end of cDNA strands. Adaptors serve two main functions: binding transcripts for sequencing and priming sites for sequencing.

The adaptor sequences are complementary to the sequences that the fragments are covalently bound to in the sequencing flow cell. The flow cell is a glass slide with lanes coated in a lawn of the two different types of oligos complementary to our adaptor sequences.

This allows for our transcripts to transiently bind to the flow cells for sequencing. The second function of adaptors is to serve as priming sites for the polymerases used in sequencing.

After adaptors are ligated to the cDNA molecules, many library preparations undergo a process of indexing. This barcode allows for the transcripts to be identified during the sequencing process after pooling samples. Pooling is a process that involves mixing numerous different samples together at a known concentration so they can be added to the flow cell and sequenced simultaneously.

Pooling samples is often done to save time and money. After adaptor ligation and indexing, samples are ready for sequencing! Step 1 the process of adaptor ligation and indexing involves the addition of the synthetic oligonucleotides to our target cDNA molecules. Step 2 an adaptor, with the unique barcode is ligated to the cDNA target.

Illumina adaptors are commonly used, they are designated P5 and P7. Step 3 the other adaptor is added to the other end of the cDNA molecule.

These amplified libraries are then quantified to determine their concentration. The concentration of the libraries is then normalized to ensure the libraries are sequenced evenly and that no one library is overrepresented during the sequencing process.

There are a few different technologies for sequencing such as sanger sequencing, and more high throughput options such as pyrosequencing, ion torrent, and nanopore sequencing.

There are two parts of sequencing by synthesis, which are cluster generation followed by the actual sequencing process. Our samples are now each indexed, meaning they have a unique barcode tag that allows us to identify the samples after multiple samples are pooled together.

The pooled samples are added to a flow cell in the sequencer. Step 1 the adaptorized transcripts can hybridize to the complementary oligos of the lawn so they are bound to the flow cell. The flow cell oligo serves as the primer for a polymerase to create a complement of the hybridized fragment. Then the double stranded molecule is denatured, and the original template is washed away leaving only the newly synthesized strand that is bound directly to the flow cell.

Step 2 the strand now folds over, and the adaptor region hybridizes to the other kind of oligo on the flow cell and a polymerase uses the new oligo as a primer to create a complementary strand again.

Step 3 now there is a double stranded bridge of complementary strands.



0コメント

  • 1000 / 1000