COSMIC - Overview
Welcome to the COSMIC Help Pages.
Please select a help topic from the index on the right hand panel.
The Catalogue Of Somatic Mutations In Cancer (COSMIC) is a comprehensive database of somatic mutations. This dataset can be examined in the following ways -
- COSMIC Website: cancer.sanger.ac.uk/cosmic
- COSMIC Cell Lines Project: cancer.sanger.ac.uk/cell_lines
- COSMIC Genome Browser: /cosmic/jbrowse/genome
- COSMIC Data Download: /cosmic/download
- COSMIC Cell Lines Data Download: cancer.sanger.ac.uk/cell_lines/download
COSMIC - Search
The search box on COSMIC's home page provides access to web pages where the data set can be examined with the help of various graphical and tabular views. Along with this, the home page provides information about the latest updates in COSMIC with current statistics and links to the additional CGP (Cancer Genome Project) resources.
Search
COSMIC can be searched in several ways. For example, by
- Gene name or HUGO synonym (eg BRAF or B-raf)
- Tissue or cancer type such as 'lung' or 'colon' (classified in COSMIC as 'large intestine')
- Mutation description eg the common KRAS mutation "c.35G>A" (CDS styntax) or "p.G12D" (Amino acid syntax)
- Combined gene and mutation description eg "KRAS p.G12D"
- Sample name such as 'COLO-829' or a Cosmic Sample Id eg '687448'
After searching, the results are listed by category with a table showing the number of hits and a panel underneath with a tab for each category. The listings in each category have links to the relevant overview pages in the COSMIC website.
COSMIC - Counts
A sample is a cell line or single piece of tumour examined through one or more genes for mutations. These experiments can happen in a number of ways, but usually involve sequencing. The name of the sample is defined by the data source. Usually cell lines have recognised names (which we capture) such as 'HCC38', or 'PC-3'. Names of primary tumours are often more abstract, sometimes numeric ('1','2'...), and often completely absent, in which case they are assigned a 6 or 7-digit name reflecting their database ID. Multiple instances of the same sample name can exist as separate entries, indicating that it was unclear during curation that these samples were identical, apart from their name. This is especially acute for cell lines, where the same sample name can indicate very different biological material, for instance the name PC-3 is used for cell lines from 3 different tissues.
A number of tumours can be examined from a single cancer patient, and a number of samples can be examined from each of these tumours. Each sample has its own name and ID. Their identical ancestry is indicated how?
Sample counting
To account for the duplication of probably identical samples during curation, we attempt to combine samples with identical names and disease descriptions. For instance these two PC-3's will be counted as one (in mutation frequency calculations) since it's likely they're the same thing, just curated from different papers:
Sample id Name Primary site( s ) COSS1028650 PC-3 prostate COSS1028702 PC-3 prostate
Mutation Frequency
The mutation frequency of a gene or tissue on the COSMIC webpages is a simple division of the number of samples with observed mutations, over the number of samples examined, from our curations. There are two different contexts for this data, between the published literature and the Cancer Genome Consortium data. The Cancer Genome Consortium data can be considered fully objective, where every gene has been fully sequenced through every sample. However, for the genes with full literature curation ('classic' genes), the % frequencies will reflect the samples and mutations as they are published. Since it is more difficult to publish studies which find no mutations, it is likely these frequencies are less accurate, simply representing the best current knowledge.
COSMIC - Data Types
Aberrant Gene Expression
Gene expression level 3 data has been downloaded from the publicly accessible TCGA portal. The platform codes currently used to produce the COSMIC gene expression values are: IlluminaGA_RNASeqV2 IlluminaHiSeq_RNASeqV2 AgilentG4502A_07_2 AgilentG4502A_07_3 For the Agilent platforms, the two samples (one from the target sample and the other from reference sample) are labeled with Cy3 and Cy5 and mixed, and then hybridized to a single microarray. The relative intensities of Cy3 and Cy5 are then used in ratio-based analysis to identify over expressed and under expressed genes
[https://wiki.nci.nih.gov/pages/viewpage.action?pageId=72942598]. For the RNASeqV2 platforms, the files used are rsem.genes.normalized_results, which contain Level 3 expression data produced using MapSplice to do the alignment and RSEM to perform the quantitation. [https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2] Analysis The mean and sample standard deviation of the gene expression values have been calculated from the Tumour samples that are diploid for each corresponding gene, platform, study. Based on these mean and STDEV values we have calculated the standard scores (z_score) for gene expression for each corresponding gene, platform, and study. Qualitative merging of results Qualitative merging of results, per study(project_code) across analysis platforms. In order to display if a gene is over or under expressed, a threshold of 2 STDEV, plus or minus was selected. In the cases that a sample has been analysed with more than one platform for the specific study and gene where the scores from all platforms are above or below the threshold then we display over or under. If they do not agree then we do not display it. The z_score column in the gene_regulation table is the z_score (serves as an indicative score) taken from the gene_expression table, from platforms in order of preference: IlluminaHiSeq_RNASeqV2 IlluminaGA_RNASeqV2 AgilentG4502A_07_3
Copy Number Variants
For Cancer Genome Project data (including the Cell Lines Project) copy number analysis was carried out using the Affymetrix SNP6.0 array in conjunction with a bespoke algorithm PICNIC (Predicting Integral Copy Numbers In Cancer) [http://www.sanger.ac.uk/resources/software/picnic] Where available, copy number data from TCGA and ICGC have been included in COSMIC (for samples already present in the database ie samples with mutations). All TCGA data included in COSMIC has been reanalysed using ASCAT [http://www.crick.ac.uk/peter-van-loo/software/ASCAT] Definition of Minor Allele and Copy Number in tables: Minor Allele: the number of copies of the least frequent allele eg if ABB, minor allele = A ( 1 copy) and major allele = B ( 2 copies) Copy Number: the sum of the major and minor allele counts eg if ABB, copy number = 3 Definition of Gain and Loss: We have introduced filtering thresholds to only display CNVs which are high level amplifications, homozygous deletions, or where there has been 'substantial loss' within an otheriwse duplicated genome. We also use a higher threshold for amplification if genome duplication has occurred. We use average ploidy > 2.7 to define genome duplication. 1. ICGC samples Gain: as defined in the original data Loss: as defined in the original data 2. TCGA samples reanalysed with ASCAT and CGP Cell Lines exomes analysed with PICNIC Gain: average genome ploidy <= 2.7 AND total copy number >= 5 OR average genome ploidy > 2.7 AND total copy number >= 9 Loss: average genome ploidy <= 2.7 AND total copy number = 0 OR average genome ploidy > 2.7 AND total copy number < ( average genome ploidy - 2.7 )
Substitutions
Substitutions involve the substitution of a single nucleotide and they are annotated using syntax derived from HGVS nomenclature recommendations [http://varnomen.hgvs.org/]. AA Mutation The change that has occurred in the peptide sequence as a result of the mutation. Syntax is based on the recommendations made by the Human Genome Variation Society. The mutation type is shown in brackets after the mutation string. A description of each type can be found below in the section entitled Mutation Type. CDS Mutation The change that has occurred in the nucleotide sequence as a result of the mutation. Syntax is identical to the method used for the peptide sequence. The mutation type is used to describe the type of mutation that has occurred. Mutation Types: Nonsense : A substitution mutation resulting in a termination codon, foreshortening the translated peptide. Missense : A substitution mutation resulting in an alternate codon, altering the amino acid at this position only. Coding silent : A synonymous substitution mutation which encodes the same amino acid as the wild type codon. Intronic : A substitution mutation outside the coding domains. No interpretation is made as to its effect on splice sites or nearby regulatory regions. Complex : A compound mutation which may involve multiple insertions, deletions and substitutions. Unknown : A mutation with no detailed information available.
Insertions/Deletions
Insertions and Deletions are annotated using syntax derived from HGVS nomenclature recommendations [http://varnomen.hgvs.org/]. Insertion An insertion of novel sequence into the gene. In frame : An insertion of nucleotides which does not affect the gene's translation frame, leaving the downstream peptide sequence intact. Frameshift : An insertion of novel sequence which alters the translation frame, changing the downstream peptide sequence (often resulting in premature termination). Deletion A deletion of a portion of the gene's sequence. In frame : A deletion of nucleotides which does not affect the gene's translation frame, leaving the downstream peptide sequence intact. Frameshift : A deletion of nucleotides which alters the translation frame, changing the downstream peptide sequence (often resulting in premature termination).
Structural Variants
The accurate description and annotation of structural variants can be complex. This is due to the different resolution that variants are reported from traditional cytogenetic coordinates down to the actual base pair positions. Furthermore, multiple rearrangements in a single area of the genome can make cataloguing and interpreting their effects challenging. The Rearrangement Overview page describes the one or more breakpoints which make up a structural variant. A breakpoint is defined as a region or point where the sample sequence has altered from the reference sequence. Minimum interpretation is made of this data. One variant event can consist of one or multiple breakpoints. The Syntax (shown above the table) gives a detailed description of the variant and its location (e.g. chr11:g.36585230_76606619del, a deletion of roughly 40Mb on chromosome 11). Syntax is based on HGVS mutation nomenclature recommendations [http://www.hgvs.org/rec.html]. In the table of breakpoints, the columns are as follows:- Mutation ID (COST) Unique identifier for the variant Mutation Description A short textual description of the variant (e.g. tandem duplication, deletion, translocation) Order For a structural variant involving multiple breakpoints, the predicted order along chromosome(s) is given (otherwise '0'). Chromosome From Gives the chromosome where the first variant/breakpoint occurs. Breakpoint From Genomic coordinate of the start of the variant/breakpoint (or range if base position not known). The icons next to the coordinate are links to the COSMIC Genome Browser and Ensembl. Strand Orientation of the break relative to the reference sequence. Chromosome To Gives the chromosome where the final variant/breakpoint occurs. Breakpoint To Genomic coordinate of the last variant/breakpoint (or range if base position not known). The icons next to the coordinate are links to the COSMIC Genome Browser and Ensembl. Strand Orientation of the break relative to the reference sequence. Non Templated Inserted Seq Sequence (if any) which is inserted at the breakpoint. The sequence is not encoded. A controlled Ontology of "Mutation Descriptions" are available below. Mutation Description Ontology In order to help with the interpretation of structural variants in COSMIC, each variant is assigned a Mutation Description and Syntax. When the assignment takes place there is an interpretation of the data and the currently known breakpoints in the region. If not all breakpoints have been characterised then the mutation may not be fully characterised. Below is a description of the Mutation Description Ontology with associated Syntax. 1. Tandem Duplication A Tandem Duplication is characterised by a duplication of a segment of the genome which is adjacent to the original sequence. The syntax takes the following format: chr2:g.124629221_125036287dup where chr2: denotes the chromosome involved, g. genomic coordinates used, 124629221_125036287 the start and end of the variant, dup indicates tandem duplication. For a tandem duplication the breakpoint is characterised by upstream sequence mapping downstream to where it should map on the genome. So in this case position 125036287 is mapping before 124629221 which is the signature of a tandem duplication. 2. Deletion The syntax takes the following format: chr11:g.36585231_76606618del where chr11: denotes the chromosome involved g. for genomic coordinates, 36585231 for the deletion start point, 76606618 for deletion end point and del indicates a deletion event. For a deletion the breakpoint is characterised by 2 distant points in the genome being next to each other. In this example position 36585230 is next to 76606619 in the genome. The region between these points is assumed to be deleted. The coordinates of the deletion are +1 and -1 as the breakpoint gives the last observed nucleotides, so the range of the deletion is from 36585231 to 76606618. 3. Inversion An inversion indicates the reversal of a piece of genome sequence. The syntax takes the following format: chr1:g.115340245_115346449inv where chr1: denotes the chromosome involved g. genomic coordinates used, 115340245_115346449 the range of the inversion, and inv indicates an inversion. Two breakpoints can be detected for this mutation although only one is required to fully characterise the mutation. 4. Translocation A Translocation is characterised by the fusion of 2 chromosomes. The syntax takes the following format: chr8:g.63669858_chr14:22298219trans[?] Where chr8:g.63669858 denotes the breakpoint on one chromosome, and chr14:22298219 on the other chromosome, trans indicates a translocation event, [?] indicates if there is any change in copy number associated with the mutation. [?] indicates not known. The strand information is often given in the syntax to describe which end of each chromosome actually forms the translocation. 5. Complex Substitution A Complex Substitution is defined as a region which been deleted and replaced with another region of the genome. The syntax takes the following format: chr8:g.55512043_63659930>chr13:22017510_22017585 where chr8: denotes the chromosome involved g. for genomic coordinates, 55512043_63659930 indicates the region deleted, > represents replaced with, chr13:22017510_22017585 indicates the region inserted. 6. Complex Amplicon A Complex Amplicon is a region of a genome which has been amplified and undergone multiple rearrangements. Due to the complexity of these regions the amplicon breakpoints are listed but no interpretation is made of the data. The syntax gives the range of the amplicon where the multiple rearrangements are occurring. An example is chr8:g.(61857345-?_129022677+?)[(10-40)] where chr8: denotes the chromosome involved g. for genomic coordinates, 61857345-?_129022677+? indicates the range of the amplicon with -? and +? indicating the precise position of the start (-?) and end (+?) are not currently known, [(10-40)] indicates the approximate copy number of this region, between 10 and 40 copies in this case. 7. Amplicon Breakpoint(s) An amplicon breakpoint is defined as a breakpoint within an amplified region with unknown boundaries so accurate interpretation of the mutation cannot be made. In these cases the breakpoint is simply described. The syntax takes the following format: chr14:g.28412748_chr14:28419493bkpt[4] where chr14: denotes the chromosome involved g. for genomic coordinates, 28412748 is the end of the sequence to the left of the breakpoint and 28419493 is the sequence coordinate to the right of the breakpoint, bkpt indicates a breakpoint, and [4] the approximate copy number in the area. Sequence Fragment(s) Structural variants can have additional sequence from elsewhere in the genome. For example: chr8:g.64123513inschr12:7418993_7419327inschr12:8232312_8232333_chr12:7072996trans[(8-13)] is a translocation with 2 additional fragments from chromosome 12, one is 21 base pairs and the other 335 base pairs. Copy Number Information Approximate Copy Number data is given when the variant is non-diploid and this information is available. The mutation description is prefixed with "amplified" or "amplicon" if there is variation in copy number. For example chr8:g.63669858_chr14:22298219trans[11-26] denotes a translocation with a copy number increase of approximately 11-26. A value of [2] would indicate diploid (normal). Strand Information In certain situations it is important to provide strand information to describe a variant. The HGVS "o" identifier is used to denote 'opposite strand'. For example: chr1:g.58958334_chr12:o69893440bkpt
Non-Coding Variants
Non-coding variants are usually defined by whole genome screening and occur either in the intronic regions of genes or in intergenic regions of the genome. They are annotated using syntax derived from HGVS nomenclature recommendations [http://www.hgvs.org/mutnomen/]. The 'g.' format of the syntax denotes genomic coordinates, eg chr19:g.34210730C>T which is a C to T substitution at nucleotide 34,210,730 on chromosome 19.
Gene Fusions
Many papers determine fusions between genes (translocations) using expression technologies, such as RT-PCR. A number of these studies have identified more than one transcript per sample, some finding over four different products between the same gene pair in one tumour. This implies significant alternative splicing of the mRNAs expressed from the fused gene pair. In order to simplify this data for display and navigation, we have inferred the position of the genomic breakpoint from the experimental data whilst maintaining the original results. To do this, it has been assumed that each sample's breakpoint lies between the most 3' expressed exon of the 5' gene and the most 5' exon of the 3' gene, from the mRNAs reported in that sample. Inferred breakpoints are determined using the rule above, and the 'Observed mRNAs' are the expressed products actually reported as the result of experimental procedures. A single inferred breakpoint can allow the expression of a number of gene fusion mRNA variants, as above. However, additionally, a single observed mRNA can, between samples, be derived from a number of different breakpoints. Syntax Syntax format describing the portions of mRNA PRESENT (in HGVS "r." format) from each gene (allows representation of UTR sequences). This is a one line syntax: Gene name 1 HUGO { new symbol associating mRNA sequence with gene name Accession number 1 Genbank } new symbol associating mRNA sequence with gene name : separates gene identifier from coordinates r. syntax defining mRNA portion present of first gene _ denotes a join of sequences Gene name 2 HUGO { new symbol associating mRNA sequence with gene name Accession number 2 Genbank } new symbol associating mRNA sequence with gene name : separates gene identifier from coordinates r. syntax defining mRNA portion present of second gene Here are 2 examples: 1. Standard Fusion TMPRSS2 from exon 1 (UTR) to ERG exon 2 (inclusive). TMPRSS2{NM_005656.2}:r.1_71_ERG{NM_004449.3}:r.38_3097 TMPRSS2 from intron after exon 1 to intron before exon 2, intronic breakpoints known (374bp downstream of TMPRSS2 exon 1 and 54bp upstream of ERG exon 2). TMPRSS2{NM_005656.2}:r.1_71+374_ERG{NM_004449.3}:r.38-54_3097 TRMPSS2 from intron after exon 5 to intron before ERG exon 3, intronic breakpoints NOT known (but remarked on in the paper). TMPRSS2{NM_005656.2}:r.1_71+?_ERG{NM_004449.3}:r.38-?_3097 2. Fusion to the complimentary strand (flipped fusion) TMPRSS2 present in sense orientation, ERG in the antisense. TMPRSS2{NM_005656.2}:r.1_71_oERG{NM_004449.3}:r.38_3097 Again, if the intronic co-ordinates are known. TMPRSS2{NM_005656.2}:r.1_71+374_oERG{NM_004449.3}:r.35-54_3097
Help Index
- COSMIC
- Tutorials
- Cancer Browser::Overview
- Selection
- Genes::Top genes
- Genes::Genes with Mutations
- Genes::Genes without Mutations
- Mutation Matrix
- Distribution
- Variants::Fusions
- Variants::Mutations
- Variants::Methylation
- Variants::CNV & Expression
- Samples::Mutant Samples
- Samples::Non-Mutant Samples
- Cancer Gene Census
- CNV Overview
- CNV ChromoView
- CNV & Expr Details
- CONAN
- Fusion::Mutations
- Fusion::Overview
- Fusion::Summary
- Gene::Analysis
- Gene View
- Genome Browser
- Overview
- Tissues
- Distribution
- Drug Resistance::Genes
- Drug Resistance::Mutations
- Variants::Mutations
- Variants::Fusions
- Variants::CNV & Expr
- Variants::Methylation
- References
- Drug Resistance::Mutation Details
- Gene::Mutation Details
- Methylation Details
- Mutation::Overview
- Ncv::Overview
- Rearrangement::Overview
- Sample::Overview
- Circos
- Overview
- Variants::Fusions
- Variants::Mutations
- Variants::Breakpoints
- Variants::Non-Coding Mutation
- Variants::CNV & Expression
- Variants::Methylation
- Mutation Spectrum
- Sequence Context
- Heatmap
- Non-Mutant Genes
- References
- Study::Overview
- Papers::Page
- Beacon