Abstract
Free full text
Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants
Abstract
The number and diversity of known CRISPR–Cas systems have substantially increased in recent years. Here, we provide an updated evolutionary classification of CRISPR–Cas systems and cas genes, with an emphasis on the major developments that have occurred since the publication of the latest classification, in 2015. The new classification includes 2 classes, 6 types and 33 subtypes, compared with 5 types and 16 subtypes in 2015. A key development is the ongoing discovery of multiple, novel class 2 CRISPR–Cas systems, which now include 3 types and 17 subtypes. A second major novelty is the discovery of numerous derived CRISPR–Cas variants, often associated with mobile genetic elements that lack the nucleases required for interference. Some of these variants are involved in RNA-guided transposition, whereas others are predicted to perform functions distinct from adaptive immunity that remain to be characterized experimentally. The third highlight is the discovery of numerous families of ancillary CRISPR-linked genes, often implicated in signal transduction. Together, these findings substantially clarify the functional diversity and evolutionary history of CRISPR–Cas.
CRISPR–Cas systems, which are best known as key components of a new generation of genome-engineering tools1,2, naturally function as adaptive immunity mechanisms in bacteria and archaea. The CRISPR–Cas immune response consists of three main stages: adaptation, expression and interference. At the adaptation stage, a distinct complex of Cas proteins binds to a target DNA, often after recognizing a distinct, short motif known as a protospacer-adjacent motif (PAM), and cleaves out a portion of the target DNA, the protospacer. After duplication of the repeat at the 5ʹ end of the CRISPR array, the adaptation complex inserts the protospacer DNA into the array, so that it becomes a spacer. Some CRISPR–Cas systems employ an alternative mechanism of adaptation — namely, spacer acquisition from RNA, via reverse transcription by a reverse transcriptase encoded at the CRISPR–cas locus.
At the expression stage, the CRISPR array is typically transcribed as a single transcript — the pre-CRISPR RNA (pre-crRNA) — that is processed into mature CRISPR RNAs (crRNAs), each containing the spacer sequence and parts of the flanking repeats. In different CRISPR–Cas variants, the pre-crRNA processing is mediated by a distinct subunit of a multiprotein Cas complex, by a single, multidomain Cas protein, or by non-Cas host RNases.
At the interference stage, the crRNA, which typically remains bound to the processing complex (protein), serves as a guide to recognize the protospacer (or a closely similar sequence) in the invading genome of a virus or plasmid, which is then cleaved and inactivated by a Cas nuclease (or nucleases) that either is part of the effector or is recruited at the interference stage. The above summary is a brief, oversimplified description of the CRISPR–Cas functionality that inevitably omits many details. These can be found in recent reviews on different aspects of CRISPR–Cas biology3–9.
Similar to other biological defence mechanisms, archaeal and bacterial CRISPR–Cas systems show a remarkable diversity of Cas protein sequences, gene compositions and architectures of the genomic loci3,5,10–15. Our knowledge of this diversity is continuously expanding through the screening of ever-growing genomic and metagenomic databases. To keep pace with such expansion, a robust classification of CRISPR–Cas systems based on their evolutionary relationships is essential for the progress of CRISPR research, but this presents formidable challenges, owing to the lack of universal markers and the fast evolution of the CRISPR–cas loci16. Therefore, the two previous CRISPR–Cas classifications, published in Nature Reviews Microbiology in 2011 and 2015, employed a multipronged approach that combined comparisons of the gene compositions of CRISPR–Cas systems and their loci architectures with sequence similarity-based clustering and phylogenetic analysis of conserved Cas proteins, such as Cas1 (REFS17,18). The 2015 classification included 5 types and 16 subtypes, as well as introducing the major division of CRISPR–Cas systems into two classes that radically differ with respect to the architectures of their effector modules involved in crRNA processing and interference. The class 1 systems have effector modules composed of multiple Cas proteins, some of which form crRNA-binding complexes (such as the Cascade complex in type I systems) that, with contributions from additional Cas proteins, mediate pre-crRNA processing and interference. By contrast, class 2 systems encompass a single, multidomain crRNA-binding protein (such as Cas9 in type II systems) that combines all activities required for interference and, in some variants, also those involved in pre-crRNA processing (BOX 1).
Since the publication of the 2015 classification, there have been at least three major developments in the study of the diversity of CRISPR–Cas systems. First, driven partly by the interest in new tools for genome engineering, dedicated efforts have been undertaken to predict and experimentally validate additional class 2 systems19–28. As a result, the RNA-targeting type VI and multiple previously unknown subtypes of type V CRISPR–Cas systems have been discovered. Moreover, it has been shown that type V systems repeatedly evolved from transposon-encoded TnpB nucleases, yielding a large pool of type V variants, many of which can be expected to eventually become separate subtypes20,29,30. The second key development was the discovery of several class 1 and class 2 CRISPR–Cas variants that appear to lack targeted cleavage activity and thus likely perform functions distinct from adaptive immunity8,31,32. Such derived CRISPR–Cas systems include type IV, several variants of type I and at least one type V system variant, and these are often encoded within mobile genetic elements29,30,33. Recently, the involvement of two of these derived CRISPR–Cas variants, encoded by Tn7-like transposons, in crRNA-dependent DNA transposition has been demonstrated experimentally34,35. Although the origin of some of these derived forms from particular class 1 subtypes is readily identifiable, their placement in the CRISPR–Cas classification scheme remains problematic. The third important finding involves the identification of numerous gene families that are associated with specific variants of CRISPR–Cas systems, particularly of type III systems, and are implicated in signal transduction and regulatory roles8,31,32,36,37.
In this article, we reassess and update the classification of CRISPR–Cas systems, using the previously developed strategies along with analysis of the modular structure of bipartite networks of gene sharing. Special emphasis is put on classification of the quickly proliferating class 2 variants. The new class 2 classification now includes 3 types and 17 subtypes, compared with 2 types and 4 subtypes in the 2015 version, and opens the door for many more subtypes of type V systems to be identified. Although experimental study of the recently discovered class 2 variants is only in its initial phase, it is already clear that their properties are highly diverse and are difficult to predict from Cas protein sequences alone. Therefore, robust classification and systematic study of class 2 variants are essential for understanding their functionality in microorganisms and for the development of versatile genome-edit ing tools. In addition, several distinct class 1 variants, including three new subtypes, have been identified, bringing the total number of CRISPR–Cas subtypes to 33. We describe the current state and prospects of the classification and nomenclature of CRISPR–Cas systems and cas genes and, additionally, outline the emerging scenario of CRISPR–Cas evolution.
The classification approach
No genes are shared by all CRISPR–Cas systems, ruling out the possibility of a straightforward, comprehensive phylogenetic classification analogous to that employed for cellular life forms. Instead, a multipronged computational strategy has been adopted that includes the identification of signature genes for CRISPR–Cas types and subtypes, comparison of gene repertoires and genomic loci organizations, and sequence similarity-based clustering and phylogenetic analysis of the genes that are conserved in different subsets of CRISPR–Cas systems. Experimental data have also been taken into consideration, when available16–18,38 (BOX 2).
Briefly, in this work, 566 amino acid sequence profiles (see Supplementary Methods), representing all variants of the 13 core cas genes, several still-uncharacterized components of their effector complexes, and reliably identified known ancillary genes, were compared to the protein sequences that are annotated in the 13,116 complete archaeal and bacterial genomes available at the NCBI as of March 1, 2019, using position-specific iterated BLAST39. This search, followed by extensive manual curation, resulted in the identification of 7,915 CRISPR–cas loci (Supplementary Dataset 1) that were fit into the previously developed classification, on the basis of the presence of the respective signature Cas proteins, sequence similarity between Cas proteins, the phylogenies of the most highly conserved Cas proteins (including Cas1 as well as the effector proteins for individual types and subtypes) and conservation of the locus organization. Loci that did not meet the criteria for inclusion in any of the previously identified subtypes were assigned to new subtypes (Supplementary Tables 1 and 2, Supplementary Dataset 1). The updated collection of Cas protein family profiles (Supplementary Dataset 2) is a resource for the identification of CRISPR–Cas systems in sequenced genomes and metagenomes.
Here, we additionally employed bipartite network analysis40,41 (Supplementary Dataset 3) for the identification of cohesive modules in the network that reflect both shared gene content and Cas protein sequence conservation, as well as for helping identify distinct CRISPR–Cas subgroups that might have subfunctionalized or neofunctionalized.
Functional modules and core genes
All cas genes can be subdivided into four distinct, although partially overlapping, functional modules (BOX 1)18,42. The adaptation module includes the gene encoding the key enzyme involved in spacer insertion (the Cas1 integrase) and the structural subunit of the adaptation complex Cas2, as well as the Cas4 nuclease in several CRISPR–Cas subtypes, the Csn2 protein in subtype II-A and reverse transcriptase in many type III systems. The expression processing module is responsible for pre-crRN A processing. In most class 1 systems, Cas6 is the enzyme that is directly responsible for processing. In type II systems, processing is catalysed by the bacterial RNase III (a non-Cas protein), whereas in many type V and apparently all type VI systems, the large effector Cas protein contains a distinct catalytic centre responsible for processing. The interference or effector module is involved in target recognition and nucleic acid cleavage. In class 1 CRISPR–Cas systems, the effector module consists of multiple Cas proteins — namely, Cas3 (sometimes fused to Cas2), Cas5–Cas8, Cas10 and Cas11, in different combinations, depending on the type and subtype (see BOX 1). By contrast, in class 2 systems, the effector module is represented by a single, large protein — Cas9, Cas12 or Cas13. The signal transduction or ancillary module is a diffuse collection of CRISPR-linked genes, most of which have roles in CRISPR–Cas systems that are, at best, tentatively predicted. However, for type III systems an essential signal transduction pathway has been characterized. This pathway involves activation of the Csm6 (or Csx1) higher eukaryotes and prokaryotes nucleotide-binding (HEPN) RNase by cyclic oligoA, which is synthesized by the Cas10 polymerase and binds the CRISPR-associated Rossmann fold (CARF) domain of Csm6 or Csx1 (REFS36,37,43).
Comparative genomic analyses have revealed partial independence of the adaptation and effector modules of CRISPR–Cas systems that, especially in the case of type III systems, appear to have recombined on many independent occasions44–46. As a result, the topology of the phylogenetic tree of Cas1 shows only limited agreement with the CRISPR–Cas classification (Supplementary Dataset 4).
The classification of CRISPR–Cas systems is based primarily on Cas protein composition differences and sequence divergence between the effector modules16,18. The class 1 effector complexes involved in pre-crRNA processing and target recognition have similar organizations between types I, III and IV, although the sequence conservation among these three types is minimal47,48. The backbone of the effector complexes in all three class 1 types is formed by the distantly related RNA recognition motif (RRM) domain-containing proteins of the repeat-associated mysterious proteins (RAMPs) Cas5 and Cas7, the latter typically present in multiple copies. In most class 1 CRISPR–Cas systems, the third RAMP, Cas6, is the dedicated RNase responsible for pre-crRNA processing and may or may not be physically associated with the effector complex. The RAMPs are characterized by extreme sequence divergence, so that the sequences of Cas5, Cas6 and Cas7 from different subtypes could be linked only by using the most sensitive methods for profile–profile sequence comparison or by direct comparison of protein structures. The large subunits of the effector complexes of type I and III systems, Cas8 and Cas10, respectively, occupy analogous positions in the complexes but show no sequence similarity and, at best, only remote structural similarity. Whether or not Cas8 and Cas10 are homologous remains an open question; if they are, the divergence is extreme, rendering any sequence of structural similarity effectively beyond detection49. Moreover, the Cas8 sequences show no detectable similarity even between some of the class 1 subtypes, such that the respective variants can serve as subtype signatures (Supplementary Table 2). The small subunit of the class 1 effector complexes (Cas11) shows no statistically significant sequence similarity between type I and III systems, but the structural similarity between the Cas11 proteins, as well as between Cas11 and the C-terminal α-helical domain of Cas10, strongly suggests that these are highly diverged homologues47,50. In type I systems, a key, standalone (although in some variants fused with Cas2) component of the effector module is Cas3, a large protein that typically consists of fused helicase and HD nuclease domains and is directly responsible for the target DNA cleavage. Type III systems differ fundamentally, with the HD nuclease being fused to Cas10, the large subunit of the complex involved in transcription-dependent cleavage of the target DNA.
Type IV CRISPR–Cas systems are highly derived variants that typically lack adaptation modules as well as the nucleases required for interference. Moreover, only Cas5 and Cas7 proteins are readily identifiable in type IV loci by sequence similarity with their counterparts in other types. Recent comparisons of the structures of the effector complexes of type IV and I systems have identified the type IV counterpart of the large subunit of the effector complex48, suggesting that type IV systems could be highly diverged type I or type III derivatives.
The class 2 effector modules are single large proteins, with their domain architectures clearly differentiating type II, V and VI systems (BOX 1; and see the discussion below)20. The types and subtypes within class 2 differ substantially with respect to the mechanisms of pre-crRNA processing51–54. In type VI and subtype V-A systems, the large effector protein also encompasses the pre-crRNA processing RNase activity55–57, whereas in type II and several type V subtypes, this processing activity is typically relegated to a non-Cas enzyme, RNase III. In the latter cases, the effector module includes an additional RNA molecule, the transactivating CRISPR (tracr) RNA, which forms stable duplexes with the partially complementary direct repeat of the pre-crRNA. After cleavage of the RNA duplex by RNase III, the mature guide RNA — that is, the crRNA–tracrRNA complex — remains stably bound to the effectors, allowing for specific DNA interference51–54.
The set of Cas1 to Cas13 proteins that comprise the adaptation and effector modules define the types and subtypes and thus represent the core of class 2 CRISPR–Cas systems. This core is accompanied by numerous ancillary proteins that are more loosely associated with CRISPR–Cas. The repertoire of the ancillary genes has recently expanded drastically, in large part through the use of dedicated computational protocols for the systematic detection of CRISPR-linked genes31,32. We discuss these ancillary genes in a later section, after describing the current state of the CRISPR–Cas classification.
In addition to the distinctions between the cas gene composition and the sequences and structures of Cas proteins, the types and subtypes of CRISPR–Cas systems can be, to some extent, differentiated by the distinct sequence and structural features of the repeats themselves58,59. However, the correspondence is incomplete, such that the branches in the cluster dendrogram of CRISPR–Cas collate multiple subtypes59.
CRISPR–Cas classification
Class 1 and its derivatives.
The classification of class 1 CRISPR–Cas systems, which include types I, III and IV, has remained relatively stable compared with the 2015 version18 (FIG. 1). The 2015 class 1 classification scheme included 12 subtypes that can be distinguished by sequence similarity clustering of effector proteins, as well as by comparison of loci organizations and the sequences of repeats. In the updated scheme, four subtypes are added — subtypes III-E, III-F, IV-B and IV-C. In addition, given the experimental demonstration of new spacer incorporation by subtype I-U sys tems60, this subtype was reclassified as subtype I-G.
Subtype III-E, identified in 14 contigs from the NCBI non-redundant nucleotide sequence database that appear to come from 8 bacterial species (Supplementary Fig. 1, Supplementary Table 2, Supplementary Dataset 5), is characterized by a unique fusion of several Cas7 proteins and a putative Csm2-like small subunit (Cas11), such that the crRNA-binding part of the effector module is compressed within a single, large multidomain protein. In this respect, subtype III-E resembles class 2 CRISPR–Cas systems, although the domain composition and sequence analysis unequivocally place it within type III of class 1, and moreover, a specific relationship with subtype III-D could be traced (Supplementary Fig. 1). The multidomain subtype III-E effector is predicted to cleave pre-crRNA, given the conservation of aspartate residues that are known to be involved in RNA cleavage in the homologous Csm4 protein, and might also contribute to target RNA cleavage (Supplementary Fig. 2). The subtype III-E loci often include a putative ancillary gene encoding a large protein that contains a CHAT domain, a caspase family protease that is, typically, involved in programmed cell death61, fused to tetratricopeptide repeats (TPRs) (FIG. 1). The presence of this ancillary protein suggests subfunctionalization or neofunctionalization of subtype III-E systems and their potential involvement in complex defence pathways (FIG. 1, Supplementary Fig. 1).
The subtype III-F systems have been identified previously62 and are included in the 2015 CRISPR–Cas census18, but they were not classified as a distinct subtype, because their number was too small. Now that this variant has been found in 12 additional genomes, it has become apparent that they qualify as a separate subtype (Supplementary Fig. 3, Supplementary Table 2). The Cas7 and Cas5 subunits, as well as the large subunit of the subtype III-F effector complex, show a distant but substantial similarity with the corresponding components of other type III subtypes, whereas the putative small subunit does not show any similarity to Cas11. Unlike all other type III systems, subtype III-F contains only one Cas7-like protein. The HD domain fused to the Cas10-like large subunit retains all catalytic residues and therefore is predicted to cleave the target DNA. However, the cyclase or polymerase domain of the Cas10-like subunit is inactivated, as indicated by amino acid substitutions in the catalytic site, and furthermore, the subtype III-F loci lack any genes encoding CARF domain proteins. Thus, this type III subtype clearly does not function via cyclic oligoA signalling, as has been shown for subtype III-A and implied for the rest of the type III systems containing an active Cas10 polymerase36,37,43.
In the 2015 classification, subtype IV-B was reported as a variant that, unlike subtype IV-A systems, lacks the dinG gene but contains a distinct version of the predicted small subunit of the effector complex; furthermore, most of the subtype IV-B loci encompass the ancillary gene cysH32. These systems have been discovered on plasmids from numerous, diverse bacteria30, and accordingly, the variant was upgraded to a subtype. Subtype IV-C loci were also detected but not formally classified in the 2015 census. Now this type IV variant has been identified in nine contigs, mostly from thermophilic microorganisms (Supplementary Fig. 4, Supplementary Table 2), its classification as a distinct subtype also appears justified. The Cas7 and Cas5 homologues of the subtype IV-C systems show statistically significant similarity to the corresponding proteins of subtype IV-A and IV-B systems, whereas the putative large and small subunits of the effector complex lack any detectable similarity with their counterparts from any other CRISPR–Cas system. Notably, unlike in the other two type IV subtypes, the putative large subunit of subtype IV-C contains an HD nuclease domain, suggesting that it cleaves the target DNA. The order of the HD nuclease motifs is the same as in Cas3 in type I systems, but different from that in the HD nuclease domains fused to Cas10 in most of the type III systems, apparently as a result of a circular permutation occurring during the evolution of the CRISPR-associated HD nucleases.
Several additional, distinct variants of class 1 could become subtypes when more taxonomic diversity and/or more structural and experimental data become available. Among such cases, a distinct type III system variant found in archaea of the order Sulfolobales is represented by the loci YN1551_RS11700 to YN1551_RS11720 from Sulfolobus islandicus (Supplementary Dataset 1). This variant features extremely diverged Cas10 and Cas5 homologues and a unique, uncharacterized predicted component of the effector complex, Csx26. Another distinct type III system, so far found only in the archaeon Ignisphaera aggregans (loci Igag_0607 to Igag_0623), includes several proteins that are not similar to any known Cas or ancillary proteins.
A variety of derived, apparently defective variants of type I systems have been discovered, such as the ‘minimal’ subtype I-F and subtype I-B systems, which are encoded by distinct families of Tn7-like transposons30,33. These variants lack the helicase-nuclease Cas3 that is required for interference63 and therefore are predicted to perform functions distinct from adaptive immunity. A hypothesis has been proposed that these minimal type I variants mediate guide-RNA-dependent transposition30,33, and recently, such activity has been demonstrated experimentally35. Defective CRISPR–Cas systems have also been reported in preliminary studies to be encoded by some of the recently discovered giant phages, where their roles remain to be deciphered64. An analogous interference-deficient derivative of subtype I-E CRISPR–Cas systems was detected in the genomes of many bacteria of the genus Streptomyces32. This variant is not associated with any detectable mobile genetic elements but is tightly linked to a gene encoding a STAND superfamily NTPase65, suggesting involvement of these interference-deficient CRISPR–Cas systems in signal transduction and possibly in dormancy induction or programmed cell death. The differences in the Cas protein compositions between these minimal CRISPR–Cas variants and fully functional type I systems potentially could be used as an argument for classification of the defective variants into separate subtypes. However, Cas protein sequence comparison and phylogenetic analysis unequivocally demonstrate the origins of these variants in subtypes I-F, I-E and I-B, respectively30,32,33. Therefore, we propose to keep them within their respective subtypes as distinct variants — denoted, for example, I-F1, I-F2 and so forth (FIG. 1).
Apart from the newly identified subtype IV-C, most of the type IV systems are also defective CRISPR–Cas forms that lack the nucleases involved in target cleavage and thus resemble the transposon-encoded variants with respect to organization and, perhaps, functionality. Indeed, the distinctive biological features of type IV systems are their apparent (nearly) exclusive localization on plasmids, integrated conjugating elements and prophages30. Furthermore, preliminary data suggest that multiple spacers targeting heterologous plasmids have been detected in type IV CRISPR arrays, suggesting that one of the functions of type IV systems is inter-plasmid competition66.
Some derived variants are so distant from the canonical organization that their status as CRISPR–Cas systems appears questionable. A case in point is a recently described locus found in many Haloarchaea that only retain highly divergent forms of Cas5 and Cas7 (haloarchaeal RAMPs, or HRAMPs), along with an uncharacterized conserved protein and various nucleases67 (Supplementary Fig. 5A). The search of Asgard archaea genomes68 performed in the course of this work also revealed highly derived CRISPR–Cas variants that resemble HRAMPs in terms of their Cas protein composition and encompass an unusual large protein containing a diverged Cas1 domain, along with distinct variants of Cas5 (a fusion with an HD nuclease) and Cas7, as well as additional nucleases (Supplementary Fig. 5B). The functions of these extremely derived systems are unknown, and given the lack of adjacent CRISPR arrays, it is not even clear whether their activity is guide-RNA dependent. If these systems are shown to function via a CRISPR–Cas-like mechanism, they might qualify as distinct types, given the drastic reduction of the Cas protein repertoire.
Thus, the formation of derived variants that lack the interference capacity and are likely to perform functions distinct from adaptive immunity is a pervasive trend in the evolution of CRISPR–Cas. Additional highly divergent CRISPR–Cas derivatives are likely to be discovered, and their experimental characterization is likely to become a major research direction.
The expanding class 2.
Class 2 CRISPR–Cas systems include types II, V and VI. The distinguishing feature of these types is that their effector complexes consist of a single, large, multidomain protein, such as Cas9 in type II. Thanks to focused efforts on the computational discovery of new class 2 systems, partly in the quest for potential new genome-editing tools, this class has undergone a drastic expansion since the 2015 classification11,20–23. From 2 types and 4 subtypes in 2015, class 2 expanded to 3 types and 17 subtypes (FIG. 2). The new discoveries include multiple, diverse variants of type V as well as type VI systems, the first and so far the only variety of CRISPR–Cas systems that exclusively cleaves RNA.
Type V systems fundamentally differ from type II by the domain architecture of their effector proteins. The type II effectors (Cas9) contain two nuclease domains that are each responsible for the cleavage of one strand of the target DNA, with the HNH nuclease inserted inside the RuvC-like nuclease domain sequence51. By contrast, the type V effectors (Cas12) only contain a RuvC-like domain that cleaves both strands69,70. Type VI effectors (Cas13) are unrelated to the effectors of type II and V systems, contain two HEPN domains and apparently target transcripts of invading DNA genomes. Cas13 proteins also display collateral, nonspecific RNase activity that is triggered by target recognition and induces dormancy in virus-infected bacteria71.
The assignment of subtypes within type II, V and VI systems is a challenge because of the uniform domain architecture of the respective effector proteins. The current practice (which, admittedly, involves a degree of arbitrariness) is to establish a new subtype for variants that do not show statistically significant sequence similarity to any of the already-established subtypes in BLAST searches39; the presence of additional accessory genes is also taken into consideration. This approach has so far resulted in the identification of 3 subtypes of type II systems, 10 subtypes of type V systems and 4 subtypes of type VI systems with typical, large effector proteins (FIG. 2).
In addition, a heterogeneous assemblage of putative type V variants with smaller RuvC-like domain-containing proteins, provisionally classified as subtype V-U, have been discovered20 (Supplementary Fig. 6). The putative subtype V-U effectors show high sequence similarity to TnpB proteins (predicted RuvC-like nucleases) encoded by IS605-like transposons and are thought to be intermediates on the evolutionary path from TnpB to fully fledged type V effectors. CRISPR–Cas systems evolved from different groups of TnpB on multiple, independent occasions, as has been shown by phylogenetic analysis of the TnpB family20. Recently, the interference activity of four subtype V-U effectors was validated experimentally, and as a result, one of these variants has been upgraded to a separate subtype, V-F 22,23. Notably, these newly characterized CRISPR–Cas variants show major differences in interference specificity compared with the previously characterized type V effectors and with one another. The subtype V-F effector, Cas12f (originally denoted Cas14), has been shown to cleave single-stranded DNA (ssDNA)22, although double-stranded DNA cleavage activity has subsequently been reported in a preliminary study as well72, whereas Cas12g is an RNA-guided RNase that also possesses collateral RNase and ssDNase activities23. These findings emphasize the remarkable functional diversity of CRISPR–Cas systems, which remains to be fully characterized through the discovery and study of new subtypes. Different variants within subtype V-F (currently, variants V-F1–V-F3) appear to originate from different groups of tnpB genes, as indicated by the phylogenetic analysis of the TnpB family20 (Supplementary Dataset 4). Nevertheless, given the highly significant sequence similarity between these effector proteins, they are all currently classified within a single subtype.
One of the former V-U variants, V-U5, contains an apparently inactivated RuvC-like nuclease domain, as indicated by the replacement of essential catalytic residues, and is encoded by cyanobacterial Tn7-like transposons30. The prediction that this variant evolved to function in transposons analogously to the defective type I systems — that is, by mediating guide RNA-dependent transposition — has recently been experimentally validated (and the subtype has accordingly been upgraded to subtype V-K) 34.
It is expected that the remaining subtype V-U variants will be classified into the already created or into additional subtypes as they are experimentally characterized. Furthermore, in all likelihood, multiple subtypes of type V systems that independently originated from TnpB nucleases remain to be discovered, and consequently, the number of recognized subtypes will grow further.
The origin of type VI systems is much less clear than the derivation of type V systems from TnpB. The HEPN RNase domain is widespread in various defence systems — in particular, as the toxin components of numerous toxin–antitoxin modules, which are likely to be the ultimate ancestors of CRISPR-associated HEPN domains7,73. Given that the presence of two HEPN domains is a unique signature of type VI effectors (Cas13), it is appealing to surmise that these effectors evolved from a common ancestor after duplication of the HEPN domain. However, the two HEPN domains in each of the Cas13 proteins are only distantly related to each other, and phylogenetic analysis results appear not to be compatible with the duplication scenario (Supplementary Fig. 7). In the phylogenetic tree of the HEPN family, the N-terminal and C-terminal HEPN domains form distinct branches, pointing to a common ancestor with two HEPN domains. This ancestral cas13 gene might have evolved by recombination between two genes encoding distinct HEPN-containing proteins and, possibly, a distinct family of toxin components of abortive infection modules73. Type VI systems appear to be far less diverse than type V systems, but the discovery of new subtypes remains possible. For example, we identified a distinct type VI system variant in Brachyspira species with a two-HEPN effector that shows no significant similarity to the Cas13 sequences from the four current subtypes (Supplementary Fig. 8). Presently, we refrain from calling it a new subtype because of its narrow spread in bacteria, but as the genomic database grows, this will be a strong candidate.
A bipartite gene-sharing network
In addition to the classification approaches outlined above, we performed a quantitative analysis of a bipartite network in which CRISPR–cas loci are connected through shared genes (Supplementary Fig. 9). To identify clusters of tightly connected loci that share overlapping gene sets, we applied a previously described consensus-clustering approach that combines bipartite modularity maximization and hierarchical clustering, followed by significance-based filtering of the results40. By highlighting distinct sets of genes and loci that are mutually associated, the identification of modules in the gene-sharing network could contribute to both CRISPR–Cas classification and functional prediction.
Altogether, 126 modules were identified in the CRISPR–Cas network, which can be roughly assigned to four categories: modules sharing distinct ancillary gene sets (category 1); derived variants characteristic of specific bacterial or archaeal lineages (category 2); mixed modules that apparently result from recombinational shuffling among CRISPR–cas loci that typically share closely related adaptation genes but have distinct effector genes (category 3); and modules that lack any of the above distinctive features but include highly diverged Cas proteins (category 4) (Supplementary Fig. 9, Supplementary Dataset 3). The recently characterized minimal variant of subtype I-F (I-F3) associated with Tn7-like transposons, a remarkable case of CRISPR–Cas neofunctionalization (module 16), is an example from category 1. Cyanobacteria-specific modules 65 and 98, which consist of distinct variants of subtype III-B, exemplify category 2. A case of previously described gene shuffling in Methanosarcina species62,74 is captured in module 84, which belongs in category 3. Most of the identified modules include CRISPR–cas loci that belong to the same subtype. The exceptions are modules that combine two or three subtypes of type I (modules 10 and 101) or type V (module 126) systems that share overlapping gene compositions. More notably, three modules (46, 93 and 108) join loci of types I and III systems, apparently reflecting recombinational events. Only a few relatively rare, low-abundance subtypes are represented by a single module. Most of the subtypes are divided into multiple modules, with subtypes I-E and I-B showing the highest heterogeneity (14 and 13 modules, respectively). This reflects the functional and evolutionary plasticity of these subtypes, which conceivably underlie their high abundance in current genomic databases (see below).
The fine-grained modules produced by bipartite network analysis could be useful for the identification of distinct functional variants of CRISPR–Cas systems that might be obscured by the conservative assignment of subtypes and variants. Moreover, this approach could provide a fast and straightforward way to assign new CRISPR–cas loci to predefined types and subtypes for which related loci have already been identified. In support of this possibility, the present bipartite network analysis was able to correctly assign most of the incomplete CRISPR–cas loci to the types and subtypes where they belong. To delineate coarse-grained modules that would facilitate the classification of novel CRISPR–Cas systems in an unsupervised way, more sophisticated multiresolution approaches will be required.
Distribution of CRISPR–Cas systems
The CRISPR–Cas systems are non-uniformly distributed among bacterial and archaeal phyla. We present a census of CRISPR–cas loci in the current collection of complete bacterial and archaeal genomes. Analysis of 13,116 complete genomes showed that CRISPR–cas loci are represented in a substantial majority of archaea (276 of 324 genomes (85.2%)), including almost all hyperthermophiles (89 of 92 genomes (96.7%)), but only in ~40% of bacteria (5,412 of 12,792 genomes (42.3%)) (FIG. 3, Supplementary Dataset 6). Clear trends are observed in the distributions of specific CRISPR–Cas classes, types and subtypes. In particular, class 2 remains nearly exclusive to bacteria. The absence of class 2 in archaea, at least in part, can be explained by the absence of RNase III, the pan-bacterial enzyme that is responsible for pre-crRNA processing in type II and some subtypes of type V systems — that is, in most of the class 2 systems46,75. By contrast, the genomes of Crenarchaeota are substantially enriched for type III systems of class 1. Overall, and in most groups of bacteria and archaea, class 1 is far more abundant than class 2. However, there are notable exceptions — for example, Tenericutes bacteria, in which only class 2 systems have been identified so far (FIG. 3). Some groups of bacteria, such as Chlamydia species (FIG. 3) or the recently discovered candidate phyla radiation, which appears to consist mostly of symbiotic microorganisms, are nearly devoid of CRISPR–Cas systems76–78. Conversely, the majority of type VI systems — and in particular, all instances of the most abundant sub type VI-B — have been identified in bacterial genomes of the phyla Bacteroidetes and Fusobacteria (FIG. 3).
The biological underpinnings of the non-uniform phyletic spread of CRISPR–Cas systems remain to be elucidated. Considering the high horizontal mobility of CRISPR–cas loci, it appears likely that their loss or retention in prokaryotic genomes depends on the trade-off between the fitness cost, which is determined mostly by autoimmunity and the curtailment of horizontal gene transfer, and the benefits of defence conferred by adaptive immunity79–84. These benefits most likely depend on the abundance and diversity of viruses in specific habitats, as well as on the biology of host–parasite interactions in specific groups of microorganisms85,86. The evolutionary dynamics that determine the distribution of CRISPR–Cas among bacteria and archaea can be expected to become one of the major directions in CRISPR–Cas research in the next few years. In particular, these dynamics might depend, to a large extent, on the interactions between CRISPR–Cas and DNA repair mechanisms, such as the double-strand break repair systems87.
Core and ancillary cas genes
The components of the adaptation and effector modules comprise the suite of core Cas proteins. The core Cas proteins in the widespread CRISPR–Cas types and subtypes are well characterized, although the discovery of novel class 2 effector proteins continues to gradually expand the core gene repertoire. Furthermore, in the course of the systematic search for new CRISPR-linked proteins, many highly diverged variants of the core proteins have been identified32.
By contrast, the list of the (predicted) ancillary CRISPR-linked proteins has greatly expanded as a result of dedicated searches of CRISPR–Cas genomic neighbourhoods31,32. For the great majority of these proteins, no experimental data are available yet, but computational analysis of their domain architectures points to multiple connections to various signal transduction pathways, as well as membrane association or functional links to membrane transport processes for many CRISPR–Cas systems — particularly those of type III systems, which drastically stand out in the complexity of their gene repertoire among all CRISPR–Cas forms (FIG. 4). Several accessory proteins — for example, those in subtypes VI-B and VI-D — have been directly shown to modulate the activity of their respective effectors25,26. Furthermore, some of the genes that are currently classified as ancillary are actually represented in numerous CRISPR–Cas systems and could perform major roles in the immune response. The most obvious example is Csm6, a HEPN-domain RNase that is a component of the majority of subtype III-A CRISPR–Cas systems and is activated by the signal transduction pathway initiated by cyclic oligoA produced by the Cas10 polymerase30,31. Systematic experimental characterization of the roles of accessory proteins in CRISPR–Cas functions will undoubtedly be another key research area in the study of CRISPR–Cas biology for years to come.
The discovery of new class 2 subtypes and numerous accessory proteins poses obvious problems for the systematic nomenclature of CRISPR-linked genes. So far, a conservative approach has been adopted, under which the cas designation is reserved for core genes, or more precisely, families of homologous core genes (Supplementary Table 1). The numbered cas gene names were originally assigned to the 11 most common genes among diverse CRISPR–Cas systems, and subsequently, cas12 and cas13 — the effectors of type V and type VI systems, respectively — have been added. Currently, the cas names are reserved for type-specific effector genes, whereas subtypes are specified by suffixes — for example, cas12a, cas12b, cas12c and so forth. The recent designation as Cas14 of small type V effector proteins related to those in subtype V-U systems22 does not conform with this criterion. We believe that the appropriate name for these proteins should be Cas12f 23, given that Cas12 is supposed to apply to all type V system effectors. Obviously, under this approach the number of cas genes cannot be expected to increase substantially, because both the discovery of new types and the identification of new core genes for already established types are rare. The ancillary genes continue to be known under their legacy names or as csx followed by a number, although a systematic nomenclature might be considered in the future.
Origins and evolution of CRISPR–Cas
Comparative analysis of CRISPR–Cas systems — in particular, the newly discovered class 2 subtypes — provides for the reconstruction, at least in outline, of a nearly complete scenario of CRISPR–Cas evolution (FIG. 5). A striking feature of the evolutionary history of CRISPR– Cas is the repeated recruitment of genes from different mobile genetic elements for various functions in adaptive immunity7,29. Thus, the adaptation module, along with the CRISPR repeats themselves, appears to originate from an immobilized transposon of the casposon family, so named because these elements employ a Cas1 homologue as the transposase88–90. The casposon could have contributed not only cas1 but also the cas4 gene, encoding another nuclease that is involved in PAM selection during adaptation in many CRISPR–Cas systems60,91–93, given that Cas4 homologues are among the cargo genes in some casposons.
The effector module of type III systems appears to be the best candidate for the ancestral state, given their widespread (especially in archaea) and complex gene composition, as well as the fact that, in most of the type III variants, the large subunit of the effector complex (Cas10) is an active enzyme, a cyclic oligoA polymerase7. The effector moiety of CRISPR–Cas could have started as a putative signalling system that has been identified in several bacteria and consists of a small-sized, ‘minimal’ Cas10 homologue and a homologue of Csm6 with fused CARF and HEPN domains7,94 (FIG. 5). This system is predicted to function analogously to the signal transduction pathway in type III CRISPR–Cas systems — namely, by synthesizing cyclic oligoA (most likely in response to stress) that is then bound by the CARF domain and allosterically activates the RNase activity of the HEPN domain36,37. Indiscriminate RNA cleavage by the HEPN domain would induce dormancy or programmed cell death. The putative ancestral system remains to be studied experimentally, but even without such validation, it resembles an abortive infection (Abi) module. Indeed, recently the HEPN-containing Csm6 protein of subtype III-A systems has been shown to act as a toxin causing growth arrest of the host cell95, which is compatible with the origin of the type III effector module from an Abi system. Similar to the known Abis10,96, the ancestor of the effector module is likely to be subject to extensive horizontal gene transfer and might, effectively, possess features of a mobile genetic element.
Thus, different types of mobile genetic elements seem to have given rise to both the adaptation and the effector parts of class 1 CRISPR–Cas systems. The subsequent evolution of the effector module would have involved serial duplication of the RRM domain of the Cas10 homologue and the capture of additional proteins — in particular, the target-cleaving HD nuclease7. The key event in the evolution of type I systems was the capture of the helicase-nuclease Cas3 and the replacement of the oligoA polymerase Cas10 with the enzymatically inactive Cas8 as the large subunit of the effector complex. Whether the latter event involved extreme divergence following the inactivation of Cas10 or the capture of an unrelated protein remains uncertain.
The origin of type IV systems remains uncertain, but the recent discovery of subtype IV-C systems, with the large subunit fused to an HD domain, together with the observation that both the Cas5 and Cas7 components of type IV systems share a greater sequence similarity with their counterparts from type III than with those from type I systems, suggests that type IV could have evolved from type III. These observations are compatible with the lack of association of the IV-C systems with any known mobile genetic elements. Similar lines of evidence could point to subtype I-D systems as a potential evolutionary intermediate between type III and type I systems. The structure of both the subtype I-D effector complex and the Cas10d protein should shed more light on the origin of type I systems. The origin of the HRAMP system, a highly derived CRISPR-less class 1 variant, is unclear as well, but both its Cas5 and Cas7 components are more similar to the respective proteins of type III than to those of type I systems, suggesting a route of evolution parallel to that of type IV systems67.
In class 2, the effectors of different subtypes of type V and, possibly, type II systems appear to have evolved, on multiple independent occasions, from TnpB nucleases encoded by yet another class of mobile genetic element, the IS605-like transposons20. Type II systems apparently evolved from a distinct variety of TnpB (denoted IscB) that contains an HNH nuclease domain inserted into the RuvC-like domain97. The type VI system effectors (Cas13) seem to originate from HEPN-containing components of an Abi module7,20 (FIG. 5). The functional analogy between Cas13a and Abi has been recently validated by experiments that have demonstrated growth arrest of phage-infected bacteria that is dependent on Cas13a activity71. A recurrent trend in the evolution of CRISPR–Cas effectors is the accretion of additional proteins (in class 1) or domains (in class 2), on top of the core nuclease domains, providing for the flexibility required to accommodate the crRNA and the target DNA or RNA7.
Another general trend in CRISPR–Cas evolution is the spawning of defective variants, many of which are appropriated by mobile genetic elements30,33. The defective forms of CRISPR–Cas systems are predicted to perform various functions that require target recognition but not cleavage. A striking case of such functionality is the crRNA-dependent, site-specific transposition that has recently been demonstrated for the transposon-encoded derived variants of subtype I-F and subtype V-K systems34,35.
Concluding remarks
Because the most abundant types and subtypes of CRISPR–Cas systems are now known, the overall structure of the current classification is likely to stand the test of time. However, the discovery of comparatively rare but functionally and evolutionarily interesting and informative variants has not stopped and, in all likelihood, will continue, especially as diverse environments are explored by methods of metagenomics and single-cell genomics. Some of these variants are distinct enough to become new subtypes, but so far, no new types have been identified after the discovery of type VI. According to the currently adopted criteria, to qualify as a new type, a CRISPR–Cas variant has to encompass an effector module unrelated (or extremely distantly related) to those of the known types. Other types might remain to be discovered, but it is becoming increasingly clear that, if such additional types exist, they are rare and/or highly specialized. Investigation of the numerous ancillary components of CRISPR–Cas is starting to uncover multiple connections between CRISPR–Cas and various functionally distinct systems of bacterial and archaeal cells, particularly those involved in different forms of signal transduction.
In summary, the diversity of the identified CRISPR– Cas systems has substantially increased over the last four years, thanks to a combination of computational and experimental approaches. Notably, the new varieties could be classified into distinct types and subtypes by using several criteria. Arguably, this granularity stems from punctuated evolution, whereby the diversification of emerging subtypes slows down after an initial period of rapid innovation. Notwithstanding the apparent distinctness of the subtypes, the increasing diversity of CRISPR–Cas creates further challenges to classification and nomenclature and calls for the development of robust classification criteria. The delineation of types and, to a large extent, subtypes will likely remain qualitative, given the paucity of shared components. However, the classification of variants within subtypes, some of which might qualify as separate subtypes, can be quantified — for example, by using bipartite network analysis, as shown here. On the whole, we believe that the classification of CRISPR–Cas systems has entered the era of consolidation and refinement. Experimental characterization of CRISPR–Cas functions still lags behind predictions produced by computational analysis. It is our hope that the updated classification will facilitate experimental studies and promote new directions.
Acknowledgements
K.S.M., Y.I.W., J.I., S.A.S. and E.V.K. are supported through the Intramural Research Program of the US National Institutes of Health; F.J.M.M. was supported by grants BIO2014–53029-P (Ministerio de Ciencia, Innovación y Universidades, Spain), and 291815 Era-Net ANIHWA (7th Framework Programme, European Commission) and PROMETEO/2017/129 (Conselleria d’Educació, Investigació, Cultura i Esport, Generalitat Valenciana, Spain); S.A.S. was supported by RFBR (research project 18–34-00012) and a Systems Biology Fellowship from Philip Morris Sales and Marketing; S.M. was funded by funding from the Natural Sciences and Engineering Research Council of Canada (Discovery program) and holds a Tier 1 Canada Research Chair in Bacteriophages.
Footnotes
Competing interests
The authors declare no competing interests.
Supplementary information
Supplementary information is available for this paper at https://doi.org/10.1038/s41579-019-0299-x.
References
This work demonstrates the relationships between the effectors of different types and subtypes of class 2 CRISPR–Cas systems and nucleases encoded by mobile genetic elements. On the basis of sequence comparison and phylogenetic analysis of Cas12 (type V effectors) and TnpB nucleases encoded by transposons, a scenario of independent recruitment of distinct TnpB variants, giving rise to different type V subtypes, is proposed.
This work describes the metagenomic discovery of two new subtypes of type V CRISPR–Cas systems and experimental validation of their activity.
This work experimentally validates the enzymatic activity of small predicted effectors that have been assigned to subtype V-U by Shmakov et al. (2017) and are here reclassified as subtype V-F. It shows that these enzymes differ substantially from the previously characterized large type II and type V effectors and catalyse both crRNA-specific and non-specific cleavage of single-stranded DNA.
This article reports the experimental characterization of CRISPR–Cas subtypes V-C, V-G, V-H and V-I. Whereas Cas12c, Cas12h and Cas12i proteins all demonstrate RNA-guided double-stranded DNA interference similar to that in previously described CRISPR–Cas effectors, Cas12g is shown to function as an RNase with collateral RNase and single-strand DNase activities.
This study demonstrates RNA targeting by the smallest known type VI effector, Cas13d, and shows that the accessory WYL domain-containing protein stimulates this activity.
Along with Shmakov et al. (2018), this study describes a computational approach to predict proteins that are functionally linked to CRISPR–Cas systems and applies this approach to type III systems.
Along with Shah et al. (2019), this article describes a computational approach for the systematic prediction of proteins that are functionally linked to CRISPR–Cas systems (‘CRISPRicity’ protocol) and applies that approach to all CRISPR–Cas types and subtypes.
This study describes, for the first time, defective CRISPR–Cas systems encoded in Tn7-like transposons and predicts their function in RNA-guided transposition.
This work validates the prediction made in Shmakov et al. (2017), by showing that V-U5 variant effector proteins, which are inactivated TnpB homologues encoded in Tn7-like transposons, form a complex with the transposase subunit and enable crRNA-dependent transposition.
This work complements Strecker et al. (2019) by experimentally validating the prediction made in Peters et al. (2017) that interference-deficient subtype I-F CRISPR–Cas systems encoded in Tn7-like transposons enable crRNA-dependent transposition.
Along with Niewoehner et al. (2017), this article describes the signalling pathway involved in the function of type III CRISPR–Cas systems, which involves the synthesis of cyclic oligoA molecules by Cas10, binding of these signalling molecules to the CARF domain of Csm6 and activation of the second domain of Casm6, the HEPN nuclease that catalyses promiscuous RNA cleavage.
This work reveals the molecular details of the involvement of Cas4, an ancillary protein that cooperates with Cas1 and Cas2 in several CRISPR–Cas subtypes, in the process of adaptation.
This work demonstrates that indiscriminate RNA cleavage by the HEPN RNase domain of the Csm6 protein of type III CRISPR–Cas systems induces growth arrest in the host bacteria, providing a backup defence mechanism.
This works expands the characterization of the signalling pathway in type III CRISPR–Cas sequences by showing that a distinct variety of CARF domain cleaves the cyclic oligoA molecules produced by Cas10 and thus regulates the pathway.
Full text links
Read article at publisher's site: https://doi.org/10.1038/s41579-019-0299-x
Read article for free, from open access legal sources, via Unpaywall: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8905525
Citations & impact
Impact metrics
Citations of article over time
Alternative metrics
Article citations
Golden Gate Cloning of Synthetic CRISPR RNA Spacer Sequences.
Methods Mol Biol, 2850:297-306, 01 Jan 2025
Cited by: 0 articles | PMID: 39363078
Metagenomic characterization of viruses and mobile genetic elements associated with the DPANN archaeal superphylum.
Nat Microbiol, 24 Oct 2024
Cited by: 0 articles | PMID: 39448846
An alternative mechanism for recruiting Cas2/3 in a phage-encoded CRISPR-Cas system.
Nat Chem Biol, 20(11):1404-1405, 01 Nov 2024
Cited by: 0 articles | PMID: 38977790
Structural variation of types IV-A1- and IV-A3-mediated CRISPR interference.
Nat Commun, 15(1):9306, 29 Oct 2024
Cited by: 0 articles | PMID: 39468082 | PMCID: PMC11519345
First comparative genomics analysis of Corynebacterium auriscanis.
Mem Inst Oswaldo Cruz, 119:e240156, 25 Oct 2024
Cited by: 0 articles | PMID: 39476150 | PMCID: PMC11508509
Go to all (898) article citations
Data
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
An updated evolutionary classification of CRISPR-Cas systems.
Nat Rev Microbiol, 13(11):722-736, 28 Sep 2015
Cited by: 1208 articles | PMID: 26411297 | PMCID: PMC5426118
Review Free full text in Europe PMC
CRISPR-Cas systems: beyond adaptive immunity.
Nat Rev Microbiol, 12(5):317-326, 07 Apr 2014
Cited by: 163 articles | PMID: 24704746
Review
Diversity, classification and evolution of CRISPR-Cas systems.
Curr Opin Microbiol, 37:67-78, 09 Jun 2017
Cited by: 622 articles | PMID: 28605718 | PMCID: PMC5776717
Review Free full text in Europe PMC
Diversity and evolution of class 2 CRISPR-Cas systems.
Nat Rev Microbiol, 15(3):169-182, 23 Jan 2017
Cited by: 461 articles | PMID: 28111461 | PMCID: PMC5851899