Abstract
Free full text
Chromosome-scale genome assembly of Astragalus membranaceus using PacBio and Hi-C technologies
Abstract
Astragalus membranaceus (Fisch.) Bge (AM) is a medicinal herb plant belonging to the Leguminosae family. In this study, we present a chromosome-scale genome assembly of AM, aiming to enhance the molecular biology and functional studies of Astragali Radix. The genome size of AM is about 1.43Gb, with a contig N50 value of 1.67Mb. A total of 98.16% of the assembly anchored to 9 pseudochromosomes using Hi-C technology. The assembly completeness was estimated to be 97.27% using BUSCO with the long terminal repeat assembly index (LAI) of 16.22 and quality value (QV) of 48.58. Additionally, the genome contained 67.98% repetitive sequences. Genome annotation predicted 29,914 protein-coding genes, including 73 genes involved in the flavonoid biosynthetic pathway and 2,048 transcription factors. The high-quality genome assembly and gene annotation resources will greatly facilitate future functional genomic studies in Leguminosae species.
Background & Summary
Astragalus membranaceus (Fisch.) Bge (AM) is a widely used medicinal plants worldwide1. Its dried roots are known as Astragali Radix possessing hepatoprotective, diuretic, tonic and expectorant activities and play roles in anti-aging, anti-tumor, anti-neurodegeneration, and regulating blood glucose and immunity in Chinese medicine2. Flavonoids are one of the main active compounds in AM. Flavonoids have diverse biological activities and play numerous roles in the interaction between plants and the environment, such as resisting diseases and insect pests, preventing ultraviolet burns, attracting insects to pollinate, etc3. Recently, the genome of Astragalus mongholicus (AMM), another authorized plant source of Astragali Radix, has been reported4,5. It’s widely believed that the morphology and function of AM and AMM are highly divergent, and the latter species was more heterozygous. Based on metabolomics in the present study6, a total of 53 chemical markers was identified for the discrimination of AMM and AM. Among them, the contents of 36 components including 14 flavonoids in AM were significantly higher than those in AMM. AM may own stronger pharmacological activities than AMM.
To further understand the underlying molecular mechanism of flavonoid biosynthesis, we performed a chromosome-level genome sequencing of AM (2n=18) using a combined PacBio reads and Hi-C scaffolding technology (Fig. 1). The assembled AM genome had a total length of 1.43Gb, with a contig N50 of 1.67Mb and a complete BUSCO score of 97.27%. A total of 1.40Gb (98.16%) of the sequences was anchored to the 9 pseudochromosomes (Fig. 2). Genome annotation predicted 29,914 protein-coding genes and 972.44Mb (67.98%) repetitive sequences. Moreover, 73 genes associated with the flavonoid biosynthetic pathway (Fig. 3) and 2,048 transcription factors (TFs) have been identified. The chromosome-scale genome of AM provides a genetic basis for exploring key genes and molecular regulatory mechanisms involved in the biosynthesis of important compounds, while also serves as a valuable resource for comparative genomic analysis between AM and AMM.
Methods
Plant materials and sequencing
The plant material used for de novo genome assembly was a seven-year-old AM plant grown in Jinzhong, China. After the collection of vigorously growing leaves, they were immediately snap-frozen in liquid nitrogen. The frozen leaves were then stored at −80°C in the laboratory until DNA extraction could be performed. Genomic DNA was extracted using DNeasy Plant Maxi kit (Qiagen, German). A short-fragmented library was prepared with an insert size of 350bp and sequenced using BGISEQ, resulting in 150bp paired-end reads. Two libraries were prepared following the manufacturer’s instructions from Pacific Biosciences, with an insert size of approximately 20kb. These libraries were sequenced using PacBio Sequel platforms to generate continuous long reads. For chromosomal conformational capture (Hi-C) sequencing, libraries generated using DpnII restriction enzymes were prepared according to previously described methods7, and subsequently sequenced on the BGISEQ platform. RNA-seq libraries from root, leaf, and stem tissues during the fruit growth period were constructed using the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, Ipswich, MA, USA) following the manufacturer’s protocol8. Then cDNA libraries were sequenced using a BGISEQ instrument, yielding 150bp paired-end reads.
In summary, 156.2Gb of paired-end next-generation sequencing reads (~109.2X), 196.4Gb of PacBio subreads (~196.4X; the N50 length of subreads was larger than 22kb), and 285.6Gb of Hi-C data (~199.7X) were obtained (Table 1).
Table 1
Reads Number | Total length (Gb) | Genome depth | N50 length of reads (bp) | ||
---|---|---|---|---|---|
NGS raw reads | 1,041,321,120 | 156.2 | 109.2 | 150 | |
NGS clean reads | 1,032,333,412 | 154.6 | 108.1 | 150 | |
PacBio subreads | cell 1 | 9,476,555 | 143.6 | 100.4 | 22,132 |
cell 2 | 9,504,443 | 137.2 | 95.9 | 22,078 | |
Hi-C raw data | 1,842,342,896 | 285.6 | 199.7 | 150 | |
Hi-C clean data | 1,801,451,914 | 271.2 | 189.7 | 150 |
Genome survey
K-mer frequency distribution is a prevalent genomic survey technique. A K-mer is a sequence of K nucleotides extracted from sequencing data. With a read length of L, this method generates L-K+1 K-mers. The 17-mer is a common choice for genome size estimation due to its capacity to cover a vast number of combinations (4^17), suitable for various species such as willow (338.93 MB)9, Dalbergia odorifera (653.45Mb)10, camel (2.01–2.05Gb)11, and gecko (2.55Gb)12. Here, we counted 17-bp K-mers using Jellyfish (v 2.2.10)13 and estimated genome characteristics with GenomeScope (v2.0)14. The estimated genome size was 1.43Gb with a heterozygosity rate of 1.01% (Table 2). This assessment closely matches the results obtained via flow cytometry, which indicated a genome size of 1.52 Gb15.
Table 2
K | K-mer Number | Genome Size (Gb) | Repeat (%) | Heterozygous Ratio (%) | Used Bases (bp) | Depth (X) |
---|---|---|---|---|---|---|
17 | 137,801,908,923 | 1.43 | 82.1 | 1.01 | 154,569,325,184 | 108.09 |
Genome assembly
Based on PacBio CLR data, Canu16, FALCON17, and MECAT218 have become widely used software in the field of genome assembly. Research by Nie et al. in 202419 demonstrated the high accuracy of these software packages in genome assembly. Notably, FALCON, endorsed by PacBio, has played a pivotal role in numerous high-quality plant genome projects. For instance, FALCON was utilized in the barley genome20, the maize Mo17 projetc21, the Asian rice genome research22, and the coffee genome study23, showcasing its effectiveness in facilitating efficient genome assembly. Here, the contig of the AM genome was assembled using Falcon (v2.0.5) assembler, with parameters as follows: -v -B48 -D250 -M24 -h600 -e.75 -l3000 -s1000 -k18 -w6 -T8–output_multi–min_idt 0.75–min_cov 4–max_n_read 200–n_core 8. After the Falcon assembly, the genome was polished by the command-line SMRT Link (v4.0.0) following the Reference Guide (https://programs.pacificbiosciences.com/l/1652/2017-02-01/3rzxn6/184345/SMRT_Tools_Reference_Guide__v4.0.0_.pdf). To enhance the contiguity of the genome and reduce errors, NGS short reads were through Pilon (v1.22)24. Finally, TrimDup, a component of the Rabbit Genome Assembler (https://github.com/gigascience/rabbit-genome-assembler), was applied to eliminate redundant sequences using a percentage of 0.3.
To anchor contigs onto pseudochromosomes, we used BWA (v 0.7.12)25 to align the Hi-C clean data to the assembled contigs. Low-quality reads were filtered out using the HiC-Pro pipeline26 with default parameters. The remaining valid reads were employed to anchor chromosomes with Juicer27 and 3d-dna pipeline28. Finally, the chromosome assemblies were cut into 500kb bins of equal lengths and the interaction signals generated by the valid mapped read pairs between each bin were visualized in a heat map.
A genome assembly spanning 1.43Gb was generated (Fig. 1; Table 3), which was close to the genome size of AMM (1.43Gb vs 1.47Gb) and the estimated genome size. The contig N50 value of AM genome was 1.67Mb, which is comparable to the recently published genome of the closely related legume Astragalus sinicus29 (1.67Mb vs 1.5Mb). Approximately 1.40Gb (98.16%) of the sequences were successfully anchored to the 9 pseudochromosomes (Table 3).
Table 3
Item | Astragalus membranaceus (AM) |
---|---|
Size of assembly (Gb) | 1.43 |
Contig N50 (Mb) | 1.67 |
Chromosome number | 9 |
Anchored pseudo-chromosomes | 98.16% |
GC content | 38.40% |
Genome complete BUSCOs | 97.27% |
Long terminal repeat assembly index | 16.22 |
Quality value | 48.58 |
Repetitive sequences | 67.98% |
Number of protein-coding genes | 29,914 |
Annotation of repetitive sequences
Tandem repeats and interspersed repeats were identified using the method described in Qu et al.30. Approximately 67.98% of the assembled genome was classified as repetitive sequences, with interspersed repeats making up 65.99% of them (Table 4). Among the repetitive sequences, the most prevalent elements were long terminal repeats (LTRs), which accounted for 60.66% of the genome size.
Table 4
Type | Repbase TEs | TE protiens | De novo | *Combined TEs | ||||
---|---|---|---|---|---|---|---|---|
Length (bp) | % in genome | Length (bp) | % in genome | Length (bp) | % in genome | Length (bp) | % in genome | |
DNA | 31,435,958 | 2.20 | 16,519,868 | 1.15 | 50,391,304 | 3.52 | 72,929,673 | 5.10 |
LINE | 10,593,042 | 0.74 | 8,014,844 | 0.56 | 9,810,845 | 0.69 | 22,455,984 | 1.57 |
SINE | 129,143 | 0.01 | 0 | 0.00 | 208,537 | 0.01 | 336,632 | 0.02 |
LTR | 233,807,308 | 16.34 | 232,969,052 | 16.29 | 840,527,902 | 58.76 | 867,737,646 | 60.66 |
Other | 3,875 | 0.00 | 0 | 0.00 | 15,661 | 0.00 | 19,536 | 0.00 |
†Unclassified | 0 | 0.00 | 0 | 0.00 | 2,177,624 | 0.15 | 2,177,624 | 0.15 |
Total | 274,854,677 | 19.21 | 257,495,852 | 18.00 | 897,005,122 | 62.71 | 943,956,995 | 65.99 |
Note: This statistical table does not contain Tandem Repeats, some elements may partly include another element domain.
*Combined: the non-redundant consensus of all repeat prediction/classification methods employed.
†Unclassified: the predicted repeats that cannot be classified by RepeatMasker;
LINE, long interspersed nuclear elements; SINE, short interspersed nuclear elements; LTR, long terminal repeat.
Protein-coding genes prediction and functional annotation
Protein-coding genes were annotated using a similar method as described in Fang et al.31. To facilitate genome annotation of AM assembly, RNA sequencing of root, stem, and leaf samples was conducted and resulted in a total of 72.18Gb clean reads (Table 5). For transcriptome-based prediction, RNA-seq clean reads were assembled using Trinity (v 2.15.1)32 with the following parameters: ‘–max_memory 200G–CPU 40–min_contig_length 200–genome_guided_bam merged_sorted.bam–full_cleanup–min_kmer_cov 3–min_glue 3–bfly_opts ‘-V 5–edge-thr=0.1–stderr’–genome_guided_max_intron 10000–genome_guided_min_coverage 2’. This generated 245,216 transcripts with an N50 of 1,997bp. The assembled transcripts were aligned to the AM assembly using Program to Assemble Spliced Alignment (PASA) (v 2.4.1)33, and gene structures were generated from valid transcript alignments. Additionally, RNA-seq clean reads were also mapped to the AM assembly using Hisat2 (v 2.0.1)34. Stringtie (v 1.2.2)35 and TransDecoder (v 5.7.1) (https://github.com/TransDecoder/TransDecoder) were employed to assemble the transcripts and identify candidate coding regions into gene models. For homology-based method, homologous genomes and gene sets, including A. membranaceus var. mongholicus (AMM)5, Cicer arietinum (GenBank accession: GCA_026016865.1)36, Medicago truncatula (GenBank accession: GCA_000219495.2)37, Trifolium pratense (GenBank accession: GCA_949352195.3)38, Glycine max (ZH13-T2T)39, and Arabidopsis thaliana (Col-PEK1.5)40, were downloaded and used as queries to search against the AM assembly utilizing GeMoMa (v 1.9)41 approach. Genes with a coding sequence (CDS) length less than 150bp were filtered out, along with single-exon genes lacking annotation of protein domains. Additionally, genes not anchored to chromosome sequences and lacking annotation of protein domains were also excluded. Finally, the generated gene models were refined with PASA (v 2.4.1) to obtain untranslated regions and information on alternative splicing variation by using Trinity assembled transcripts and isoforms from full-length transcriptomes of leaf and root tissues42. Following the method described in Bi et al.43, the integrated gene set was translated into amino-acid sequences and annotated. As a result, 29,828 genes (99.71% of the total) were successfully annotated.
Table 5
Sample | Total CleanReads | Clean data | Clean Q20% (fq1;fq2) | GC_rate (%) | Uniquely mapped reads | Total MappingRatio | Uniquely MappingRatio | SRA accession |
---|---|---|---|---|---|---|---|---|
leaf_1 | 42,134,094 | 6,320,114,100 | 95.71;94.67 | 44.04 | 28,298,792 | 71.76% | 67.16% | SRR27790544 |
leaf_2 | 42,618,542 | 6,392,781,300 | 95.65;94.67 | 43.83 | 31,733,374 | 79.87% | 74.46% | SRR27790543 |
leaf_3 | 42,362,198 | 6,354,329,700 | 95.68;94.83 | 43.19 | 33,032,872 | 83.56% | 77.98% | SRR27790542 |
root_1 | 42,365,348 | 6,354,802,200 | 95.62;94.51 | 42.61 | 36,156,538 | 91.68% | 85.34% | SRR27790541 |
root_2 | 42,176,852 | 6,326,527,800 | 95.61;94.73 | 42.59 | 35,176,690 | 90.26% | 83.40% | SRR27790540 |
root_3 | 42,326,320 | 6,348,948,000 | 95.15;94.31 | 42.89 | 34,986,928 | 88.98% | 82.66% | SRR27790539 |
stem_1 | 42,113,002 | 6,316,950,300 | 95.63;94.54 | 42.57 | 36,319,666 | 91.70% | 86.24% | SRR27790538 |
stem_2 | 42,255,416 | 6,338,312,400 | 95.66;94.71 | 42.57 | 35,921,772 | 90.41% | 85.01% | SRR27790547 |
stem_3 | 42,317,456 | 6,347,618,400 | 95.66;94.26 | 43.19 | 34,638,542 | 87.17% | 81.85% | SRR27790546 |
Overall, we predicted 29,914 protein-coding genes, with average lengths of 4,752bp for genes, 622bp for introns, and 1,306bp for coding sequences. We downloaded the genes related to the flavonoid biosynthetic pathway in the AMM genome and identified genes associated with the flavonoid biosynthetic pathway in the AM genome using the OrthoFinder method44. OrthoFinder is an accurate and comprehensive tool used for identifying and comparing homologous genomics among biological species. As a result, 73 genes associated with the flavonoid biosynthetic pathway in the AM genome were obtained. Homologous sequences were aligned by MAFFT (v 7.505)45, and the alignment was then processed with TrimAL (v 1.4.1)46 to remove poorly aligned positions. Subsequently, the phylogenetic tree was generated using iqtree2 (v 2.0.6)47 with parameters of “-b 1000” and visualized using Evolview48 (Fig. 3a). Using the method described in Li et al.49, a total of 2,048 transcription factors (TFs) were identified (Fig. 3b). In brief, the plant TF domain profile (https://planttfdb.gao-lab.org/)50 was searched against the AM protein data using the hmmsearch tool implemented in HMMER (v 3.1b2) (http://hmmer.org/). Proteins exhibiting a TF domain match with an E-value of 1E-5 or lower were chosen.
Genomic variations between AM and AMM
By applying the analytical tool MCScan (Python version)51, we conducted an in-depth identification of homologous regions between the AM and AMM genomes, with a threshold set to include at least ten genes. Our research findings revealed a total of 22,160 pairs of orthologous genes shared between the two genomes, with the AM genome containing 21,727 pairs (accounting for 72.63% of the total), and the AMM genome containing 21,474 pairs (accounting for 77.06% of the total) (Fig. 2a). The amino acid sequence lengths of the orthologous gene pairs within these collinear regions showed a significant positive correlation (Fig. 2b), further confirming their homology. Additionally, we observed two potential chromosomal rearrangement events. Firstly, a chromosomal fusion event occurred in the AMM linkage group Chr7, which connected the two chromosomes from the AM genome, namely Chr7 and Chr8. Secondly, a chromosomal fusion also took place in the AM linkage group Chr9, involving the two chromosomes from the AMM genome, Chr8 and Chr9. The specific causes of these fusion events, their timing, and how they affect the traits of the organism are important issues that require further in-depth exploration in future research.
Single nucleotide polymorphisms (SNPs) and small insertions/deletions (InDels) were identified using a similar method as previously reported52. Briefly, genome alignment between the AM and AMM assemblies was performed with the NUCmer program of MUMmer4 (v4.0.0)53 using the parameter settings “–mum -g 1000 -c 90 -l 40”. The delta-filter program was used to obtain alignment blocks with the parameter setting “-1 -l 5000”. The show-snps program was used to detect SNPs and InDels with the settings “-Clr -x 1 -T”. Finally, a total of 4,902,056 SNPs and 903,918 InDels were identified (Fig. 2c,d). These variations serve as resources for further research.
Data Records
The DNA and RNA sequence reads of AM have been deposited in the Sequence Read Archive (SRA) with accession numbers SRP48693054 under project number PRJNA1067739. The genome assembly has been deposited at GenBank under the WGS accession GCA_039519185.155. Additionally, the genome assembly, along with files for gene structure annotation, repeat predictions and gene functional annotation, variation information including SNP and InDels between AM and AMM genomes were deposited in Figshare56.
Technical Validation
Genome assembly and gene prediction quality assessment
The quality and accuracy of the AM assembly were assessed through the following analyses. Firstly, the Hi-C interaction map showed a strong intrachromosomal interactive signal along the diagonal (Fig. 4). Secondly, the distribution of CG depth indicated that there was no apparent contamination in the assembled sequences (Fig. 5). Thirdly, the AM assembly presented an LTR assembly index of 16.22 and a BUSCO score of 97.27%, indicating its high completeness (Table 3). In addition, evaluation using Merqury showed a QV of 48.58, suggesting high accuracy at the base-pair level. Lastly, 99.56% of the DNA next-generation sequencing reads were mapped to the AM genome assembly, whereas an equally impressive 99.25% of the error-corrected PacBio data could also be mapped to the assembly. Notably, the genome coverage achieved from the error-corrected PacBio data reached 99.40%, and the depth of each window remained consistent without significant fluctuations (Fig. 6).
We compared the length distribution of genes among the AMM5, C. arietinum36, and G. max39, and found similar patterns (Fig. 7). Meanwhile, 85.39% of the RNA-seq data were aligned to the predicted exons and only 2.5% located in intergenic region (Fig. 8). The BUSCO analysis showed that 96.59% (single-copy gene: 88.97%, duplicated gene: 7.62%) of 1,614 embryophyta single-copy orthologs were successfully identified as complete, while 1.12% were fragmented and 2.29% were missing in the assembly (Table 6). The 29,828 (99.71%) gene models were successfully annotated in diverse databases, such as NR, SwissProt, KEGG, KOG, TrEMBL and Interpro (Table 7). Taken together, all these results provide strong evidence that a high-quality AM genome has been obtained.
Table 6
Database:embryophyta_odb10 | Number of BUSCOs | Percentage |
---|---|---|
Complete (C) | 1,559 | 96.59% |
Complete and single-copy (S) | 1,436 | 88.97% |
Complete and duplicated (D) | 123 | 7.62% |
Fragmented (F) | 18 | 1.12% |
Missing (M) | 37 | 2.29% |
Total BUSCO groups searched | 1,614 | 1 |
Table 7
Type | Gene number | Percentage | |
---|---|---|---|
Total | 29,914 | 100% | |
NR | 29,606 | 98.97% | |
SwissProt | 24,179 | 80.83% | |
KEGG | 24,103 | 80.57% | |
KOG | 23,581 | 78.83% | |
TrEMBL | 29,594 | 98.93% | |
Interpro | All | 29,607 | 98.97% |
GO | 19,069 | 63.75% | |
Annotated | 29,828 | 99.71% | |
Unannotated | 86 | 0.29% |
Acknowledgements
This work was supported by Project of the “Modernization Research of Traditional Chinese Medicine” Key Research and Development Program of the Ministry of Science and Technology (No. 2019YFC1710800), Project of the Shanxi Collaborative Innovation Center of Astragali Radix Resource Industrialization and Industrial Internationalization (No. HQXTCXZX2016-005 and No. HQXTCXZX2016-016) and Key Research and Development (R&D) project of Shanxi Province (No.201603D3111001).
Author contributions
H.J.F., Z.C., R.Z., C.G.M. and Q.S.L. conceived the study. H.J.F. collected and prepared the samples. Z.Y.W. and X.K.Y. performed bioinformatics analysis. Z.C. and H.J.F. wrote the manuscript with significant contributions from X.K.Y., A.K.L. and H.F.S. All authors read and approved the final manuscript.
Code availability
No specific code or script was used in this work. Commands used for data processing were all executed according to the manuals and protocols of the corresponding software.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Huijie Fan, Zhi Chai, Xukui Yang.
Contributor Information
Huijie Fan, Email: nc.ude.mctxs@eijiuhnaf.
Zhi Chai, Email: nc.ude.mctxs@ihziahc.
Qingshan Li, Email: moc.361@2102sqlxs.
Cungen Ma, Email: nc.ude.mctxs@negnucam.
Ran Zhou, Email: moc.uhos@85ruohz.
Supplementary information
The online version contains supplementary material available at 10.1038/s41597-024-03852-6.
References
Articles from Scientific Data are provided here courtesy of Nature Publishing Group
Citations & impact
This article has not been cited yet.
Impact metrics
Alternative metrics
Discover the attention surrounding your research
https://www.altmetric.com/details/169186708
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
Data Citations
- (1 citation) DOI - 10.6084/m9.figshare.25100393.v3
BioProject
- (1 citation) BioProject - PRJNA1067739
GCA - NCBI genaome assembly (4)
- (1 citation) GCA - GCA_026016865.1
- (1 citation) GCA - GCA_949352195.3
- (1 citation) GCA - GCA_039519185.1
- (1 citation) GCA - GCA_000219495.2
Nucleotide Sequences (Showing 10 of 10)
- (1 citation) ENA - SRR27790542
- (1 citation) ENA - SRR27790541
- (1 citation) ENA - SRR27790544
- (1 citation) ENA - SRR27790543
- (1 citation) ENA - SRR27790540
- (1 citation) ENA - SRR27790539
- (1 citation) ENA - SRP486930
- (1 citation) ENA - SRR27790538
- (1 citation) ENA - SRR27790546
- (1 citation) ENA - SRR27790547
Show less
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
Chromosome level genome assembly of endangered medicinal plant Anisodus tanguticus.
Sci Data, 11(1):161, 02 Feb 2024
Cited by: 0 articles | PMID: 38307894 | PMCID: PMC10837431
Chromosome-level genome assembly and annotation of xerophyte secretohalophyte Reaumuria soongarica.
Sci Data, 11(1):812, 22 Jul 2024
Cited by: 1 article | PMID: 39039100 | PMCID: PMC11263558
Chromosome-scale genome assembly and annotation of Cotoneaster glaucophyllus.
Sci Data, 11(1):406, 22 Apr 2024
Cited by: 1 article | PMID: 38649372 | PMCID: PMC11035681
Chromosome-level genome assembly of watershield (Brasenia schreberi).
Sci Data, 10(1):467, 19 Jul 2023
Cited by: 3 articles | PMID: 37468511 | PMCID: PMC10356934