Chromosome-scale genome assembly of Astragalus membranaceus using PacBio and Hi-C technologies.

Fan H; Chai Z; Yang X; Liu A; Sun H; Wu Z; Li Q; Ma C; Zhou R

doi:10.1038/s41597-024-03852-6

Chromosome-scale genome assembly of Astragalus membranaceus using PacBio and Hi-C technologies.

Affiliations

1. College of Basic Medical Sciences, Shanxi University of Chinese Medicine, Jinzhong, 030619, China.
Authors
Fan H¹
Chai Z¹
Ma C¹
Zhou R¹
(4 authors)
2. BGI Genomics, Shenzhen, 518083, China.
Authors
Yang X²
Wu Z²
(2 authors)
3. Department of Life Sciences, Changzhi University, Changzhi, 046011, China.
Authors
Liu A³
(1 author)
4. College of Chemistry and Chemical Engineering, Shanxi University, Taiyuan, 030006, China.
Authors
Sun H⁴
(1 author)
5. School of Chinese Materia Medica, Shanxi University of Chinese Medicine, Jinzhong, 030619, China.
Authors
Li Q⁵
(1 author)

Scientific Data, 02 Oct 2024, 11(1):1071
https://doi.org/10.1038/s41597-024-03852-6 PMID: 39358417 PMCID: PMC11446949

This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.

Free full text in Europe PMC

Abstract

Astragalus membranaceus (Fisch.) Bge (AM) is a medicinal herb plant belonging to the Leguminosae family. In this study, we present a chromosome-scale genome assembly of AM, aiming to enhance the molecular biology and functional studies of Astragali Radix. The genome size of AM is about 1.43 Gb, with a contig N50 value of 1.67 Mb. A total of 98.16% of the assembly anchored to 9 pseudochromosomes using Hi-C technology. The assembly completeness was estimated to be 97.27% using BUSCO with the long terminal repeat assembly index (LAI) of 16.22 and quality value (QV) of 48.58. Additionally, the genome contained 67.98% repetitive sequences. Genome annotation predicted 29,914 protein-coding genes, including 73 genes involved in the flavonoid biosynthetic pathway and 2,048 transcription factors. The high-quality genome assembly and gene annotation resources will greatly facilitate future functional genomic studies in Leguminosae species.

Free full text

Sci Data. 2024; 11: 1071.

Published online 2024 Oct 2. https://doi.org/10.1038/s41597-024-03852-6

PMCID: PMC11446949

PMID: 39358417

Chromosome-scale genome assembly of Astragalus membranaceus using PacBio and Hi-C technologies

Huijie Fan,^#¹ Zhi Chai,^#¹ Xukui Yang,^#² Ake Liu,³ Haifeng Sun,⁴ Zhangyan Wu,² Qingshan Li,^5,⁶ Cungen Ma,¹ and Ran Zhou¹

Author information Article notes Copyright and License information Disclaimer

Associated Data

Data Citations

2022. NCBI GenBank. GCA_026016865.1
NCBI GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_026016865.1 (2022).
2014. NCBI GenBank. GCA_000219495.2
NCBI GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_000219495.2 (2014).
2023. NCBI GenBank. GCA_949352195.3
NCBI GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_949352195.3 (2023).
2023. National Genomics Data Center. https://ngdc.cncb.ac.cn/gwh/Assembly/66216/show
National Genomics Data Centerhttps://ngdc.cncb.ac.cn/gwh/Assembly/66216/show (2023).
2023. AraShare. https://www.arashare.cn//static/uploads/Col-PEK1.5_assembly_and_annotation.tar.gz
AraSharehttps://www.arashare.cn//static/uploads/Col-PEK1.5_assembly_and_annotation.tar.gz (2023).
2024. NCBI Sequence Read Archive. SRP486930
NCBI Sequence Read Archivehttps://identifiers.org/ncbi/insdc.sra:SRP486930 (2024).
Fan H. 2024. Astragalus membranaceus isolate JZ-2020, whole genome shotgun sequencing project. GenBank. GCA_039519185.1
Fan, H. Astragalus membranaceus isolate JZ-2020, whole genome shotgun sequencing project. GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_039519185.1 (2024).
Fan H. 2024. Genome Assembly and Annotation of Astragalus membranaceus (Fisch.) Bge (AM) figshare. Dataset. [CrossRef]
Fan, H. Genome Assembly and Annotation of Astragalus membranaceus (Fisch.) Bge (AM). figshare. Dataset.10.6084/m9.figshare.25100393.v3 (2024).

Supplementary Materials: Supplementary Fig.2
41597_2024_3852_MOESM1_ESM.pdf (651K)
Supplementary Fig.1
41597_2024_3852_MOESM2_ESM.pdf (220K)

Abstract

Astragalus membranaceus (Fisch.) Bge (AM) is a medicinal herb plant belonging to the Leguminosae family. In this study, we present a chromosome-scale genome assembly of AM, aiming to enhance the molecular biology and functional studies of Astragali Radix. The genome size of AM is about 1.43Gb, with a contig N50 value of 1.67Mb. A total of 98.16% of the assembly anchored to 9 pseudochromosomes using Hi-C technology. The assembly completeness was estimated to be 97.27% using BUSCO with the long terminal repeat assembly index (LAI) of 16.22 and quality value (QV) of 48.58. Additionally, the genome contained 67.98% repetitive sequences. Genome annotation predicted 29,914 protein-coding genes, including 73 genes involved in the flavonoid biosynthetic pathway and 2,048 transcription factors. The high-quality genome assembly and gene annotation resources will greatly facilitate future functional genomic studies in Leguminosae species.

Subject terms: Plant genetics, Genomics

Background & Summary

Astragalus membranaceus (Fisch.) Bge (AM) is a widely used medicinal plants worldwide¹. Its dried roots are known as Astragali Radix possessing hepatoprotective, diuretic, tonic and expectorant activities and play roles in anti-aging, anti-tumor, anti-neurodegeneration, and regulating blood glucose and immunity in Chinese medicine². Flavonoids are one of the main active compounds in AM. Flavonoids have diverse biological activities and play numerous roles in the interaction between plants and the environment, such as resisting diseases and insect pests, preventing ultraviolet burns, attracting insects to pollinate, etc³. Recently, the genome of Astragalus mongholicus (AMM), another authorized plant source of Astragali Radix, has been reported^4,5. It’s widely believed that the morphology and function of AM and AMM are highly divergent, and the latter species was more heterozygous. Based on metabolomics in the present study⁶, a total of 53 chemical markers was identified for the discrimination of AMM and AM. Among them, the contents of 36 components including 14 flavonoids in AM were significantly higher than those in AMM. AM may own stronger pharmacological activities than AMM.

To further understand the underlying molecular mechanism of flavonoid biosynthesis, we performed a chromosome-level genome sequencing of AM (2n=18) using a combined PacBio reads and Hi-C scaffolding technology (Fig. 1). The assembled AM genome had a total length of 1.43Gb, with a contig N50 of 1.67Mb and a complete BUSCO score of 97.27%. A total of 1.40Gb (98.16%) of the sequences was anchored to the 9 pseudochromosomes (Fig. 2). Genome annotation predicted 29,914 protein-coding genes and 972.44Mb (67.98%) repetitive sequences. Moreover, 73 genes associated with the flavonoid biosynthetic pathway (Fig. 3) and 2,048 transcription factors (TFs) have been identified. The chromosome-scale genome of AM provides a genetic basis for exploring key genes and molecular regulatory mechanisms involved in the biosynthesis of important compounds, while also serves as a valuable resource for comparative genomic analysis between AM and AMM.

Fig. 1

Circos plot illustrating the genome of AM genome. The plot includes the following components, arranged from inside to outside: (I) Collinear regions within AM assembly; (II) GC content in non-overlapping 1Mb windows; (III) Percentage of repeats in 1-Mb sliding windows; (IV) Gene density in 1-Mb sliding windows; (V) Length of pseudo-chromosome in megabases (Mb).

Fig. 2

Comparative genomic analysis between AM and AMM. (a) The syntenic regions. The analysis reveals intricate relationships between AM and AMM in their genomes. (b) AM protein length plotted against the orthologous protein length for AMM. (c) The density plot of SNPs between AM assembly and AMM assembly. (d) The density plot of Indels between AM assembly and AMM assembly.

Fig. 3

The genes involved in the biosynthesis of flavonoids and the TFs in the AM genome. (a) The phylogenetic tree of genes involved in the flavonoid biosynthetic pathway. Genes with IDs highlighted in gold represent those originating from AM, while those highlighted in blue denote genes from AMM, and those in red denote genes from M. truncatula. (b) The distribution of TF family in AM genome. Only TF family containing 10 or more genes are shown.

Methods

Plant materials and sequencing

The plant material used for de novo genome assembly was a seven-year-old AM plant grown in Jinzhong, China. After the collection of vigorously growing leaves, they were immediately snap-frozen in liquid nitrogen. The frozen leaves were then stored at −80°C in the laboratory until DNA extraction could be performed. Genomic DNA was extracted using DNeasy Plant Maxi kit (Qiagen, German). A short-fragmented library was prepared with an insert size of 350bp and sequenced using BGISEQ, resulting in 150bp paired-end reads. Two libraries were prepared following the manufacturer’s instructions from Pacific Biosciences, with an insert size of approximately 20kb. These libraries were sequenced using PacBio Sequel platforms to generate continuous long reads. For chromosomal conformational capture (Hi-C) sequencing, libraries generated using DpnII restriction enzymes were prepared according to previously described methods⁷, and subsequently sequenced on the BGISEQ platform. RNA-seq libraries from root, leaf, and stem tissues during the fruit growth period were constructed using the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, Ipswich, MA, USA) following the manufacturer’s protocol⁸. Then cDNA libraries were sequenced using a BGISEQ instrument, yielding 150bp paired-end reads.

In summary, 156.2Gb of paired-end next-generation sequencing reads (~109.2X), 196.4Gb of PacBio subreads (~196.4X; the N50 length of subreads was larger than 22kb), and 285.6Gb of Hi-C data (~199.7X) were obtained (Table 1).

Table 1

Summary of sequencing data of AM genome.

	Reads Number		Total length (Gb)	Genome depth	N50 length of reads (bp)
NGS raw reads	1,041,321,120		156.2	109.2	150
NGS clean reads	1,032,333,412		154.6	108.1	150
PacBio subreads	cell 1	9,476,555	143.6	100.4	22,132
	cell 2	9,504,443	137.2	95.9	22,078
Hi-C raw data	1,842,342,896		285.6	199.7	150
Hi-C clean data	1,801,451,914		271.2	189.7	150

Genome survey

K-mer frequency distribution is a prevalent genomic survey technique. A K-mer is a sequence of K nucleotides extracted from sequencing data. With a read length of L, this method generates L-K+1 K-mers. The 17-mer is a common choice for genome size estimation due to its capacity to cover a vast number of combinations (4^17), suitable for various species such as willow (338.93 MB)⁹, Dalbergia odorifera (653.45Mb)¹⁰, camel (2.01–2.05Gb)¹¹, and gecko (2.55Gb)¹². Here, we counted 17-bp K-mers using Jellyfish (v 2.2.10)¹³ and estimated genome characteristics with GenomeScope (v2.0)¹⁴. The estimated genome size was 1.43Gb with a heterozygosity rate of 1.01% (Table 2). This assessment closely matches the results obtained via flow cytometry, which indicated a genome size of 1.52 Gb¹⁵.

Table 2

Estimation of genome and repeat fragment size, and heterozygosity of AM.

K	K-mer Number	Genome Size (Gb)	Repeat (%)	Heterozygous Ratio (%)	Used Bases (bp)	Depth (X)
17	137,801,908,923	1.43	82.1	1.01	154,569,325,184	108.09

Genome assembly

Based on PacBio CLR data, Canu¹⁶, FALCON¹⁷, and MECAT2¹⁸ have become widely used software in the field of genome assembly. Research by Nie et al. in 2024¹⁹ demonstrated the high accuracy of these software packages in genome assembly. Notably, FALCON, endorsed by PacBio, has played a pivotal role in numerous high-quality plant genome projects. For instance, FALCON was utilized in the barley genome²⁰, the maize Mo17 projetc²¹, the Asian rice genome research²², and the coffee genome study²³, showcasing its effectiveness in facilitating efficient genome assembly. Here, the contig of the AM genome was assembled using Falcon (v2.0.5) assembler, with parameters as follows: -v -B48 -D250 -M24 -h600 -e.75 -l3000 -s1000 -k18 -w6 -T8–output_multi–min_idt 0.75–min_cov 4–max_n_read 200–n_core 8. After the Falcon assembly, the genome was polished by the command-line SMRT Link (v4.0.0) following the Reference Guide (https://programs.pacificbiosciences.com/l/1652/2017-02-01/3rzxn6/184345/SMRT_Tools_Reference_Guide__v4.0.0_.pdf). To enhance the contiguity of the genome and reduce errors, NGS short reads were through Pilon (v1.22)²⁴. Finally, TrimDup, a component of the Rabbit Genome Assembler (https://github.com/gigascience/rabbit-genome-assembler), was applied to eliminate redundant sequences using a percentage of 0.3.

To anchor contigs onto pseudochromosomes, we used BWA (v 0.7.12)²⁵ to align the Hi-C clean data to the assembled contigs. Low-quality reads were filtered out using the HiC-Pro pipeline²⁶ with default parameters. The remaining valid reads were employed to anchor chromosomes with Juicer²⁷ and 3d-dna pipeline²⁸. Finally, the chromosome assemblies were cut into 500kb bins of equal lengths and the interaction signals generated by the valid mapped read pairs between each bin were visualized in a heat map.

A genome assembly spanning 1.43Gb was generated (Fig. 1; Table 3), which was close to the genome size of AMM (1.43Gb vs 1.47Gb) and the estimated genome size. The contig N50 value of AM genome was 1.67Mb, which is comparable to the recently published genome of the closely related legume Astragalus sinicus²⁹ (1.67Mb vs 1.5Mb). Approximately 1.40Gb (98.16%) of the sequences were successfully anchored to the 9 pseudochromosomes (Table 3).

Table 3

Characteristics of the genome assembly in Astragalus membranaceus.

Item	Astragalus membranaceus (AM)
Size of assembly (Gb)	1.43
Contig N50 (Mb)	1.67
Chromosome number	9
Anchored pseudo-chromosomes	98.16%
GC content	38.40%
Genome complete BUSCOs	97.27%
Long terminal repeat assembly index	16.22
Quality value	48.58
Repetitive sequences	67.98%
Number of protein-coding genes	29,914

Annotation of repetitive sequences

Tandem repeats and interspersed repeats were identified using the method described in Qu et al.³⁰. Approximately 67.98% of the assembled genome was classified as repetitive sequences, with interspersed repeats making up 65.99% of them (Table 4). Among the repetitive sequences, the most prevalent elements were long terminal repeats (LTRs), which accounted for 60.66% of the genome size.

Table 4

Statistics of interspersed repeats in AM assembly.

Type	Repbase TEs		TE protiens		De novo		*Combined TEs
Type	Length (bp)	% in genome	Length (bp)	% in genome	Length (bp)	% in genome	Length (bp)	% in genome
DNA	31,435,958	2.20	16,519,868	1.15	50,391,304	3.52	72,929,673	5.10
LINE	10,593,042	0.74	8,014,844	0.56	9,810,845	0.69	22,455,984	1.57
SINE	129,143	0.01	0	0.00	208,537	0.01	336,632	0.02
LTR	233,807,308	16.34	232,969,052	16.29	840,527,902	58.76	867,737,646	60.66
Other	3,875	0.00	0	0.00	15,661	0.00	19,536	0.00
†Unclassified	0	0.00	0	0.00	2,177,624	0.15	2,177,624	0.15
Total	274,854,677	19.21	257,495,852	18.00	897,005,122	62.71	943,956,995	65.99

Note: This statistical table does not contain Tandem Repeats, some elements may partly include another element domain.

*Combined: the non-redundant consensus of all repeat prediction/classification methods employed.

†Unclassified: the predicted repeats that cannot be classified by RepeatMasker;

LINE, long interspersed nuclear elements; SINE, short interspersed nuclear elements; LTR, long terminal repeat.

Protein-coding genes prediction and functional annotation

Protein-coding genes were annotated using a similar method as described in Fang et al.³¹. To facilitate genome annotation of AM assembly, RNA sequencing of root, stem, and leaf samples was conducted and resulted in a total of 72.18Gb clean reads (Table 5). For transcriptome-based prediction, RNA-seq clean reads were assembled using Trinity (v 2.15.1)³² with the following parameters: ‘–max_memory 200G–CPU 40–min_contig_length 200–genome_guided_bam merged_sorted.bam–full_cleanup–min_kmer_cov 3–min_glue 3–bfly_opts ‘-V 5–edge-thr=0.1–stderr’–genome_guided_max_intron 10000–genome_guided_min_coverage 2’. This generated 245,216 transcripts with an N50 of 1,997bp. The assembled transcripts were aligned to the AM assembly using Program to Assemble Spliced Alignment (PASA) (v 2.4.1)³³, and gene structures were generated from valid transcript alignments. Additionally, RNA-seq clean reads were also mapped to the AM assembly using Hisat2 (v 2.0.1)³⁴. Stringtie (v 1.2.2)³⁵ and TransDecoder (v 5.7.1) (https://github.com/TransDecoder/TransDecoder) were employed to assemble the transcripts and identify candidate coding regions into gene models. For homology-based method, homologous genomes and gene sets, including A. membranaceus var. mongholicus (AMM)⁵, Cicer arietinum (GenBank accession: GCA_026016865.1)³⁶, Medicago truncatula (GenBank accession: GCA_000219495.2)³⁷, Trifolium pratense (GenBank accession: GCA_949352195.3)³⁸, Glycine max (ZH13-T2T)³⁹, and Arabidopsis thaliana (Col-PEK1.5)⁴⁰, were downloaded and used as queries to search against the AM assembly utilizing GeMoMa (v 1.9)⁴¹ approach. Genes with a coding sequence (CDS) length less than 150bp were filtered out, along with single-exon genes lacking annotation of protein domains. Additionally, genes not anchored to chromosome sequences and lacking annotation of protein domains were also excluded. Finally, the generated gene models were refined with PASA (v 2.4.1) to obtain untranslated regions and information on alternative splicing variation by using Trinity assembled transcripts and isoforms from full-length transcriptomes of leaf and root tissues⁴². Following the method described in Bi et al.⁴³, the integrated gene set was translated into amino-acid sequences and annotated. As a result, 29,828 genes (99.71% of the total) were successfully annotated.

Table 5

Summary of RNAseq sequencing data of AM genome.

Sample	Total CleanReads	Clean data	Clean Q20% (fq1;fq2)	GC_rate (%)	Uniquely mapped reads	Total MappingRatio	Uniquely MappingRatio	SRA accession
leaf_1	42,134,094	6,320,114,100	95.71;94.67	44.04	28,298,792	71.76%	67.16%	SRR27790544
leaf_2	42,618,542	6,392,781,300	95.65;94.67	43.83	31,733,374	79.87%	74.46%	SRR27790543
leaf_3	42,362,198	6,354,329,700	95.68;94.83	43.19	33,032,872	83.56%	77.98%	SRR27790542
root_1	42,365,348	6,354,802,200	95.62;94.51	42.61	36,156,538	91.68%	85.34%	SRR27790541
root_2	42,176,852	6,326,527,800	95.61;94.73	42.59	35,176,690	90.26%	83.40%	SRR27790540
root_3	42,326,320	6,348,948,000	95.15;94.31	42.89	34,986,928	88.98%	82.66%	SRR27790539
stem_1	42,113,002	6,316,950,300	95.63;94.54	42.57	36,319,666	91.70%	86.24%	SRR27790538
stem_2	42,255,416	6,338,312,400	95.66;94.71	42.57	35,921,772	90.41%	85.01%	SRR27790547
stem_3	42,317,456	6,347,618,400	95.66;94.26	43.19	34,638,542	87.17%	81.85%	SRR27790546

Overall, we predicted 29,914 protein-coding genes, with average lengths of 4,752bp for genes, 622bp for introns, and 1,306bp for coding sequences. We downloaded the genes related to the flavonoid biosynthetic pathway in the AMM genome and identified genes associated with the flavonoid biosynthetic pathway in the AM genome using the OrthoFinder method⁴⁴. OrthoFinder is an accurate and comprehensive tool used for identifying and comparing homologous genomics among biological species. As a result, 73 genes associated with the flavonoid biosynthetic pathway in the AM genome were obtained. Homologous sequences were aligned by MAFFT (v 7.505)⁴⁵, and the alignment was then processed with TrimAL (v 1.4.1)⁴⁶ to remove poorly aligned positions. Subsequently, the phylogenetic tree was generated using iqtree2 (v 2.0.6)⁴⁷ with parameters of “-b 1000” and visualized using Evolview⁴⁸ (Fig. 3a). Using the method described in Li et al.⁴⁹, a total of 2,048 transcription factors (TFs) were identified (Fig. 3b). In brief, the plant TF domain profile (https://planttfdb.gao-lab.org/)⁵⁰ was searched against the AM protein data using the hmmsearch tool implemented in HMMER (v 3.1b2) (http://hmmer.org/). Proteins exhibiting a TF domain match with an E-value of 1E-5 or lower were chosen.

Genomic variations between AM and AMM

By applying the analytical tool MCScan (Python version)⁵¹, we conducted an in-depth identification of homologous regions between the AM and AMM genomes, with a threshold set to include at least ten genes. Our research findings revealed a total of 22,160 pairs of orthologous genes shared between the two genomes, with the AM genome containing 21,727 pairs (accounting for 72.63% of the total), and the AMM genome containing 21,474 pairs (accounting for 77.06% of the total) (Fig. 2a). The amino acid sequence lengths of the orthologous gene pairs within these collinear regions showed a significant positive correlation (Fig. 2b), further confirming their homology. Additionally, we observed two potential chromosomal rearrangement events. Firstly, a chromosomal fusion event occurred in the AMM linkage group Chr7, which connected the two chromosomes from the AM genome, namely Chr7 and Chr8. Secondly, a chromosomal fusion also took place in the AM linkage group Chr9, involving the two chromosomes from the AMM genome, Chr8 and Chr9. The specific causes of these fusion events, their timing, and how they affect the traits of the organism are important issues that require further in-depth exploration in future research.

Single nucleotide polymorphisms (SNPs) and small insertions/deletions (InDels) were identified using a similar method as previously reported⁵². Briefly, genome alignment between the AM and AMM assemblies was performed with the NUCmer program of MUMmer4 (v4.0.0)⁵³ using the parameter settings “–mum -g 1000 -c 90 -l 40”. The delta-filter program was used to obtain alignment blocks with the parameter setting “-1 -l 5000”. The show-snps program was used to detect SNPs and InDels with the settings “-Clr -x 1 -T”. Finally, a total of 4,902,056 SNPs and 903,918 InDels were identified (Fig. 2c,d). These variations serve as resources for further research.

Data Records

The DNA and RNA sequence reads of AM have been deposited in the Sequence Read Archive (SRA) with accession numbers SRP486930⁵⁴ under project number PRJNA1067739. The genome assembly has been deposited at GenBank under the WGS accession GCA_039519185.1⁵⁵. Additionally, the genome assembly, along with files for gene structure annotation, repeat predictions and gene functional annotation, variation information including SNP and InDels between AM and AMM genomes were deposited in Figshare⁵⁶.

Technical Validation

Genome assembly and gene prediction quality assessment

The quality and accuracy of the AM assembly were assessed through the following analyses. Firstly, the Hi-C interaction map showed a strong intrachromosomal interactive signal along the diagonal (Fig. 4). Secondly, the distribution of CG depth indicated that there was no apparent contamination in the assembled sequences (Fig. 5). Thirdly, the AM assembly presented an LTR assembly index of 16.22 and a BUSCO score of 97.27%, indicating its high completeness (Table 3). In addition, evaluation using Merqury showed a QV of 48.58, suggesting high accuracy at the base-pair level. Lastly, 99.56% of the DNA next-generation sequencing reads were mapped to the AM genome assembly, whereas an equally impressive 99.25% of the error-corrected PacBio data could also be mapped to the assembly. Notably, the genome coverage achieved from the error-corrected PacBio data reached 99.40%, and the depth of each window remained consistent without significant fluctuations (Fig. 6).

Fig. 4

Hi-C assembly of chromosome interactive heat map. The abscissa and ordinate represent the order of each bin on the corresponding chromosome group. The color block illuminates the intensity of interaction from yellow (low) to red (high).

Fig. 5

The relationship between GC content and sequencing depths base on the alignment of PacBio data.

Fig. 6

Depth of HiFi long reads mapped across the 9 chromosomes of AM genome.

We compared the length distribution of genes among the AMM⁵, C. arietinum³⁶, and G. max³⁹, and found similar patterns (Fig. 7). Meanwhile, 85.39% of the RNA-seq data were aligned to the predicted exons and only 2.5% located in intergenic region (Fig. 8). The BUSCO analysis showed that 96.59% (single-copy gene: 88.97%, duplicated gene: 7.62%) of 1,614 embryophyta single-copy orthologs were successfully identified as complete, while 1.12% were fragmented and 2.29% were missing in the assembly (Table 6). The 29,828 (99.71%) gene models were successfully annotated in diverse databases, such as NR, SwissProt, KEGG, KOG, TrEMBL and Interpro (Table 7). Taken together, all these results provide strong evidence that a high-quality AM genome has been obtained.

Fig. 7

The composition of gene elements in the AM genome compared to the genomes of other species.

Fig. 8

RNA-seq clean data verified the accuracy of protein-coding gene prediction.

Table 6

Integrity assessment of predicted coding genes in AM assembly.

Database:embryophyta_odb10	Number of BUSCOs	Percentage
Complete (C)	1,559	96.59%
Complete and single-copy (S)	1,436	88.97%
Complete and duplicated (D)	123	7.62%
Fragmented (F)	18	1.12%
Missing (M)	37	2.29%
Total BUSCO groups searched	1,614	1

Table 7

Number of functional annotations for predicted genes in AM assembly.

Type		Gene number	Percentage
Total		29,914	100%
NR		29,606	98.97%
SwissProt		24,179	80.83%
KEGG		24,103	80.57%
KOG		23,581	78.83%
TrEMBL		29,594	98.93%
Interpro	All	29,607	98.97%
Interpro	GO	19,069	63.75%
Annotated		29,828	99.71%
Unannotated		86	0.29%

Supplementary information

Supplementary Fig.2^{(651K, pdf)}

Supplementary Fig.1^{(220K, pdf)}

Acknowledgements

This work was supported by Project of the “Modernization Research of Traditional Chinese Medicine” Key Research and Development Program of the Ministry of Science and Technology (No. 2019YFC1710800), Project of the Shanxi Collaborative Innovation Center of Astragali Radix Resource Industrialization and Industrial Internationalization (No. HQXTCXZX2016-005 and No. HQXTCXZX2016-016) and Key Research and Development (R&D) project of Shanxi Province (No.201603D3111001).

Author contributions

H.J.F., Z.C., R.Z., C.G.M. and Q.S.L. conceived the study. H.J.F. collected and prepared the samples. Z.Y.W. and X.K.Y. performed bioinformatics analysis. Z.C. and H.J.F. wrote the manuscript with significant contributions from X.K.Y., A.K.L. and H.F.S. All authors read and approved the final manuscript.

Code availability

No specific code or script was used in this work. Commands used for data processing were all executed according to the manuals and protocols of the corresponding software.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Huijie Fan, Zhi Chai, Xukui Yang.

Contributor Information

Huijie Fan, Email: nc.ude.mctxs@eijiuhnaf.

Zhi Chai, Email: nc.ude.mctxs@ihziahc.

Qingshan Li, Email: moc.361@2102sqlxs.

Cungen Ma, Email: nc.ude.mctxs@negnucam.

Ran Zhou, Email: moc.uhos@85ruohz.

Supplementary information

The online version contains supplementary material available at 10.1038/s41597-024-03852-6.

References

1. Fu, J. et al. Review of the botanical characteristics, phytochemistry, and pharmacology of Astragalus membranaceus (Huangqi). Phytotherapy research: PTR28, 1275–1283 (2014). [Abstract] [Google Scholar]

2. Zheng, Y. et al. A Review of the Pharmacological Action of Astragalus Polysaccharide. Frontiers in pharmacology.11, 349 (2020). [Europe PMC free article] [Abstract] [Google Scholar]

3. Chen, J. et al. Global transcriptome analysis profiles metabolic pathways in traditional herb Astragalus membranaceus Bge. var. mongolicus (Bge.) Hsiao. BMC genomics16, 1–20 (2015). [Europe PMC free article] [Abstract] [Google Scholar]

4. Chen, Y. et al. A reference-grade genome assembly for Astragalus mongholicus and insights into the biosynthesis and high accumulation of triterpenoids and flavonoids in its roots. Plant Communications 4 (2022). [Europe PMC free article] [Abstract]

5. Global Pharmacopoeia Genome Databasehttp://www.gpgenome.com/species/109 (2022).

6. Wang, Y. et al. Chemical Discrimination of Astragalus mongholicus and Astragalus membranaceus Based on Metabolomics Using UHPLC-ESI-Q-TOF-MS/MS Approach. Molecules (Basel, Switzerland)24, E4064 (2019). [Europe PMC free article] [Abstract] [Google Scholar]

7. Belton, J.-M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods58(3), 268–76 (2012). [Europe PMC free article] [Abstract] [Google Scholar]

8. Bian, X. et al. Regulatory role of non-coding RNA in ginseng rusty root symptom tissue. Scientific reports11, 9211 (2021). [Europe PMC free article] [Abstract] [Google Scholar]

9. He, X. et al. The whole-genome assembly of an endangered Salicaceae species: Chosenia arbutifolia (Pall.) A. Skv. GigaScience 11 (2022). [Europe PMC free article] [Abstract]

10. Hong, Z. et al. The chromosome-level draft genome of Dalbergia odorifera. Gigascience 9.8 (2020). [Europe PMC free article] [Abstract]

11. Wu, H. et al. Camelid genomes reveal evolution and adaptation to desert environments. Nature communications 5.1 (2014). [Abstract]

12. Liu, Y. et al. Gekko japonicus genome reveals evolution of adhesive toe pads and tail regeneration. Nature communications 6.1 (2015). [Europe PMC free article] [Abstract]

13. Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics27, 764–770 (2011). [Europe PMC free article] [Abstract] [Google Scholar]

14. Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature communications11, 1432 (2020). [Europe PMC free article] [Abstract] [Google Scholar]

15. Fan, H. J. et al. Study of Genome Size of Medicinal Plant Astragali Radix. Chinese Journal of Basic Medicine In Traditional, 25(09), 1299–1302. (in Chinese with English abstract) (2019).

16. Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research27.5, 722–736 (2017). [Europe PMC free article] [Abstract] [Google Scholar]

17. Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nature Methods12, 780–786 (2015). [Europe PMC free article] [Abstract] [Google Scholar]

18. Xiao, C. L. et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nature Methods14.11, 1072–1074 (2017). [Abstract] [Google Scholar]

19. Nie, F. et al. De novo diploid genome assembly using long noisy reads. Nature Communications15(1), 2964 (2024). [Europe PMC free article] [Abstract] [Google Scholar]

20. Zeng, X. et al. An improved high-quality genome assembly and annotation of Tibetan hulless barley. Scientific Data7(1), 139 (2020). [Europe PMC free article] [Abstract] [Google Scholar]

21. Wang, B. et al. De novo genome assembly and analyses of 12 founder inbred lines provide insights into maize heterosis. Nature Genetics55.2, 312–323 (2023). [Abstract] [Google Scholar]

22. Zhou, Y. et al. Pan-genome inversion index reveals evolutionary insights into the subpopulation structure of Asian rice. Nat Commun14, 1567 (2023). [Europe PMC free article] [Abstract] [Google Scholar]

23. Salojärvi, J. et al. The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars. Nat Genet56, 721–731 (2024). [Europe PMC free article] [Abstract] [Google Scholar]

24. Walker, B. J., Abeel, T., Shea, T., Priest, M. & Earl, A. M. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLoS ONE9, e112963 (2014). [Europe PMC free article] [Abstract] [Google Scholar]

25. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics25, 1754–1760 (2009). [Europe PMC free article] [Abstract] [Google Scholar]

26. Servant, N. et al. HiC-Pro: An optimized and flexible pipeline for Hi-C data processing. Genome Biology 16 (2015). [Europe PMC free article] [Abstract]

27. Durand, N. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Systems3, 95–98 (2016). [Europe PMC free article] [Abstract] [Google Scholar]

28. Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science356, eaal3327 (2017). [Europe PMC free article] [Abstract] [Google Scholar]

29. Chang, D. et al. The chromosome-level genome assembly of Astragalus sinicus and comparative genomic analyses provide new resources and insights for understanding legume-rhizobial interactions. Plant communications3, 100263 (2022). [Europe PMC free article] [Abstract] [Google Scholar]

30. Qu, C. et al. Comparative genomic analyses reveal the genetic basis of the yellow-seed trait in Brassica napus. Nature Communications14, 5194 (2023). [Europe PMC free article] [Abstract] [Google Scholar]

31. Fang, X. et al. The sequence and analysis of a Chinese pig genome. GigaScience1, 16–16 (2012). [Europe PMC free article] [Abstract] [Google Scholar]

32. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology29(7), 644–52 (2011). [Europe PMC free article] [Abstract] [Google Scholar]

33. Haas, B. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research31, 5654–5666 (2003). [Europe PMC free article] [Abstract] [Google Scholar]

34. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods12, 357–360 (2015). [Europe PMC free article] [Abstract] [Google Scholar]

35. Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology20, 278 (2019). [Europe PMC free article] [Abstract] [Google Scholar]

36. NCBI GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_026016865.1 (2022).

37. NCBI GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_000219495.2 (2014).

38. NCBI GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_949352195.3 (2023).

39. National Genomics Data Centerhttps://ngdc.cncb.ac.cn/gwh/Assembly/66216/show (2023).

40. AraSharehttps://www.arashare.cn//static/uploads/Col-PEK1.5_assembly_and_annotation.tar.gz (2023).

41. Keilwagen, J., Hartung, F., Grau, J. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. In:Kollmar, M. (eds)Gene Prediction. Methods in Molecular Biology, vol 1962 (2019). [Abstract]

42. Li, J. et al. Long read reference genome-free reconstruction of a full-length transcriptome from Astragalus membranaceus reveals transcript variants involved in bioactive compound biosynthesis. Cell Discovery 3 (2017). [Europe PMC free article] [Abstract]

43. Bi, Q. et al. The phased chromosome-scale genome of yellowhorn sheds light on the mechanism of petal color change. Horticultural Plant Journal (2023).

44. Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. in Genome biology20, 238 (2019). [Europe PMC free article] [Abstract] [Google Scholar]

45. Nakamura, T., Yamada, K. D., Tomii, K. & Katoh, K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics (Oxford, England)34, 2490–2492 (2018). [Europe PMC free article] [Abstract] [Google Scholar]

46. Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics (Oxford, England)25, 1972–1973 (2009). [Europe PMC free article] [Abstract] [Google Scholar]

47. Minh, B. Q. et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution37, 1530–1534 (2019). [Europe PMC free article] [Abstract] [Google Scholar]

48. Subramanian, B., Gao, S., Lercher, M. J., Hu, S. & Chen, W.-H. Evolview v3: a webserver for visualization, annotation, and management of phylogenetic trees. Nucleic acids research47, W270–W275 (2019). [Europe PMC free article] [Abstract] [Google Scholar]

49. Li, D. et al. A high-quality genome assembly of the eggplant provides insights into the molecular basis of disease resistance and chlorogenic acid synthesis. Molecular ecology resources21, 1274–1286 (2021). [Abstract] [Google Scholar]

50. Jin, J., Zhang, H., Kong, L., Gao, G. & Luo, J. PlantTFDB 3.0: a portal for the functional and evolutionary study of plant transcription factors. Nucleic acids research42, D1182–7 (2014). [Europe PMC free article] [Abstract] [Google Scholar]

51. Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic acids research40, e49 (2012). [Europe PMC free article] [Abstract] [Google Scholar]

52. Li, T. et al. Genome assembly of KA105, a new resource for maize molecular breeding and genomic research. The Crop Journal (2023).

53. Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Computational Biology 14 (2018). [Europe PMC free article] [Abstract]

54. NCBI Sequence Read Archivehttps://identifiers.org/ncbi/insdc.sra:SRP486930 (2024).

55. Fan, H. Astragalus membranaceus isolate JZ-2020, whole genome shotgun sequencing project. GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_039519185.1 (2024).

56. Fan, H. Genome Assembly and Annotation of Astragalus membranaceus (Fisch.) Bge (AM). figshare. Dataset.10.6084/m9.figshare.25100393.v3 (2024).

Articles from Scientific Data are provided here courtesy of Nature Publishing Group

Full text links

Read article at publisher's site: https://doi.org/10.1038/s41597-024-03852-6

Citations & impact

This article has not been cited yet.

Impact metrics

Alternative metrics

Altmetric item for https://www.altmetric.com/details/169186708

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/169186708

Data

Data behind the article

This data has been text mined from the article, or deposited into data resources.

Data Citations

(1 citation) DOI - 10.6084/m9.figshare.25100393.v3

BioProject

(1 citation) BioProject - PRJNA1067739

GCA - NCBI genaome assembly (4)

Nucleotide Sequences (Showing 10 of 10)

(1 citation) ENA - SRR27790542
(1 citation) ENA - SRR27790541
(1 citation) ENA - SRR27790544
(1 citation) ENA - SRR27790543
(1 citation) ENA - SRR27790540
(1 citation) ENA - SRR27790539
(1 citation) ENA - SRP486930
(1 citation) ENA - SRR27790538
(1 citation) ENA - SRR27790546
(1 citation) ENA - SRR27790547

Show less

Search life-sciences literature (45,100,050 articles, preprints and more)

Chromosome-scale genome assembly of Astragalus membranaceus using PacBio and Hi-C technologies.

Author information

Affiliations

Authors

Authors

Authors

Authors

Authors

Abstract

Free full text

Chromosome-scale genome assembly of Astragalus membranaceus using PacBio and Hi-C technologies

Huijie Fan

Zhi Chai

Xukui Yang

Ake Liu

Haifeng Sun

Zhangyan Wu

Qingshan Li

Cungen Ma

Ran Zhou

Associated Data

Abstract

Background & Summary

Methods

Plant materials and sequencing

Table 1

Genome survey

Table 2

Genome assembly

Table 3

Annotation of repetitive sequences

Table 4

Protein-coding genes prediction and functional annotation

Table 5

Genomic variations between AM and AMM

Data Records

Technical Validation

Genome assembly and gene prediction quality assessment

Table 6

Table 7

Supplementary information

Acknowledgements

Author contributions

Code availability

Competing interests

Footnotes

Contributor Information

Supplementary information

References

Full text links

Citations & impact

Impact metrics

Alternative metrics

Data

Data behind the article

Data Citations

BioProject

GCA - NCBI genaome assembly (4)

Nucleotide Sequences (Showing 10 of 10)

Similar Articles

Chromosome level genome assembly of endangered medicinal plant Anisodus tanguticus.

Chromosome-level genome assembly and annotation of xerophyte secretohalophyte Reaumuria soongarica.

Chromosome-scale genome assembly and annotation of Cotoneaster glaucophyllus.

Chromosome-level genome assembly of watershield (Brasenia schreberi).

Partnerships & funding