CSD Subsets
What Is a CSD Subset?
A CSD subset is a list of CSD entries that have certain characteristics, a particular application, or are the best representative of each unique polymorph in the CSD. It is possible to quickly and effectively search just the compounds within a subset of interest rather than the whole CSD. The subsets also help to demonstrate the benefits of links between databases and other information sources, and are examples of how the CCDC’s values of collaboration and community help progress structural science.
The process of adding CSD entries to subsets is continually being improved to ensure that all the entries applicable to a given subset are being captured.
Find a List of Current Subsets
What Are the Benefits of Using the CSD Subsets?
CSD Subsets allow you to search and analyse only structures that are relevant to your work, reducing the need to separately filter out structures that aren’t relevant to a particular problem area, and increasing speed and efficiency. The CSD Subsets that are “Best Representative Lists” (described in the table below) provide a way to perform an analysis with only one example of a particular crystal form and thus reduce the chances of biases that could be introduced if multiple determinations are used in an analysis.
How Can I Find a CSD Subset?
Subsets are updated with every data release to CCDC’s Desktop software and can be used with ConQuest, Mercury or the CSD Python API.
ConQuest:
The subsets are organized under the View Databases tab of ConQuest through dropdowns.
To restrict the results of a ConQuest search to a subset:
-
Build the query as usual and hit Search. Then hit the Select Subset button in the Search Setup dialogue box.
-
Select Entries in a pre-defined hitlist.
-
Click on the Choose a subset button underneath the words Restrict on pre-defined hitlists and select the required subset.
Mercury:
The subsets are organized under the CSD-Core tab of Mercury through dropdowns.
CSD Python API:
The CSD Python API contains the class ccdc.io.Subsets; this class provides a simple way to access pre-defined CSD subsets. The returned reader object is the same as if the Reader class has been initialized with the associated GCD file directly. You’ll see prompts to autofill your code with the correct subset name.
Current CSD Subsets
Find out more about the CSDBest Representative Lists
Best Hydrogen List (Over 750 thousand):
A list of the CSD entries that are of the most reliable crystal structures in the CSD and are the best representative of each unique polymorph for H atoms having been located, particularly via neutron studies.
Best Low Temperature List (Over 750 thousand):
A list of the CSD entries that are of the most reliable crystal structures in the CSD and are the best representative of each unique polymorph for having been determined at the lowest temperature (reducing the impact of thermal motion).
Best R factor List (Over 750 thousand):
A list of the CSD entries that are of the most reliable crystal structures in the CSD and are the best representative of each unique polymorph for having the lowest R-value.
Best Room Temperature List (Over 750 thousand):
A list of the CSD entries that are of the most reliable crystal structures in the CSD and are the best representative of each unique polymorph for having been determined at room temperature.
CSD Drug Subsets
CSD COVID-19 Subset (345 entries):
A list of the CSD entries that have been reported as small molecule drug candidates for the treatment of COVID-19.
CSD Drug Subset (Over 16 thousand):
A list of the CSD entries that contain an approved drug molecule (as defined by the DrugBank approved drug list). Includes hydrates, solvates, salt and metal complexes.
Single Component Drug Subset (Over 2.5 thousand):
A list of the CSD entries where the approved drug molecule (as defined by the DrugBank approved drug list) is the only non-disordered fully modelled molecule in the crystal structure.
CSD MOF Subsets
CSD MOF Subset (Over 128 thousand):
A list of the CSD entries that are metal-organic framework (MOF) structures. Since the definition of a MOF differs significantly, this subset is intended to be as widely applicable as possible, including all metal centres (rather than based on transition metals) and 1D, 2D and 3D frameworks.
Non-disordered MOF Subset (Over 94 thousand):
A list of the CSD entries that are non-disordered (structures in the CSD are normally classified as disordered if there is any non-hydrogen disorder – including unmodelled molecules common in MOF-like frameworks). This subset includes structures where the framework is non-disordered, but may still contain disorder relating to solvent molecules (unmodelled disorder as part of a framework is not excluded in this subset).
1D/2D/3D MOF subsets (Over 36/26/30 thousand respectively):
A list of the CSD entries from the non-disordered MOF subset where the main polymeric framework of the structure is 1-dimensional/2-dimensional/3-dimensional.
Teaching Subset
945 entries.
A list of the CSD entries that have been selected to enhance chemistry learning including: molecules typically used in standard chemical texts to exemplify the core concepts and principles taught in the undergraduate chemistry curriculum, examples of all major chemical functional groups as well as a wide range of broader chemical classes, and a diverse range of molecular geometries.
ADPs Available
Over 1 million entries.
A list of the CSD entries for which at least one atom has anisotropic displacement parameter information available in the CSD. Ellipsoids are visible for these structures when being viewed in Mercury.
CSD Pesticides
Over 1.1 thousand entries.
A list of the CSD entries where a molecule in the crystal structure is found in the Pesticide Properties DataBase (PPDB), developed by the Agriculture & Environment Research Unit (AERU) at the University of Hertfordshire. The matches do not take into account stereochemistry, so some entries in the subset may correspond to enantiomers or stereoisomers of the PPDB entry. Links directly to the PPDB are available via our Access Structures and WebCSD services, with reciprocal links to the CSD in the PPDB database.
Electron Diffraction Subset
Over 500 entries.
A list of the CSD entries that have been identified (either through CIF information or through the associated publication) as having been solved via electron-diffraction techniques. Electron diffraction is an emerging field that enables the structure solution of crystalline powders that do not easily form measurable single crystals (as is the case for many known APIs).
High Pressure Subset
Over 4.7 thousand entries.
A list of the CSD entries that have been identified (either through CIF information or through the associated publication) as having been measured at 0.1 gigapascals (GPa) or above. These structures are highly relevant for research into MOFs for gas storage.
Hydrate Subset
Over 162 thousand entries.
A list of the CSD entries that contain a component for H2O in their formula, where the H can be H or D. Finding the type of molecules that form hydrates is important in de-risking new drug candidates. This subset originated from the work of our Crystal Form Consortium (CFC); the collaboration between industry and the CCDC to provide structural informatics tools and approaches for the rational design of the solid form.
Minimal Disorder Subset
Over 38 thousand entries.
A list of the CSD entries for which their disorder classification is solely due to the use of SQUEEZE or MASK routines to exclude disordered components within a structure. The remaining modelled molecules do not exhibit disorder and could therefore be used for conformational or statistical analysis alongside non-disordered structures.
Polymorphic Subset
Over 44 thousand entries.
A list of the CSD entries that have been identified as being polymorphic and contain information in their polymorph field as well as those in the same refcode family as a structure that is polymorphic.
Significant Disorder Subset
Over 58 thousand entries.
A list of the CSD entries with unresolved or symmetry disorder. For these entries it is generally not possible to obtain reliable data relating to chemistry (e.g bond lengths) or crystal properties (e.g. void space or density). Excluding this subset from query results may help to avoid unexpected results when performing statistical analysis.