Deciphering the RRM-RNA recognition code: A computational analysis.

Roca-Martínez J; Dhondge H; Sattler M; Vranken WF

doi:10.1371/journal.pcbi.1010859

Deciphering the RRM-RNA recognition code: A computational analysis.

Affiliations

1. Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium.
Authors
Roca-Martínez J¹
Vranken WF¹
(2 authors)
2. Université de Lorraine, CNRS, Inria, LORIA, Nancy, France.
Authors
Dhondge H²
(1 author)
3. Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich, Neuherberg, Germany.
Authors
Sattler M³
(1 author)

ORCIDs linked to this article

Plos Computational Biology, 23 Jan 2023, 19(1):e1010859
https://doi.org/10.1371/journal.pcbi.1010859 PMID: 36689472 PMCID: PMC9894542

This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.

Free full text in Europe PMC

Abstract

RNA recognition motifs (RRM) are the most prevalent class of RNA binding domains in eucaryotes. Their RNA binding preferences have been investigated for almost two decades, and even though some RRM domains are now very well described, their RNA recognition code has remained elusive. An increasing number of experimental structures of RRM-RNA complexes has become available in recent years. Here, we perform an in-depth computational analysis to derive an RNA recognition code for canonical RRMs. We present and validate a computational scoring method to estimate the binding between an RRM and a single stranded RNA, based on structural data from a carefully curated multiple sequence alignment, which can predict RRM binding RNA sequence motifs based on the RRM protein sequence. Given the importance and prevalence of RRMs in humans and other species, this tool could help design RNA binding motifs with uses in medical or synthetic biology applications, leading towards the de novo design of RRMs with specific RNA recognition.

Free full text

PLoS Comput Biol. 2023 Jan; 19(1): e1010859.

Published online 2023 Jan 23. https://doi.org/10.1371/journal.pcbi.1010859

PMCID: PMC9894542

PMID: 36689472

Deciphering the RRM-RNA recognition code: A computational analysis

Joel Roca-Martínez, Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft,^1
,² Hrishikesh Dhondge, Data curation, Investigation, Software, Validation, Writing – review & editing,³ Michael Sattler, Conceptualization, Validation, Writing – review & editing,^4
,⁵ and Wim F. Vranken, Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing^1
,^2
,^¤^*

Shi-Jie Chen, Editor

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Go to:

Associated Data

Supplementary Materials: S1 Fig: Schematic procedure for the similarity score calculation between two RNAs based on their binding with the RRM. The two RNAs are aligned in all possible combinations using a sliding window, with the number of matches and unique positions each nucleotide interacts with in the RRM sequence counted for the 4 positions aligned in this example. The ratios between the matches and unique positions are added and then averaged for all the positions by dividing them by the length of the alignment.
(PDF)
pcbi.1010859.s001.pdf (148K)
S2 Fig: Variation on the number of clusters (A) and number of entries in the biggest cluster (B) depending on the chosen cutoffs for similarity score (X-axis, scores from 0 to 1) and percentage of entries (Y-axis from 0% to 100%) that should have an equal or higher similarity score with the rest of the cluster. The chosen cutoff for the cluster generation is highlighted in yellow.
(PDF)
pcbi.1010859.s002.pdf (105K)
S3 Fig: Superimposed RRM-RNA complexes with the lowest similarity score in cluster 0, PDB Id. 6g90 (chain B, green) and PDB Id. 3nnh (chain B, orange). The aligned nucleotides used for the RMSD calculations are labelled.
(PDF)
pcbi.1010859.s003.pdf (111K)
S4 Fig: Scores for the conserved aromatic positions in RNP2 (β1–3) and RNP1 (β3–3, β3–5) in contact with their respective RNA positions (phenylalanine and tyrosine scores are highlighted in bright green). (PDF)
pcbi.1010859.s004.pdf (207K)
S1 Table: PFAM identifiers and related metadata of the selected RRM families for the analysis. (PDF)
pcbi.1010859.s005.pdf (107K)
S2 Table: Identifiers of the 19 selected structures to use in PROMALS3D. (PDF)
pcbi.1010859.s006.pdf (110K)
S3 Table: Experimental Kd values correlated with the RRMScorer scores for the WT RNA and the 36 RNA mutants tested against MSI1 RRM1, using a window size of three nucleotides. (PDF)
pcbi.1010859.s007.pdf (83K)
S1 File: S1 –S4 Equations: Basis for the development of the RRMScorer equation. Our method is based on the information difference between the occurrence of two events, in our case the information of how often a specific nucleotide interacts with a specific residue, $I (N_{i}; R_{J})$ , and the information when that same nucleotide interacts with any other residue, $I ({n - N}_{i}; R_{J})$ (S1 Equation). The individual terms are developed in S2 and S3 Equations where: $f_{N_{i}, R_{j}}$ is the number of occurrences for a specific contact between nucleotide i and residue j; $f_{R_{j}}$ the number of times residue j is in that position; $f_{N_{i}}$ the number of times nucleotide i is in that position; $f_{{n - N}_{i}, R_{j}}$ the number of times nucleotide i interacts with any residue but residue j; $f_{{n - N}_{i}}$ the number nucleotides other than nucleotide i in that position; R the total number of residues in the dataset. As it’s a difference between the two logarithms, the common terms that account for the number of specific residues in position j ( $f_{R_{j}}$ ) and total number of residues in the dataset (R) disappear from the equation. After the simplification we obtain S4 Equation (Same as Eq 2 in the main text).
(PDF)
pcbi.1010859.s008.pdf (122K)
S1 Dataset: Complete list of the RRM structures retrieved from the PDB, with the RRM sequence and the sequence range for both PDB and UniProt. (FASTA)
pcbi.1010859.s009.fasta (173K)
S2 Dataset: List including the identifiers for the RRM-RNA complexes interacting with three or more nucleotides. The identifier name is organized as follows: <UniProt ID>_<RRM number>_<PDB ID>_<protein chain>_<PDB numbering>_<UniProt numbering>_<Internal numbering for mapping>_<RNA chain>.
(TXT)
pcbi.1010859.s010.txt (11K)
S3 Dataset: Contacts list for all the RRM-RNA complexes. Each line corresponds to a specific contact and it is organized as follows: <PDB ID>_<Protein chain>, <PDB residue number>_<Residue one letter code>_<Nucleotide one letter code>_<PDB nucleotide number>_<RNA chain>.
(TXT)
pcbi.1010859.s011.txt (255K)
S4 Dataset: Fasta file with the reduced RRM dataset after applying a 99% sequence identity threshold. (FASTA)
pcbi.1010859.s012.fasta (35K)
S5 Dataset: RRM sequence alignment in fasta format for the 347 RRM selected sequences. Generated using PROMALS3D. (FASTA)
pcbi.1010859.s013.fasta (97K)
S6 Dataset: RRM sequence alignment in fasta format for the 271 RRM-RNA complexes. Generated via HMM from the master alignment.
(FASTA)
pcbi.1010859.s014.fasta (75K)
S7 Dataset: RNA binding similarity matrix for the 271 RRM-RNA complexes. Values close to 1 refer to similar biding modes while close to 0 correspond to different binding modes. A score of 0 is given when an RRM-RNA complex is compared with itself. The matrix is available in csv format.
(CSV)
pcbi.1010859.s015.csv (1.2M)
S8 Dataset: Fasta file with the RNA alignment for the 187 RRM-RNA complexes included in cluster 0. (FASTA)
pcbi.1010859.s016.fasta (8.7K)
S9 Dataset: CSV file with the internal validation results for the 187 RRM complexes in cluster 0. (CSV)
pcbi.1010859.s017.csv (12K)
S10 Dataset: JSON file with the protein identifiers, RRM domain, bits value from RNAcompete data and RRMScorer predictions. (JSON)
pcbi.1010859.s018.json (41M)
Attachment: Submitted filename: response_to_reviewers.pdf
pcbi.1010859.s019.pdf (225K)

Data Availability Statement: All the datasets and code required to run RRMScorer is publicly available at https://bitbucket.org/bio2byte/rrmscorer/src/master/.

Go to:

Abstract

RNA recognition motifs (RRM) are the most prevalent class of RNA binding domains in eucaryotes. Their RNA binding preferences have been investigated for almost two decades, and even though some RRM domains are now very well described, their RNA recognition code has remained elusive. An increasing number of experimental structures of RRM-RNA complexes has become available in recent years. Here, we perform an in-depth computational analysis to derive an RNA recognition code for canonical RRMs. We present and validate a computational scoring method to estimate the binding between an RRM and a single stranded RNA, based on structural data from a carefully curated multiple sequence alignment, which can predict RRM binding RNA sequence motifs based on the RRM protein sequence. Given the importance and prevalence of RRMs in humans and other species, this tool could help design RNA binding motifs with uses in medical or synthetic biology applications, leading towards the de novo design of RRMs with specific RNA recognition.

Go to:

Author summary

The interactions between proteins and RNAs are key to many different biological processes and crucial for the proper functioning of the cells. The RNA recognition motif is the most prevalent protein carrying such functions in eucaryotes, so understanding how this protein motif interacts with different RNA sequences is of immediate relevance. However, a general recognition code between this motif and the RNA is not known yet. We have performed a computational analysis to understand how the recognition process works, identify the main binding mode between the protein and the RNA and which are the most relevant amino acids involved in the recognition. This analysis allowed us to build a predictor that estimates the binding between an RNA recognition motif and any RNA sequence. We have named this method RRMScorer and it only needs the sequence of both the protein and the RNA to predict the results. We hope that this new tool will be useful to identify new potential RNA targets or to design new protein mutants to modify their RNA binding capabilities.

Go to:

Introduction

The RNA recognition motif (RRM) is a well-studied RNA-binding domain that is prevalent throughout organisms, but especially so in eucaryotes, where it plays crucial roles in many aspects of post-transcriptional gene regulation [1]. A single RRM domain is approximately 90 residues long, with a very conserved topology of two α-helices packed on an antiparallel β-sheet. The four β-strands and the loops connecting the secondary structure elements (Loop1 connecting β1- α1, loop 3 connecting β2-β3 and loop 5 connecting α2-β4) often serve as the main RNA binding interface [2] (Fig 1A). Even though the core 3D fold of the RRM is very conserved, its amino acid sequence has exhaustively evolved to specifically bind different RNA sequences [3], thereby enabling RRMs to regulate a wide range of biological functions [2]. Most commonly, the RRM binds single stranded RNA (ssRNA), and in some cases single stranded DNA [4,5] or structured RNA motifs [6]. Examples are available where protein-protein interactions between an RRM and another RRM, or a non-RRM protein, can modulate the RRM binding capabilities [7–9] and the RRM fold has also evolved to mediate protein-protein interactions with limited or no RNA binding capability, e.g. U2AF Homology Motif (UHM) domains that recognize peptidic UHM Ligand Motifs (ULMs) [10,11].

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010859.g001.jpg

Fig 1

A) Cartoon representation of the Sex-lethal RRM1 protein in complex with polyU (PDB Id. 1b7f). The main secondary structure elements are labelled and coloured in pale green (β-strands), dark salmon (⍺-helices) and light blue (loops and terminal regions). B) Schematic representation of the RRM depicting the main positions of the canonical RNA binding interface with single circles and highlighting in red the most significant ones in terms of RNA interactions prevalence. The same colour code is used and the light green and dark green circles in the β-strands correspond to the exposed and buried residue sidechains, respectively.

Many efforts have been made to understand the RNA recognition mechanism, and despite the identification of some RNA consensus sequences for several RRM domains [12–19], many RRMs have no consensus sequence identified yet. The variable RNA binding modes and variations in RRM subfamilies [1] complicate the identification of a general code for RRM-RNA recognition, which has remained a challenge for many years [20]. The increasing number of structures of RRM-RNA complexes now available allows a more detailed and general analysis of the residue-nucleotide preferences of RRMs. RNA binding preference predictions for RNA binding proteins (RBP) have already been a research focus during the last decade, where deep learning methods are taking advantage of the huge amount of information available for certain RBP families [21,22]. One of the main limitations of these methods is that often only the RNA information is considered [18,23], which is unfeasible for RRMs where a few amino acid changes on the RNA binding interface can completely change RNA specificity [24,25]. Knowledge-based potentials have also been widely used to study protein/nucleic acid interactions [26,27], with some specific applications on protein-RNA recognition [28]. One of the main limitations of this method is that they rely on docking models and a detailed calculation of all the atomic interactions, and therefore have a strong dependence on the precise structural data that the potential was based on [29]. The recent release of RoseTTAFoldNA [30] also promises a huge advance in the field, providing high accuracy models with atomic resolution for protein/nucleic-acid complexes, which is extremely useful for proteins where it is clear which RNA the protein binds, but that is not always the case. Therefore, a fast and interpretable method that works at the sequence level and that is applicable to genome scale studies to identify possible interactions, or that can be used in computational screening in protein design, is not available.

Here we introduce RRMScorer, a scoring method that overcomes these limitations by carefully combining information on protein and RNA, structure and sequence data, whilst minimising bias in both areas. A comprehensive analysis of these data reveals RRM-RNA interaction preferences, which are used to generate a score that informs how likely a particular RRM-RNA interaction is. We focus on the canonical binding mode characterized by the binding of the RNA to the surface of the central β-sheet of the RRM fold, (Fig 1A), which comprises aromatic residues in the β3 and β1 strands that mediate non-sequence-specific stacking interactions with RNA bases. These residues are part of the highly conserved ribonucleoprotein domain 1 (RNP1) and RNP2, in the β3 and β1 strands, respectively, and are a hallmark of the canonical RRM fold. Even within this canonical binding mode, a wide range of different RNA sequences can be recognized. Indeed, some of the well-described human RRMs such as HuR, U1A and PTB (UniProt Ids. Q15717, P09012, P26599) recognize very different RNA sequences [19] whilst sharing the canonical binding mode.

This meta-analysis of available RRM-RNA information brings us closer to understanding how this versatile motif works. RRMScorer provides a novel means to decipher a general recognition code for canonical RRMs by using a completely different approach from other RBP predictors, where complex deep-learning networks are trained using high-throughput RNA binding data to determine the consensus for different RRM domains [21,22]. Instead, our method relies on a carefully curated alignment for all the structural information available, that is then translated into a single score that estimates how likely it is that any residue in a specific position of the RRM interacts with any nucleotide of the RNA. One of the key points of RRMScorer is therefore its interpretability, enabling tracking of the residue-nucleotide contacts that lead to good or bad overall scores for an RRM-RNA complex. This approach brings us closer to successfully designing novel RRMs that are specific for different ssRNA targets. Due to the wide range of processes RRMs are involved in, designing such RRMs would be relevant in both therapeutic and synthetic biology fields, for example through creating novel means for post-transcriptional regulation through RRMs, as well as for discovery of in vivo RNA targets of RRMs with as yet unknown function and interactions.

Go to:

Materials and methods

Source data

To gather the RRM data available we started from the Pfam database [31], where 19 families containing RRMs were identified (S1 Table) and further validated through visual inspection of representative structures for each family. All the available entries on the Protein Data Bank belonging to those families were retrieved and split into the separate RRM domains. The structures can be found in proteins containing a single RRM domain to larger proteins containing up to four RRM domains. 1259 RRM domain structures were so identified in the complete RRM dataset (S1 Dataset), from which the domains in complex with RNA were extracted. 271 entries remained after removing the complex structures where less than three nucleotides are recognized (Fig 2). The complete RRM dataset and the RRM-RNA complex dataset (S2 Dataset) are available from the supplementary information folder in the Bitbucket repository. All entries are identified by their UniProt code, RRM number, PDB Id. and chain, starting and ending positions of the RRM by PDB and UniProt, and UniProt starting and ending positions matching the sequence included in the file. The latter numbering is required to include some extra residues at both the C- and N-terminus regions that might still be relevant for RNA binding.

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010859.g002.jpg

Fig 2

Data flow diagram for the RRM structural and sequence data to generate the master alignment, use it to align the RRM-RNA complexes and cluster them depending on their binding mode so the RNAs can be aligned.

The different Datasets that have been made available are named and labelled with its corresponding number of entries.

From these RRM-RNA complexes, all protein-RNA interactions were computed using an in-house script and stored in a text file also available from the Bitbucket repository (S3 Dataset). An amino acid residue and nucleotide were considered to interact if any atom from the residue and the nucleotide were less than 5 Å away from each other. This is a broadly used interaction definition to keep strong interactions such as hydrogen bonds or electrostatic interactions, while still accounting for hydrophobic interactions that can occur at distances of 3.8–5.0 Å [32]. The analysis resulted in 13387 identified amino acid to nucleotide interactions.

Data cleaning and alignment procedure

For the analysis, it was necessary to reduce the bias in the original complete RRM dataset. The protein sequences were extracted from the 1259 RRM domain structures (S1 Dataset) and to eliminate nearly identical RRM sequences in this set a sequence identity threshold of 99% was applied, after which 356 sequences remained in the reduced RRM dataset (S4 Dataset). This set still showed a strong bias towards certain RRM families, especially the RRM_1 Pfam family (PF00076), with 314 entries out of the 356 belonging to this family. To overcome this bias a set of 19 representative RRM domains was defined using CD-HIT [33] with a 30% sequence identity cut-off, while validating that we were selecting entries from different Pfam families, even though larger families such as RRM_1 are still repeated. This representative RRM set (S2 Table) contains very diverse RRM sequences for which the structure is available, and served as the core for generating the master multiple sequence alignment (MSA) (Fig 2).

The sequence and structure information from the reduced RRM dataset was fed into PROMALS3D [34] to generate the alignment (Fig 2). Besides using structure and sequence, this tool employs secondary structure predictions to generate the MSA. For our study, we used the structures of the previously selected pool of 19 representative RRMs, and the sequences for the remaining 336 domains, totalling 356 RRM sequences. After manual checks to avoid gaps in the MSA for the 6 main secondary structures (β1-α1-β2-β3-α2-β4) due to some unusually long β-strands and/or α-helices, 347 RRMs remained in the clean alignment.

The clean alignment was further enhanced to improve the often poor alignments for the loops and terminal regions. Based on the principle that for amino acids in the loop regions the most important characteristic is how they are connected to the fixed secondary structure elements, these regions were ‘squeezed’ so all the gaps are placed in the middle of the loop regions, or at the extremes for the N and C terminal regions, in case of shorter loops or terminal regions, respectively (Fig 3). This alignment, hereinafter referred to as the master alignment, is then used to generate an HMM useful to quickly align other RRM sequences, such as the RRM-RNA structures dataset (Fig 2). The master alignment (S5 Dataset) and the alignment for the RRM-RNA structures (S6 Dataset) are available from the supplementary information folder in the Bitbucket repository.

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010859.g003.jpg

Fig 3

Schematic representation of the sequence alignment procedure to improve the alignment in the loop and terminal regions.

RRM-RNA complexes similarity matrix

To identify and cluster the entries where RNA molecules bind in a similar orientation to the RRMs, we developed a method to compute a pairwise similarity score based on the amino acid positions of the RRM that recognize the nucleotides of the RNAs (S1 Fig). This was calculated for the 271 RRM-RNA structures available (S7 Dataset). Some entries contain long RNAs (up to 3000 nucleotides), for which the RNA sequences were truncated to keep only the part that binds to the RRM. All RNA sequences were pairwise compared, by sliding them with respect to each other and checking whether their nucleotides bind similar amino acid sequence positions in the RRM master MSA. The number of matching positions between the two RNAs (positions of the RRM that both nucleotides are interacting with) are divided by the number of unique positions between both nucleotides, i.e., the total number of different positions that those two nucleotides bind. The value of this ratio for each aligned nucleotide is added and then divided by the length of the alignment, providing a score from 0 (completely different binding mode) to 1 (the same binding mode) (Eq 1).

S i m i l a r i t y s c o r e = \frac{\sum_{i}^{i = n} \frac{N_{M a t c h i n g p o s i t i o n s}}{N_{U n i q u e p o s i t i o n s}}}{N_{A l i g n e d n u c l e o t i d e s}}

For each compared RNA pair, only the alignment with the highest similarity score is retained. A similarity matrix was then constructed from these best scores that includes all the RRM-RNA complexes, and which is later used to identify the different binding modes. The matrix is available as a CSV file from the supplementary information folder in the Bitbucket repository (S7 Dataset).

To select a homogeneous cluster, we grouped all the entries that have a minimum score of 0.25 with at least 25% of the complexes in the cluster. We explored the effect of changing the cutoffs on the cluster generation (S2 Fig), and selected a combination of values that balances the variability and the similarity within the clusters, so allowing further meaningful analysis of the complexes. The first cluster defined, hereinafter referred to as cluster 0, retained 187 entries and it was the largest found (Fig 2) using those cutoffs. Other smaller clusters were generated, but in this work, we will focus on the canonical binding mode represented by cluster 0.

After visual inspection of some of the entries within cluster 0 we verified that the RNAs were bound similarly, and that we essentially captured the canonical binding mode [1]. The entries raising the lowest similarity score were PDB Id. 6g90 (chain B–RRM1) and PDB Id. 3nnh (chain B–RRM1), with a similarity score of 0.083. Based on their best alignment from the score calculation, a UC RNA fragment from 6g90 aligns with a GU RNA fragment from 3nnh. To verify they still shared a similar binding mode both complexes were superimposed (S3 Fig) and the RMSD was calculated using the heavy atoms in the sugar for the four aligned nucleotides, obtaining a value of 1.98Å.

RNA sequence alignment

To conduct the RNA alignment, we used the same method as for the RNA binding modes identification, based on the amino acid sequence positions of the RRM that the nucleotides of the RNA are in contact with. When comparing two RNAs, the sliding window position that generates the highest score corresponds to the best possible alignment for those sequences. To align the 187 RNAs included in cluster 0, we selected the medoid, the entry with the highest similarity scores with respect to all other entries (PDB Id. 3hhn, chain D). For the 186 remaining entries we found the best alignment against the medoid and generated the RNA MSA. Gaps were added at the 5’ or 3’ ends of the RNA sequences when required to ensure that nucleotides in the same position were properly aligned. This is necessary due to the different length of the RNA fragments considered, which ranges from 3 to 11 nucleotides. The FASTA file with all the RNAs aligned from cluster 0 is available from the Bitbucket repository (S8 Dataset).

RRM-RNA scoring

The RRM-RNA scoring method we have developed, RRMScorer, is purely based on statistics derived from the carefully curated multiple sequence alignment. It is an adaptation of the GOR method [35,36] that converts the statistical information in a probabilistic framework, and which was originally used for secondary structure prediction. The GOR method applies information theory principles to calculate the information difference of the occurrence of two events. In the GOR method, the two information components calculated are 1) a specific amino acid residue type being in a particular secondary structure element and 2) that same amino acid residue type being in any another secondary structure element. The two components are calculated as a logarithm, with the background amino acid information present in both, and as the difference between them is calculated, the amino acid residue type occurrence disappears from the equation. RRMScorer relies on the same information difference equation to calculate which nucleotide-residue contacts are preferred for specific amino acid positions in an RRM. The amino-acid contribution, similar to the original GOR equation, disappears from the equation and the obtained terms result in Eq 2 below. The full development of the equation is available in Supplementary Material (S1 File). This approach is suitable for the limited amount of data and residue-level information that is currently available, as it generalises the information and avoids overinterpretation of specific RRM-RNA interactions. The method can score RNA fragments up to 5 nucleotides long, because those are the positions for which we have sufficient information to statistically analyse.

I (Δ N_{i}; R_{J}) = l o g (\frac{f_{N_{i}, R_{j}}}{f_{n - N i, R j}}) + l o g (\frac{f_{n - N_{i}}}{f_{N_{i}}})

The scores are computed for each residue-nucleotide interacting position individually. The result is the sum of two terms; the first term computes the logarithm of the ratio between the number of times a nucleotide in position i (from the RNA sequence alignment that accounts for how it binds the RRM) has been observed interacting with an amino acid residue in position j ( $f_{N_{i}, R_{J}}$ ) (based on the master alignment), over the number of times that the nucleotide interacts with any other amino acid residue ( $f_{n - N i, R j}$ ). E.g., the number of times an adenine in position 1 is observed interacting with an arginine in position β1–1 is divided by the number of times adenines in position 1 interact with any other amino acid residue in position β1–1. This value is then corrected by the second term, which computes the ratio between the number of times another nucleotide is observed in position i ( $f_{n - N_{i}}$ ) versus the number of times the selected nucleotide is observed in that position ( $f_{N_{i},}$ ). Following the previous example, this is the number of times any nucleotide except adenine is observed in RNA position 1 divided by the number of times an adenine is observed in that position.

RRMScorer takes the information from the interactions observed in all available RRM-RNA complexes in cluster 0 (training set). However, selecting all available interactions for the scoring would bias this approach to the amino acids and bound RNAs for the most studied RRMs, which are overrepresented in our dataset. To generalize the approach as much as possible, we first selected a subset of nucleotide and residue positions that interact with each other in at least 20% of the RRM-RNA complexes having different UniProt identifiers in cluster 0, so limiting the analysis to key binding positions. These retrieve all the RRM positions already well known for its importance with respect to RNA binding, such as solvent-exposed RNP1 and RNP2 residues. In total, 30 unique interactions from 20 RRM sequence positions to 5 RNA sequence positions were considered. A data frame was then generated for each of the interactions that incorporates the scores between all the residues and nucleotides in that position, as calculated from Eq 2. To get the final binding score between a target RRM sequence and a target RNA sequence, we take the average from the 30 calculated values. The matrices were rendered and coloured for all the selected interactions (E.g., S4 Fig), and are available from the Bitbucket repository under the supplementary information folder. For simplicity, we only included in the matrix amino acid residues that interact with the RNA. In summary, higher final binding scores indicate a higher overall probability that key amino acids from the target RRM have been observed to be in contact with the nucleotides from the target RNA.

Score internal validation

To cross-validate the scores in as unbiased a manner as possible we followed several steps. First, we removed the complex for which we are calculating the score from the training set. With the remaining entries we calculated the residue-nucleotide preferences for the 30 selected interactions, then calculated the final binding score for the entry being scored, given the amino acid sequence of the RRM, and nucleotide sequence of the RNA. This constitutes the single ‘actual binding’ score for that entry. All the 187 entries used to generate the RRMScorer were taken as the training set, as we have experimental proof that the RNAs bind to their respective RRMs.

A randomized test set was generated to validate that the RRMScorer can identify true binders. For each of the training entries an entry was randomly generated, whereby the RRM amino acid sequence is retained, but instead we picked a random nucleotide from the same position in the RNA alignment, but belonging to a complex with a different protein (UniProt Id). With this approach we expect to select nucleotides less likely to bind the tested protein, even though is still possible to pick the same nucleotide as in the original sequence.

Validation with RNAcompete data

RNAcompete is an in vitro technique to quickly analyse the RNA binding preference of RNA binding proteins (RBP) [19]. The experimental RNA preferences for all the available RRMs were downloaded from the CISBP-RNA Database [37], resulting in information for 171 different proteins. The RNA preferences are presented as a matrix of frequencies observed for each nucleotide along 7 different positions (8 positions in a few particular cases). We reconstructed all the possible sequence combinations for those 7-mers and computed the associated bits the same way it is done for classic sequence logos [38], capturing the information content of each nucleotide in a particular position.

As RRMScorer can score up to 5 nucleotide long RNA fragments, we kept the 5-mer with the highest average bits value from the RNAcompete 7-mers. Considering that in the RNAcompete assay they used an RNA pool comprising ~240,000 short fragments (30–41 nucleotides) that guarantees that any 9-mer is at least repeated 16 times, we can assume that any 5-mer sequence is also present multiple times in the pool. Therefore, there are 1024 (4⁵) theoretical different 5-mers, from which we can extract the associated bits value for each of the 171 RRM-containing proteins.

The 171 proteins in the dataset are divided into 3 categories based on their RBP composition. This must be done to properly align the sequences to the master alignment and analyse the resulting scores for each individual domain. There are 47 proteins that consist of a single RRM, 118 have multiple RRMs and the 6 remaining ones have at least one RRM but in conjunction with other RBPs such as KH domains or zinc fingers. The single RRMs are simply aligned to the master alignment (HMM) and the scores are calculated for all the possible RNA 5-mers (1024 sequences). For proteins with more than one RRM, we cut the sequence of each individual RRM and computed the score for the RNA 5-mers, we keep the highest score to correlate with the bits data. All the available data for each RRM domain, bits average value and RRMScorer predictions are available from the supplementary information folder in the Bitbucket repository (S10 Dataset).

Score confidence

The confidence of the scores is calculated based on the final binding scores between the RRM and the RNA in the complex we are scoring, and how that compares with the scores from the training and randomized sets. The scores for both sets were fit into a gaussian kernel density estimator (KDE) using Scikit-learn [39]. We then can compute the likelihood for the final binding score of an RRM-RNA complex to fall on either the training or the randomized regions. The ratio between those probabilities is calculated, normalized from 0 to 1 and provided as confidence score. Low values means that the RRM-RNA complex does not show very favourable contacts according to our dataset, which places it in the predominant randomized region. Correspondingly, high values (near 1) mean that the contacts (or absence thereof) were observed very often in our training set, and we can be confident about the prediction.

Go to:

Results

RRM alignment analysis and representation

The RRM master alignment (S5 Dataset) was validated by checking the number of gaps in the main secondary structure elements, and comparing the sequence logos with prior information about the β3-strand and β1-strand, also known as ribonucleoprotein domain 1 (RNP1) and RNP2 [1], respectively. The alignment for the RRM domains from the RRM-RNA complexes (S2 Dataset) was similarly validated, and the interactions previously extracted for all the complexes (S3 Dataset) were mapped into the alignment to analyse the RNA contacts along the protein structure. The frequency of the contacts and gaps is calculated as a percentage for each of the positions in the alignment of the 271 RRM-RNA complexes (Fig 4). As expected, most of the contacts occur in the β-strands, the 1st, 3rd and 5th loop, and the C-terminal region, which further validates the RRM-RNA alignment. The gaps are concentrated in the middle of the loops because of the alignment curation (see M&M).

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010859.g004.jpg

Fig 4

Percentage of contacts (Blue line) and gaps (grey dashed line) for all the positions of the 271 bound RRM alignment.

The β-strands and α-helices are depicted and labelled in green and red, respectively.

To better identify the different RRM sequence positions from the RRM master alignment, we created a cartoon representation of the RRM showing the most relevant sequence positions in relation to the conserved structural features (Fig 1B). Only the positions of the canonical RNA binding interface are depicted by individual spheres for simplicity. The position’s labelling is designed in a grid-based system, e.g., position β1–1 in Fig 1B. With such a representation it is easier to refer to the different RRM positions, which becomes particularly useful when comparing different RRMs or when analysing to which positions the RNAs bind. The light-green and dark-green colours for the β-sheet positions indicate exposed or buried residues, respectively. Positions with a significant number of observed RNA interactions (contacts observed in more than 20% of the proteins with different UniProt identifiers) are highlighted in red. As expected, the exposed positions of the β-strands are the ones most often interacting with RNA. These positions include the well-known conserved aromatic residues in β1–3, β3–3 and β3–5 that usually anchor the RNA by pi stacking (positions labelled in Fig 1B) [1].

Selecting the RNA binding mode and aligning the RNAs

To connect the RRM alignment with RNA recognition, the RNA sequences were aligned to compute how often a residue in a specific RRM position interacts with a nucleotide in a specific sequence position of the RNA. Considering that the RNA can adopt different conformations upon binding the RRM, we clustered the structures of the RRM-RNA complexes in subgroups with comparable positions of the bound RNAs with respect to the protein.

For the 187 complexes grouped in cluster 0, the RNAs bind to the RRMs in a very similar way, thus allowing to align the RNA sequences in relation to the RRM. In general, the number of interacting nucleotides in the complexes is variable, but most of the RRMs interact with 2 to 5 different nucleotides [3], even though there are some RRM domains with non-canonical binding that can interact with up to 8 nucleotides [40,41]. Based on this information, we analysed the fraction of gaps for the different positions in the RNA alignment. Gaps must be included to properly align RNAs based on the positions of the RRM they interact with (E.g., aligning UAGU with GUAGU RNA motifs, and assuming they bind their respective RRMs in the same way, a gap must be added to the 5’ end of the shorter RNA motif so it aligns with the latter). After analysing the aligned complexes, we observed that the 5 positions with fewer gaps (Fig 5 top) are recognized by the central two β-strands (β1 and β3, Fig 1) of the RRM. We defined this region as the core of the RNA alignment, and focused on this for further analysis. The nucleotide conservation for these positions was also determined (Fig 5 bottom). Notably, although we are restricted to the canonical binding mode, a wide range of RNA sequences is covered. The only position where less sequence variability was captured is the 1st position of the RNA alignment. This is mostly due to a higher percentage of gaps and to a bias towards two RRM proteins, SNRPA (UniProt Id. P09012) and U2AF (UniProt Id. P26368) with 69 and 23 entries in cluster 0 respectively, that often bind a uracil in this position.

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010859.g005.jpg

Fig 5

Gap fraction for the 5 core RNA positions in the alignment used for the scoring (top) and nucleotide conservation (bottom). Sequence logo generated with WebLogo [38].

RRM-RNA scoring

The RRM-RNA scoring method we developed, RRMScorer, predicts how likely it is for a given RNA sequence to bind a target RRM sequence. It is specifically developed for the canonical binding mode (captured by the RRM-RNA complexes in cluster 0) and it uses both the RRM and RNA alignments to generate a score to estimate RNA binding. The score is individually calculated for the 30 selected RRM-RNA contacts and then averaged. Positive contact scores indicate that a specific amino acid-nucleotide interaction is likely to be encountered, while negative scores have the opposite meaning. Scores close to 0 mean that there is no clear preference, e.g., the conserved aromatic residues in RNP2 (RRM position β1–3) and RNP1 (RRM positions β3–3 and β3–5) are barely specific for any nucleotide (S4 Fig), which is coherent with the fact that pi stacking interactions are not nucleotide-specific.

Note that the current analysis of RRMScorer is restricted to RRMs in cluster 0 and assumes that the RNA binding mode does not change. Significant protein sequence variations might change the RNA binding mode and will require further analysis. Moreover, due to the limited size of the training set, not all possible amino acid-nucleotide contacts are sampled, and thus scoring of RRM-RNA complexes with interactions that have not been observed before is less reliable. The unbiased number of observed contacts in the training set that is used to calculate the scores is also shown in the preference matrices (E.g., S4 Fig), below each of the scores. This value is the sum of the contact conservation but UniProt entries normalised, e.g., if a specific contact is observed in 8 out of the 10 available structures for the same UniProt entry, this contributes with 0.8 to the unbiased number of observed contacts. Following this procedure, we avoid the bias towards overrepresented protein structures in the dataset. When this value is absent, no such residue-nucleotide contact was observed in our training set for those positions. Residues that do not contact any nucleotide in a specific position are not displayed for simplicity.

Scoring validation

To validate RRMScorer, its performance was assessed on different independent experimental datasets to certify that the method is capable of predicting the binding capabilities of the RRMs.

Internal validation

We performed an internal validation by computing the scores for the entries in cluster 0, referred to as the training set, and comparing them with the scores from a set of randomly generated RNA sequences, referred to as the randomized set. The scores were calculated as explained in the score validation section on Materials and methods, taking out the entry from the dataset before computing the matrices with the scores. Following this procedure, we ensure that we are not biasing the scoring.

The distribution of the scores for the training and randomized entries (187 entries in each set) are compared and plotted (Fig 6). A separation between the sets is observed confirming that the scores can discriminate with certain confidence between the RNA sequences that would bind to an RRM from the ones that would likely not bind. The internal validation results with the training and randomized scores are available from the Bitbucket repository under the supplementary information folder (S9 Dataset).

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010859.g006.jpg

Fig 6

Score distribution for the training set (scores from experimentally solved RRM-RNA complexes) and randomized set (randomly generated RNAs).

Validation with RNAcompete data

Using the RNAcompete data we were able to evaluate the RRMScorer method with a large set of experimental RNA binding preferences. The data were processed to translate the RNA binding preferences into bits values that then we can compare with our predictions (S10 Dataset). RNA fragments with a higher average bits value should be better binders for that particular RRM, and consequently should correlate with higher scores from our method.

The correlations between the scores and bits values are shown for the three different categories (Fig 7). While we can make a one-to-one connection between the nucleotide preferences and the single RRMs, in case of multiple RRMs, and especially when other RBPs are present, the observed preferences might be a combination of different specificities of each individual domain. In fact, in general the highest affinity domain interaction will dominate results, and domain-specific contributions are in general not known. This fact is also represented on the different correlation of the medians observed for single RRM, multiple RRM and multiple RBP categories, with respective Pearson correlation coefficients of 0.78, 0.30 and 0.07. The clear correlation between the single RRMs bits values and RRMScorer predictions further validates our method, making it particularly useful for genome scale studies and large-scale screenings of RNA candidates.

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010859.g007.jpg

Fig 7

Correlation between the bits values derived from the RNAcompete data and the scores obtained with RRMScorer.

For a clearer depiction of the single RRM (orange), multiple RRM (yellow) and multiple RBP (green) categories, each box corresponds to the distribution of the bits values for a range of 0.1 in the score axis.

Validation with Musashi-1

Musashi-1 (MSI1) is a protein containing two tandem RRM domains that is involved in post-transcriptional regulation processes, controlling target mRNA transit and translation [42]. It is expressed in several species, and in humans its malfunction is often associated with cancer development [43]. As a relevant RRM use case, we compared the experimental information available about the RNA binding affinity with the predictions from RRMScorer. Zearfoss et al. analysed several RNA mutants to define the Musashi RNA binding specificity [12]. They mutated all the positions of a 12-nucleotide long RNA containing a motif previously identified by SELEX [13], and determined that MSI1 RRM1 (UniProt Id. Q61474) specifically recognizes a core motif of 3 nucleotides, UAG. If this core region is mutated, the binding affinity drops, while mutating other positions of the RNA does not have a substantial impact on the binding.

The authors determined the binding affinities between the RNA mutants and the MSI1 RRM1 using fluorescence polarization assays. The mouse variant was used on the experiments, which is identical to the RRM1 human variant. We then used RRMScorer but removed the MSI1 RRM1 entry from the training set to make the test-case as agnostic as possible. For this validation we used a window size of 3 nucleotides to calculate the scores, as the authors claim that this is the length that MSI1 specifically recognizes. We correlated the experimental results with the 3-mer fragment raising the highest score for each of the 36 RNA mutants (Fig 8). A table with the scores and affinity data is available from the supplementary information file (S3 Table).

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010859.g008.jpg

Fig 8

Experimental Kd values-scores correlation for the 36 RNA mutants tested against MSI1 RRM1, using a window size of three nucleotides as it is the length that this RRM specifically recognizes [12].

The correlation proves that RRMScorer successfully distinguishes between high affinity RNAs containing the UAG core motif (bottom right corner) that are in the low nM range (40–200 nM), from the ones without the conserved three-nucleotide motif, whose affinities drop to the μM range (1000–2500 nM). The guanine in the UAG motif is the most relevant nucleotide with respect to the binding affinity, when mutated to any other nucleotide the resulting interaction is on average 37-fold weaker than with the WT RNA (S3 Table). Using as a reference the PDB entry 2rs2 for MSI1 RRM1, two residues are involved in the specific recognition of this guanine, K21 and F65, interacting with their respective sidechains (Fig 9A). The corresponding positions in the RRM alignment for K21 and F65 are β1–1 and β3–3, respectively (Fig 1B and S6 Dataset), and the guanine from the UAG motif corresponds to the RNA position 4 in the RNA alignment (Fig 5 and S8 Dataset). In agreement with the experimental observations, a lysine in position β1–1 only shows a positive score for guanine (score of 0.48), and negative scores when binding any other nucleotide (-0.84, -1.64 and -0.58 for adenine, cytosine and uracil, respectively, Fig 9B). On the other hand, a phenylalanine in position β3–3 does not show a preference for any nucleotide (Fig 9C). This was expected as pi stacking interactions are not nucleotide-specific.

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010859.g009.jpg

Fig 9

A) Cartoon representation of the MSI1 RRM1 protein in complex with GUAGU (G1 not shown, PDB Id. 2rs2). The protein backbone is shown in light blue and heavy atoms are shown in red (O atoms), blue (N atoms), Orange (P atoms), light blue (C atoms of RRM) and sea blue (C atoms of RNA). The RNA nucleotides and the residues involved in the specific recognition of G4 (K21 and F65) are shown as sticks and spheres. B) and C) Score matrices for the alignment positions matching the interactions between K21 and F65 with G4, respectively (highlighted in bright green).

By computing the scores for all sliding windows in the RNA to find the highest possible score, we also predict which part of the RNA is more likely to bind the RRM. This is particularly useful to spot likely binding regions in long RNAs and the prediction confidence (Fig 10). The 5-nucleotide window with the highest score is clearly separated from the other RNA fragments and predicted with high confidence, and it agrees with the SELEX consensus defined for this protein (G/A)U1–3AGU [12,13].

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010859.g010.jpg

Fig 10

Scores for the 5-nucleotide sliding windows of the WT RNA tested by N. Ruth Zearfoss et al.

The bars are coloured depending on the score confidence, that goes from 0 to 1 for no-confidence to maximum confidence predictions, respectively. To ease the interpretation of the results, we add 0.89 to each of the scores obtained as it is the value that better separates the training and randomized regions according to the receiver operating characteristic curve (ROC curve). Therefore, positive scores correspond to likely binders while negative scores correspond to RNA fragments less likely to bind the RRM.

Validation with SRSF1

The complex of the human prototypical SR protein SRSF1 RRM1 with RNA was not available when we generated the master alignment. Therefore, it was not used to develop RRMScorer and it can be used as an external validation test case. The structure was recently released as part of a work by A. Cléry et al. [24] where the authors identified that this RRM has a strong preference to bind cytosines (PDB Id. 6hpj). They also engineered a variant to gain the ability to bind uridines, mutating a single residue in the β4 strand from glutamate to asparagine (E87N).

We wanted to assess whether our method can separate the RNAs that bind from the ones that do not bind, and whether it can capture the difference in binding of the engineered variant. The closest RRM sequence in cluster 0 has 51% sequence identity (SRSF3 RRM1), but a visual inspection of the complex shows a similar binding mode, which is crucial for the reliability of RRMScorer. The SRSF1 sequence is then aligned with the rest of the entries in the cluster so we can compute the scores for this entry and the mutated variant. For this step, only the HMM was used, not any structural information, which our approach does not require.

The authors tested the WT SRSF1 RRM1 binding with poly-A, poly-G, poly-U and poly-C by NMR, with only the latter binding the protein as validated through chemical shift perturbations in 1H-15N HSQC spectra. Our scores showed the same trend, while poly-A, poly-G and poly-U yield scores of -0.95, -1.00 and -0.91 respectively (Table 1), which locates them in the predominant randomized region (Fig 6), poly-C gets a score of -0.59, which clearly falls in the training set region or likely-binder region. Notably, the engineered variant E87N increases the poly-U score from -0.91 to -0.82 (Table 1), which is a significant change considering that only one residue is mutated from the 20 RRM positions used for the scoring. This score now falls in the likely-binder region (Fig 6) and agrees again with the NMR assays where the authors determined that the RRM1 mutant can bind uridines. The corresponding alignment position for E87 is β4–3 (Fig 1A) and it interacts with a cytosine in position 4 based on our RNA alignment for cluster 0 (Fig 5). A close look at this interaction (Fig 11A) shows how the carboxyl group of the glutamate sidechain interacts with the amino group of the cytosine, while not being able to recognize a uracil in this position, as stated by the authors [24]. The single point mutation E87N that enables this RRM to bind a uracil is in perfect agreement with our scores (Fig 11B), in fact, an asparagine is the only residue showing a positive score for uracil (from the residues observed in our training dataset).

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010859.g011.jpg

Fig 11

A) Cartoon representation of the SRSF1 RRM1 protein in complex with AACAAA (PDB Id. 6hpj). The protein backbone is shown in pale green and heavy atoms are shown in red (O atoms), blue (N atoms), Orange (P atoms), pale green (C atoms of RRM) and pale orange (C atoms of RNA). The RNA nucleotides and the residue involved in the specific recognition of C3 (E87) are shown as sticks and spheres. The interaction between E87 and C3 is shown as yellow dashed line. B) Score matrix for the respective alignment position of the interaction between E87 and C3. The change in score for the E87N mutation performed by Cléry et al. is highlighted in bright green.

Table 1

Scores for the 4 tested RNAs by A. Cléry et al. coloured in green for the RNAs that bind the target on their NMR assays and in red for the RNAs that do not bind. The symbols reflect the score change after the E87N mutation.

	SRSF1 (WT)	SRSF1 (E87N)
PolyA	$‐ 0.95$	$‐ 1.02$ ↓
PolyG	$‐ 1.00$	$‐ 0.98$ ≈
PolyU	$‐ 0.91$	$‐ 0.82$ ↑
PolyC	$‐ 0.59$	$‐ 0.69$ ↓

Go to:

Discussion

We developed RRMScorer, a novel scoring method to estimate RRM-RNA binding from sequence information only. RRMScorer provides scores for the probability that a given RNA sequence binds to an RRM protein and was validated on both computational and experimental data. RRMScorer does not predict precise binding affinities, but rather provides relative scores to compare different RNA sequences in relation to a particular RRM, or to compare the effect of different amino acid mutations in a particular RRM. We also focus solely on the canonical binding mode, therefore, scores predicted for RRM proteins not belonging to the analysed cluster are less reliable. Notably, our method has proved successful with SRSF1 RRM1, which falls within the canonical binding mode [24] but was not included in our original analysis. RRMScorer is also consistent with the RNAcompete large scale experimental data, especially for the single RRMs when there is a one-to-one connection between the RRM and the obtained RNA preferences, which does not apply for multiple RRMs where several RRMs contribute to the RNA preferences. The performance of our method makes it very useful on genomic scale studies to find good RNA candidates for a specific RRM. This could also then be coupled with state-of-the-art methods to predict the structure of the RNA-RRM complex, such as RoseTTAFoldNA [30].

The generation of the RRM master alignment is one of the key and non-trivial steps of this work. While the RRM fold is highly conserved, its sequence has experienced extensive changes across evolution to modify RRM sensitivity and specificity towards different RNA molecules. This has led to the broad range of functions that this protein motif performs. Considering the low sequence identity within the set, alignment methods purely based on sequence were not successful. Purely structure-based methods did not produce the expected results either, for example due to the different length of some of the beta-strands many gaps were included within some of them. The combination of sequence and structure is essential, with in our case PROMALS3D [34] generating the best alignment from the tools we tried. In addition, the rearrangement of the loop residues in the alignment with respect to the secondary structure elements was essential to better capture the RRM amino acid preferences in relation to RNA binding.

Our alignment captures as much sequence variability as possible, while avoiding bias towards overrepresented families and keeping a high-quality alignment. This allows us to extract reliable preferences between the protein residues and the nucleotides. We decided to build our scoring method on statistics derived from the master alignment and available structures to get a better grip on the data. Other methods based on machine learning would have provided relevant insights as well, but their interpretation would be less straight forward, and likely less generic. The limited number of available RRM-RNA complexes is also an impediment for most machine learning algorithms. With our statistical approach we can score any RRM-RNA complex while easily tracking the individual scores of each residue-nucleotide pair. This information can be useful to rationally design new RRMs.

Still, limited data availability is the main reason why our method is restricted to the canonical binding mode. We generated clusters for other binding modes but the number of complexes available was very limited, which makes it difficult to define well-grounded binding preferences. Finding validation sets was also challenging, because big changes on either the protein or the RNA sequence can completely change the RNA binding mode, so invalidating use of our method. From the protein structure side, the data availability is no longer a limitation after the AlphaFold Protein Structure Database release [44]. Even though it does not solve the RNA recognition problem [45], current challenges purely based on protein structure, such as assessing the preferred RNA binding mode of an RRM, might be solved soon, although the current inability of such methods to cover dynamics and multiple conformations remains a bottleneck to be solved. The more generic approach we present here might therefore be more applicable at this point in time; by ‘averaging’ the limited available information we reduce complexity but enable interpretability. The amino acid representations extracted by unsupervised language models [46] might in this context also be able to provide generalisations of amino acid behaviour that are applicable for improved prediction of RRM-RNA binding.

Go to:

Supporting information

S1 Fig

Schematic procedure for the similarity score calculation between two RNAs based on their binding with the RRM.

The two RNAs are aligned in all possible combinations using a sliding window, with the number of matches and unique positions each nucleotide interacts with in the RRM sequence counted for the 4 positions aligned in this example. The ratios between the matches and unique positions are added and then averaged for all the positions by dividing them by the length of the alignment.

(PDF)

Click here for additional data file.^{(148K, pdf)}

S2 Fig

Variation on the number of clusters (A) and number of entries in the biggest cluster (B) depending on the chosen cutoffs for similarity score (X-axis, scores from 0 to 1) and percentage of entries (Y-axis from 0% to 100%) that should have an equal or higher similarity score with the rest of the cluster. The chosen cutoff for the cluster generation is highlighted in yellow.

(PDF)

Click here for additional data file.^{(105K, pdf)}

S3 Fig

Superimposed RRM-RNA complexes with the lowest similarity score in cluster 0, PDB Id. 6g90 (chain B, green) and PDB Id. 3nnh (chain B, orange).

The aligned nucleotides used for the RMSD calculations are labelled.

(PDF)

Click here for additional data file.^{(111K, pdf)}

S4 Fig

Scores for the conserved aromatic positions in RNP2 (β1–3) and RNP1 (β3–3, β3–5) in contact with their respective RNA positions (phenylalanine and tyrosine scores are highlighted in bright green).

(PDF)

Click here for additional data file.^{(207K, pdf)}

S1 Table

PFAM identifiers and related metadata of the selected RRM families for the analysis.

(PDF)

Click here for additional data file.^{(107K, pdf)}

S2 Table

Identifiers of the 19 selected structures to use in PROMALS3D.

(PDF)

Click here for additional data file.^{(110K, pdf)}

S3 Table

Experimental Kd values correlated with the RRMScorer scores for the WT RNA and the 36 RNA mutants tested against MSI1 RRM1, using a window size of three nucleotides.

(PDF)

Click here for additional data file.^{(83K, pdf)}

S1 File

S1 –S4 Equations: Basis for the development of the RRMScorer equation.

Our method is based on the information difference between the occurrence of two events, in our case the information of how often a specific nucleotide interacts with a specific residue, $I (N_{i}; R_{J})$ , and the information when that same nucleotide interacts with any other residue, $I ({n - N}_{i}; R_{J})$ (S1 Equation). The individual terms are developed in S2 and S3 Equations where: $f_{N_{i}, R_{j}}$ is the number of occurrences for a specific contact between nucleotide i and residue j; $f_{R_{j}}$ the number of times residue j is in that position; $f_{N_{i}}$ the number of times nucleotide i is in that position; $f_{{n - N}_{i}, R_{j}}$ the number of times nucleotide i interacts with any residue but residue j; $f_{{n - N}_{i}}$ the number nucleotides other than nucleotide i in that position; R the total number of residues in the dataset. As it’s a difference between the two logarithms, the common terms that account for the number of specific residues in position j ( $f_{R_{j}}$ ) and total number of residues in the dataset (R) disappear from the equation. After the simplification we obtain S4 Equation (Same as Eq 2 in the main text).

(PDF)

Click here for additional data file.^{(122K, pdf)}

S1 Dataset

Complete list of the RRM structures retrieved from the PDB, with the RRM sequence and the sequence range for both PDB and UniProt.

(FASTA)

Click here for additional data file.^{(173K, fasta)}

S2 Dataset

List including the identifiers for the RRM-RNA complexes interacting with three or more nucleotides.

The identifier name is organized as follows: <UniProt ID>_<RRM number>_<PDB ID>_<protein chain>_<PDB numbering>_<UniProt numbering>_<Internal numbering for mapping>_<RNA chain>.

(TXT)

Click here for additional data file.^{(11K, txt)}

S3 Dataset

Contacts list for all the RRM-RNA complexes.

Each line corresponds to a specific contact and it is organized as follows: <PDB ID>_<Protein chain>, <PDB residue number>_<Residue one letter code>_<Nucleotide one letter code>_<PDB nucleotide number>_<RNA chain>.

(TXT)

Click here for additional data file.^{(255K, txt)}

S4 Dataset

Fasta file with the reduced RRM dataset after applying a 99% sequence identity threshold.

(FASTA)

Click here for additional data file.^{(35K, fasta)}

S5 Dataset

RRM sequence alignment in fasta format for the 347 RRM selected sequences. Generated using PROMALS3D.

(FASTA)

Click here for additional data file.^{(97K, fasta)}

S6 Dataset

RRM sequence alignment in fasta format for the 271 RRM-RNA complexes.

Generated via HMM from the master alignment.

(FASTA)

Click here for additional data file.^{(75K, fasta)}

S7 Dataset

RNA binding similarity matrix for the 271 RRM-RNA complexes.

Values close to 1 refer to similar biding modes while close to 0 correspond to different binding modes. A score of 0 is given when an RRM-RNA complex is compared with itself. The matrix is available in csv format.

(CSV)

Click here for additional data file.^{(1.2M, csv)}

S8 Dataset

Fasta file with the RNA alignment for the 187 RRM-RNA complexes included in cluster 0.

(FASTA)

Click here for additional data file.^{(8.7K, fasta)}

S9 Dataset

CSV file with the internal validation results for the 187 RRM complexes in cluster 0.

(CSV)

Click here for additional data file.^{(12K, csv)}

S10 Dataset

JSON file with the protein identifiers, RRM domain, bits value from RNAcompete data and RRMScorer predictions.

(JSON)

Click here for additional data file.^{(41M, json)}

Go to:

Acknowledgments

We thank Dr David Bickel for his help on protein structure visualization and Adrián Diaz for the IT support.

Go to:

Funding Statement

J.R-M., H.D., M.S., and W.V.; Marie Skłodowska-Curie Innovative Training Network (MSCA-ITN) RNAct supported by European Union’s Horizon 2020 research and innovation programme under grant agreement No 813239. (https://ec.europa.eu/info/research-and-innovation/funding/funding-opportunities/funding-programmes-and-open-calls/horizon-2020_en). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Go to:

Data Availability

All the datasets and code required to run RRMScorer is publicly available at https://bitbucket.org/bio2byte/rrmscorer/src/master/.

Go to:

References

1. Cléry A, Blatter M, Allain FH-T. RNA recognition motifs: boring? Not quite. Current Opinion in Structural Biology. 2008;18: 290–298. 10.1016/j.sbi.2008.04.002 [Abstract] [CrossRef] [Google Scholar]

2. Maris C, Dominguez C. The RNA recognition motif, a plastic RNA-binding platform to regulate post-transcriptional gene expression. FEBS Journal. 2005; 14. 10.1111/j.1742-4658.2005.04653.x [Abstract] [CrossRef] [Google Scholar]

3. Tsai YS, Gomez SM, Wang Z. Prevalent RNA recognition motif duplication in the human genome. RNA. 2014;20: 702–712. 10.1261/rna.044081.113 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

4. Soubise B, Jiang Y, Douet-Guilbert N, Troadec M-B. RBM22, a Key Player of Pre-mRNA Splicing and Gene Expression Regulation, Is Altered in Cancer. Cancers (Basel). 2022;14: 643. 10.3390/cancers14030643 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

5. Ding J, Hayashi MK, Zhang Y, Manche L, Krainer AR, Xu R-M. Crystal structure of the two-RRM domain of hnRNP A1 (UP1) complexed with single-stranded telomeric DNA. Genes Dev. 1999;13: 1102–1115. 10.1101/gad.13.9.1102 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

6. Skowyra ML, Rapoport TA. Mechanism of PEX5-mediated protein import into peroxisomes. bioRxiv; 2022. p. 2022.05.31.494222. 10.1101/2022.05.31.494222 [CrossRef] [Google Scholar]

7. Sagae T, Yokogawa M, Sawazaki R, Ishii Y, Hosoda N, Hoshino S-I, et al.. Paip2A inhibits translation by competitively binding to the RNA recognition motifs of PABPC1 and promoting its dissociation from the poly(A) tail. J Biol Chem. 2022;298: 101844. 10.1016/j.jbc.2022.101844 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

8. Hennig J, Militti C, Popowicz GM, Wang I, Sonntag M, Geerlof A, et al.. Structural basis for the assembly of the Sxl–Unr translation regulatory complex. Nature. 2014;515: 287–290. 10.1038/nature13693 [Abstract] [CrossRef] [Google Scholar]

9. Voith von Voithenberg L, Sánchez-Rico C, Kang H-S, Madl T, Zanier K, Barth A, et al.. Recognition of the 3’ splice site RNA by the U2AF heterodimer involves a dynamic population shift. Proc Natl Acad Sci U S A. 2016;113: E7169–E7175. 10.1073/pnas.1605873113 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

10. Corsini L, Bonnal S, Bonna S, Basquin J, Hothorn M, Scheffzek K, et al.. U2AF-homology motif interactions are required for alternative splicing regulation by SPF45. Nat Struct Mol Biol. 2007;14: 620–629. 10.1038/nsmb1260 [Abstract] [CrossRef] [Google Scholar]

11. Kielkopf CL, Lücke S, Green MR. U2AF homology motifs: protein recognition in the RRM world. Genes Dev. 2004;18: 1513–1526. 10.1101/gad.1206204 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

12. Zearfoss NR, Deveau LM, Clingman CC, Schmidt E, Johnson ES, Massi F, et al.. A Conserved Three-nucleotide Core Motif Defines Musashi RNA Binding Specificity. J Biol Chem. 2014;289: 35530–35541. 10.1074/jbc.M114.597112 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

13. Imai T, Tokunaga A, Yoshida T, Hashimoto M, Mikoshiba K, Weinmaster G, et al.. The neural RNA-binding protein Musashi1 translationally regulates mammalian numb gene expression by interacting with its mRNA. Mol Cell Biol. 2001;21: 3888–3900. 10.1128/MCB.21.12.3888-3900.2001 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

14. Loerch S, Kielkopf CL. Dividing and Conquering the Family of RNA Recognition Motifs: A Representative Case Based on hnRNP L. J Mol Biol. 2015;427: 2997–3000. 10.1016/j.jmb.2015.06.009 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

15. Tacke R, Manley JL. The human splicing factors ASF/SF2 and SC35 possess distinct, functionally significant RNA binding specificities. EMBO J. 1995;14: 3540–3551. 10.1002/j.1460-2075.1995.tb07360.x [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

16. Miyazaki S, Sato Y, Asano T, Nagamura Y, Nonomura K-I. Rice MEL2, the RNA recognition motif (RRM) protein, binds in vitro to meiosis-expressed genes containing U-rich RNA consensus sequences in the 3′-UTR. Plant Mol Biol. 2015;89: 293–307. 10.1007/s11103-015-0369-z [Abstract] [CrossRef] [Google Scholar]

17. Liu Y, Liu J, Wang Z, He JJ. Tip110 binding to U6 small nuclear RNA and its participation in pre-mRNA splicing. Cell Biosci. 2015;5: 40. 10.1186/s13578-015-0032-z [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

18. Kuang S, Wang L. Identification and analysis of consensus RNA motifs binding to the genome regulator CTCF. NAR Genom Bioinform. 2020;2: lqaa031. 10.1093/nargab/lqaa031 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

19. Ray D, Kazan H, Chan ET, Peña Castillo L, Chaudhry S, Talukder S, et al.. Rapid and systematic analysis of the RNA recognition specificities of RNA-binding proteins. Nat Biotechnol. 2009;27: 667–670. 10.1038/nbt.1550 [Abstract] [CrossRef] [Google Scholar]

20. Auweter SD, Oberstrass FC, Allain FH-T. Sequence-specific binding of single-stranded RNA: is there a code for recognition? Nucleic Acids Res. 2006;34: 4943–4959. 10.1093/nar/gkl620 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

21. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33: 831–838. 10.1038/nbt.3300 [Abstract] [CrossRef] [Google Scholar]

22. Pan X, Rijnbeek P, Yan J, Shen H-B. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genomics. 2018;19: 511. 10.1186/s12864-018-4889-1 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

23. Wei J, Chen S, Zong L, Gao X, Li Y. Protein–RNA interaction prediction with deep learning: structure matters. Briefings in Bioinformatics. 2022;23: bbab540. 10.1093/bib/bbab540 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

24. Cléry A, Krepl M, Nguyen CKX, Moursy A, Jorjani H, Katsantoni M, et al.. Structure of SRSF1 RRM1 bound to RNA reveals an unexpected bimodal mode of interaction and explains its involvement in SMN1 exon7 splicing. Nat Commun. 2021;12: 428. 10.1038/s41467-020-20481-w [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

25. Chen H-J, Topp SD, Hui HS, Zacco E, Katarya M, McLoughlin C, et al.. RRM adjacent TARDBP mutations disrupt RNA binding and enhance TDP-43 proteinopathy. Brain. 2019;142: 3753–3770. 10.1093/brain/awz313 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

26. Poole AM, Ranganathan R. Knowledge-based potentials in protein design. Current Opinion in Structural Biology. 2006;16: 508–513. 10.1016/j.sbi.2006.06.013 [Abstract] [CrossRef] [Google Scholar]

27. Donald JE, Chen WW, Shakhnovich EI. Energetics of protein–DNA interactions. Nucleic Acids Research. 2007;35: 1039–1047. 10.1093/nar/gkl1103 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

28. Tuszynska I, Bujnicki JM. DARS-RNP and QUASI-RNP: New statistical potentials for protein-RNA docking. BMC Bioinformatics. 2011;12: 348. 10.1186/1471-2105-12-348 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

29. Fornes O, Garcia-Garcia J, Bonet J, Oliva B. Chapter Four—On the Use of Knowledge-Based Potentials for the Evaluation of Models of Protein–Protein, Protein–DNA, and Protein–RNA Interactions. In: Donev R, editor. Advances in Protein Chemistry and Structural Biology. Academic Press; 2014. pp. 77–120. 10.1016/B978-0-12-800168-4.00004–4 [Abstract] [CrossRef] [Google Scholar]

30. Baek M, McHugh R, Anishchenko I, Baker D, DiMaio F. Accurate prediction of nucleic acid and protein-nucleic acid complexes using RoseTTAFoldNA. bioRxiv; 2022. p. 2022.09.09.507333. 10.1101/2022.09.09.507333 [CrossRef] [Google Scholar]

31. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, et al.. Pfam: The protein families database in 2021. Nucleic Acids Research. 2021;49: D412–D419. 10.1093/nar/gkaa913 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

32. Corley M, Burns MC, Yeo GW. How RNA-Binding Proteins Interact with RNA: Molecules and Mechanisms. Mol Cell. 2020;78: 9–29. 10.1016/j.molcel.2020.03.011 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

33. Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26: 680–682. 10.1093/bioinformatics/btq003 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

34. Pei J, Kim B-H, Grishin NV. PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res. 2008;36: 2295–2300. 10.1093/nar/gkn072 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

35. Kouza M, Faraggi E, Kolinski A, Kloczkowski A. The GOR Method of Protein Secondary Structure Prediction and Its Application as a Protein Aggregation Prediction Tool. Methods Mol Biol. 2017;1484: 7–24. 10.1007/978-1-4939-6406-2_2 [Abstract] [CrossRef] [Google Scholar]

36. Garnier J, Gibrat JF, Robson B. GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol. 1996;266: 540–553. 10.1016/s0076-6879(96)66034-0 [Abstract] [CrossRef] [Google Scholar]

37. Ray D, Kazan H, Cook KB, Weirauch MT, Najafabadi HS, Li X, et al.. A compendium of RNA-binding motifs for decoding gene regulation. Nature. 2013;499: 172–177. 10.1038/nature12311 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

38. Crooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14: 1188–1190. 10.1101/gr.849004 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

39. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al.. Scikit-learn: Machine Learning in Python. JMLR.: 6. 10.48550/arXiv.1201.0490 [CrossRef] [Google Scholar]

40. Lunde BM, Moore C, Varani G. RNA-binding proteins: modular design for efficient function. Nat Rev Mol Cell Biol. 2007;8: 479–490. 10.1038/nrm2178 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

41. Price SR, Evans PR, Nagai K. Crystal structure of the spliceosomal U2B"-U2A’ protein complex bound to a fragment of U2 small nuclear RNA. Nature. 1998;394: 645–650. 10.1038/29234 [Abstract] [CrossRef] [Google Scholar]

42. Bley N, Hmedat A, Müller S, Rolnik R, Rausch A, Lederer M, et al.. Musashi-1-A Stemness RBP for Cancer Therapy? Biology (Basel). 2021;10: 407. 10.3390/biology10050407 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

43. Glazer RI, Vo DT, Penalva LOF. Musashi1: an RBP with versatile functions in normal and cancer stem cells. Front Biosci (Landmark Ed). 2012;17: 54–64. 10.2741/3915 [Abstract] [CrossRef] [Google Scholar]

44. Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al.. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research. 2022;50: D439–D444. 10.1093/nar/gkab1061 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

45. Akdel M, Pires DEV, Pardo EP, Jänes J, Zalevsky AO, Mészáros B, et al.. A structural biology community assessment of AlphaFold 2 applications. bioRxiv; 2021. p. 2021.09.26.461876. 10.1101/2021.09.26.461876 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

46. Rao R, Bhattacharya N, Thomas N, Duan Y, Chen X, Canny J, et al.. Evaluating Protein Transfer Learning with TAPE. bioRxiv; 2019. p. 676825. 10.1101/676825 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

2023 Jan; 19(1): e1010859.

Published online 2023 Jan 23. 10.1371/journal.pcbi.1010859.r001

Decision Letter 0

Shi-Jie Chen, Academic Editor and Nir Ben-Tal, Section Editor

24 Sep 2022

Dear Prof. Dr. Vranken,

Thank you very much for submitting your manuscript "Deciphering the RRM-RNA recognition code: A computational analysis" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Shi-Jie Chen

Academic Editor

PLOS Computational Biology

Nir Ben-Tal

Section Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Roca-Martinez and coworkers have performed a computational analysis of RNA recognition motifs (RRMs) and RRM-RNA complexes in an effort to develop a scoring method for predicting and evaluating the probability of interaction between canonical RRMs and single-stranded RNA. The authors use available sequence and structure information to obtain individual scoring matrices for commonly observed interacting positions in RRM and RNA sequences (identified through multiple-sequence alignments), describing the preference of different nucleobase types to interact with different residue types. While the question of understanding the physicochemical underpinnings of RNA-protein interactions and predicting and sculpting their sequence determinants is extremely timely and important, the comments below should be addressed in detail before the suitability of the manuscript for publication can be adequately assessed.

Major comments

1. The description of the RRM-RNA scoring approach (p. 6), the very heart of the manuscript, is unclear and sloppy. Equation 2 is not consistent with the text and a proper explanation of the symbols and indices used is missing (denominator in the first term different from text, fn not explained, index i in denominator in the first term different from index J in the text etc.). Also, the explanations are given in an incomplete way e.g. the sentence “…is related to the number of times adenines interact with any other amino acid residue in position beta1-1” is missing the crucial qualifier “adenines at position 1”. This makes it hard to comprehend how the scores were actually calculated.

2. More importantly, the motivation and the physical foundation of the scoring function is not adequately explained. Centrally, the scores do not consider the frequency of amino-acid residues observed at a specific position (see e.g. the second term in Eq. 2), making it not symmetric when considering nucleotides and residues, respectively. In the example given on p. 6, the score should depend on the frequency of arginines interacting with adenines as well as with other nucleotides, but this is not included.

3. Also, the scoring function shares resemblance with the standard quasi-chemical approach for defining knowledge-based potentials (Miyazawa, S. and Jernigan, R.L., Macromolecules, 18, 534-552 (1985)), but with important differences. Namely, the authors here normalize the number of occurrences of a given event (e.g. presence of a nucleotide at a given site interacting with a given residue) by the number of all events other than that event (e.g the number of interactions of that nucleotide with all the other residues, except the one in question) and not the total number of all events (e.g. the number of interactions of that nucleotide with all residues). Why is this? The authors are motivated by the GOR method for analyzing secondary structural propensities, but it is not clear that the same formalism is applicable here – namely, in the GOR method one analyzes the linkage between an object (amino acid) and its property, while here one analyzes the propensity of two objects (amino acid and nucleotide) to co-occur in the same context (i.e. contact). This is related to the asymmetry discussed in point 1.

4. The rather extensive literature on contact-based statistical potentials for nucleic-acid/protein interactions should be adequately cited and discussed (see, for example Donald et al. Nucleic Acids Res., 2007, 35, 1039–1047. or Tuszynska et al. BMC Bioinformatics, 2011, 12, 348 and other).

5. The authors refer to their randomized test set as a negative test set (p. 7). As there is no guarantee that many members of this set are not actual binders – the naming should be changed to something like “background set” or “randomized set”, but certainly not “negative set”. More critically, randomization was only done on the side of the RNA sequences (change of 1 nucleotide in the sequence) and not on the side of RRM sequences – this relates to the asymmetry of the whole approach as discussed above and must be properly defended.

6. Defining clusters as all complexes that have a certain similarity score with at least 25% of complexes in the cluster is quite low as a cutoff (p. 5). Of course, if one increases the cutoff, one risks not having sufficient samples for adequate statistics. The authors should defend the choice of their cutoff by providing quantitative evidence that it does not overly impact the scores i.e. the qualitative features of their method.

7. For the validation of their scoring method the authors analyze two experimentally studied examples, while extensive data on RRM binding motifs obtained by different experimental methods exists and is not used. See for example the RNAcompete results (Ray, D., Kazan, H., Cook, K. et al, A compendium of RNA-binding motifs for decoding gene regulation, Nature, 499, 172–177 (2013)) or the Attract database (PMID: 27055826). The authors should validate their results on an as extensive a set of experimental data as possible.

Minor comments

1. On p. 9, the authors state that “The unbiased number of observed contacts in the training set that is used to calculate the scores is also shown in the preference matrices (Figure 8 B,C), below each of the scores”. However, these numbers are not integers, so it is unclear what they actually refer to.

2. Numbers in Figure 2 are not fully consistent with the text (1263 instead of 1259, 20 instead of 19; p. 3 and p. 4).

3. It is stated that the "alignment for the RRM-RNA structures" is available in Dataset S6 (p. 4). However, Dataset S6 contains the RRM-RNA similarity matrix.

4. In the caption of Table 1 (p. 28), the authors state that “The symbols reflect the score change after the E87N mutation”, however there are no symbols.

Reviewer #2: The paper describes construction of statistical based potential for scoring rrm domain interactions with specific RNA sequence. They use structural model of binding to identify contacting residues and then create sequence based interaction statistics. The authors acknowledge that the method is limited to already known binding modes of rna to RRM domains or very close to those. The authors validate the approach on leave out training set as well as few novel cases. The problem is still open - but there is a progress. I think the paper can be published. Would be interesting to compare the approach with AF like approach for modeling protein RNA interactions (see the link below). Also can protein rna models from this approach can be used to create alignment? https://www.biorxiv.org/content/10.1101/2022.09.09.507333v1.full.pdf

Reviewer #3: The manuscript by Roca-Martinez RNA-recognition motifs discusses a

scoring method to estimate binding between an RRM and a single

stranded RNA, and the method aims to predict RRM binding RNA sequence

motifs based on RRM protein sequence. The authors adopt a simpler

statistical approach over deep learning method employed in several

existing methods for better interpretability. Interesting results on

discriminatin of high affinity RNAs with the UAG core motif from lower

affinity RNAs are reported.

While the reported method serves a useful purpose towards the overall

task of solving the problem of deciphering RNA recognition code of

RRMs, there are a number of significant issues:

1. The score described by Equation 2 is not explained and the physical

unerpinning cannot be found. The two log terms summed are the same as

product of two ratio. But it is not clear what does it mean, and why

does this make sense? Does it model some eperical binding?

Conservation? Not clear. Furthermore, why the denominator in the

first term comes to be f_n - N_i,R_J ? This is not understandable.

2. There are numerous places where the method development depends on

visual inspection. This raises serious issure of reproducibility.

3. The model appears to be rather restrictive and works only for cluster 0 and no binding mode change can occur.

4. The negative test should be strenthened and should include other entries

if possible.

Other issues:

1. p.4. "the number of unique postions between both nucleotides" It is not clear if the authors meant positions only in A, only in B, or both?

2. Will the results sensitive to the specific threshold of 5A?

Minor issues

1. Figures seems to be jumping around in order, and it makes it difficult to go back and forth.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at gro.solp@serugif.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

2023 Jan; 19(1): e1010859.

Published online 2023 Jan 23. 10.1371/journal.pcbi.1010859.r002

Author response to Decision Letter 0

9 Nov 2022

Attachment

Submitted filename: response_to_reviewers.pdf

Click here for additional data file.^{(225K, pdf)}

2023 Jan; 19(1): e1010859.

Published online 2023 Jan 23. 10.1371/journal.pcbi.1010859.r003

Decision Letter 1

Shi-Jie Chen, Academic Editor and Nir Ben-Tal, Section Editor

7 Jan 2023

Dear PhD Vranken,

We are pleased to inform you that your manuscript 'Deciphering the RRM-RNA recognition code: A computational analysis' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Shi-Jie Chen

Academic Editor

PLOS Computational Biology

Nir Ben-Tal

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have significantly revised the manuscript and have adequately addressed all of my concerns from the first round.

Reviewer #3: The authors have made changes to improve the manuscript. However, a number of important issues have not been addressed adequately.

I find it still difficult to understand the origin of Eqn 2 and the physical basis remains unclear. Furthermore, while I appreciate why the authors are are using visual inspection for verifications, the issue of reproducibility largely remain.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

2023 Jan; 19(1): e1010859.

Published online 2023 Jan 23. 10.1371/journal.pcbi.1010859.r004

Acceptance letter

Shi-Jie Chen, Academic Editor and Nir Ben-Tal, Section Editor

17 Jan 2023

PCOMPBIOL-D-22-01066R1

Deciphering the RRM-RNA recognition code: A computational analysis

Dear Dr Vranken,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

Articles from PLOS Computational Biology are provided here courtesy of PLOS

Full text links

Read article at publisher's site: https://doi.org/10.1371/journal.pcbi.1010859

Read article for free, from open access legal sources, via Unpaywall: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1010859&type=printable

Citations & impact

Impact metrics

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/141691543

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/141691543

Article citations

Large-scale structure-informed multiple sequence alignment of proteins with SIMSApiper.
Crauwels C, Heidig SL, Díaz A, Vranken WF
Bioinformatics, 40(5):btae276, 01 May 2024
Cited by: 0 articles | PMID: 38648741 | PMCID: PMC11099654
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Analysis of the inter-domain orientation of tandem RRM domains with diverse linkers: connecting experimental with AlphaFold2 predicted models.
Roca-Martínez J, Kang HS, Sattler M, Vranken W
NAR Genom Bioinform, 6(1):lqae002, 29 Jan 2024
Cited by: 1 article | PMID: 38288375 | PMCID: PMC10823583
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Origin of ribonucleotide recognition motifs through ligand mimicry at early earth.
Mozumdar D, Roy RN
RNA Biol, 21(1):107-121, 01 Jan 2024
Cited by: 0 articles | PMID: 39526332 | PMCID: PMC11556283
Review
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Research Progress on the Structural and Functional Roles of hnRNPs in Muscle Development.
Li Z, Wei H, Hu D, Li X, Guo Y, Ding X, Guo H, Zhang L
Biomolecules, 13(10):1434, 22 Sep 2023
Cited by: 1 article | PMID: 37892116 | PMCID: PMC10604023
Review
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
CroMaSt: a workflow for assessing protein domain classification by cross-mapping of structural instances between domain databases and structural alignment.
Dhondge H, Chauvot de Beauchêne I, Devignes MD
Bioinform Adv, 3(1):vbad081, 27 Jun 2023
Cited by: 0 articles | PMID: 37431435 | PMCID: PMC10329740
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC

Go to all (6) article citations

Data

Data behind the article

This data has been text mined from the article, or deposited into data resources.

BioStudies: supplemental material and supporting data

http://www.ebi.ac.uk/biostudies/studies/S-EPMC9894542?xr=true

Pfam

(1 citation) Pfam - PF00076

Protein structures in PDBe

(1 citation) PDBe - 2rs2
View structure

Funding

Funders who supported this work.

H2020 Marie Skłodowska-Curie Actions (1)

Grant ID: 813239
3 publications

Search life-sciences literature (45,103,589 articles, preprints and more)

Deciphering the RRM-RNA recognition code: A computational analysis.

Author information

Affiliations

Authors

Authors

Authors

ORCIDs linked to this article

Abstract

Free full text

Deciphering the RRM-RNA recognition code: A computational analysis

Joel Roca-Martínez

Hrishikesh Dhondge

Michael Sattler

Wim F. Vranken

Associated Data

Abstract

Author summary

Introduction

Materials and methods

Source data

Data cleaning and alignment procedure

RRM-RNA complexes similarity matrix

RNA sequence alignment

RRM-RNA scoring

Score internal validation

Validation with RNAcompete data

Score confidence

Results

RRM alignment analysis and representation

Selecting the RNA binding mode and aligning the RNAs

RRM-RNA scoring

Scoring validation

Internal validation

Validation with RNAcompete data

Validation with Musashi-1

Validation with SRSF1

Table 1

Discussion

Supporting information

S1 Fig

S2 Fig

S3 Fig

S4 Fig

S1 Table

S2 Table

S3 Table

S1 File

S1 Dataset

S2 Dataset

S3 Dataset

S4 Dataset

S5 Dataset

S6 Dataset

S7 Dataset

S8 Dataset

S9 Dataset

S10 Dataset

Acknowledgments

Funding Statement

Data Availability

References

Decision Letter 0

Author response to Decision Letter 0

Attachment

Decision Letter 1

Acceptance letter

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Data

Data behind the article

BioStudies: supplemental material and supporting data

Pfam

Protein structures in PDBe

Similar Articles

Funding

H2020 Marie Skłodowska-Curie Actions (1)