Index


  1. Cite us
  2. Description of the algorithm
  3. Comparison Options
  4. Input format
  5. Parameters
  6. Datasets
  7. Output

Cite us

Back to the top

Please cite the following paper(s) if you use BEAGLE 2.0 in your research:

  • Ballesio F., Teofani A., Carrino C., Catalano M., Nicolaeasa M.L., Ausiello G., Helmer-Citterich M., Gherardini P.F.(2024) BEAGLE 2.0: a web server for RNA secondary structure similarity detection leveraging SHAPE-directed RNA structure determination submitted


Description of the algorithm

Back to the top

The BEAGLE (BEar Alignment Global and Local) web-server performs pairwise alignments of RNA secondary structure. The method exploits a new encoding for RNA secondary structure (BEAR) and a substitution matrix for RNA structural elements (MBR) (Mattei et al., 2014). The BEAR encoding allows to include structural information within a string of characters where each character of the encoding stores the information about the type and length of the secondary structure elements the nucleotide belongs to (Fig. 1).

Figure 1. The BEAR encoding. (A) The BEAR structural alphabet. Different sets of characters are associated with the different RNA basic structures(loop, internal loop, stem and bulge on the right side of a stem, and bulge on the left side, denoted here as L, I, S, BL and BR, respectively), with different characters used for basic structures of different length. (B) RNA hairpin with the constituent substructures (loop, stem, bulges and internal loop) highlighted in different colors. On the top right, the BEAR characters corresponding to each substructure, shown with the same colors. On the bottom right, the hairpin RNA sequence is shown associated with its dot-bracket and its BEAR secondary structure descriptions. (C) Conversion into BEAR of an RNA secondary structure. An RNA secondary structure, extracted from Rfam, is shown, containing four non-branching structures depicted in boxes. The resulting BEAR conversion of the non-branching and branching structures is shown in blue below the secondary structure. A ‘:’ character is assigned to the remaining nucleotides that do not belong to non-branching or branching structures (reported in black).

Transition rates between secondary structure elements were computed on a set of evolutionally related BEAR-encoded RNAs (Fig.2).

Figure 2. Graphical representation of the MBR. This figure shows a subset of rows/columns of the MBR matrix, using a color-coding to show substitution rate patterns: color scale represents log-odds scores from lower (blue) to higher (red). Rows and columns are elements of RNA secondary structure of different length and every cell stores the log-odds value for the substitution of one element with another element. The cells in the principal diagonal always have the highest value in the respective row and column. Substitutions between elements belonging to the same group (i.e. stems, loops and interior loops) display higher log-odds values than substitutions between elements belonging to different groups. The ‘. . . ’ notation indicates that some rows/columns were omitted from the graphical matrix representation.

The BEAR encoding uses an alphabet of 83 characters so the size of the MBR is 83x83. The total number of possible pairs is 3486 (83*(83+1)/2) among which 221 has a score higher than 1, corresponding to pairs occurring more than expected. The plot below shows the distribution of the score in the matrix (Fig. 3).


The BEAGLE method implements a modified version of the the Needleman-Wunsch algorithm for global alignment and the Smith-Waterman algorithm for local alignment, for the comparison of BEAR-encoded RNA secondary structures, using the MBR (or any other user-provided substitution matrix for BEAR characters) to guide the alignment.


Comparison options

Back to the top
BEAGLE offers three kinds of comparisons:
  1. One set
  2. This option accepts one set of RNAs as input and the server will perform all the possible pairwise alignments among those RNAs. Maximum 300 RNAs accepted in input. The input sequences must be supplied using the textarea.

  3. Two sets
  4. This option accepts two sets of RNAs, namely the query and target set. The sequences in the query set will be aligned to the sequences in the target set. Each RNA in the query set will be aligned with each RNA in the target set producing n x m alignments where n is the cardinality of the query set and m is the cardinality of the target set. The input sequences must be supplied using the two textareas.

  5. Search
  6. Using this option, an input RNA will be compared to one of the pre-compiled RNA datasets. For more information about the available datasets see the datasets section.


Input format

Back to the top

All the comparison options required the input sequences to be supplied using the textarea in the home page.

Note: The minimum accepted sequence length, for each comparison method, is 10 nucleotides!

The input sequences are accepted in FASTA format:
-The line containing the name and/or the description of the sequence starts with a ">";
-The words following the ">" are interpreted as the RNA id;
-The following line reports the RNA nucleotide sequence; -The subsequent line characters are interpreted as secondary structure information (Optional)

or

FASTB format:
-The line containing the name and/or the description of the sequence starts with a ">";
-The words following the ">" are interpreted as the RNA id;
- The following line reports the RNA nucleotide sequence;
-The subsequent line characters are interpreted as secondary structure information in the BEAR alphabet.

The IUPAC notation is accepted for nucleotides (case-insensitive).
The secondary structure must be supplied using dot-bracket notation; only '( . )' characters will be accepted by the program.

Example of a well formatted input file:
>X06054.1/711637
GGGCCCGUCGUCUAGCCUGGUUAGGACGCUGCCCUGACGCGGCAGAAAUCCUGGGUUCAAGUCCCAGCGGGCCCA
In this case the secondary strucure for the sequence will be computed on the fly using RNAfold (Vienna package), with the minimum free energy prediction method.

or
>X06054.1/711637
GGGCCCGUCGUCUAGCCUGGUUAGGACGCUGCCCUGACGCGGCAGAAAUCCUGGGUUCAAGUCCCAGCGGGCCCA
(((((((..((((..........)))).(((((.......))))).....(((((.......)))))))))))).
or
>X06054.1/711637
GGGCCCGUCGUCUAGCCUGGUUAGGACGCUGCCCUGACGCGGCAGAAAUCCUGGGUUCAAGUCCCAGCGGGCCCA
(((((((..((((..........)))).(((((.......))))).....(((((.......)))))))))))).
GGGGGGG::ddddssssssssssdddd:eeeeepppppppeeeee:::::eeeeepppppppeeeeeGGGGGGG:
The input may contain many sequences e.g. :
>X06054.1/711637
GGGCCCGUCGUCUAGCCUGGUUAGGACGCUGCCCUGACGCGGCAGAAAUCCUGGGUUCAAGUCCCAGCGGGCCCA
(((((((..((((..........)))).(((((.......))))).....(((((.......)))))))))))).
GGGGGGG::ddddssssssssssdddd:eeeeepppppppeeeee:::::eeeeepppppppeeeeeGGGGGGG:
>AP000063.1/5917959095
GCGGGGGUGCCCGAGCCUGGCCAAAGGGGUCGGGCUCAGGACCCGAUGGCGUAGGCCUGCGUGGGUUCAAAUCCCACCCCCCGCA
(((((((..(((.............))).(((((.......)))))..............(((((.......)))))))))))).
>AP000989.1/7327973354
GCGGCCGUCGUCUAGUCUGGAUUAGGACGCUGGCCUUCCAAGCCAGUAAUCCCGGGUUCAAAUCCCGGCGGCCGCA
(((((((..((((...........)))).(((((.......))))).....(((((.......)))))))))))).
>AE006696.1/291218
GCCGCCGUAGCUCAGCCCGGGAGAGCGCCCGGCUGAAGACCGGGUUGUCCGGGGUUCAAGUCCCCGCGGCGGCA
(((((((..((((.........)))).(((((.......))))).....(((((.......)))))))))))).



Parameters

Back to the top

GAP INSERTION
Cost of starting a gap in the alignment

GAP EXTENSION
Cost of extending an alignment gap.

SEQUENCE BONUS
Extra score for aligning two identical nucleotides.

GLOBAL/LOCAL
Allows the user to choose between global and local alignment

DATASETS (only for "Search" comparison option)
Allows the user to choose one of the pre-compiled datasets. Graphical output will not be available for this option.



Datasets

Back to the top

The searchable secondary structure datasets were divided into three groups: those derived from icSHAPE, NONCODE, and Rfam data. The first dataset, based on experimental data, is the most reliable. The NONCODE dataset was included to allow research across a wider range of species. The Rfam dataset, which was also used in the initial version of Beagle, has been updated to Rfam seed version 15.0 and relies on the conservation of positions within Rfam families for its reliability. All the comparison options required the input sequences to be supplied using the textarea in the home page.

Accession ID Organism Cell line Transcript type Reference Number of RNAs Library strategy Condition Reagent
GSE145805 Homo sampiens HEK293 PolyA+ Sun et al., 2021 73,069 icSHAPE in vivo NAI-N3
GSE145805 Homo sampiens H9 PolyA+ Sun et al., 2021 72,697 icSHAPE in vivo NAI-N3
GSE145805 Homo sampiens K562 PolyA+ Sun et al., 2021 66,042 icSHAPE in vivo NAI-N3
GSE145805 Homo sampiens HeLa PolyA+ Sun et al., 2021 61,066 icSHAPE in vivo NAI-N3
GSE145805 Homo sampiens HepG2 PolyA+ Sun et al., 2021 48,491 icSHAPE in vivo NAI-N3
GSE145805 mEs HepG2 PolyA+ Sun et al., 2021 26,756 icSHAPE in vivo NAI-N3
GSE146952 Homo sapiens HEK293t small RNAs Luo et al. 2021 299 SHAPE-MaP in vivo NAI-N3
GSE120724 Danio rerio Fertilized eggs PolyA+ Shi et al. 2020 7,033 icSHAPE in vivo NAI-N3
GSE120724 Danio rerio 1 cell (0,4 hpf) PolyA+ Shi et al. 2020 7,033 icSHAPE in vivo NAI-N3
GSE120724 Danio rerio 4 cells (1 hpf) PolyA+ Shi et al. 2020 5,013 icSHAPE in vivo NAI-N3
GSE120724 Danio rerio 64 cells (2 hpf) PolyA+ Shi et al. 2020 5,138 icSHAPE in vivo NAI-N3
GSE120724 Danio rerio Sphere (4 hpf) PolyA+ Shi et al. 2020 5,024 icSHAPE in vivo NAI-N3
GSE120724 Danio rerio Shield (6 hpf) PolyA+ Shi et al. 2020 4,879 icSHAPE in vivo NAI-N3
GSE154563 Escherichia coli (K-12 DH10B) K-12 DH10B rRNA 16s, 23s Marinus et al., 2021 2 SHAPE-MaP in vivo 2A3
GSE154563 Bacillus subtilis (168) K-12 DH10B rRNA 16s, 23s Marinus et al., 2021 2 SHAPE-MaP in vivo 2A3
GSE106483 Dengue Virus (EDEN-2402) Vero Genome Huber et al., 2019 1 SHAPE-MaP in vivo NAI
GSE106483 Dengue Virus (EDEN-3295) Vero Genome Huber et al., 2019 1 SHAPE-MaP in vivo NAI
GSE106483 Dengue Virus (EDEN-2270) Vero Genome Huber et al., 2019 1 SHAPE-MaP in vivo NAI
PRJEB28648 Zika virus (MR766) Huh-7 Genome P. Li et al. 2018 1 icSHAPE in vivo NAI-N3
PRJEB28648 Zika virus (PRVABC 59) Huh-7 Genome P. Li et al. 2018 1 icSHAPE in vivo NAI-N3
GSE189259 SARS-CoV-1 Vero E6 Genome Morandi et al, 2022 1 SHAPE-MaP in vivo 2A3
GSE189259 SARS-CoV-2 Vero E6 Genome Morandi et al, 2022 1 SHAPE-MaP in vivo 2A3
GSE153984 HCoV-HK U5 Huh 7.5.1 UTRs Lan, T.C.T. et al, 2022 2 icSHAPE in vivo NAI-N3
GSE153984 HCoV-HK U9 Huh 7.5.1 UTRs Lan, T.C.T. et al, 2022 2 icSHAPE in vivo NAI-N3
GSE153984 HCoV-HK U1 Huh 7.5.1 UTRs Lan, T.C.T. et al, 2022 2 icSHAPE in vivo NAI-N3
GSE153984 HCoV-NL 63 Huh 7.5.1 UTRs Lan, T.C.T. et al, 2022 2 icSHAPE in vivo NAI-N3
GSE153984 MERS-CoV Huh 7.5.1 UTRs Lan, T.C.T. et al, 2022 2 icSHAPE in vivo NAI-N3


To enhance clarity and transparency regarding sequencing coverage and quality control, we have included this table that provides a summary of the coverage thresholds and quality control measures implemented in the original studies, as well as the filtering criteria applied, allowing users to easily verify how SHAPE data were processed.


Accession ID Reference Coverage Threshold Quality control measures
GSE145805 Sun et al., 2021 >50,000 transcripts in human;
>30,000 transcripts in mouse
Background correction with DMSO-treated samples, entropy-based filtering and winsorization. The scores of each peak were normalized to a scale of [0, 1]
GSE153984 Lan, T.C.T. et al. 2022 >100,000 reads per region (sliding window of 500 nt) Background correction with DMSO-treated samples, winsorization and removal of low-confidence positions. The scores of each peak were normalized to a scale of [0, 1]
GSE146952 Sun et al., 2021 >500 per nucleotide Background correction with DMSO-treated samples, removal of low-confidence positions and structural score computation using mutation rates
GSE120724 Marinus et al. 2021 Coverage threshold not explicitly stated Background correction with DMSO-treated samples and removal of low-confidence positions (Phread < 20)
GSE106483 Huber et.al 2019 >100 reads per base on 99.99% of all bases across eight viruses Background correction with DMSO-treated samples, Shannon entropy-based filtering, correlation-based validation of replicates (Pearson > 0.81) and comparison with known structural elements
PRJEB28648 Li et al. 2018 Coverage threshold not explicitly stated Background correction with DMSO-treated samples, icSHAPE data normalization using a global median-based approach and cross-validation using independent biological replicates (Pearson correlation R ≥ 0.89)
GSE189259 Morandi et al., 2022 Coverage threshold not explicitly stated Background correction with DMSO-treated samples, filtering of low-complexity regions using the Gini coefficient, normalization of SHAPE reactivities and validation using COMRADES-supported RNA–RNA interactions


Output

Back to the top

The results page reports a table containing all the computed pairwise alignments .Each row of the table contains the two input RNA ids and alignment statistics such as sequence and structural identity percentages and the structural similarity percentage. Moreover, also two measures for the statistical significance of the alignments are reported: p-value and z-score. Results can be sorted according to one of the previous parameters by clicking the selected column header. by default the results are displayed sorted by z-score in ascending order.

Click the "Details" button (blue bloxes in the figure above) to open a dedicated page showing the identifiers of that specific pair of RNAs along with the corresponding sequence and structure color-coded alignment . The color code helps the user to identify ungapped structural regions between the two RNAs (see figure below). It is possible to switch the color-conding on and off by clicking on the “Toggle Colors” button.

The purpose of the colors is to help the user to identify common un-gapped sub-structures between the RNAs. Different colors are associated to different sub-structures. These colors are not to be intended as conservation measures.

Description of the output fields:

The NW W (Needleman and Wunsch): Needleman-Wunsch alignment score for the RNA primary sequences
The SeqIdentity y (Sequence identity percentage) is computed as the fraction of paired nucleotide bases.
The Str Id (structural identity percentage) is computed as the fraction of paired bases encoded with an identical BEAR character.
The Str Sim (structural similarity percentage) is computed as the fraction of paired bases encoded with two different BEAR characters belonging to the same RNA structural element (e.g. two different characters encoding for stems with different lengths).
The p-value and z-score are computed using as background the distribution of the scores obtained aligning unrelated RNA sequences (as detailed in the Supplementary Material of Mattei et al. , sumbitted). We suggest to consider the z-score as the reference statistic measure and consider as significant all the alignments having a z-score higher than 3.

By clicking on Download Results, it is possible to download all the pairwise alignments. The exported file will be formatted in a FASTA-like format as follow:

-The first line containing starts with a ">" followed by the ID of the first sequence, the ID of the second and the alignment scores, the sequence identity percentage, the structural identity percentage, the structural similarity percentage, P-value and Z-score. The fields are divided by '|'
-Next line represents the aligned nucleotide sequence of the first RNA
-Next line represents the aligned secondary structure of the first RNA
-Next line represents the aligned BEAR characters of the first RNA
-Next line represents the aligned nucleotide sequence of the second RNA
-Next line represents the aligned secondary structure of the second RNA
-Next line represents the aligned BEAR characters of the second RNA

Example:

Input1:

Input2:

Output: