Still loading...

Help

RegulomeDB is a database that provides functional context to variants or regions of interest and serves as a tool to prioritize functionally important single nucleotide variants (SNVs) located within the non-coding regions of the human genome. RegulomeDB queries any given variant by intersecting its position with the genomic intervals that were identified to be functionally active regions from the computational analysis outputs of functional genomic assays such as TF ChIP-seq and DNase-seq (from the ENCODE database) as well as those overlapping the footprints and QTL data.

All the source data used in RegulomeDB v2.1 can be found on the ENCODE website using these two links under the Data button at the top of the page: Experiments and Annotations. RegulomeDB also provides further information about those hits by incorporating them into prediction scores, thereby, providing a way to interpret the probability of these variants to be of real functional significance.

Querying variants with RegulomeDB

Users can submit queries to the RegulomeDB database in the following formats (Note: one can toggle between the hg19 and GRCh38 coordinates using the toggle button above the search box):

  1. Rsids assigned by dbSNP (eg. rs190509934).
  2. Single nucleotide positions: expressed in BED format, i.e. chrom:chromStart-chromEnd.
  3. Chromosomal regions: expressed in BED format, i.e. chrom:chromStart-chromEnd. In this case all the common dbSNPs with a minor allele frequency >1% in this region will be queried and returned.

After supplying a list of search queries in the search box, and upon clicking the search button below, users are redirected to a summary table representing prediction scores for all the given query variants (see the explanation of scores in FAQ).

Users can download the search result output table using the two download buttons on the top of the page: either in BED file format or a tab separated file format. One may also continue to explore each of the results individually by clicking on one of the outputs in the table. Upon clicking on any variant of interest, the users are redirected to the results page that is further subdivided into six sections as seen in the next screenshots. The data sources for each of the sections are explained in FAQ.

The top-most section of any variants page provides a summary of the results on the top, and includes key information such as the rsid of the variant, the number of peaks found intersecting that variant, its prediction rank and score values as well as the allelic frequency of that rsid in different populations as reported in different data sources such as GnomAD, 1000Geomes, TOPMED, and others.

Each of the sub-sections can be further expanded by clicking on one of the six sections at a time individually.

  • TF binding sites (ChIP-seq): This page provides further information about the TF (transcription factor) ChIP peaks that intersected the variant of interest.
    • The first section on this page provides the users with a bar plot representation showing the number of peaks that intersected with the variant of interest along with their targets. The peak numbers on each of the bars within the chart represents the number of biosamples where the same TF target was found to be having a peak signal that contained the intersected variant position.
    • Below the bar chart, a user can explore the underlying data on a tabular view that provides further metadata details of all the assays that produced the peak files: such as the intersecting peak location (chromosome start and end), the biosample information (along with the organ) that was used in each of the TF-ChIP assay, the ENCODE file ids (ENCFF ids) that was a source for the the peak information and its’ corresponding dataset ids (ENCSR ids). The ENCODE file accessions and the dataset accessions on this table are hyperlinked to the corresponding file objects and dataset objects on the ENCODE website for further metadata exploration.
  • Chromatin accessibility: This section provides the users with a bar plot graphical representation showing the number of times the variant was found to be within peaks called from chromatin accessibility assays using each biosample.
    • Each of the bars on the bar plot can be further expanded to view the underlying data table by clicking the title to the left of each bar.
    • Just like the ChIP data page, users can click on the hyperlinked ENCFF (file ids) or ENCSR (dataset ids) and that leads them to the corresponding ENCODE pages showing further metadata of the file or dataset information.
    • Note: in cases where we have more than one biosample DNase peak, they are not necessarily redundant. The DNase-seq samples can be derived from different donors and different treatment conditions. One could explore the exact underlying metadata by looking on the dataset linkouts to the ENCODE portal.
  • TF motifs & footprints: This page provides information regarding the position weighted matrices (PWMs) representing TF motifs and matching with the sequence overlapping the variant of interest, as well as footprints information that intersected with the variant of interest.
    • We provide a list of biosamples that were the source files for DNase-seq peak files used in the TRACE pipeline for predicting the footprints. (See how TF motifs and footprints are computed in FAQ.)
    • The biosamples list is hyperlinked to the corresponding ENCODE annotation filesets that contain the TRACE output files in the bed format. The ENCODE page also provides information about the exact chromatin accessibility file used for the TRACE pipeline.
    • Similarly the PWM file (when available) is also listed as a hyperlinked ENCFF id above the biosamples list box and can be further explored on the ENCODE website.
    • The exact genome reference region that overlaps with all the output motifs is represented on the top section along with a “boxed” letter that represents the variant of interest.
    • Each query could match a footprint (sometimes with no significant PWM score), or match the PWM itself outside of a footprint.
  • eQTLs & caQTLs (chromatin accessibility QTLs): The tables in this section show the information of eQTL and caQTL studies where the query variant is identified to be associated with gene expression levels and chromatin accessibility.
    • The caQTL data comes from curated publications, viewable at ENCODE portal.
    • The eQTL data comes from the GTEx project (source of GTEx_Analysis_v8_eQTL.tar file) and has also been uploaded on the ENCODE portal.
    • The corresponding ENCODE file ids and their corresponding dataset ids are also listed on the table and hyperlinked for further exploration.
    • The biosample information and population ethnicity information (when available) are also listed on the caQTL table and correspond to the original biosample information used for that study in the publication.
    • Example: rs75982468 has both biosample and population information that comes from the publication listed here: PMID:30650056.
  • Chromatin states:
    • This section shows predicted chromatin states from chromHMM.
    • The variant positions are intersected with those chromatin states and displayed on an interactive human body map as well as in a tabular representation.
    • The body map is colored by the most active state among all biosamples in each organ. Thus, it shows the users a pictorial representation of candidate organs where the query variant is likely to be functional and within different categories of regulatory elements.
    • For example, if the variant is within an active enhancer region for that biosample, it might lead to changes in the gene expression that is regulated by that enhancer.
    • Users can use the body map diagram to filter down the search results to display only a few organs of interest. Users can also filter the search results using the list of biosamples or the various chromatin states that are listed on the panels next to the body map.
    • The tabular view below provides further details on the biosample, classification, organ as well as the source ENCODE datasets and files (hyperlinked to ENCODE for further metadata exploration).
  • Genome browser: Users can explore the nearby genes of the variant (shown as a yellow highlight on the browser tracks). The browser shows the tracks from TF ChIP-seq and DNase-seq assays with overlapping peaks of the variant.
    • Users can use the “Refine your search” interface located above the browser tracks to further narrow down the list of tracks as needed.
    • This interface allows users to select from a variety of faceting options. For example, users can filter down the browser tracks displayed in the browser using the file types (bigWig or bigBed), dataset types (ChIP-seq or DNAse-seq), organ or cell type, biosamples as well as the targets used in respective ChIP-seq assays.
    • Users can also expand the track information section using the expand button on the lower right corner of each track. This expanded view allows the users to see the underlying ENCODE file and ENCODE dataset (both of which are hyperlinked to the respective ENCODE pages).

FAQ

Which reference genome is used?

You can switch between assemblies GRCh38 and hg19 through the toggle bar above the search box. However, the GRCh38 query contains the most recent datasets and we recommend using the GRCh38 version over the hg19 version.

What does the RegulomeDB ranking score represent?

The scoring scheme refers to the following supporting evidence for that particular location or variant id. In general, if more supporting data is available, the higher is its likelihood of being functional and hence receives a higher score (with 1 being higher and 7 being lower score).

ScoreSupporting data
1aeQTL/caQTL + TF binding + matched TF motif + matched Footprint + chromatin accessibility peak
1beQTL/caQTL + TF binding + any motif + Footprint + chromatin accessibility peak
1ceQTL/caQTL + TF binding + matched TF motif + chromatin accessibility peak
1deQTL/caQTL + TF binding + any motif + chromatin accessibility peak
1eeQTL/caQTL + TF binding + matched TF motif
1feQTL/caQTL + TF binding / chromatin accessibility peak
2aTF binding + matched TF motif + matched Footprint + chromatin accessibility peak
2bTF binding + any motif + Footprint + chromatin accessibility peak
2cTF binding + matched TF motif + chromatin accessibility peak
3aTF binding + any motif + chromatin accessibility peak
3bTF binding + matched TF motif
4TF binding + chromatin accessibility peak
5TF binding or chromatin accessibility peak
6Motif hit
7Other

How to interpret the RegulomeDB probability score?

The RegulomeDB probability score is ranging from 0 to 1, with 1 being most likely to be a regulatory variant. The probabilistic score is calculated from a random forest model, TURF, trained with allele-specific TF binding SNVs. We used a simplified version here only including binary features from functional genomic evidence as used in the heuristic ranking, as well as numeric features from information content in matched PWMs. We will include the whole feature set in a future release.

There is an overall positive correlation between the ranking scores and the probability scores, but there are some exceptions because 1) we added additional features when predicting probability scores. 2) features used in probability scoring were weighted differently from ranking scoring.

What data sources does RegulomeDB use for each genomic annotation?

RegulomeDB currently query variants with genomic annotations from the following data types:

TF binding sites
Peaks from TF (transcription factor) ChIP-seq assays called by uniform pipeline from the latest release of the ENCODE project.

Chromatin states
Chromatin states in 833 biosamples were called from chromHMM in EpiMap and were directly retrieved from the ENCODE portal.

Chromatin accessibility peaks
Peaks from DNase-seq assays called by uniform pipeline from the latest release of the ENCODE project.

TF motifs
PWM matching positions from 746 motifs in JASPAR 2020 CORE collection for vertebrates.

Footprints
Footprints were predicted with signals from 642 DNase-seq experiments and 591 TF motifs by the TRACE pipeline.

eQTLs
The eQTLs from the GTEx project across 49 human tissues.

caQTLs
The chromatin accessibility QTLs (caQTLs) from 9 publications.

How are TF motifs and footprints computed?

For TF motifs, we downloaded the PWMs (position weight matrices) of 746 non-redundant TF motifs from the JASPAR 2020 CORE collection. The kmers matching to TF motifs were called by TFM P-value with a threshold at 4-8 for each PWM. Bowtie was used to map the kmers on the genome to determine the final PWM matching positions for the TF motifs.

Footprints were predicted with the signals from DNase-seq experiments and the PWMs of TF motifs by the TRACE pipeline. TRACE is a computational method that incorporates signals from chromatin accessibility assays and PWMs within a multivariate hidden Markov model to detect footprint regions with matching motifs.

Note that TF motifs and Footprints are two separate genomic annotations. TF motifs are called totally from the DNA sequence, while footprints also consider signals from chromatin accessibility experiments and weigh less on the sequence side.

Can I download precalculated scores from RegulomeDB?

We currently have RegulomeDB rank scores available for common SNVs (Single Nucleotide Variants) in NCBI dbSNP Build 153. You can download the file here: regulomedb_dbsnp153_common_snv.tsv.

What version of dbSNP is RegulomeDB querying?

RegulomeDB is currently querying build 153 of dbSNP. See NCBI for additional information about dbSNP 153.

Why is there no data for my chromosomal region?

Entering a chromosomal region will identify all common SNPs (with an allele frequency > 1%) in that region. These SNPs are used to query RegulomeDB. If there are no common SNPs in the uploaded genomic regions, there will be no data that can be returned.

To cite RegulomeDB:
Dong, S., Zhao, N., Spragins, E., Kagda, M. S., Li, M., Assis, P. R., Jolanki, O., Luo, Y., Michael Cherry, J., Boyle, A. P., & Hitz, B. C. (2022). Annotating and prioritizing human non-coding variants with RegulomeDB. In bioRxiv.
Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, Karczewski KJ, Park J, Hitz BC, Weng S, Cherry JM, Snyder M. Annotation of functional variation in personal genomes usingRegulomeDB. Genome Research 2012, 22(9):1790-1797. PMID: 22955989.

To contact RegulomeDB:
regulomedb@mailman.stanford.edu