# fanta.bio

> fanta.bio (Functional genome ANnotations with Transcriptional Activities) is a database that collects functional annotations of genomes for studying gene regulation, with a primary focus on cis-regulatory elements (CREs) such as promoters and enhancers.

## Database Overview

fanta.bio provides comprehensive functional genome annotations focused on cis-regulatory elements (CREs). CREs, including promoters and enhancers, are identified based on their transcription signatures. Both promoters and enhancers produce specific sets of RNAs, such as mRNA, lncRNA, uaRNA (upstream antisense RNA), and eRNA (enhancer RNA).

The identification methodology builds upon the pioneering work of the FANTOM5 project, applying advanced approaches to an expanded dataset. The database additionally collects relevant resources like genome binding sites of transcription factors and genome variations across individuals.

## Data Access Interfaces

- **Web Interface**: The most user-friendly way to explore the database at [https://fanta.bio/](https://fanta.bio/)
- **REST API**: Programmatic access at [https://api.fanta.bio/](https://api.fanta.bio/). Interactive Scalar UI at [https://api.fanta.bio/docs](https://api.fanta.bio/docs); OpenAPI 3.1 spec at [https://api.fanta.bio/openapi.json](https://api.fanta.bio/openapi.json). Endpoints cover gene, SNP, and CRE search/lookup; cross-database analysis (shared regulators, regulatory summaries); a unified `/v0/search` that auto-routes by query shape; expression and ChIP-Atlas TF binding data.
- **MCP Server**: Model Context Protocol server at [https://mcp.fanta.bio/mcp](https://mcp.fanta.bio/mcp) (Streamable HTTP, JSON-RPC 2.0). Lets AI assistants like Claude query fanta.bio directly. Exposes 19 tools (search, lookup, relationships, expression, TF binding, cross-database analysis), 5 prompt templates (`find_gene_tfs`, `compare_genes_regulation`, `region_regulatory_analysis`, `disease_snp_analysis`, `explore_cre_expression`), and 2 resources (`organisms://list`, `schema://database`).
- **Documentation**: Full developer and user docs at [https://docs.fanta.bio/](https://docs.fanta.bio/) covering the website, API, and MCP integration.
- **UCSC Genome Browser**: Access via [track hub](https://genome-asia.ucsc.edu/cgi-bin/hgTracks?hubUrl=https://data.fanta.bio/hub/v1.1.0-2409/trackhub/hub.txt) for visualization alongside other genomic datasets
- **Download Archive**: Available at [https://data.fanta.bio/](https://data.fanta.bio/) for local data analysis

## Search API and URL Structures

fanta.bio supports URL-based searching that can be used programmatically to access specific data. The following URL patterns enable direct access to search results and record pages:

### Basic Search
- **Main search page**: https://fanta.bio/
- **Basic search URL structure**: `https://fanta.bio/search?q={QUERY}&organism={ORGANISM}`
  - Example (search for TP53): `https://fanta.bio/search?q=TP53&organism=human`
  - Example (search for Sox2): `https://fanta.bio/search?q=Sox2&organism=mouse`
  - Organism options: `human`, `mouse`, or `any`

### UCSC Browser Search
- **URL structure**: `https://genome-asia.ucsc.edu/cgi-bin/hgTracks?hubUrl=https://data.fanta.bio/hub/v1.1.0-2409/trackhub/hub.txt&genome={GENOME}&position={POSITION}`
  - Example (human, chr17:7668402-7687538): `https://genome-asia.ucsc.edu/cgi-bin/hgTracks?hubUrl=https://data.fanta.bio/hub/v1.1.0-2409/trackhub/hub.txt&genome=hg38&position=chr17:7668402-7687538`
  - Genome options: `hg38` (human), `mm10` (mouse)

### Advanced Search
- **Advanced search page**: [https://fanta.bio/search/cre-advanced](https://fanta.bio/search/cre-advanced)
- **URL structure**: `https://fanta.bio/search/cre-advanced?q={QUERY}&organism={ORGANISM}&bound_tf={TF_NAME}`
  - Example (CREs with CTCF binding): `https://fanta.bio/search/cre-advanced?q=&organism=human&bound_tf=CTCF`
  - Example (CREs with p53 binding): `https://fanta.bio/search/cre-advanced?q=&organism=human&bound_tf=TP53`
  - Supports combinations of parameters for precise filtering
  - Search terms allow exact or partial matches, wildcards, and logical combinations

### Gene Neighbor Search
- **Gene search page**: [https://fanta.bio/search/gene](https://fanta.bio/search/gene)
- **URL structure**: `https://fanta.bio/search/gene?q={GENE_NAME_OR_SYMBOL}&organism={ORGANISM}`
  - Example (TP53 gene): `https://fanta.bio/search/gene?q=TP53&organism=human`
  - Example (Sox2 gene): `https://fanta.bio/search/gene?q=Sox2&organism=mouse`
  - Returns CREs within 10kb of the gene (distance=0 means CRE is inside the gene)

### GWAS SNP Search
- **SNP search page**: [https://fanta.bio/search/snp](https://fanta.bio/search/snp)
- **URL structure**: `https://fanta.bio/search/snp?q={TRAIT_OR_SNP_ID}&organism={ORGANISM}`
  - Example (diabetes trait): `https://fanta.bio/search/snp?q=diabetes&organism=human`
  - Example (specific SNP): `https://fanta.bio/search/snp?q=rs12345&organism=human`
  - Returns SNPs related to the query and counts of nearby CREs (within 10kb)

### CRE Record Access
- **Direct CRE record URL structure**: `https://fanta.bio/cre/{CRE_ID}`
  - Example: `https://fanta.bio/cre/hg38_cre_12345`
  - Example: `https://fanta.bio/cre/mm10_cre_67890`
  - Provides access to all tabs: Annotation, Bound TFs, Variations, Expression Table

### Result Formats
- **CSV Export**: Add `&format=csv` to any search URL to download results in CSV format
  - Example: `https://fanta.bio/search?q=TP53&organism=human&format=csv`

## CRE Identification Methodology

CRE peaks are identified using experimental measurements of transcription starting sites across diverse biological samples. The process utilizes CAGE (Cap Analysis of Gene Expression) data from FANTOM5, FANTOM6, and other public repositories like SRA/ENA/DRA.

CAGE (Cap Analysis of Gene Expression) is a method to sequence capped RNA 5'-ends, capturing the precise transcription start sites across the genome. The CAGE data utilized in fanta.bio comes from:
- [FANTOM5](https://fantom.gsc.riken.jp/5/) project
- [FANTOM6](https://fantom.gsc.riken.jp/6/) project
- Public repositories: [SRA](https://www.ncbi.nlm.nih.gov/sra)/[ENA](https://www.ebi.ac.uk/ena)/[DRA](https://www.ddbj.nig.ac.jp/dra)

The identification pipeline employs a method based on transcription divergence (Kawaji et al., in preparation). This method analyzes the bidirectional transcription patterns that are characteristic of active regulatory elements.

CREs are categorized into two primary groups:
- **Promoter Level Activity (PLA)**: Color-coded from red to blue in the browser view, indicating transcription direction
  - Red: Forward direction (+1)
  - Blue: Reverse direction (-1)
  - Intermediate colors: Directionalities in-between
  - Directionality is defined as: (ForwardCounts - ReverseCounts) / (ForwardCounts + ReverseCounts)
- **Enhancer Level Activity (ELA)**: Marked in yellow in the browser view

CRE genomic coordinates are provided in [BED9+ format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1), with thickStart/thickEnd representing the core region bounded by the highest signals in forward and reverse strands.

## CRE Activity Measurement

Cell-dependent gene regulation requires activation of specific CRE sets. The database quantifies CRE activities by measuring transcription outputs per cell or tissue type. RNA 5'-ends within CRE regions are:
1. Counted
2. Normalized as CPM (counts per million) to adjust for sequence depth
3. Scaled by the RLE method ([Anders et al. 2013](https://doi.org/10.1038/nprot.2013.099)) for meaningful sample comparisons

This approach allows for quantitative comparison of CRE activity across different samples, cell types, and experimental conditions. The expression data helps researchers understand cell-dependent gene regulation that requires activation of specific sets of CREs.

## Data Collection and Processing Pipeline

The fanta.bio database construction workflow involves several key steps:

1. **Data Collection**:
   - CAGE data from FANTOM5, FANTOM6, and public repositories
   - ChIP-seq data from ChIP-Atlas
   - Genome variation data from TogoVar (human) and MoG+ (mouse)

2. **CRE Identification**:
   - Detection of CAGE peaks indicating transcription start sites
   - Analysis of bidirectional transcription patterns
   - Classification of peaks as PLA or ELA based on activity levels
   - Annotation with genome coordinates in BED9+ format

3. **Association with Genomic Features**:
   - Identification of nearby transcripts (within 500bp)
   - Association with gene identifiers (Ensembl, RefSeq, GenBank)
   - Comparison with other TSS databases (FANTOM5 CAGE peaks, refTSS)
   - Assessment of overlap with known enhancers (FANTOM5 enhancers, SCREEN cCREs)

4. **Integration with TF Binding Data**:
   - Mapping of ChIP-seq peaks to CRE regions
   - Determination of TF binding (50% overlap criterion, Q-score > 1000)
   - Association of binding data with experimental metadata

5. **Variation Data Integration**:
   - Mapping of genome variations to CRE regions
   - Annotation with variation metadata and functional predictions

6. **Expression Quantification**:
   - Counting of CAGE tags within CRE regions
   - Normalization and scaling for cross-sample comparison
   - Organization of expression data by sample and condition

## Associated Datasets

### ChIP-seq Data
- Sourced from [ChIP-Atlas](https://chip-atlas.org/)
- Includes ChIP-seq peaks in "TF and Others" categories
- Selected data derived from cell lines matching transcriptome data
- Search by TF name in advanced search: `https://fanta.bio/search/cre-advanced?bound_tf={TF_NAME}`
- Peak determination criteria: 50% overlap between CRE region and TF binding region
- Quality threshold: Q-score > 1000 (-10 * Log10[MACS2 Q-value])

### Human Genome Variation Data
- [TogoVar](https://togovar.org/) serves as the primary data source
- Focuses on Japanese genetic variations but includes non-Japanese data via dbSNP
- Additional resources accessible via UCSC Genome Browser:
  - [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) for clinical variant interpretation
  - [gnomAD](https://gnomad.broadinstitute.org/) for population frequency data
  - [TCGA Pan-cancer mutations](https://portal.gdc.cancer.gov/) for cancer-related variants
  - [dbSNP](https://www.ncbi.nlm.nih.gov/snp/) for general variant information
- Available in the "TogoVar Variations" tab of each human CRE record

### Mouse Genome Variation Data
- [MoG+](https://molossinus.brc.riken.jp/) provides variations across mouse subspecies/strains
- Additional data via UCSC Genome Browser:
  - [Mouse Genomes Project](https://www.sanger.ac.uk/data/mouse-genomes-project/) for variations across laboratory mouse strains
  - [dbSNP](https://www.ncbi.nlm.nih.gov/snp/) for general variant information
- Linked from the "Annotation" tab of each mouse CRE record

### GWAS Data Integration
- GWAS SNP data collected from [GWAS Catalog](https://www.ebi.ac.uk/gwas/home)
- Trait annotations used as searchable terms in the SNP search function
- CREs within 10kb of GWAS SNPs are associated with those variants
- Enables discovery of potential regulatory elements implicated in specific traits or diseases

## User Interface Guide

### Search Functionality
- **Basic Search**: Search by CRE ID, CRE Name, TFs, external identifiers, and organism
  - URL: `https://fanta.bio/search?q={QUERY}&organism={ORGANISM}`
  - Results displayed as a table of CREs, downloadable as CSV
  - Each CRE entry links to its detailed record page

- **UCSC Browser Search**: Search by keywords or genomic coordinates
  - URL: `https://genome-asia.ucsc.edu/cgi-bin/hgTracks?hubUrl=https://data.fanta.bio/hub/v1.1.0-2409/trackhub/hub.txt&genome={GENOME}&position={POSITION}`
  - Results displayed in UCSC Genome Browser interface
  - CRE visualization color-coded by type and directionality:
    - PLA: Red to blue indicating direction
    - ELA: Yellow

- **Advanced Search**: Combine various search parameters for more specific results
  - URL: `https://fanta.bio/search/cre-advanced?q={QUERY}&organism={ORGANISM}&bound_tf={TF_NAME}`
  - Supports advanced filtering with multiple parameters
  - Accepts exact matches, partial matches, wildcards, and logical combinations

- **Neighbor Gene Search**: Find CREs near specific genes (within 10kb)
  - URL: `https://fanta.bio/search/gene?q={GENE_NAME_OR_SYMBOL}&organism={ORGANISM}`
  - Target items: Gene name, Gene Symbol, Synonyms
  - Results listed by Gencode Gene IDs
  - Selecting a gene ID shows associated CREs with distance information
  - Distance = 0 indicates CRE is inside the gene region

- **GWAS SNP Search**: Locate CREs near SNPs associated with specific traits
  - URL: `https://fanta.bio/search/snp?q={TRAIT_OR_SNP_ID}&organism={ORGANISM}`
  - Accepts trait terms derived from GWAS Catalog annotations
  - Results shown as SNP list with ID, Organism, Position, Trait, and CRE Count
  - Selecting a SNP ID shows list of nearby CREs (within 10kb)

### CRE Record Pages
Each CRE record at `https://fanta.bio/cre/{CRE_ID}` provides detailed information organized in tabs:

1. **Annotation**: Basic information, genome coordinates, and nearest transcript data
   - Includes Ensembl transcript ID/RefSeq ID/GenBank accession
   - NCBI Gene ID, HGNC ID/MGI ID, UniProt ID
   - Gene Name/Symbol/Synonym from HGNC/MGI
   - Distance from detected TSS to 3'- or 5'-end of transcript
   - Overlap information with FANTOM5 CAGE peaks and refTSS
   - Information on overlap with FANTOM5 enhancers and SCREEN cCREs
   - Exact genomic coordinates in standard genomic format

2. **Bound TFs (ChIP-Atlas)**: Transcription factors experimentally shown to bind to the CRE region
   - Lists TFs with Max Qscore (-10 * Log10[MACS2 Q-value])
   - Includes experiment information (SRA ID) and Qscores for antigens
   - TF binding determined by 50% overlap (peak cutoff: Q-score > 1000)
   - Hidden tab at left side of each antigen name contains all experimental details
   - Search for specific TF binding with "Advanced" search feature

3. **Variations**: Genome variation data specific to the organism
   - Human: "TogoVar Variations" tab with detailed variation data from TogoVar
   - Mouse: Link to the corresponding region in MoG+ from the "Annotation" tab
   - Provides insights into potential functional impacts of genetic variation on CREs

4. **Expression Table**: CRE expression values across different samples
   - Shows normalized expression values (CPM, scaled by RLE method)
   - Lists expression levels in each cell or tissue type sample
   - Facilitates understanding of cell-dependent gene regulation
   - Helps identify cell-specific or tissue-specific regulatory activity

## CRE Naming and Identification Conventions

fanta.bio uses a systematic approach for naming and identifying CREs:

- **CRE ID Format**: `{genome_assembly}_cre_{numeric_id}`
  - Examples: `hg38_cre_12345`, `mm10_cre_67890`
  - Genome assembly: hg38 (human), mm10 (mouse)
  - Numeric ID: Unique identifier within the assembly

- **CRE Classifications**:
  - PLA (Promoter Level Activity): Likely functioning as promoters
  - ELA (Enhancer Level Activity): Likely functioning as enhancers

- **Directionality Scoring**:
  - Formula: (ForwardCounts - ReverseCounts) / (ForwardCounts + ReverseCounts)
  - Range: -1 (fully reverse) to +1 (fully forward)
  - Visual representation: Color spectrum from blue (-1) to red (+1)

## Combining Search Results
fanta.bio enables combining search results through URL parameters:
- Multiple parameters can be combined with `&` character
- Example (human CREs with CTCF binding and TP53 in description): `https://fanta.bio/search/cre-advanced?q=TP53&organism=human&bound_tf=CTCF`
- Example (retrieving CSV format): `https://fanta.bio/search?q=Sox2&organism=mouse&format=csv`
- Parameters are processed using AND logic by default

### Advanced Search Syntax
The advanced search supports sophisticated query construction:
- Exact match: `"exact phrase"`
- Partial match: `partial`
- Wildcard: `gene*` (matches gene1, gene2, etc.)
- Logical AND: `term1 AND term2`
- Logical OR: `term1 OR term2`
- Logical NOT: `NOT term`
- Grouping: `(term1 OR term2) AND term3`

## Data Usage Examples

### Identifying Cell-Specific Regulatory Elements
1. Search for a gene of interest using Gene Neighbor Search
2. Examine the expression patterns of associated CREs across cell types
3. Identify CREs with cell-specific expression patterns
4. Investigate TF binding patterns at these CREs

### Finding Potentially Disease-Associated Regulatory Variants
1. Use GWAS SNP Search to find SNPs associated with a disease/trait
2. Examine CREs near the identified SNPs
3. Investigate variation data within these CREs using the Variations tab
4. Analyze TF binding that might be affected by variations

### Studying Transcription Factor Regulatory Networks
1. Use Advanced Search to find CREs bound by a specific TF
2. Identify genes associated with these CREs
3. Examine expression patterns to determine when the TF is active
4. Find other TFs that co-bind to these regions

## Affiliations and Licensing

- fanta.bio is affiliated with [INTRARED](https://www.intrared.org/), serving as a member database
- All data is distributed under the [CC-BY 4.0 license](http://creativecommons.org/licenses/by/4.0/)
- Citation: Nobusada T et al., Update of the FANTOM web resource: enhancement for studying noncoding genomes. *Nucleic Acids Res.* **53** (D1), D419–D424 (2025). doi: [10.1093/nar/gkae1047](https://doi.org/10.1093/nar/gkae1047)

## Database Versioning

- Current version: v1.1.0 (released 2024)
- Track hub URL: https://data.fanta.bio/hub/v1.1.0-2409/trackhub/hub.txt
- Future updates will be announced on the fanta.bio website

## Team and Acknowledgements

The database is maintained by three collaborative labs:
- Laboratory for Large-Scale Biomedical Data Technology at [RIKEN IMS](https://www.ims.riken.jp/english/) (led by Dr. Kasukawa)
- Integrated Bioresource Information Division at [RIKEN BRC](https://web.brc.riken.jp/en/) (led by Dr. Masuya)
- Research Center for Genome & Medical Sciences at [TMiMS](https://www.igakuken.or.jp/english/) (led by Dr. Kawaji)

The project acknowledges contributions from:
- [ChIP-Atlas](https://chip-atlas.org/): A data-mining suite for exploring epigenomic landscapes
- [MoG+](https://molossinus.brc.riken.jp/): A database of genomic variations across mouse subspecies for biomedical research
- [TogoVar](https://togovar.org/): A comprehensive Japanese genetic variation database
- [UCSC Genome Browser](https://genome.ucsc.edu/) and [its Asian mirror](https://genome-asia.ucsc.edu/) for enabling genomic interface

fanta.bio is supported by [JST](https://www.jst.go.jp/) [NBDC](https://biosciencedbc.jp/) Grant Number JPMJND2202 in [Database Integration Coordination Program (DICP)](https://biosciencedbc.jp/en/funding/program/dicp/).

## Contact Information

For questions or assistance: help@fanta.bio

## Key References

- Mitsuhashi N, Toyo-Oka L, Katayama T, Kawashima M, Kawashima S, Miyazaki K, Takagi T. TogoVar: A comprehensive Japanese genetic variation database. Hum Genome Var. 2022 Dec 12;9(1):44. doi: 10.1038/s41439-022-00222-9. PMID: 36509753; PMCID: PMC9744889.
- Takada T, Fukuta K, Usuda D, Kushida T, Kondo S, Kawamoto S, Yoshiki A, Obata Y, Fujiyama A, Toyoda A, Noguchi H, Shiroishi T, Masuya H. MoG+: a database of genomic variations across three mouse subspecies for biomedical research. Mamm Genome. 2022 Mar;33(1):31-43. doi: 10.1007/s00335-021-09933-w. Epub 2021 Nov 15. PMID: 34782917; PMCID: PMC8913468.
- Zou Z, Ohta T, Miura F, Oki S. ChIP-Atlas 2021 update: a data-mining suite for exploring epigenomic landscapes by fully integrating ChIP-seq, ATAC-seq and Bisulfite-seq data. Nucleic Acids Res. 2022 Jul 5;50(W1):W175-W182. doi: 10.1093/nar/gkac199. PMID: 35325188; PMCID: PMC9252733.
- Abugessaisa I, Ramilowski JA, Lizio M, Severin J, Hasegawa A, Harshbarger J, Kondo A, Noguchi S, Yip CW, Ooi JLC, et al. FANTOM enters 20th year: expansion of transcriptomic atlases and functional annotation of non-coding RNAs. Nucleic Acids Res. 2021 Jan 8;49(D1):D892-D898. doi: 10.1093/nar/gkaa1054. PMID: 33211864; PMCID: PMC7779024.
- Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014 Mar 27;507(7493):455-461. doi: 10.1038/nature12787. PMID: 24670763; PMCID: PMC5215096.
- Forrest AR, Kawaji H, Rehli M, Baillie JK, de Hoon MJ, Haberle V, Lassmann T, Kulakovskiy IV, Lizio M, Itoh M, et al. A promoter-level mammalian expression atlas. Nature. 2014 Mar 27;507(7493):462-70. doi: 10.1038/nature13182. PMID: 24670764; PMCID: PMC4529748.
- Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, Robinson MD. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat Protoc. 2013 Sep;8(9):1765-86. doi: 10.1038/nprot.2013.099. PMID: 23975260.