Frequently Asked Questions

Question: Which version of the human genome was to construct the alternative splicing database?
Answer: SpliceMiner was converted over to the database built for our newer suite of splicing utilities, SpliceCenter. The latest database uses Human genome build 37.1 and version 41 of Refseq and version 178 of GenBank transcripts. See here for a description of the database contents and links to documents describing the database build process in detail.

Question: For which organisms are splice variants available?
Answer: We support nine model organisms including human, mouse, and rat.

Question: How does the interactive query screen work?
Answer: The interactive query page is designed to support manual investigation of specific genes or loci. Simply enter a gene symbol, genomic start / stop position, or probe sequence. Remember to set the radio button to indicate the type of query. The results are displayed on the same page (you may need to scroll down). Each unique splice variant of the gene will be displayed. The variants are identified by accession number and you may click on an accession number to jump to the NCBI transcript record for the variant. Exons are the thick boxes and drawn to scale based on their length. Introns are thin connections between exons and are NOT representative of genomic length. Occasionally small breaks in an exon appear if there is a gap or mismatch of the alignment of the exon to the genome. Dark blue exon sections are coding regions and light blue regions are untranslated regions (3' and 5' UTRs). Splice variants drawn in yellow are predicted nonsense mediated decay targets. The last line in the graphic shows a composite structure of all known splice variants which shows the subexon structure of the gene.



If the query is a genomic coordinate query or a probe sequence query, a red line will indicate the position of the probe or coordinates. It is possible for coordinate queries to return multiple genes but the system will limit the results to the first 10 genes in the specified coordinate location.

Question: Why do some exons have a small break in them?
Answer: Occasionally a transcript will not align exactly with the genome. If there is a gap or a mismatch in the alignment, a thin break will appear in an exon.

Question: What is indicated by the different color tones in exons?
Answer: The lighter sections of exons are UTRs and the darker sections are the protein coding regions.

Question: Why are some transcripts show in a different color scheme?
Answer: SpliceMiner identifies transcripts that are likely to be targeted for nonsense mediated decay(NMD). Transcripts with a stop codon that is >50 bases upstream from the 3' most exon splice site are likely to be NMD targets. Transcripts that meet this criterion are drawn in an alternate color scheme as show here:

Question: What gene symbols are used by SpliceMiner?
Answer: EVDB uses HGNC symbols for human genes. The SpliceMiner interactive and batch queries also support common aliases for genes. For a more complete set of tools that enables the user to translate between disparate ids for the same gene, please use our MatchMiner resource.

Question: Why don't the links on the interactive results page work for me?
Answer: The links to NCBI information on transcripts open in a new browser window. If your browser is blocking pop-ups, you may need to allow pop-ups from our site or try ctrl-click. This is also true for the Gene List link above.

Question: How does the batch query interface work?
Answer: The batch query page is designed for high-volume queries and like the interactive page supports gene symbol, genomic coordinate, or probe sequence queries. Users may upload a file containing queries or may cut/paste queries into the text area on the page. Remember to set the radio button to indicate the type of queries that you are submitting and the method of submission (file vs. text area). If the batch has 20 items or less, the results will be returned to the browser. If the batch has > 20 items, the batch will be processed as a background task and you will be notified by email when the results are available. A notification email will contain a link for downloading results. The results will be available for 3 days after which they will be deleted.

The results from a batch query are designed for automated processing. The results file is a tab delimited text file. A header line indicates the contents of each column. Each row contains a query identifier that may be used to associate results with the original queries. For gene symbol queries, all of the splice variants and each exon of the variants will be returned. For genomic coordinate or probe sequence queries, only the specific variant/exon associated with the probe or coordinate range will be returned.

For probe sequence queries, the results indicate the location of exons (Exon Start / Exon Stop) and the location of the probe (Probe Start / Probe Stop) in both genomic and transcript coordinates. There will often be multiple rows returned for a given probe because probes are usually positioned on exons that are present in several variants. In the example above, probe 36593_at:187:625 aligns with Exon 17 of the EXT2 gene and two variants, U64511 and NM_207122 include this exon so two result rows for this probe are returned. Sometimes multiple result rows are returned because the probe is a perfect match to more than one gene. In the example above, probe 1576_g_at:394:551 aligns with exon 31 of gene ABCB4 and exon 28 of gene ABCB1.

Some probes may span an exon-exon boundary (‘junction’ probes). These probes can be identified by the same number appearing two or more times in the 'Exon Set' column. The 'Exon Set' column provides a unique identifier for each match between a probe and a transcript. Multiple records with the same Exon Set number indicate a situation where a probe crosses an exon boundary. In these cases two records are returned, one for each exon that matches a portion of the probe. In the example above, the probe tst_junction_p1 crosses between exon 4 and 5 of ABL1. When probes cross exon boundaries, the Probe Start (Chr) and Probe Stop (Chr) can be used to determine the position of the probe with in each exon.

Probes with matches to multiple genes are flagged with a value of 1 in the Degen Flag column and may be excluded entirely from query results by selecting the 'Non-Degenerate (Non-Cross Hybridizing) Matches Only' filter on the batch query screen.

Question: What are the file formats for batch queries?
Answer: Upload files can be plain text or a single plain text file in a zip file. Zip files must have a .zip file extension.

There are three different types of batch queries each with a different file format. For batch gene queries the file should have a single gene symbol per line (HGNC symbols for human genes). For Example:

ACP1
ACP2
ACP5
ACP6
BRCA1

Here is a sample gene batch file: genebatch.txt

For genomic coordinate queries, each line should have Chromosome Start_Position Stop_Position. The values may be separated by white space or tabs. For Example:

2   254943     254953
19 1154942   1154960
1  144343986 144343999
11 47222862  47222882

Here is a sample coordinate batch file: posbatch.txt

For sequence queries, the file should be in FASTA format. The FASTA headers can be Affymetrix style or have a unique probe id as the first word in the header. For Example:

>probe:HG-U95E:67024_at:215:9; Interrogation_Position=45; Antisense;
TTTGATTTATCACATTTCTGGAGCA
>probe:HG-U95E:67024_at:40:19; Interrogation_Position=47; Antisense;
TGATTTATCACATTTCTGGAGCAAG

OR

>1537:530:351
GTTTTCAGGAAACTTGTAACCGATC
>1537:92:313
TCACAGGAGCTGCTCTCATGGACAA

Here is a sample sequence batch file: probe.txt

For protein queries, the file should be in FASTA format. The FASTA headers should have a unique sequence id as the first word in the header. For Example:

>Prot_Seq_1
SVLFVCLGNIC
>Prot_Seq_2
qshgssacsqphgsvtqsqg

Question: On the batch query form, what is meant by 'Remove Degenerate (Cross Hybridizing) Matches'?
Answer: This filter can be used on batch sequence searches to filter out results for probes that match more than one gene. If a sequence has a perfect match to two or more distinct genes and this filter is selected, all results for the probe will be excluded.

Question: On the Batch Query, what is meant by Result Type of Splice Variants or Exon / Subexons?
Answer: The default batch query returns results for each splice variant that matches the query. See the "How does the batch query interface work?" question above for an example of the batch Splice Variant results. If for example, you submit microarray probe sequences, the results will show each splice variant targeted by the probe.

The other type of results that can be returned by the batch query utility is a single exon/subexon match for each query. Instead of returning multiple rows for each variant matched by the query, a single row is returned that indicates the exon targeted by the query. If, for example, the query is a set of microarray probes, the results will show the gene and exon targeted by the probe. Our database also includes a subexon designation for exons that have multiple isoforms. Some exons have alternate acceptor/donor sites and 3' exons can have multiple poly-A sites. Subexons are designated with incrementing decimals (1.1, 1.2, 1.3, etc). Subexon mapping allows investigators to group probes into sets that target the smallest splice units identified in know splice variants.

Question: What corrections were made to the NCBI data?
Answer: Probe queries in SpliceMiner require an exact mapping of transcripts to chromosomal coordinates. In a small percentage of NCBI records, there is an internal inconsistency between the genomic and transcript coordinates. This most often occurs when there are insertions or deletions in the transcript relative to the genome sequence. In cases where the length of an exon in transcript coordinates was not equal to the length of the exon in chromosomal coordinates, a refinement was performed by re-aligning the exon to the genomic sequence using BLAT. Exon fragment records were then created to indicate the exact alignment of each section of the transcript to the genome. The corrected exon records enable SpliceMiner to accurately determine the chromosomal position of probe sequences. Corrected exon records are identified by a 1 in the Coord-Correction column of batch results.

Question: How can I integrate SpliceMiner into an automated pipeline?
Answer: Web requests can be made programmatically to retrieve splice variant data. Here is a sample program that demonstrates SpliceMiner integration (download script):

#!/usr/bin/perl -w

# File: evvsamp.pl
# This is a sample program that shows how to make calls to the SpliceMiner
# webserver in order to integrate splice variant information into
# a genomic data processing pipeline.

# Author: Michael Ryan
# Date: March 24, 2006

use strict;
use warnings;
use LWP::Simple;

# URL is set to the web address of the SpliceMiner Website
# query type can be: 'Gene' or 'chrom_position' or 'Probe'
# value would then be the gene symbol, chromosome coordinates or probes sequence
# For example:
# 'Gene' 'ACP1' OR
# 'chrom_position' '2 254956 254966' OR
# 'Probe' 'GCGCAGAGGCGCCGAGACACCGCGGCGTTC'
my $URL = 'http://discover.nci.nih.gov/spliceminer/Batch?';
my $queryType = 'Gene';
my $value = 'ACP1';
my $organism = '9606';

# Execute the query
my $page = get $URL . 'queryBy=' . $queryType . '&organism=' . $organism . '&text=' . $value;

# Parse the results. Skip the header line.
my @lines = split(/\n/,$page);
for (my $i = 1; $i < @lines; $i++)
{
   # Results are tab-delimited
   my @columns = split(/\t/,$lines[$i]);
   my $query = $columns[0];
   my $gene_symbol = $columns[1];
   my $chrom = $columns[2];
   my $strand = $columns[3];
   my $accession = $columns[4];
   my $exon_num = $columns[5];
   my $chr_start = $columns[6];
   my $chr_stop = $columns[7];
   my $trans_start = $columns[8];
   my $trans_stop = $columns[9];
   print $query . "|" . $gene_symbol . "|" . $exon_num . "|";
   print $chrom . "|" . $strand . "|" . $chr_start . "|";
   print $chr_stop . "|" . $accession . "|" . $trans_start . "|"
   print $trans_stop . "\n";
}

Question: What is the probe coverage report?
Answer: The probe coverage report is a microarray specific application of the EVDB data. A probe sequence file in FASTA format is required to generate the report. The report has two sections for each gene covered by the chip. The first shows which exons are covered by probes and the second shows the probe coverage for each splice variant.

The information in the coverage report could be used to enhance probe level expression analysis. It can also be used to identify chip design issues such as variants that are missed by the probes. The flags column indicates probe issues. Probes with 'H' in the flags column are at risk of cross hybridization because they match more than one gene. Probes with 'S' in the flags column cross a splice junction. In the second section, 'P' indicates that the exon is covered by a probe. '-' indicates that the variant does not contain the exon.


We would like to hear from you. You can reach the team via email.

SpliceMiner and EVDB were originally developed jointly by the Genomics and Bioinformatics Group (GBG) of LMP, NCI, NIH and George Mason University, Department of Bioinformatics and Computational Biology. It is now maintained and under continuing development by GBG.

Notice and Disclaimer