CRAIG Web Server Documentation
  1. Introduction and Basic Options
  2. Sequence format
  3. Handling masked input
  4. Output from CRAIG

Introduction and Basic Options

This is a brief user's manual for CRAIG'S Web Server. The basic options for the program are: The organism's model; right now human is the only one available, but it should work reasonably well for sequences of other vertebates, as shown in some of the predictions.

The strand in which prediction must be performed (default is on both strands simultaneously).

Type of genes can be allowed at prediction time, which can be controlled by the gene model used, i.e., either partial or complete (default is partial).

There are two ways to input your sequences for prediction: You can upload a local file with multiple sequences in fasta format, or paste a single sequence into the window.

The execution time of the program is roughly linear on the sequence length.

Sequence format

DNA sequences must be in FASTA format which looks like this example:
>U77349 ignored text
GAATTCCAAGAATGTTAAGGAAATGGTCGCAGAGTTGACAAGTTGTGACTTCTGTTAGGAATAAAAGAAAAGTGATGGTC
ACAGGGGGTCAAGAAGATGACTATAAAGGAGAAACCAAAGACAGGGATGCTGCTTAAATGGAGCAAGGCTAATTGAATTA
AGGAATTTGGCATTTGCATTACAAGTAATCATTTTGTTCTCTGTCCACAGAATCAAAGGAAATGGAAGACAGTAATATGT
TACCTCAGTTCATCCATGGCATACTATCAACATCTCATTCTCTATTTCCAAGAAGTATCCAAGAGCTTGATGAGGGGGCC
ACCACACCGTATGACTATGATGATGGTGAACCTTGTCATAAAACCAGTGTGAAGCAAATTGGAGCTTGGATCCTGCCCCC
ACTCTACTCCCTGGTATTCATCTTTGGTTTTGTGGGCAACATGTTGGTCATTATAATTCTGATAAGCTGTAAAAAGCTGA
AGAGCATGACTGATATCTACCTGTTCAACCTGGCCATCTCTGACCTGCTCTTCCTGCTCACACTCCCATTCTGGGCTCAC
TATGCTGCAAATGAGTGGGTCTTTGGGAATATAATGTGCAAATTATTCACAGGGCTTTATCACATTGGGTATTTTGGTGG
AATCTTCTTCATTATCCTCCTGACAATTGATAGATATTTGGCTATTGTCCATGCTGTCTTTGCTTTAAAAGCCAGGACAG
TTACCTTTGGGGTAATAACAAGTGTAGTCACTTGGGTGGTGGCTGTGTTTGCCTCTCTACCAGGAATCATATTTACTAAA
TCTGAACAAGAAGATGATCAGCATACTTGTGGCCCTTATTTTCCAACAATCTGGAAGAATTTCCAAACAATAATGAGGAA
TATCTTGAGTTTGATCCTGCCCCTACTTGTCATGGTCATCTGCTACTCAGGAATCCTCCACACCCTGTTTCGCTGTAGGA
ATGAGAAAAAGAGGCATAGGGCTGTGAGGCTCATCTTTGCCATCATGATTGTCTACTTTCTCTTCTGGACTCCATACAAT
ATTGTTCTCTTCCTGACCACCTTCCAGGAATTCTTGGGAATGAGTAACTGTGTGGTTGACATGCACTTAGACCAGGCCAT
GCAGGTGACAGAGACTCTTGGAATGACACACTGCTGCGTTAATCCTATCATTTATGCCTTTGTTGGTGAGAAGTTCCGAA
GGTATCTCTCCATATTTTTCAGAAAGCACATTGCCAAAAATCTCTGCAAACAATGCCCAGTTTTCTATAGGGAGACAGCA
GACCGAGTGAGCTCAACATTTACCCCTTCTACTGGGGAGCAAGAAGTCTCAGTTGGGTTGTAAAGTAAGTAGCAGTCCCC
CTTTT

Letters can be either upper or lower-case. Spaces and other non-letter characters in the sequence are ignored. Letter U(u) is translated to T(t). Allowed letters are either A,T,G,C,M,R,W,S,Y,K,B,D,H,V,N or X or their corresponding lower-case latters. When pasting sequences in the window, the maximum allowed length is 200Kb.

Handling masked input

When using masked sequences make sure that masked regions are in lower-case; most maskers have this type of ouput as an option (for example, the option -xsmall for RepeatMasker accomplishes this). The reason for this requirement is that CRAIG uses masking content as an additional feature related to exonic segments only and needs the original sequence for prediction without any post-processing.

Output from CRAIG

The output of the program is in GTF format, an extended version of GFF format, and described here in detail.

The following is an example of CRAIG's output:

MUSREGII        CRAIG   start_codon     863     865     .       +       0     gene_id "PRED.MUSREGII.0-001"; transcript_id "PRED.MUSREGII.0-001.0";
MUSREGII        CRAIG   CDS     863     926     .       +       0       gene_id "PRED.MUSREGII.0-001"; transcript_id "PRED.MUSREGII.0-001.0";
MUSREGII        CRAIG   CDS     1486    1625    .       +       2       gene_id "PRED.MUSREGII.0-001"; transcript_id "PRED.MUSREGII.0-001.0";
MUSREGII        CRAIG   CDS     1850    1987    .       +       0       gene_id "PRED.MUSREGII.0-001"; transcript_id "PRED.MUSREGII.0-001.0";
MUSREGII        CRAIG   CDS     3281    3355    .       +       0       gene_id "PRED.MUSREGII.0-001"; transcript_id "PRED.MUSREGII.0-001.0";
MUSREGII        CRAIG   stop_codon      3356    3358    .       +       0      gene_id "PRED.MUSREGII.0-001"; transcript_id "PRED.MUSREGII.0-001.0";
Each line is tab-separated and the columns are defined in the following way:
  1. Sequence identifier
  2. Program name
  3. Prediction : Either CDS, start_codon or stop_codon.
  4. Beginning
  5. End
  6. Score : between 0 and 1.
  7. Strand : $+$ for direct and $-$ for complementary
  8. Frame : for exons only, it is the position of the donor in the frame.
  9. gene_id "id"; : unique gene identifier.
  10. transcript_id "id"; : unique transcript identifier.
The program also outputs some comment lines which are preceeded by `#'.