Format of RTPS files (2004-9-28)

Terms

[TU (Transcriptional Unit)]
Genomic (discontinous) regions from which one mature mRNA is derived. If multiple transcripts overlap on the genome, these TUs are merged into one.
[TK (Transcript frameworK)]
A group of transcripts that map to the same genomic region, are on the same strand, that share some sort of exon structure. Members may have different transcriptional start sites (TSS) and polyadenylation signals (PAS).
[RTS (Representative Transcript Set)]
A set of representative transcripts that are selected from every TK The number of transcripts in RTS = The number of TKs
[RPS (Representative Protein Set)]
A set of representative proteins that are selected from every TK with translated sequences
[SP (Splicing Pattern) clusters]
Grouped transcripts based on splicing patterns
[IT (Identical Transcript) clusters]
Grouped transcripts that are identical
[IP (Identical Protein) clusters]
Grouped proteins that are identical
[VTS (Variant-based representative Transcript Set)]
A set of transcripts that are selected from every SP cluster
[VPS (Variant-based representative Protein Set)]
A set of proteins that are selected from every SP cluster that has translated sequences
[ITS (representative Identical Transcript Set)]
A set of transcripts that are selected from every IT cluster
[IPS (representative Identical Protein Set)]
A set of proteins that are selected from every IP cluster
[CTS (Consensus Transcript Set)]
A set of consensus transcripts that are generated from genomic sequences in TK

ID format

[TU ID]
integer
[TK ID]
integer
[SP ID]
(TK ID) '.' integer
[IT ID]
(TK ID) '.' integer
[IP ID]
(TK ID) '.' integer
[RTS ID]
'T' ('A' or 'B') (TK ID)
[RPS ID]
'PA' (TK ID)
[VTS ID]
'T' ('A' or 'B') (IT ID)
[VPS ID]
'PA' (IP ID)
[References to external databases]

(DB name) '|' (ID/Accession number in the DB)

List of DB names

'GB'
DDBJ/EMBL/GenBank DNA sequences
'GP'
GenPept sequences (Translated peptides from DDBJ/EMBL/GenBank sequences)
'REFSEQ'
NCBI RefSeq sequences
'ENSEMBL'
Ensembl predicted transcripts/proteins
'RIKEN'
RIKEN Phase2 sequences ('seqid' is used as accession number)
'LocusLink'
NCBI LocusLink database
'UniGene'
NCBI UniGene database
'SWISSPROT'
SWISSPROT protein database
'TrEMBL'
TrEMBL protein database
'MGD'
MGD database in the Jackson Laboratory
[Transcript ID (TPacc)]
[CDS/Longest ORF region]
(region: location format in DDBJ/EMBL/GenBank features) ' ' '+' (codon start: '1','2' or '3') ' ' (frameshift positions if exists: 'pos-X', 'pos+N')

Formats of files

[RELEASE]
Release file containg build date, the number of sequences and public database list
[rts.fasta]
FASTA file of RTS
[rts.txt]
RTS information in tab-delimited file
  1. RTS ID
  2. Database reference of representative transcript
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
[rts.das]

Genomic coodiation of RTS. Its format is one used in LDAS

Format of annotation section (1 exon=1 line)

  1. 'RTS'
  2. RTS ID
  3. 'exon'
  4. 'similarity' or 'predicted'
  5. Chromosome reference
  6. Start position on the genome
  7. End position on the genome
  8. strand ('+' or '-')
  9. '.'
  10. %-identity
  11. Start position on the transcript
  12. End position on the transcript
[rps.fasta]
FASTA file of RPS
[rps.txt]
RPS information in tab-delimited file
  1. RPS ID
  2. Database reference of representative protein
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
[allseq.txt]
Transcript/protein sequences in TKs
  1. TK ID
  2. RTS ID
  3. Transcript ID of representative transcript
  4. RPS ID
  5. Transcript ID of representative protein
  6. Transcript IDs of all transcripts in TK [delimiter: ' ' (single space)]
[rtps.dat]
Information about TK/RTS/RPS (It will be obsoleted)
[rtps.yaml]
Information about TK/RTS/RPS. YAML format
'TKID':
TK ID
'RTS':
Information about a representative transcript
'RTS_ID':
RTS ID
'RTS_acc':
Accession number of the representative transcript (database reference)
'RTS_TPacc':
Transcript ID of the representative transcript
'RPS':
Information about a representative protein
'RPS_ID':
RPS ID
'RPS_acc':
Accession number of the representative protein (database reference)
'RPS_TPacc':
Transcript ID of the representative protein
'DESC':
gene names (YAML-Sequence)
'text':
gene name
'dbref':
database reference from which the gene name is retrieved
'SYMBOL':
gene symbol (YAML-Sequence)
'text':
gene symbol
'dbref':
database reference from which the gene symbol is retrieved
'SYNONYM':
synonym (YAML-Sequence)
'text':
synonyms
'dbref':
database reference from which the synonyms are retrieved
'GO':
Gene Ontology (YAML-Sequence)
'goid':
Gene Ontology ID
'evidence':
Evidence code
'dbref':
database reference from which the assignments are retrieved
'aspect':
'F'= molecular function, 'C'=cellular component, 'P'=biological process
'Transcripts':
Information about all transcripts (non-EST) in TK (YAML-Sequence)
'TPacc':
Transcript ID
'rank':
Representative rank of the transcript (1=representative, 2..n=non-representative)
'status':
`Not-Mapped'=Not mapped transcript, `OK'=Non-overlapped transcript
'ntlen':
Length of the transcript (bp)
'aalen':
Length of the translated seuqnece (aa)
'lorflen':
Length of the longest ORF (bp) (in case that CDS information does not present. Lower bound is 100bp)
'CDS':
CDS region in the transcript
'LORF':
Longest ORF region in the transcript
'strain':
Strain of the source library
'tissue':
Tissue name of the source library
'stage':
Developmental stage of the source library
'map_chr':
mapped chromosome number
'map_strand':
mapped strand
'map_gstart':
Start position on the genome
'map_gstop':
End position on the genome
'map_gcdsstart':
Start position of the mapped CDS on the genome
'map_gcdsstop':
End position of the mapped CDS on the genome
'map_tstart':
Start position on the transcript
'map_tstop':
End position on the transcript
'map_tcdsstart':
Start position of the mapped CDS on the transcript
'map_tcdsstop':
End position of the mapped CDS on the transcript
'dbref':
Database references about the transcript
'ESTs':
Information about all ESTs in TK (if ESTs are used in this RTPS build) (YAML-Sequence) The format in 'ESTs' is same as one in 'Transcripts:'
'DBREFS':
Database reference about the TK (YAML-Sequence)
'Antisense':
TK ID of antisense (with exon overlap) TKs (YAML-Sequence)
'Antisense_intronic':
TK ID of antisense (without exon overlap) TKs (YAML-Sequence)
'Overlap':
TK ID of sense-overlapped (with exon overlap) TKs (YAML-Sequence)
'Overlap_intronic':
TK ID of sense-overlapped (without exon overlap) TKs (YAML-Sequence)
[cts.das]
Genomic coodiation of CTS. Its format is one used in LDAS
  1. 'TK'
  2. TK ID
  3. 'exon'
  4. 'consensus'
  5. Chromosome reference
  6. Start position on the genome
  7. End position on the genome
  8. strand ('+' or '-')
  9. '.'
  10. %-identity
  11. Start position on the transcript
  12. End position on the transcript
[vts.fasta]
FASTA file of VTS
[vts.txt]
VTS information in tab-delimited file
  1. VTS ID
  2. Database reference of representative transcript
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
[vts.das]
Genomic coodiation of VTS. Its format is one used in LDAS
  1. 'VTS'
  2. VTS ID
  3. 'exon'
  4. 'similarity' or 'predicted'
  5. Chromosome reference
  6. Start position on the genome
  7. End position on the genome
  8. strand ('+' or '-')
  9. '.'
  10. %-identity
  11. Start position on the transcript
  12. End position on the transcript
[vps.fasta]
FASTA file of VPS
[vps.txt]
VPS information in tab-delimited file
  1. VPS ID
  2. Database reference of representative protein
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
[allseq_sp.txt]
Results of splicing pattern-based cluster
  1. SP ID
  2. VTS ID
  3. Transcript ID of representative transcript
  4. VPS ID
  5. Transcript ID of representative protein
  6. Transcript IDs of all transcripts in variant group [delimiter: ' ']
[its.fasta]
FASTA file of ITS
[its.txt]
ITS information in tab-delimited file
  1. ITS ID
  2. Database reference of representative transcript
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
[its.das]
Genomic coodiation of ITS. Its format is one used in LDAS
  1. 'ITS'
  2. ITS ID
  3. 'exon'
  4. 'similarity' or 'predicted'
  5. Chromosome reference
  6. Start position on the genome
  7. End position on the genome
  8. strand ('+' or '-')
  9. '.'
  10. %-identity
  11. Start position on the transcript
  12. End position on the transcript
[allseq_it.txt]
Results of splicing pattern-based cluster
  1. IT ID
  2. ITS ID
  3. ''
  4. ''
  5. Transcript ID of representative protein
  6. Transcript IDs of all transcripts in variant group [delimiter: ' ']
[ips.fasta]
FASTA file of IPS
[ips.txt]
IPS information in tab-delimited file
  1. IPS ID
  2. Database reference of representative protein
  3. Definition in GenBank/RefSeq
  4. LocusLink ID
  5. Definition in LocusLink
  6. UniGene ID
  7. Definition in UniGene
  8. MGI Gene Marker ID
  9. Definition in MGD
[allseq_ip.txt]
Results of splicing pattern-based cluster
  1. IP ID
  2. ''
  3. ''
  4. IPS ID
  5. Transcript ID of representative protein
  6. Transcript IDs of all transcripts in variant group [delimiter: ' ']
[excluded_transcripts.txt]
A list of excluded transcripts in RTPS build.
  1. Accession number of excluded transcript
  2. Reason
    'Immune'
    Unclassified T-cell receptor or immunoglobulin transcript
    'scFv'
    transcript derived from scFv
    'No info'
    There is no information to group into TKs (Not mapped & not recorded in any public database)
[gene_association.rtps]
GO assignments to TKs. See http://www.geneontology.org/GO.annotation.html#file
[annotation.rtps]
Annotation assigned to TKs (description, symbol). Tab-delimited text
  1. TK ID
  2. qualifier ("description" or "symbol")
  3. description or symbol
  4. database reference
  5. evidence (="RTPS pipeline")
[alltrans.das]
Genomic coodiation of all transcripts. Its format is one used in LDAS
[allest.das]
Genomic coodiation of all ESTs. Its format is one used in LDAS
[alltrans.gff]
Genomic coodiation of all transcripts. Its format is one used in GFF ver.3. Attributes are as follows:
[Target, Gap] same as the GFF3 specification
[transcript_orientation] If direction of transcript/EST is opposite
of mRNA's, its value is '-'
[tu] TU ID
[tk] TK ID
[sp] SP ID
[it] IT ID
[ip] IP ID
[donor] 2 bps in donor site
[acceptor] 2 bps in acceptor site
[annotated_CDS] '1': CDS is annotated, '0': no CDS
[seq_clusters.txt]
Correspondence between each transcript/EST and their belonging TU/TK/SP/IT/IP clusters. Tab-delimited text
  1. Database reference of a transcript/EST
  2. TU ID
  3. TK ID
  4. SP ID
  5. IT ID
  6. IP ID
[tu_clusters.txt, tk_clusters.txt sp_clusters.txt, it_clusters.txt, ip_clusters.txt]

Associations between one of TU/TK/SP/IT/IP cluster and the others. Tab-delimited text

  1. TU/TK/SP/IT/IP ID

(2-5) Cluster (TU/TK/SP/IT/IP) IDs associated with (1)

[allest.gff]
Genomic coodiation of all ESTs. Its format is one used in GFF ver.3
[alltrans_lowermap.txt]
Genome mapped positions of all transcripts with lower scores. Its format is same as one used in 'alltrans.das'.
[allest_lowermap.txt]
Genome mapped positions of all ESTs with lower scores. Its format is same as one used in 'allseq.das'.
[allcds.txt]
CDS locations in each transcript. Tab-delimited text
  1. database reference of the transcript
  2. source
  3. type (=CDS)
  4. CDS start in the transcript
  5. CDS stop in the transcript
  6. '.'
  7. strand
  8. phase
  9. '1': 5'-end is truncated, '0' : 5'-end is complete
  10. '1': 3'-end is truncated, '0' : 3'-end is complete
  11. Gap (CIGAR-alignment)
  12. database reference of the translated protein sequence
  13. CDS location writtein in the DDBJ/EMBL/GenBank FT format
  14. Length of the translated protein sequence
[allcds.gff]
CDS locations in each transcript. GFF format
[protein_id] database reference of the translated protein sequence
[Gap] same as the GFF3 specification
[location] CDS location writtein in the DDBJ/EMBL/GenBank FT format
[tu] TU ID
[tk] TK ID
[sp] SP ID
[it] IT ID
[ip] IP ID

changelog