=begin

= Format of RTPS files (2004-9-28)

== Terms

:[TU (Transcriptional Unit)]
  Genomic (discontinous) regions from which one mature mRNA is
  derived. If multiple transcripts overlap on the genome, these TUs are
  merged into one.

:[TK (Transcript frameworK)]
  A group of transcripts that map to the same genomic region, are on
  the same strand, that share some sort of exon structure.
  Members may have different transcriptional start sites (TSS)
  and polyadenylation signals (PAS).

:[RTS (Representative Transcript Set)]
  A set of representative transcripts that are selected from every TK
  The number of transcripts in RTS = The number of TKs

:[RPS (Representative Protein Set)]
  A set of representative proteins that are selected from every TK
  with translated sequences

:[SP (Splicing Pattern) clusters]
  Grouped transcripts based on splicing patterns

:[IT (Identical Transcript) clusters]
  Grouped transcripts that are identical

:[IP (Identical Protein) clusters]
  Grouped proteins that are identical

:[VTS (Variant-based representative Transcript Set)]
  A set of transcripts that are selected from every SP cluster

:[VPS (Variant-based representative Protein Set)]
  A set of proteins that are selected from every SP cluster that
  has translated sequences

:[ITS (representative Identical Transcript Set)]
  A set of transcripts that are selected from every IT cluster

:[IPS (representative Identical Protein Set)]
  A set of proteins that are selected from every IP cluster

:[CTS (Consensus Transcript Set)]
  A set of consensus transcripts that are generated from genomic
  sequences in TK

== ID format

:[TU ID]
  integer
:[TK ID]
  integer
:[SP ID]
  (TK ID) '.' integer
:[IT ID]
  (TK ID) '.' integer
:[IP ID]
  (TK ID) '.' integer
:[RTS ID]
  'T' ('A' or 'B') (TK ID)
:[RPS ID]
  'PA' (TK ID)
:[VTS ID]
  'T' ('A' or 'B') (IT ID)
:[VPS ID]
  'PA' (IP ID)
:[References to external databases]
  (DB name) '|' (ID/Accession number in the DB)

  List of DB names
  :'GB'
    DDBJ/EMBL/GenBank DNA sequences
  :'GP'
    GenPept sequences (Translated peptides from DDBJ/EMBL/GenBank sequences)
  :'REFSEQ'
    NCBI RefSeq sequences
  :'ENSEMBL'
    Ensembl predicted transcripts/proteins
  :'RIKEN'
    RIKEN Phase2 sequences ('seqid' is used as accession number)
  :'LocusLink'
    NCBI LocusLink database
  :'UniGene'
    NCBI UniGene database
  :'SWISSPROT'
    SWISSPROT protein database
  :'TrEMBL'
    TrEMBL protein database
  :'MGD'
    MGD database in the Jackson Laboratory
:[Transcript ID (TPacc)]
  * without translated sequences
    (transcript DB name) '|' (Acc No. of transcript)

  * with translated sequences
    (transcript DB name) '|' (Acc No. of transcript) '|' (Protein DB
    name) '|' (Acc No. of protein)
:[CDS/Longest ORF region]
  (region: location format in DDBJ/EMBL/GenBank features)
  ' ' '+' (codon start: '1','2' or '3')
  ' ' (frameshift positions if exists: 'pos-X', 'pos+N')

== Formats of files

:[(({RELEASE}))]
  Release file containg build date, the number of sequences and 
  public database list 

:[(({rts.fasta}))]
  FASTA file of RTS

:[(({rts.txt}))]
  RTS information in tab-delimited file
  (1) RTS ID
  (2) Database reference of representative transcript
  (3) Definition in GenBank/RefSeq
  (4) LocusLink ID
  (5) Definition in LocusLink
  (6) UniGene ID
  (7) Definition in UniGene
  (8) MGI Gene Marker ID
  (9) Definition in MGD

:[(({rts.das}))]
  Genomic coodiation of RTS. Its format is one used in LDAS
  * [reference] section
     Information about genome sequences
  * [annotation] section
     Anntation on the genome (Mapped positions of transcripts)
  Format of annotation section (1 exon=1 line)
  (1) 'RTS'
  (2) RTS ID
  (3) 'exon'
  (4) 'similarity' or 'predicted'
  (5) Chromosome reference
  (6) Start position on the genome
  (7) End position on the genome
  (8) strand ('+' or '-')
  (9) '.'
  (10) %-identity
  (11) Start position on the transcript
  (12) End position on the transcript

:[(({rps.fasta}))]
  FASTA file of RPS

:[(({rps.txt}))]
  RPS information in tab-delimited file
  (1) RPS ID
  (2) Database reference of representative protein
  (3) Definition in GenBank/RefSeq
  (4) LocusLink ID
  (5) Definition in LocusLink
  (6) UniGene ID
  (7) Definition in UniGene
  (8) MGI Gene Marker ID
  (9) Definition in MGD

:[(({allseq.txt}))]
  Transcript/protein sequences in TKs
  (1) TK ID
  (2) RTS ID
  (3) Transcript ID of representative transcript
  (4) RPS ID
  (5) Transcript ID of representative protein
  (6) Transcript IDs of all transcripts in TK [delimiter: ' ' (single space)]

:[(({rtps.dat}))]
  Information about TK/RTS/RPS
  (It will be obsoleted)

:[(({rtps.yaml}))]
  Information about TK/RTS/RPS.  YAML format

  :'TKID':
   TK ID
  :'RTS':
   Information about a representative transcript
    :'RTS_ID':
      RTS ID
    :'RTS_acc':
      Accession number of the representative transcript (database reference)
    :'RTS_TPacc':
      Transcript ID of the representative transcript
  :'RPS':
   Information about a representative protein
    :'RPS_ID':
      RPS ID
    :'RPS_acc':
      Accession number of the representative protein (database reference)
    :'RPS_TPacc':
      Transcript ID of the representative protein
  :'DESC':
    gene names (YAML-Sequence)
    :'text':
      gene name
    :'dbref':
      database reference from which the gene name is retrieved
  :'SYMBOL':
    gene symbol (YAML-Sequence)
    :'text':
      gene symbol
    :'dbref':
      database reference from which the gene symbol is retrieved
  :'SYNONYM':
    synonym (YAML-Sequence)
    :'text':
      synonyms
    :'dbref':
      database reference from which the synonyms are retrieved
  :'GO':
    Gene Ontology (YAML-Sequence)
    :'goid':
       Gene Ontology ID
    :'evidence':
       Evidence code
    :'dbref':
       database reference from which the assignments are retrieved
    :'aspect':
       'F'= molecular function, 'C'=cellular component, 'P'=biological
       process
  :'Transcripts':
    Information about all transcripts (non-EST) in TK (YAML-Sequence)
    :'TPacc':
      Transcript ID
    :'rank':
      Representative rank of the transcript
      (1=representative, 2..n=non-representative)
    :'status':
      `Not-Mapped'=Not mapped transcript,
      `OK'=Non-overlapped transcript
    :'ntlen':
      Length of the transcript (bp)
    :'aalen':
      Length of the translated seuqnece (aa)
    :'lorflen':
      Length of the longest ORF (bp) (in case that CDS
      information does not present. Lower bound is 100bp)
    :'CDS':
      CDS region in the transcript
    :'LORF':
      Longest ORF region in the transcript
    :'strain':
      Strain of the source library
    :'tissue':
      Tissue name of the source library
    :'stage':
      Developmental stage of the source library
    :'map_chr':
      mapped chromosome number
    :'map_strand':
      mapped strand
    :'map_gstart':
      Start position on the genome
    :'map_gstop':
      End position on the genome
    :'map_gcdsstart':
      Start position of the mapped CDS on the genome
    :'map_gcdsstop':
      End position of the mapped CDS on the genome
    :'map_tstart':
      Start position on the transcript
    :'map_tstop':
      End position on the transcript
    :'map_tcdsstart':
      Start position of the mapped CDS on the transcript
    :'map_tcdsstop':
      End position of the mapped CDS on the transcript
    :'dbref':
      Database references about the transcript
  :'ESTs':
    Information about all ESTs in TK (if ESTs are used in this RTPS build) (YAML-Sequence) The format in 'ESTs' is same as one in 'Transcripts:'
  :'DBREFS':
    Database reference about the TK (YAML-Sequence)
  :'Antisense':
    TK ID of antisense (with exon overlap) TKs (YAML-Sequence)
  :'Antisense_intronic':
    TK ID of antisense (without exon overlap) TKs (YAML-Sequence)
  :'Overlap':
    TK ID of sense-overlapped (with exon overlap) TKs (YAML-Sequence)
  :'Overlap_intronic':
    TK ID of sense-overlapped (without exon overlap) TKs (YAML-Sequence)


:[(({cts.das}))]
  Genomic coodiation of CTS. Its format is one used in LDAS
  (1) 'TK'
  (2) TK ID
  (3) 'exon'
  (4) 'consensus'
  (5) Chromosome reference
  (6) Start position on the genome
  (7) End position on the genome
  (8) strand ('+' or '-')
  (9) '.'
  (10) %-identity
  (11) Start position on the transcript
  (12) End position on the transcript

:[(({vts.fasta}))]
  FASTA file of VTS

:[(({vts.txt}))]
  VTS information in tab-delimited file
  (1) VTS ID
  (2) Database reference of representative transcript
  (3) Definition in GenBank/RefSeq
  (4) LocusLink ID
  (5) Definition in LocusLink
  (6) UniGene ID
  (7) Definition in UniGene
  (8) MGI Gene Marker ID
  (9) Definition in MGD

:[(({vts.das}))]
  Genomic coodiation of VTS. Its format is one used in LDAS
  (1) 'VTS'
  (2) VTS ID
  (3) 'exon'
  (4) 'similarity' or 'predicted'
  (5) Chromosome reference
  (6) Start position on the genome
  (7) End position on the genome
  (8) strand ('+' or '-')
  (9) '.'
  (10) %-identity
  (11) Start position on the transcript
  (12) End position on the transcript

:[(({vps.fasta}))]
  FASTA file of VPS

:[(({vps.txt}))]
  VPS information in tab-delimited file
  (1) VPS ID
  (2) Database reference of representative protein
  (3) Definition in GenBank/RefSeq
  (4) LocusLink ID
  (5) Definition in LocusLink
  (6) UniGene ID
  (7) Definition in UniGene
  (8) MGI Gene Marker ID
  (9) Definition in MGD

:[(({allseq_sp.txt}))]
  Results of splicing pattern-based cluster
  (1) SP ID
  (2) VTS ID
  (3) Transcript ID of representative transcript
  (4) VPS ID
  (5) Transcript ID of representative protein
  (6) Transcript IDs of all transcripts in variant group [delimiter: ' ']

#:[(({vtps.dat}))]
#  Information about variant-based cluster/VTS/VPS
#  (It will be obsoleted)

:[(({its.fasta}))]
  FASTA file of ITS

:[(({its.txt}))]
  ITS information in tab-delimited file
  (1) ITS ID
  (2) Database reference of representative transcript
  (3) Definition in GenBank/RefSeq
  (4) LocusLink ID
  (5) Definition in LocusLink
  (6) UniGene ID
  (7) Definition in UniGene
  (8) MGI Gene Marker ID
  (9) Definition in MGD

:[(({its.das}))]
  Genomic coodiation of ITS. Its format is one used in LDAS
  (1) 'ITS'
  (2) ITS ID
  (3) 'exon'
  (4) 'similarity' or 'predicted'
  (5) Chromosome reference
  (6) Start position on the genome
  (7) End position on the genome
  (8) strand ('+' or '-')
  (9) '.'
  (10) %-identity
  (11) Start position on the transcript
  (12) End position on the transcript

:[(({allseq_it.txt}))]
  Results of splicing pattern-based cluster
  (1) IT ID
  (2) ITS ID
  (3) ''
  (4) ''
  (5) Transcript ID of representative protein
  (6) Transcript IDs of all transcripts in variant group [delimiter: ' ']

:[(({ips.fasta}))]
  FASTA file of IPS

:[(({ips.txt}))]
  IPS information in tab-delimited file
  (1) IPS ID
  (2) Database reference of representative protein
  (3) Definition in GenBank/RefSeq
  (4) LocusLink ID
  (5) Definition in LocusLink
  (6) UniGene ID
  (7) Definition in UniGene
  (8) MGI Gene Marker ID
  (9) Definition in MGD

:[(({allseq_ip.txt}))]
  Results of splicing pattern-based cluster
  (1) IP ID
  (2) ''
  (3) ''
  (4) IPS ID
  (5) Transcript ID of representative protein
  (6) Transcript IDs of all transcripts in variant group [delimiter: ' ']

:[(({excluded_transcripts.txt}))]
  A list of excluded transcripts in RTPS build.
  (1) Accession number of excluded transcript
  (2) Reason
       :'Immune'
         Unclassified T-cell receptor or immunoglobulin transcript
       : 'scFv'
         transcript derived from scFv
       :'No info'
         There is no information to group into TKs
         (Not mapped & not recorded in any public database)

#:[(({vtps.yaml}))]
#  Information about variant-based cluster/VTS/VPS. YAML
#  format. Its content is similar to "rtps.yaml"

:[(({gene_association.rtps}))]
  GO assignments to TKs.
  See http://www.geneontology.org/GO.annotation.html#file

#:[(({gene_association.vtps}))]
#  GO assignments to variant-based cluster
#  See http://www.geneontology.org/GO.annotation.html#file

:[(({annotation.rtps}))]
  Annotation assigned to TKs (description, symbol). Tab-delimited text
  (1) TK ID
  (2) qualifier ("description" or "symbol")
  (3) description or symbol
  (4) database reference
  (5) evidence (="RTPS pipeline")

:[(({alltrans.das}))]
  Genomic coodiation of all transcripts. Its format is one used in LDAS

:[(({allest.das}))]
  Genomic coodiation of all ESTs. Its format is one used in LDAS

:[(({alltrans.gff}))]
  Genomic coodiation of all transcripts. Its format is one used in GFF
  ver.3.  Attributes are as follows:
  :[Target, Gap] same as the GFF3 specification
  :[transcript_orientation] If direction of transcript/EST is opposite
   of mRNA's, its value is '-'
  :[tu] TU ID
  :[tk] TK ID
  :[sp] SP ID
  :[it] IT ID
  :[ip] IP ID
  :[donor] 2 bps in donor site
  :[acceptor] 2 bps in acceptor site
  :[annotated_CDS] '1': CDS is annotated, '0': no CDS

:[(({seq_clusters.txt}))]
  Correspondence between each transcript/EST and their belonging TU/TK/SP/IT/IP
  clusters. Tab-delimited text
  (1) Database reference of a transcript/EST
  (2) TU ID
  (3) TK ID
  (4) SP ID
  (5) IT ID
  (6) IP ID

:[(({tu_clusters.txt, tk_clusters.txt sp_clusters.txt, it_clusters.txt, ip_clusters.txt}))]
  Associations between one of TU/TK/SP/IT/IP cluster and the others.
  Tab-delimited text
  (1) TU/TK/SP/IT/IP ID
  (2-5) Cluster (TU/TK/SP/IT/IP) IDs associated with (1)

:[(({allest.gff}))]
  Genomic coodiation of all ESTs. Its format is one used in GFF
  ver.3

:[(({alltrans_lowermap.txt}))]
  Genome mapped positions of all transcripts with lower scores.
  Its format is same as one used in 'alltrans.das'.

:[(({allest_lowermap.txt}))]
  Genome mapped positions of all ESTs with lower scores.
  Its format is same as one used in 'allseq.das'.

:[(({allcds.txt}))]
  CDS locations in each transcript.  Tab-delimited text
  (1)  database reference of the transcript
  (2)  source
  (3)  type (=CDS)
  (4)  CDS start in the transcript
  (5)  CDS stop  in the transcript
  (6)  '.'
  (7)  strand
  (8)  phase
  (9)  '1': 5'-end is truncated,  '0' : 5'-end is complete
  (10) '1': 3'-end is truncated,  '0' : 3'-end is complete
  (11) Gap (CIGAR-alignment)
  (12) database reference of the translated protein sequence
  (13) CDS location writtein in the DDBJ/EMBL/GenBank FT format
  (14) Length of the translated protein sequence

:[(({allcds.gff}))]
  CDS locations in each transcript.  GFF format
  :[protein_id] database reference of the translated protein sequence
  :[Gap] same as the GFF3 specification
  :[location] CDS location writtein in the DDBJ/EMBL/GenBank FT format
  :[tu] TU ID
  :[tk] TK ID
  :[sp] SP ID
  :[it] IT ID
  :[ip] IP ID


== changelog

* 2004-10-16
  :[allcds.txt, allcds.gff]
   New files
  :[seq_clusters.txt {tu,tk,it,ip,sp}_clusters.txt
   New files

* 2004-9-28
  :(generic)
    * New clusters: "IT" and "IP"
    * New IDs: "IT ID" and "IP ID"
    * New representative sets: "ITS" and "IPS"
  :[its.fasta, its.txt, its.das, ips.fasta, ips.txt]
   New files
  :[allvseq.txt, vtps.yaml, vtps.dat, gene_association.vtps, annotation.vtps]
   Files are abandoned
  :[allseq_sp.txt, allseq_it.txt allseq_ip.txt]
   New files
  :[cts.das]
   Format change

* 2004-9-20
  :(generic)
   Change "Variant ID" to "SP ID"
  :[annotation.rtps, annotation.vtps]
   New files

* 2004-9-9
  :(generic)
    'chr' prefix was appended at the top of chromosome references
  :[allseq.txt]
    TK ID column was added.
  :[allvseq.txt]
    Variant ID column was added.
  :[alltrans.gff allest.gff]
    New file
  :[alltrans.das allest.das alltrans_lowermap.txt allest_lowermap.txt]
    allseq.das was separated into two files alltrans.das and allest.das
  :[rts.txt rps.txt vts.txt vps.txt]
    Change accession number to database reference in the second column
  :[rts.fasta rps.fasta vts.fasta vps.fasta]
    Header format was changed.
  :[rtps.yaml]
    * New attribute 'aspect' in 'GO' for expressing 'F', 'P' or 'C'
    * Add an 'ESTs' attribute
  :[gene_association.rtps, gene_association.vtps]
    New files for describing GO assignments


=end