GFF3 Format

This sec­tion describes the rep­re­sen­ta­tion of a pro­tein-cod­ing gene in GFF3. To illus­trate how a canon­i­cal gene is rep­re­sent­ed, con­sid­er Fig­ure 1 (figure1.png). This indi­cates a gene named EDEN extend­ing from posi­tion 1000 to posi­tion 9000. It encodes three alter­na­tive­ly-spliced tran­scripts named EDEN.1, EDEN.2 and EDEN.3, the last of which has two alter­na­tive trans­la­tion­al start sites lead­ing to the gen­er­a­tion of two pro­tein cod­ing sequences.

There is also an iden­ti­fied tran­scrip­tion­al fac­tor bind­ing site locat­ed 50 bp upstream from the tran­scrip­tion­al start site of EDEN.1 and EDEN2.

Here is how this gene should be described using GFF3:

 
 0  ##gff-version 3.2.1
 1  ##sequence-region ctg123 1 1497228
 2  ctg123 . gene            1000  9000  .  +  .  ID=gene00001;Name=EDEN
 3  ctg123 . TF_binding_site 1000  1012  .  +  .  ID=tfbs00001;Parent=gene00001
 4  ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00001;Parent=gene00001;Name=EDEN.1
 5  ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00002;Parent=gene00001;Name=EDEN.2
 6  ctg123 . mRNA            1300  9000  .  +  .  ID=mRNA00003;Parent=gene00001;Name=EDEN.3
 7  ctg123 . exon            1300  1500  .  +  .  ID=exon00001;Parent=mRNA00003
 8  ctg123 . exon            1050  1500  .  +  .  ID=exon00002;Parent=mRNA00001,mRNA00002
 9  ctg123 . exon            3000  3902  .  +  .  ID=exon00003;Parent=mRNA00001,mRNA00003
10  ctg123 . exon            5000  5500  .  +  .  ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003
11  ctg123 . exon            7000  9000  .  +  .  ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003
12  ctg123 . CDS             1201  1500  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
13  ctg123 . CDS             3000  3902  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
14  ctg123 . CDS             5000  5500  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
15  ctg123 . CDS             7000  7600  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
16  ctg123 . CDS             1201  1500  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
17  ctg123 . CDS             5000  5500  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
18  ctg123 . CDS             7000  7600  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
19  ctg123 . CDS             3301  3902  .  +  0  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
20  ctg123 . CDS             5000  5500  .  +  1  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
21  ctg123 . CDS             7000  7600  .  +  1  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
22  ctg123 . CDS             3391  3902  .  +  0  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
23  ctg123 . CDS             5000  5500  .  +  1  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
24  ctg123 . CDS             7000  7600  .  +  1  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

Lines begin­ning with ‘##’ are direc­tives (some­times called prag­mas or meta-data) and provide meta-infor­ma­tion about the doc­u­ment as a whole. Blank lines should be ignored by parsers and lines begin­ning with a sin­gle ‘#’ are used for human-read­able com­ments and can be ignored by parsers. End-of-line com­ments (com­ments pre­ceed­ed by # at the end of and on the same line as a fea­ture or direc­tive line) are not allowed.

Line 0 gives the GFF ver­sion using the ##gff-ver­sion prag­ma. Line 1 indi­cates the bound­aries of the region being anno­tat­ed (a 1,497,228 bp region named “ctg123”) using the ##sequence-region prag­ma.

Line 2 defines the bound­aries of the gene. Column 9 of this line assigns the gene an ID of gene00001, and a human-read­able name of EDEN. Because the gene is not part of a larg­er fea­ture, it has no Par­ent.

Line 3 anno­tates the tran­scrip­tion­al fac­tor bind­ing site. Since it is log­i­cal­ly part of the gene, its Par­ent attrib­ute is gene00001.

Lines 4–6 define this gene’s three spliced tran­scripts, one line for the full extent of each of the mRNAs. The­se fea­tures are nec­es­sary to act as par­ents for the four CDSs which derive from them, as well as the struc­tural par­ents of the five exons in the alter­na­tive splic­ing set.

Lines 7–11 iden­ti­fy the five exons. The Par­ent attrib­ut­es indi­cate which mRNAs the exons belong to. Notice that sev­er­al of the exons share the same par­ents, using the com­ma sym­bol to indi­cate mul­ti­ple parent­age.

Lines 12–24 denote this gene’s four CDSs. Each CDS belongs to one of the mRNAs. cds00003 and cds00004, which cor­re­spond to alter­na­tive start codons, belong to the same mRNA.

Note that sev­er­al of the fea­tures, includ­ing the gene, its mRNAs and the CDSs, all have Name attrib­ut­es. This attrib­ut­es assigns those fea­tures a pub­lic name, but is not manda­to­ry. The ID attrib­ut­es are only manda­to­ry for those fea­tures that have chil­dren (the gene and mRNAs), or for those that span mul­ti­ple lines. The IDs do not have mean­ing out­side the file in which they reside. Hence, a slight­ly sim­pli­fied ver­sion of this file would look like this:

Find more at this link: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Leave a Reply

Your email address will not be published. Required fields are marked *