Celera Assembler Terminology

  • An assem­bly is a set of scaf­folds com­put­ed from reads.
  • A scaf­fold is an ordered and ori­ent­ed set of one or more con­tigs with dis­tances assigned to the gaps between con­tigs. In prac­tice, each gap dis­tance is com­put­ed from mate pairs that are anchored in neigh­bor con­tigs and span the gap. A scaf­fold implies a sin­gle sequence that pos­si­bly includes gaps.
  • A con­tig con­sists of a set of reads, a lay­out that includes all the reads and leaves no gaps, a mul­ti­ple sequence align­ment of the reads, and a con­sen­sus sequence. In prac­tice con­tigs con­sist of one or more unit­igs. Note the con­sen­sus may con­tain (small) gaps spanned by reads even though the lay­out includes no (0X) gaps.
  • A unit­ig is a spe­cial kind of con­tig. Ide­al­ly, it is ful­ly con­sis­tent with all the data includ­ing reads, over­laps, and mate con­straints. In prac­tice, unit­igs can only be con­sis­tent with most of the data. Con­cep­tu­al­ly, a unit­ig is a high-con­fi­dence con­tig. Max­i­mal unit­igs should con­tain either (1) unique sequence up to repeat bound­aries, with less than a read-length of repeat on each end, or (2) near­ly the full extent of a genomic repeat.

A Scaffold with a Surrogate

A Scaf­fold with a Sur­ro­gate

The Cel­era Assem­bler works with frag­men­tary sequences, their detect­ed over­laps, and their given mate pairs. Often, the data are mutu­al­ly con­tra­dic­to­ry, as shown here. Yet, Cel­era Assem­bler reduces the data to a lin­ear sequence when­ev­er that is jus­ti­fied.

(A) Sequence over­laps and mate pairs sug­gest sev­er­al pos­si­ble joins. Line seg­ments rep­re­sent frag­ments, ver­ti­cal stack­ing rep­re­sents over­laps, rec­tan­gles rep­re­sent con­tigs, arrows rep­re­sent links, and every element’s thick­ness cor­re­lates to the amount of sup­port­ing data.

(B) The assem­bler reduces the graph such that one con­tra­dic­tion remains. The sequence frag­ments were reduced to con­tigs based on over­laps. The mate pairs were reduced to con­tig links of var­i­ous weights. Here, three con­tigs form a lin­ear scaf­fold but the fourth con­tig is prob­lem­at­ic.

© The assem­bler has reduced the graph to a lin­ear sequence. Its final step was to insert the 4th con­tig twice. Called a mul­ti­ply placed sur­ro­gate unit­ig, the 4th con­tig appears to rep­re­sent over-col­lapse of frag­ments induced by a near-per­fect repeat in the genome.


(1) A lay­out and asso­ci­at­ed con­sen­sus sequence(s) and/or multi-alignment(s). In oth­er words, we use this term to speak of a ten­ta­tive recon­struc­tion of seg­ments of the tar­get sequence and the loca­tions from which the reads were sam­pled.
Branch Point
(1) A branch point is a posi­tion on a frag­ment and/or chunk that is known to rep­re­sent the bound­ary of a repet­i­tive ele­ment. The infer­ence one would like to make is that one side of the branch­point is unique sequence and the oth­er is repet­i­tive, but inter­nal repeat bound­aries of micro- and mini-satel­lites are also detect­ed as branch­points.
Con­sen­sus Sequence (or sim­ply Con­sen­sus)
(1) Given a col­lec­tion of over­lap­ping reads, that do not pre­cise­ly match along their over­laps, a con­sen­sus sequence for the col­lec­tion is, loose­ly speak­ing, one’s best guess at the sequence the reads were sam­pled from. Often peo­ple mean some­thing more pre­cise: the math­e­mat­i­cal def­i­n­i­tion of con­sen­sus sequence is one for which the sum of the dif­fer­ences between the con­sen­sus sequence and each one of the reads is min­i­mal.
(1) A max­i­mal set of reads in a lay­out which in aggre­gate cov­er a con­tigu­ous inter­val.
(2) A con­tigu­ous join of unit­igs. It con­sists of a mul­ti­ple sequence align­ment of reads plus a con­sen­sus sequence, although it also has an inter­nal unit­ig struc­ture. The con­sen­sus can have short gaps rep­re­sent­ing inserts in a minor­i­ty of the under­ly­ing reads. The con­sen­sus can have regions of 0X read cov­er­age when the con­sen­sus is due to a sur­ro­gate.
(1) A unit­ig that could not be com­bined into any scaf­fold. It is like a sin­gle­ton but it has more than one read. Degen­er­ates some­times con­tain high-copy plas­mid sequence. Degen­er­ates can reflect bio­log­i­cal phe­nom­e­na that under­mine the assump­tions of Cel­era Assembler’s math­e­mat­i­cal mod­el.
(1) Either a guide or a read. Unfor­tu­nate­ly this term has a long his­to­ry of dif­fer­ent uses by dif­fer­ent groups. In par­tic­u­lar­ly, one may actu­al­ly be talk­ing about inserts. Usu­al­ly the intend­ed mean­ing is clear from con­text, but when it isn’t and its impor­tant to under­stand the pre­cise mean­ing, be sure to ask for clar­i­fi­ca­tion.
Guide (obso­lete)
(1) A read-sized sequence of the rel­e­vant genome sup­plied from an exter­nal data source, e.g. an STS mark­er, a BAC-end, or a fab­ri­cat­ed piece of a known BAC.
(1) A seg­ment of the tar­get genome placed into a vec­tor and ulti­mate­ly end-sequenced by us. For exam­ple, we are cur­rent­ly plan­ning on sequenc­ing the ends of a 4/1 mix of 2Kbp and 10Kbp inserts.
(1) A lay­out is a (par­tial) posi­tion­ing of a set of reads with respect to each oth­er sub­ject to the one con­straint that every pair of reads that over­lap in the lay­out do so as defined imme­di­ate­ly above. The term lay­out is intend­ed to specif­i­cal­ly speak to the arrange­ment of the reads as opposed to their mutu­al con­nec­tiv­i­ty (as in “con­tig” below) or the sequence(s) the set mod­els (as in “con­sen­sus” below). A lay­out includes the ori­en­ta­tion of the frag­ments and in the case that reads are mate-linked gives the esti­mat­ed dis­tance between con­tigs that con­tain each end of a mate pair­ing.
Mate-Pair or Mates
(1) A pair of reads tak­en from the end of a given insert.
Mul­ti Align­ment
(1) A mul­ti-align­ment of a set of over­lap­ping frag­ments is a matrix in which a row is a pos­si­bly emp­ty pre­fix of blanks, fol­lowed by the sequence of a frag­ment inter­spersed with dash­es, fol­lowed by a pos­si­bly emp­ty suf­fix of blanks. One gen­er­al­ly seeks the mul­ti-align­ment of the frag­ments that expos­es their sim­i­lar­i­ty and sup­ports the evi­dence for a par­tic­u­lar con­sen­sus sequence. Indeed, any com­pu­ta­tion that pro­duces a con­sen­sus either implic­it­ly or explic­it­ly com­putes a mul­ti-align­ment of the under­ly­ing reads.
(1) A pair of sequences, say A and B, over­lap if there is an inter­val of A and an inter­val of B that match to with­in a user-spec­i­fied lev­el of sim­i­lar­i­ty. If the sequenc­ing error rate is less than 2% than a match with few­er than 4% dif­fer­ences con­sti­tutes an over­lap. Typ­i­cal­ly, one is also imply­ing that the seg­ments involved con­sti­tute either a suffix/prefix pair (a “dove­tail over­lap”) or all of one of the two sequences (a “con­tain­ment over­lap”). In pic­tures,
   A -------------------          or    A --------------------.
         ------------------- B                 ---------- B
(1) A sin­gle sequence read pro­duced by an ABI 3700 by our inter­nal pro­duc­tion pipeline.
(1) Unit­igs that were used to fill a gap in a scaf­fold. They are usu­al­ly short and repet­i­tive. Rocks require high­er con­fi­dence joins than stones. (An even low­er con­fi­dence cat­e­go­ry, peb­bles, was dis­con­tin­ued after its use in the Cel­era assem­bly of Drosophi­la.) Rocks and stones are “thrown” into gaps late in the scaf­fold build­ing process. They are thrown in mul­ti­ple iter­a­tions, with the loop count con­trolled by a run-time para­me­ter.
(1) A max­i­mal set of con­tigs in a lay­out that are con­nect­ed togeth­er by mate-links.
(2) A lin­ear order­ing of con­tigs joined by mate pairs. A scaf­fold defines the order and ori­en­ta­tion (DNA strand) for each com­po­nent con­tig. There are two ways to mea­sure scaf­fold length. “Scaf­fold bases” is sum of con­tig lengths. “Scaf­fold span” is that plus the sum of gap lengths. Cel­era Assem­bler uses com­plex cri­te­ria to build scaf­folds, but some gen­er­al­iza­tions apply. Every gap in a scaf­fold was spanned by at least two mate pairs. A gap with neg­a­tive length means the sequence data and mate data dis­agree. Usu­al­ly, neg­a­tive gaps are small (20bp) and induced by low-qual­i­ty sequence at the end of a read. In the FASTA rep­re­sen­ta­tion of a scaf­fold, neg­a­tive gaps are rep­re­sent­ed by a fixed num­ber (20) of N’s.
(1) A read that could not assem­ble. Sin­gle­tons can rep­re­sent con­t­a­m­i­na­tion, unique sequence with no over­lap due to the fluc­tu­a­tion of ran­dom cov­er­age, or sequence with so many over­laps it could not be assem­bled effi­cient­ly. It can hap­pen that a mate pair has two sin­gle­tons, and in some con­texts the­se pairs are called mini-scaf­folds.
Sin­gle­ton Unit­ig
(1) A unit­ig con­sist­ing of a sin­gle frag­ment.
(1) A unit­ig whose arrival rate sta­tis­tic was beyond the expect­ed range. Such unit­igs are treat­ed as col­lapsed repeats. Their con­sen­sus may get placed in one or more scaf­folds. Some of their reads may get placed, by mates, late in the pipeline. When a repet­i­tive unit­ig can­not be placed even once, it becomes a degen­er­ate.
Unit­ig (also Chunk)
(1) A high-con­fi­dence con­tig seed. The end of a unit­ig is, by def­i­n­i­tion, a place where the over­lap data shows mul­ti­ple, mutu­al­ly con­tra­dic­to­ry, paths. Unit­igs are sup­posed to end at repeats.
(2) A unique­ly assem­bleable sub­set of over­lap­ping frag­ments. A unit­ig and/or chunk is an assem­bly of frag­ments for which there are no com­pet­ing choic­es in terms of inter­nal over­laps. This means that a chunk is either a cor­rect­ly assem­bled por­tion of a con­tig or it is an over­com­pressed assem­bly of sev­er­al high-fideli­ty copies of a repeat. Every frag­ment belongs to one chunk.
(1) A unit­ig with an arrival rate sta­tis­tic (based on unit­ig length and read cov­er­age) with­in the expect­ed range. The unique­ness des­ig­na­tion becomes impor­tant dur­ing the scaf­fold build­ing stage. Only a unique unit­ig can seed a con­tig. Con­tigs can be extend­ed by mates and over­laps from their unique unit­igs only.

Leave a Reply

Your email address will not be published. Required fields are marked *