DAVID new version: 6.8

DAVID 6.8 (cur­rent beta release) May. 2016

– The DAVID Knowl­edge­base com­plete­ly rebuilt
— Entrez Gene inte­grat­ed as the cen­tral iden­ti­fier to allow for more time­ly updates
while still incor­po­rat­ing Ensem­bl and Uniprot as inte­gral data sources
— New GO cat­e­go­ry (GO Direct) pro­vides GO map­pings direct­ly anno­tat­ed by the source data­base (no par­ent terms includ­ed)
— New anno­ta­tion cat­e­gories
— New list iden­ti­fier sys­tems added for list upload­ing and con­ver­sion
— A few bugs fixed

Division of Medical Sciences at Harvard Medical School (DMS)

Bioinformatics and Integrative Genomics (BIG)

The Bioin­for­mat­ics and Inte­gra­tive Genomics (BIG) Pro­gram enrolls PhD stu­dents with excep­tion­al train­ing in quan­ti­ta­tive sci­ences and strong inter­est in bio­med­ical appli­ca­tions. Research areas encom­pass com­pu­ta­tion­al analy­sis and math­e­mat­i­cal mod­el­ing of data gen­er­at­ed by DNA sequence, gene expres­sion, struc­tural, pro­teomics, and metabo­lite-assay­ing tech­nolo­gies. In applied projects, they may also include inte­gra­tion of clin­i­cal and pop­u­la­tion data from elec­tron­ic health records. Both bioin­for­mat­ics and genomics are tight­ly linked to the math­e­mat­i­cal and bio­phys­i­cal mod­el­ing of com­plex bio­log­i­cal sys­tems and exper­i­men­tal val­i­da­tion of com­pu­ta­tion­al pre­dic­tions. Grad­u­ate stu­dents will con­duct orig­i­nal research in the devel­op­ment of nov­el approach­es and new tech­nolo­gies to address fun­da­men­tal bio­log­i­cal ques­tions, and they will acquire the skills to be lead­ers in the field of bioin­for­mat­ics and genomics. Stu­dents will be joint mem­bers of BIG and a “home pro­gram” cho­sen from one of the four DMS pro­grams (BBS, Immunol­o­gy, Neu­ro­science, Virol­o­gy). BIG stu­dents will fol­low the cur­ricu­lum and par­tic­i­pate in activ­i­ties of the home pro­gram, which will be sup­ple­ment­ed with BIG pro­gram­mat­ic and cur­ric­u­lar offer­ings.

Terminology

1.fold-coverage

the the­o­ret­i­cal “fold-cov­er­age” of a shot­gun sequenc­ing exper­i­ment:

<num­ber of reads> * <read length> / <tar­get size>

2.Amplicon

An ampli­con is a piece of DNA or RNA that is <the source and/or pro­duct of nat­u­ral or arti­fi­cial ampli­fi­ca­tion or repli­ca­tion events>.

It can be formed using var­i­ous meth­ods includ­ing poly­merase chain reac­tions (PCR), lig­ase chain reac­tions (LCR), or nat­u­ral gene dupli­ca­tion.

3.Whole genome map­ping

A Whole Genome Map is a high-res­o­lu­tion, ordered, whole genome restric­tion map gen­er­at­ed from sin­gle DNA mol­e­cules extract­ed from bac­te­ria, yeast, or oth­er fungi. Whole Genome Map­ping is a nov­el tech­nol­o­gy with unique capa­bil­i­ties in the field of micro­bi­ol­o­gy, with speci­fic appli­ca­tions in the areas of Com­par­a­tive Genomics, Strain Typ­ing, and Whole Genome Sequence Assem­bly. Whole Genome Maps are gen­er­at­ed de novo, inde­pen­dent of sequence infor­ma­tion, require no ampli­fi­ca­tion or PCR steps, and provide a com­pre­hen­sive view of whole genome archi­tec­ture. A Whole Genome Map is dis­played in the Map­Code pat­tern where the ver­ti­cal lines indi­cate the loca­tions of restric­tion sites, and the dis­tance between the lines rep­re­sent the restric­tion frag­ment size.

4.Radiation hybrid map­ping

A the­o­ry is devel­oped to pre­dict mark­er reten­tion and con­di­tion­al reten­tion or loss in radi­a­tion hybrids. Applied to mul­ti­ple pair­wise analy­sis of a human chro­mo­some 21 data set, this the­o­ry fits much bet­ter than pro­posed alter­na­tives and gives a phys­i­cal map con­sis­tent with oth­er evi­dence and robust with respect to errors to typ­ing. Radi­a­tion hybrids have great promise to provide order and phys­i­cal loca­tion at two lev­els of res­o­lu­tion, span­ning the tech­niques of link­age and restric­tion frag­ments and not lim­it­ed to poly­mor­phic loci.

5.dna bar­cod­ing

DNA bar­cod­ing is a tax­o­nom­ic method that uses a short genet­ic mark­er in an organism’s DNA to iden­ti­fy it as belong­ing to a par­tic­u­lar species

6.metric space

In math­e­mat­ics, a met­ric space is a set for which dis­tances between all mem­bers of the set are defined. Those dis­tances, tak­en togeth­er, are called a met­ric on the set.

7.Pseudometric space

In math­e­mat­ics, a pseudo­met­ric space is a gen­er­al­ized met­ric space in which the dis­tance between two dis­tinct points can be zero.

8.pyrosequencing

Pyrose­quenc­ing is a method of DNA sequenc­ing (deter­min­ing the order of nucleotides in DNA) based on the “sequenc­ing by syn­the­sis” prin­ci­ple. It dif­fers from Sanger sequenc­ing, in that it relies on the detec­tion of pyrophos­phate release on nucleotide incor­po­ra­tion, rather than chain ter­mi­na­tion with dideoxynucleotides.The desired DNA sequence is able to be deter­mined by light emit­ted upon incor­po­ra­tion of the next com­ple­men­tary nucleotide by the fact that only one out of four of the pos­si­ble A/T/C/G nucleotides are added and avail­able at a time so that only one let­ter can be incor­po­rat­ed on the sin­gle strand­ed tem­plate (which is the sequence to be deter­mined). The inten­si­ty of the light deter­mi­nes if there are more than one of the­se “let­ters” in a row. The pre­vi­ous nucleotide let­ter (one out of four pos­si­ble dNTP) is degrad­ed before the next nucleotide let­ter is added for syn­the­sis: allow­ing for the pos­si­ble reveal­ing of the next nucleotide(s) via the result­ing inten­si­ty of light (if the nucleotide added was the next com­ple­men­tary let­ter in the sequence). This process is repeat­ed with each of the four let­ters until the DNA sequence of the sin­gle strand­ed tem­plate is deter­mined.

9.n-gram(k-mer)

In the fields of com­pu­ta­tion­al lin­guis­tics and prob­a­bil­i­ty, an n-gram is a con­tigu­ous sequence of n items from a given sequence of text or speech. The items can be phonemes, syl­la­bles, let­ters, words or base pairs accord­ing to the appli­ca­tion. The n-grams typ­i­cal­ly are col­lect­ed from a text or speech cor­pus.

 

DNA sequencing:base pair

AGCTTCGA

…, A, G, C, T, T, C, G, A, …

…, AG, GC, CT, TT, TC, CGGA, …

…, AGC, GCT, CTT, TTC, TCGCGA, …

10.sequence space

In evo­lu­tion­ary biol­o­gy, sequence space is a way of rep­re­sent­ing all pos­si­ble sequences (for a pro­tein, gene or genome).

11.k-mer dis­tance

1.li,lj,表示两条序列

2.τ:表示一个k-mer的一个子序列,

ni(τ),nj(τ):表示该子序列在两条序列的k-mer中的个数。

3.ki,j:表示这两条序列k-mer的相似度

12.optical map(ordered restric­tion map)

Opti­cal map­ping is a tech­nique for con­struct­ing ordered, genome-wide, high-res­o­lu­tion restric­tion maps from sin­gle, stained mol­e­cules of DNA, called “opti­cal maps”. By map­ping the loca­tion of restric­tion enzyme sites along the unknown DNA of an organ­ism, the spec­trum of result­ing DNA frag­ments col­lec­tive­ly serve as a unique “fin­ger­print” or “bar­code” for that sequence.

13.Restriction map

A restric­tion map is a map of known restric­tion sites with­in a sequence of DNA. Restric­tion map­ping requires the use of restric­tion enzymes. In mol­e­c­u­lar biol­o­gy, restric­tion maps are used as a ref­er­ence to engi­neer plas­mids or oth­er rel­a­tive­ly short pieces of DNA, and some­times for longer genomic DNA.

14.Expressed sequence tag

An expressed sequence tag or EST is a short sub-sequence of a cDNA sequence.They may be used to iden­ti­fy gene tran­scripts, and are instru­men­tal in gene dis­cov­ery and gene sequence deter­mi­na­tion. The iden­ti­fi­ca­tion of ESTs has pro­ceed­ed rapid­ly, with approx­i­mate­ly 74.2 mil­lion ESTs now avail­able in pub­lic data­bas­es (e.g. Gen­Bank 1 Jan­u­ary 2013, all species).

15.Multiple Sequenc­ing Align­ment

A Mul­ti­ple Sequence Align­ment (MSA) is a sequence align­ment of three or more bio­log­i­cal sequences, gen­er­al­ly pro­tein, DNA, or RNA. In many cas­es, the input set of query sequences are assumed to have an evo­lu­tion­ary rela­tion­ship by which they share a lin­eage and are descend­ed from a com­mon ances­tor. From the result­ing MSA, sequence homol­o­gy can be inferred and phy­lo­ge­net­ic analy­sis can be con­duct­ed to assess the sequences’ shared evo­lu­tion­ary ori­gins. Visu­al depic­tions of the align­ment as in the image at right illus­trate muta­tion events such as point muta­tions (sin­gle amino acid or nucleotide changes) that appear as dif­fer­ing char­ac­ters in a sin­gle align­ment column, and inser­tion or dele­tion muta­tions (indels or gaps) that appear as hyphens in one or more of the sequences in the align­ment. Mul­ti­ple sequence align­ment is often used to assess sequence con­ser­va­tion of pro­tein domains, ter­tiary and sec­ondary struc­tures, and even indi­vid­u­al amino acids or nucleotides.

16.POA(Partial Order Align­ment)

Par­tial order align­ment (POA) has been pro­posed as a new approach to mul­ti­ple sequence align­ment (MSA), which can be com­bined with exist­ing meth­ods such as pro­gres­sive align­ment. This is impor­tant for address­ing prob­lems both in the orig­i­nal ver­sion of POA (such as order sen­si­tiv­i­ty) and in stan­dard pro­gres­sive align­ment pro­grams (such as infor­ma­tion loss in com­plex align­ments, espe­cial­ly sur­round­ing gap regions).

17.Progressive Align­ment

This approach begins with the align­ment of the two most close­ly relat­ed sequences (as deter­mined by pair­wise analy­sis) and sub­se­quent­ly adds the next clos­est sequence or sequence group to this ini­tial pair [37,7]. This process con­tin­ues in an iter­a­tive fash­ion, adjust­ing the posi­tion­ing of indels in all sequences. The major short­com­ing of this approach is that a bias may be intro­duced in the infer­ence of the ordered series of motifs (homol­o­gous parts) because of an over­rep­re­sen­ta­tion of a sub­set of sequences.

18.核糖体小亚基(英文:Ribosomal Small Subunit,简称“SSU”)

是核糖体中较小的核糖体亚基。每个核糖体都由一个核糖体小亚基与一个核糖体大亚基共同构成。[1]小亚基在核糖体翻译过程中负责信息的识别。  原核细胞中的70S核糖体、真核细胞细胞质中的80S核糖体与真核细胞线粒体中的线粒体核糖体各拥有一种不同的核糖体小亚基:70S核糖体中包含30S核糖体亚基,80S核糖体中包含40S核糖体亚基,线粒体核糖体中则包含28S核糖体亚基。

原核细胞 (70S核糖体) 大亚基:50S亚基(包含5S rRNA及23S rRNA)  
  小亚基:30S亚基(包含16S rRNA)  
真核细胞 细胞质核糖体 (80S核糖体) 大亚基:60S亚基(包含5S rRNA、5.8S rRNA及28S rRNA)
    小亚基:40S亚基(包含18S rRNA)
  线粒体核糖体 39S大亚基(12S MT-RNR1
    28S小亚基(16S MT-RNR2

19.rare bios­phere

Low-abun­dance high-diver­si­ty group is what is now called the “Rare Bios­phere”.

20.Phred qual­i­ty score

Phred qual­i­ty scores were orig­i­nal­ly devel­oped by the pro­gram Phred to help in the automa­tion of DNA sequenc­ing in the Human Genome Project. Phred qual­i­ty scores are assigned to each nucleotide base call in auto­mat­ed sequencer traces.[1][2] Phred qual­i­ty scores have become wide­ly accept­ed to char­ac­ter­ize the qual­i­ty of DNA sequences, and can be used to com­pare the effi­ca­cy of dif­fer­ent sequenc­ing meth­ods. Per­haps the most impor­tant use of Phred qual­i­ty scores is the auto­mat­ic deter­mi­na­tion of accu­rate, qual­i­ty-based con­sen­sus sequences.

21.Base call­ing

Base call­ing is the process of assign­ing bases (nucle­obas­es) to chro­matogram peaks. One of the best com­put­er pro­grams for accom­plish­ing this job is Phred base-call­ing, which is cur­rent­ly the most wide­ly used base­call­ing soft­ware pro­gram by both aca­d­e­mic and com­mer­cial DNA sequenc­ing lab­o­ra­to­ries because of its high base call­ing accu­ra­cy

22.MIAME(Mini­mum Infor­ma­tion About a Microar­ray Exper­i­ment)

describes the Min­i­mum Infor­ma­tion About a Microar­ray Exper­i­ment that is need­ed to enable the inter­pre­ta­tion of the results of the exper­i­ment unam­bigu­ous­ly and poten­tial­ly to repro­duce the exper­i­ment.

1.The raw data for each hybridi­s­a­tion.

2.The final processed data for the set of hybridi­s­a­tions in the exper­i­ment (study)

3.The essen­tial sam­ple anno­ta­tion, includ­ing exper­i­men­tal fac­tors and their val­ues

4.The exper­i­ment design includ­ing sam­ple data rela­tion­ships

5.Sufficient anno­ta­tion of the array design

6.Essential exper­i­men­tal and data pro­cess­ing pro­to­cols

 

 

How to Become a Bioinformatics Professional

1 Under­stand what Bioin­for­mati­cians do.

  • Broad­ly, com­pu­ta­tion­al biol­o­gy is involved with devel­op­ing and imple­ment­ing tools in order to use and man­age bio­log­i­cal data.
  • The med­ical field is a major employ­er of Bioin­for­mati­cians, but they are also need­ed in indus­try and agri­cul­ture.

2 Stay abreast of new devel­op­ments in Bioin­for­mat­ics and biotech­nol­o­gy.

  • This high­ly tech­no­log­i­cal field is under­go­ing rapid changes.
  • The Bioin­for­mat­ics Orga­ni­za­tion offers con­tin­u­ing edu­ca­tion cours­es.

3 Become pro­fi­cient in com­put­er sci­ence.

  • This includes data­base admin­is­tra­tion and pro­gram­ming skills.
  • UNIX is cur­rent­ly the pre­ferred oper­at­ing sys­tem plat­form.
  • Be able to write pro­grams in com­put­er lan­guages such as PERL, SQL and C.
  • Learn to use genomic sequence analy­sis and mol­e­c­u­lar mod­el­ing pro­grams.

4 Study col­lege lev­el biol­o­gy.

  • Biol­o­gy cours­es should include ana­lyt­i­cal tech­niques and mol­e­c­u­lar biol­o­gy.

5 Take math cours­es, par­tic­u­lar­ly those for biol­o­gists.

  • Bio­sta­tis­tics is an impor­tant dis­ci­pline in Bioin­for­mat­ics.

6 Pur­sue high­er edu­ca­tion. Under­grad­u­ate degrees can be in biol­o­gy, com­put­er sci­ence or biotech­nol­o­gy.

  • In grad­u­ate school, find a pro­gram that com­bi­nes both dis­ci­plines, if pos­si­ble; how­ev­er the empha­sis seems to be on mol­e­c­u­lar biol­o­gy study with the acquir­ing of infor­ma­tion tech­nol­o­gy skills.
  • Bioin­for­mat­ics or com­pu­ta­tion­al biol­o­gy pro­grams are still fair­ly new.
  • Researchers should have a doc­tor­ate in biol­o­gy, sta­tis­tics or math.

7 Learn to iden­ti­fy the right ques­tions to ask in addi­tion to the method­olo­gies to apply.

听起来就是这么容易……

http://www.wikihow.com/Become-a-Bioinformatics-Professional

DNA Packaging: Nucleosomes and Chromatin

18847_6

At the top right por­tion of the dia­gram, a ver­ti­cal dou­ble-end­ed arrow indi­cates that the DNA dou­ble helix strands are 2 nm apart. The strands are rep­re­sent­ed as gray rib­bons con­nect­ed by ver­ti­cal col­ored bars that are either half red/half green or half yellow/half cyan.

As the DNA strand reach­es the left side of the illus­tra­tion, all col­ors are replaced by gray. Box 1 has the text “At the sim­plest lev­el, chro­mat­in is a dou­ble-strand­ed heli­cal struc­ture of DNA. The DNA strand turns down and goes back toward the right, still com­pact­ing along the way.

Below this is Box 2, with the text “DNA is com­plexed with his­tones to form nucle­o­somes.” Toward the cen­ter of the schemat­ic are three sets of two brown discs, each disc quar­tered, and the cylin­ders are wrapped 1.65 times by the DNA, which has now com­pact­ed into a thick gray thread shape. Each nucle­o­some con­sists of eight his­tone mol­e­cules.

To the right of the first nucle­o­some com­plex is Box 3, with the text “Each nucle­o­some con­sists of eight his­tone pro­teins around which the DNA wraps 1.65 times.” The sec­ond nucle­o­some has a ver­ti­cal red bar, about as long as the nucle­o­some is high, attached to the side of the nucle­o­some. This bar is labeled H1 his­tone. A hor­i­zon­tal, dou­ble-end­ed, black arrow indi­cates the nucle­o­some with DNA has a diam­e­ter of 11 nm. A third nucle­o­some to the right of the sec­ond is labeled “Chro­mato­some.” Above and to the right of the chro­mato­some is Box 4, with the text “A chro­mato­some con­sists of a nucle­o­some plus the H1 his­tone.”

Below this, the nucle­o­somes are fold­ed in on each oth­er to form a hol­low, tube-like fiber, where many nucle­o­somes are arranged in par­al­lel rings to form the tube’s out­er lay­er. To the right of this is a ver­ti­cal, dou­ble-end­ed, black arrow labeled 30 nm. To the right of this arrow is Box 5, with the text “The nucle­o­somes fold up to pro­duce a 30-nm fiber…” The nucle­o­some tube con­tin­ues to com­pact to form a gray spi­ral and gray squig­gles as it con­tin­ues left­ward. Above this is Box 6 with the text “… that forms loops aver­ag­ing 300 nm in length.” A black, ver­ti­cal, dou­ble-end­ed arrow is labeled 300 nm. The squig­gles com­pact fur­ther, going down and back toward the right, coil­ing like a tele­phone cord. Below this is Box 7 with the text “The 300-nm fibers are com­pressed and fold­ed to pro­duce a 250-nm-wide fiber.” A black, ver­ti­cal, dou­ble-end­ed arrow is labeled 700 nm. Two, inward-point­ing, black arrows indi­cate a gap labeled “250-nm-wide fiber.”

The­se coils con­tin­ue to the right and com­press fur­ther, form­ing a hor­i­zon­tal, X-shaped, chro­mo­some. A black, ver­ti­cal, dou­ble-end­ed arrow is labeled 1400 nm. Below this is Box 8 with the text “Tight coil­ing of the 250-nm fiber pro­duces the chro­matid of a chro­mo­some.”

这里的欠缺在于,从图框6到图框8之间的折叠细节没有交代清楚。而且染色质\染色体的构象是随着时间在发生动态变化的,并非一成不变的。所以上述描述可以说是一个剪影。

http://www.nature.com/scitable/topicpage/DNA-Packaging-Nucleosomes-and-Chromatin-310

Excel 数据分析

  1. 回归分析
  2. 直线图
  3. 快速公式套用:
    1. 在一个格子内输入公式;
    2. 点击该格子,Ctrl+Shift+方向,选定所有需要套用的格子;
    3. Ctrl+D,完成计算。

 

Oxford Nanopore Technology — MinION

Min­ION is a portable device for mol­e­c­u­lar analy­ses that is dri­ven by nanopore tech­nol­o­gy. It is adapt­able for the analy­sis of DNA, RNA, pro­teins or small mol­e­cules with a straight­for­ward work­flow.

MinION

Scalability

The Min­ION can be run for min­utes or days accord­ing to the exper­i­men­tal need. Users can adjust set­tings like the speed that the DNA pass­es through the nanopore. Prome­thION, which will soon be released into ear­ly access, is designed to be ful­ly scal­able so that users can oper­ate between one or 48 flow cells at any one time.

Long read lengths

The Oxford Nanopore sys­tem process­es the reads that are pre­sent­ed to it rather than gen­er­at­ing speci­fic read lengths. The longest read report­ed by a Min­ION user to date is more than 200Kb, but it can process the spec­trum of read lengths.

身边的算法

  1. windows自带纸牌游戏(NP难)
    对于如何生成一副可被完成的组合,这是一个NP难问题;电脑每次随机生成一副牌,不保证一定有解。所以有时候,纸牌游戏无论如何都完成不了时,可能是这次真的完成不了,而不是你自己的问题。不过,判定到底是你的问题还是电脑的问题仍然是NP难的问题。可以编写一个程序,来算:当我们觉得已经无法前进时,出现了的所有牌,是否存在一种新的组合会使得出现新的翻牌希望。

Keywords of Genomics

Pop­u­la­tion genet­ics

Pop­u­la­tion genet­ics is the study of the dis­tri­b­u­tion and change in fre­quen­cy of alle­les with­in pop­u­la­tions, and as such it sits firm­ly with­in the field of evo­lu­tion­ary biol­o­gy.

The main process­es of evo­lu­tion are nat­u­ral selec­tion, genet­ic drift, gene flow, muta­tion, and genet­ic recom­bi­na­tion and they form an inte­gral part of the the­o­ry that under­pins pop­u­la­tion genet­ics.

Stud­ies in this branch of biol­o­gy exam­ine such phe­nom­e­na as adap­ta­tion, spe­ci­a­tion, pop­u­la­tion sub­di­vi­sion, and pop­u­la­tion struc­ture.

Pop­u­la­tion strat­i­fi­ca­tion

Pop­u­la­tion strat­i­fi­ca­tion refers to dif­fer­ences in allele fre­quen­cies between cas­es and con­trols due to sys­tem­at­ic dif­fer­ences in ances­try rather than asso­ci­a­tion of genes with dis­ease.

It would be caused by sys­tem­at­ic dif­fer­ences in the ances­try of cas­es and con­trols.

Diploid genome

Diploid genome refers to a genome that con­tains a bal­anced set of chro­mo­somes derived equal­ly from mater­nal and pater­nal sources.

Coa­les­cent the­o­ry

Coa­les­cent the­o­ry is a ret­ro­spec­tive sto­chas­tic mod­el of pop­u­la­tion genet­ics that relates genet­ic diver­si­ty in a sam­ple to demo­graph­ic his­to­ry of the pop­u­la­tion from which it was tak­en.

That is, it is a mod­el of the effect of genet­ic drift, viewed back­wards in time, on the geneal­o­gy of antecedents.

J.Q. Liu

  1. Yak whole-genome rese­quenc­ing reveals domes­ti­ca­tion sig­na­tures and pre­his­toric pop­u­la­tion expan­sions (2015)
    1. genome vari­a­tion of wild and domes­tic yaks
    2. evo­lu­tion
  2. Genome rese­quenc­ing: 13 wild yaks and 59 domes­tic yaks

windows install and configure

如果是让电脑维修店的人重装系统,要注意找一家好一点的维修店。因为,重装系统看似一样,其实,每个店使用的安装镜像以及一些细节的配置是有出入的。去一家很差的店,重装的系统,会给自己后期的配置带来极大的困难。

  1. 操作系统重装
    win7
  2. 硬件驱动重装
    1.显卡驱动
  3. 软件重装
    1. Direc­tX
  4. 运行库
  5. 编程语言编译工具
    1. Java
    2. MinGW
    3. Straw­ber­ry Perl
  6. 小工具
    1. dae­mon tool lites
    2. pchunter
    3. xming+putty
  7. 123

NCBI 使用注意事项及技巧

  1. 关于序列标识join和complement:
    join:表示序列是模板链上的5'->3';
    complement:表示序列是编码链上的5'->3';
    example:
    join:
    现在(2016.4.5),似乎没有再标识join了。
    complement:
    我看到的素有gene类别下的序列都是给的complement。

    gene complement(2872..3195)
    /gene=" lacZ' "
    Sequence:NC_000913.3 (363231..366305, complement)

  2. 在指定的基因组检索目的序列:打开基因组,然后输入目的序列,开始检索。
  3. 对于蛋白质,NCBI提供了查看其CD(conserved domain),名字叫“Identify Conserved Domains”;

Windows 10系统故障修复和优化配置集锦

  1. 开机蓝屏,出现“BAD_SYSTEM_CONFIG_INFO”:(windows 10)
    1. 选择“高级修复-疑难解答-命令提示符”
    2. 执行命令:“bcdedit /deletevalue {default} truncatememory
  2. 运行时蓝屏,出现:
    1. video memory management internal: 安装合适的显卡驱动
    2. VIDEO TDR FAILURE (nvlddmkm.sys)
    3.video scheduler internal error
  3. 出现蓝屏死机(Blue screen of death)时,系统自动记录DMP文件所在位置(win-10):c:/Windows/Minidump/
  4. IE浏览器打不开网页(windows 7)
    1. 选择“Internet选项-高级”
    2. 在“重置Internet Explorer设置”处,点击重置
  5. DLL
    1. 安装:Visual C++ Redistributable Packages for Visual Studio 2013,msvcp120.dll,Microsoft Visual C++ 2010 可再发行组件包 (x86)
  6. 无法指定一个电脑上存在的软件为默认打开程序
    1. 打开注册表编辑器;(cmd->regedit);
    2. 到达该位置:HKEY_CURRENT_USER\Software\Classes\Applications
    3. 检查该软件是否存在,不存在则添加;检查该软件路径是否正确,不正确则修改正确路径。
  7. 系统和压缩内存占用内存过高
    0. cmd -> regedit;
    1. HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Session  Manager\Memory Management\PrefetchParameters
    2. 将“EnablePrefetcher”和“EnableSuperfetch”的值调为2;
    3. 重启;
    备注:
    值-2 :开启系统启动预读取
    值-3 :开启应用启动和系统启动预读取
  8. win10-关闭Cortana
    1. 点击启动按钮;
    2. 点击设置->隐私->语音、墨迹书写和键入;
    3. 点击,停止搜集有关我的信息;
  9. win10-关闭windows search
    1. 控制面板->管理工具->服务;
    2. 禁用windows search。
  10. win10-禁止自动更新
    1. gpedit.msc
    2. 计算机配置—管理模板—Windows组件-Windows更新
    3. “配置自动更新”,“2 = 在下载和安装任何更新前发出通知”
  11. Itunes: error 7 (windows error 193)
    当我们从win7升级到win10以后,可能会出现这个软件无法卸载,并且新的itune不能安装。这个时候怎么办呢?就得使用微软专门开发的工具: MicrosoftProgram_Install_and_Uninstall.meta.diagcab。
    运行这个软件,卸载掉apple software update,之后就可以顺利安装itunes了。
    微软的两个帖子,如何解决 Windows installer 服务的问题修复阻止程序安装或删除的问题
  12. 接入需要验证的拨号网络时,会自动打开浏览器并连接这个网址:http://go.microsoft.com/fwlink/?LinkID=219472&clcid=0x409
    1. sfc/scannow (windows+x,命令提示符-管理员)
    What is the Sfc Command?
    The sfc command is a Command Prompt command that can be used to verify and replace important Windows system files.
    System File Checker is a very useful tool to use when you suspect issues with protected Windows files like many DLL files.
    Sfc /scannow will inspect all of the important Windows files on your computer, including Windows DLL files. If System File Checker finds an issue with any of these protected files, it will replace it.
    2. run:

    findstr /c:"[SR] Cannot repair member file" %windir%\logs\cbs\cbs.log > "%userprofile%"\Desktop\sfcdetails.txt

    3. correct the error according to the information given in the file "Desktop\sfcdetails.txt";

  13. 通过防火墙禁止应用程序联网:
    1. win+x;
    2.控制面板,小图标,windows防火墙;
    3.高级设置,入战规则,新建规则;
    4.程序,此程序路径,阻止连接。
    caveat:一定要在公用网络和专用网络下都打开windows防火墙,方法:到达上面第2步后,点击“启用或关闭windows防火墙”,打开两个网络下的防火墙。
  14. 右下角小图标收拢
    1.点击左下角图标,点击设置,点击系统
    2.点击通知和操作
    3.选择在任务栏上显示哪些图标
    4.关闭“通知区域始终显示所有图标”
  15. 清除右键多余项:
    1.  win+r
    2. regedit, 回车;
    3.  编辑,搜索需要删除的项目
    4. (如果需要删除的项目,是已经卸载的软件)删掉所有搜索出来的结果。(更可能是hkey-local-machine下的)补充信息:
    reg_sz:A null-terminated string. This will be either a Unicode or an ANSI string, depending on whether you use the Unicode or ANSI functions.
  16. chm打开后空白
    右键->属性->勾选:解除锁定
  17. 开启管理员账户状态:
    1. cmd: gpedit.msc
    2. Windows 设置 -> 安全设置 -> 本地策略 -> 安全选项,找到“账户:管理员账户状态”
    3. 右键,属性,修改为已启用。
  18. 禁止计算机进入睡眠和休眠:
    比较容易达到的是禁止计算机睡眠,但是禁止计算机休眠的选项在很深的位置,这样一来,计算机长时间(我的是3小时)不用以后,就会进入休眠,这会给远程控制带来很大的麻烦。改变这个设置的具体操作流程如下:
    1. win+i,启动设置界面
    2. 系统->电源和睡眠->其他电源设置->更改计算机睡眠时间->更改高级电源设置->睡眠->在此时间后休眠
    3. “接通电源”选项后面的框内,选择:从不。
  19. foobar

python 使用技巧

  1. 安装 pip:
    1. 下载:https://bootstrap.pypa.io/get-pip.py
    2. 安装:python get-pip.py
  2. 在win­dows下使用pip:
    python –m pip
  3. error:Microsoft Visu­al C++ 9.0 required(Unable to find vcvarsall.bat).
    解决方法:下载 VCForPython27.msi 。
    地址: http://www.microsoft.com/en-us/download/confirmation.aspx?id=44266

Key points of the analysis of microarray — 基因芯片分析要点

  1. bio­log­i­cal repli­cates — 生物学重复
    Five or more is usu­al­ly robust for micro-array stud­ies
    五个及以上的样本数对基因芯片研究来说才是具有鲁棒性的。
  2. qPCR val­i­da­tion
    Micro-array may give many false pos­i­tives so it is usu­al­ly nec­es­sary to val­i­date the dif­fer­en­tial expres­sion observed in some of the key genes.
    基因芯片可能产生许多的假阳性结果,所以验证部分关键的差异表达基因通常来说是必须的。

KEGG 使用注意事项

  1. bta里的pathway个数在不断增加,过去抓取的和现在的混着用就会出错;
  2. 批量下载KEGG Mapper生成的图像时,由于网络状况可能导致下载不完全,请一定仔细核实数目是否对应,图像是否完整;

KEGG ORTHOLOGY (KO) Database

在KEGG中,分子水平上的功能保存在KO(KEGG Orthology)数据库中。这些功能与直系同源组联系在一起,以此来使得一个特殊物种的实验数据可以被扩展到其他物种。KEGG中的基因组注释是直系同源注释,其方式为,为GENES数据库中的每个基因制定KO iden­ti­fiers (K num­bers) 。对于原始数据,像由RefSeq或者GenBank给出的基因名和描述,即使他们和KO的分配不一致,KEGG也不会做任何修改。

将KO的条目与功能表征的序列数据的实验证据联系在一起的工作,已经开始了,并且现在已经展示在REFERENCE下的SEQUENCE子域中。而且,基因组层面的“KEGG GENES”(http://www.genome.jp/kegg/genes.html)集合已经被扩展,使其可以将蛋白数据也包含在附录中。最终KO数据库将覆盖所有的功能表征蛋白序列信息(另见“KEGG Enzyme”(http://www.genome.jp/kegg/annotation/enzyme.html))。

In KEGG, mol­e­c­u­lar-lev­el func­tions are stored in the KO (KEGG Orthol­o­gy) data­base and asso­ci­at­ed with ortholog groups in order to enable exten­sion of exper­i­men­tal evi­dence in a speci­fic organ­ism to oth­er organ­isms. Genome anno­ta­tion in KEGG is ortholog anno­taion, assign­ing KO iden­ti­fiers (K num­bers) to indi­vid­u­al genes in the GENES data­base. No updates are made to orig­i­nal data, such as gene names and descrip­tions given by Ref­Seq or Gen­Bank, even if they are incon­sis­tent with the KO assign­ment.

Major efforts have been ini­tat­ed to asso­ciate each KO entry with exper­i­men­tal evi­dence of func­tion­al­ly char­ac­ter­ized sequence data, now shown in the SEQUENCE sub­field of the REFERENCE field. Fur­ther­more, the genome-based col­lec­tion of KEGG GENES has been expand­ed to allow indi­vid­u­al pro­tein data to be includ­ed in the adden­dum cat­e­go­ry. Even­tu­al­ly the KO data­base will cov­er all knowl­edge on func­tion­al­ly char­ac­ter­ized pro­tein sequences (see also KEGG Enzyme).

一般来说,KO对功能直系同源的划分是定义在KEGG分子网络的语境中(KEGG path­way maps, BRITE hier­ar­chies and KEGG modules)。KEGG分子网络实际上是由K numbers标识的网络节点表示的。KOs和相应的分子网络的关系呗存储在下面这个系统中。

KEGG Orthol­o­gy (KO

将功能信息和直系同源组关联在一起这个功能是KEGG资源的一个独特的功能。基于有限总量的实验数据生成的对序列相似性的预测被预先定义好在KEGG中。如同在BlastKOALA和其他工具中实现的那样,对KEGG GENES的序列相似性搜索是针对K numbers的。一旦一个K numbers被指定给基因组中的基因,KEGG path­ways maps, Brite hierarchies,和KEGG modules都会自动重建。如此一来,就能对较高水平的功能有一个生物学上的科学的诠释。

In gen­er­al KO group­ing of func­tion­al orthologs is defined in the con­text of KEGG mol­e­c­u­lar net­works (KEGG path­way maps, BRITE hier­ar­chies and KEGG mod­ules), which are in fact rep­re­sent­ed as net­works of nodes iden­ti­fied by K num­bers. The rela­tion­ships between KOs and cor­re­spond­ing mol­e­c­u­lar net­works are rep­re­sent­ed in the fol­low­ing KO sys­tem.

KEGG Orthol­o­gy (KO)The fact that func­tion­al infor­ma­tion is asso­ci­at­ed with ortholog groups is a unique aspect of the KEGG resource. The sequence sim­i­lar­i­ty based infer­ence as a gen­er­al­iza­tion of lim­it­ed amount of exper­i­men­tal evi­dence is pre­de­fined in KEGG. As imple­ment­ed in BlastKOALA and oth­er tools, the sequence sim­i­lar­i­ty search again­st KEGG GENES is a search for most appro­pri­ate K num­bers. Once K num­bers are assigned to genes in the genome, the KEGG path­ways maps, Brite hier­ar­chies, and KEGG mod­ules are auto­mat­i­cal­ly recon­struct­ed, enabling bio­log­i­cal inter­pre­ta­tion of high-lev­el func­tions.

DAVID/DAVID-WS使用技巧

DAVID-WS(网络服务)被开发出来,使用户完成任务无需进行人工交互,而是编程接入DAVID,经由状态网络服务实现自动化。

DAVID-WS (web ser­vice) has been devel­oped to auto­mate user tasks by pro­vid­ing state­ful web ser­vices to access DAVID pro­gram­mat­i­cal­ly with­out the need for human inter­ac­tions. [1]

DAVID-WS通过保留一个用户在一次查询会话中的状态相关的操作输入,使这些输入能在用户该次会话接下来的操作中被获取,从而达到状态化。用户可以增添基因列表,改变分析背景总体,选择物种和种类,重置数据分析的功能参数,在一次会话中调用所有工具以及按照希望规范输出。

DAVID-WS is made state­ful by keep­ing the state-relat­ed input of a user oper­a­tion in a ses­sion con­text that can be accessed by sub­se­quent user oper­a­tions with­in the same ses­sion. Users can add lists, change back­ground pop­u­la­tions, select species and cat­e­gories and reset func­tion­al para­me­ters for data analy­sis, as well as query all tools with­in the same ses­sion and for­mat out­put as desired. [1]

[1] Jiao, X., Sher­man, B.T., Huang da, W., Stephens, R., Basel­er, M.W., Lane, H.C., and Lem­picki, R.A. (2012). DAVID-WS: a state­ful web ser­vice to facil­i­tate gene/protein list analy­sis. Bioin­for­mat­ics 28, 1805–1806.

Perl使用注意及技巧

  1. 位置信息
    1. (子)脚本所在的位置:/home/wangyu/
    File::Spec
    my $path_curf = File::Spec->rel2abs(__FILE__);
    my ($vol, $dirs, $file) = File::Spec->splitpath($path_curf);
    2. 从哪里调用的(主)脚本:/home/wangyu/code
    $ENV{'PWD'}
    3. 程序目前切换(chdir)到哪里了:/lustre/Work
    `pwd`
    解释:
    1. 我用a.pl调用b.pl,主脚本为a.pl,子脚本为b.pl;
    2. a.pl在/home/wangyu/code/perl, b.pl在/home/wangyu;
    3. 使用chdir切换了到/lustre/Work以后,调用b.pl,在b.pl里面,使用三种方式判断路径。
  2. perl –d: 打开调试功能
  3. windows下,html中指定路径:“file:\/\/\/path_to_the_file”;
  4. 对读入的数据进行split前,注意,要用chomp处理;
    因为,读入的数据的末尾的换行符会被分配到最后一串字符里。
    其实际影响案例有:1. 如果一个变量$var包含了换行符,我把这个变量放在system “gzip –d –c $var > filename”,这条命令$var后面的就无法生效,因为在$var已经敲了回车了。
  5. Instal­la­tion:
    perl -MCPAN -e shell
    install SOAP::Lite
  6. Your Perl is con­fig­ured to link again­st libgdbm,but libgdbm.so was not found.:aptitude install libgdbm-dev
  7. Please tell me where I can find your apache src:
  8. Func­tion Round: int($number+0.5)
  9. Unquot­ed string “..” may clash with future reserved word
    I meet this warn­ing because my file­han­dle is low­er­case with the “warn­ing” on. It’s bet­ter to use upper­case as devel­op­ers wish.
  10. $$: 该脚本的进程号;
  11. 微型Perl: 修改文件内容
     perl -p -i -e 's/from/to/' *.file

    –p:输出本行内容(-n: 不输出本行内容)
    –i:指定备份文件后缀名,如果给出-i选项并且没有指定后缀名,则覆盖原文件 (-i.bak)
    –e:需要运行的perl代码,分号分割,可写多条语句。计数变量可用。
    *.file: 需要修改的文件

  12. 已安装模块备份及重装
    #所有安装的模块信息存储在:
    #/home/nott/.cpan/Bundle/Snapshot_2017_03_10_00.pm
    perl -MCPAN -eautobundle 
    
    #重装
    perl -MCPAN -e 'install Bundle::Snapshot_2017_03_10_00'
    
    
  13. foo­bar