GCE 图片欣赏

图一

A.由于reads错误率上升,无效的reads越来越多,错误峰从无到有,从低到高;

B.由于越来越多的reads变成错误的reads,所以正确的reads占比下降,峰高下降;

C.正确的reads占比下降,是由于其数量的减少导致的,进而导致其对基因组的覆盖度下降,峰左移;

图二

A. Kmer分析峰的深度,比数据对基因组的覆盖深度会低少许,峰左移;

B. Kmer分析峰的高度更高一些,实际意义待定。

 

无锁(lock-free)的非阻塞算法:CAS (Compare-And-Swap)

定义

CAS的意思是:当本线程打算修改某个其他线程也可以访问的变量时,并且这个变量的改变和其旧值相关,那么,在修改前进行这样的操作:判断这个变量的当前值是否和本线程存储的其过去值相等,如果相等,就将其修改为新值;否则,就将其过去值修改为当前值,本次尝试修改失败,然后再次尝试修改这个值。

当讨论多线程的时候,就得讨论锁,而CAS是一种乐观锁。所谓乐观锁,就是指,不去锁定数据,而是通过判断来确定是否可以修改这个值。

应用:

Jel­ly­fish

GFF3 Format

This sec­tion describes the rep­re­sen­ta­tion of a pro­tein-cod­ing gene in GFF3. To illus­trate how a canon­i­cal gene is rep­re­sent­ed, con­sid­er Fig­ure 1 (figure1.png). This indi­cates a gene named EDEN extend­ing from posi­tion 1000 to posi­tion 9000. It encodes three alter­na­tive­ly-spliced tran­scripts named EDEN.1, EDEN.2 and EDEN.3, the last of which has two alter­na­tive trans­la­tion­al start sites lead­ing to the gen­er­a­tion of two pro­tein cod­ing sequences.

There is also an iden­ti­fied tran­scrip­tion­al fac­tor bind­ing site locat­ed 50 bp upstream from the tran­scrip­tion­al start site of EDEN.1 and EDEN2.

Here is how this gene should be described using GFF3:

 
 0  ##gff-version 3.2.1
 1  ##sequence-region ctg123 1 1497228
 2  ctg123 . gene            1000  9000  .  +  .  ID=gene00001;Name=EDEN
 3  ctg123 . TF_binding_site 1000  1012  .  +  .  ID=tfbs00001;Parent=gene00001
 4  ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00001;Parent=gene00001;Name=EDEN.1
 5  ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00002;Parent=gene00001;Name=EDEN.2
 6  ctg123 . mRNA            1300  9000  .  +  .  ID=mRNA00003;Parent=gene00001;Name=EDEN.3
 7  ctg123 . exon            1300  1500  .  +  .  ID=exon00001;Parent=mRNA00003
 8  ctg123 . exon            1050  1500  .  +  .  ID=exon00002;Parent=mRNA00001,mRNA00002
 9  ctg123 . exon            3000  3902  .  +  .  ID=exon00003;Parent=mRNA00001,mRNA00003
10  ctg123 . exon            5000  5500  .  +  .  ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003
11  ctg123 . exon            7000  9000  .  +  .  ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003
12  ctg123 . CDS             1201  1500  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
13  ctg123 . CDS             3000  3902  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
14  ctg123 . CDS             5000  5500  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
15  ctg123 . CDS             7000  7600  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
16  ctg123 . CDS             1201  1500  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
17  ctg123 . CDS             5000  5500  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
18  ctg123 . CDS             7000  7600  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
19  ctg123 . CDS             3301  3902  .  +  0  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
20  ctg123 . CDS             5000  5500  .  +  1  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
21  ctg123 . CDS             7000  7600  .  +  1  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
22  ctg123 . CDS             3391  3902  .  +  0  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
23  ctg123 . CDS             5000  5500  .  +  1  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
24  ctg123 . CDS             7000  7600  .  +  1  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

Lines begin­ning with ‘##’ are direc­tives (some­times called prag­mas or meta-data) and provide meta-infor­ma­tion about the doc­u­ment as a whole. Blank lines should be ignored by parsers and lines begin­ning with a sin­gle ‘#’ are used for human-read­able com­ments and can be ignored by parsers. End-of-line com­ments (com­ments pre­ceed­ed by # at the end of and on the same line as a fea­ture or direc­tive line) are not allowed.

Line 0 gives the GFF ver­sion using the ##gff-ver­sion prag­ma. Line 1 indi­cates the bound­aries of the region being anno­tat­ed (a 1,497,228 bp region named “ctg123”) using the ##sequence-region prag­ma.

Line 2 defines the bound­aries of the gene. Column 9 of this line assigns the gene an ID of gene00001, and a human-read­able name of EDEN. Because the gene is not part of a larg­er fea­ture, it has no Par­ent.

Line 3 anno­tates the tran­scrip­tion­al fac­tor bind­ing site. Since it is log­i­cal­ly part of the gene, its Par­ent attrib­ute is gene00001.

Lines 4–6 define this gene’s three spliced tran­scripts, one line for the full extent of each of the mRNAs. The­se fea­tures are nec­es­sary to act as par­ents for the four CDSs which derive from them, as well as the struc­tural par­ents of the five exons in the alter­na­tive splic­ing set.

Lines 7–11 iden­ti­fy the five exons. The Par­ent attrib­ut­es indi­cate which mRNAs the exons belong to. Notice that sev­er­al of the exons share the same par­ents, using the com­ma sym­bol to indi­cate mul­ti­ple parent­age.

Lines 12–24 denote this gene’s four CDSs. Each CDS belongs to one of the mRNAs. cds00003 and cds00004, which cor­re­spond to alter­na­tive start codons, belong to the same mRNA.

Note that sev­er­al of the fea­tures, includ­ing the gene, its mRNAs and the CDSs, all have Name attrib­ut­es. This attrib­ut­es assigns those fea­tures a pub­lic name, but is not manda­to­ry. The ID attrib­ut­es are only manda­to­ry for those fea­tures that have chil­dren (the gene and mRNAs), or for those that span mul­ti­ple lines. The IDs do not have mean­ing out­side the file in which they reside. Hence, a slight­ly sim­pli­fied ver­sion of this file would look like this:

Find more at this link: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

正确答案

围棋的核心在于逻辑运算。答案正确与否,并非是以标准答案为准,我们可以自己进行逻辑推理去判断答案给出的解法是否合理。比如下面这一道来自“围棋大全”的题:

初始局面:(下一手是第25手)

标准解(下面的左图):标准解假设我们下了第25手以后,白棋会走第26手,实际上,白棋可以走在第27手黑棋的位置,那么局面如下面的右图。

                 

所以我认为,应当这么下:第25手直接立下。

孰优孰略,还可以精密计算,后面补充。

新的标志:战斗到最后,输>27.5目

今天这局,不是下的最好的。但是,却有一个重大突破,那就是,不放弃。

在以前,如果下错棋了,我会非常沮丧,我会悔棋,我会重来。但是今天我没有。我克服了自己的心理,选择了重新审视局面,继续战斗。虽然失败也坚持,也思考,这让我有机会通过一场战斗学会更多,也学会了面对强大的对手。

最后从结果反过来总结一下:虽然在战斗中,觉得白棋并没有特别重视角落,也有大胆地往中间跳。但是实际上,白棋很重视角落,很想在角落得分。

另外,这款软件的数目功能有问题。

Perl 语法细究

  1. list: A list is an ordered col­lec­tion of scalars. 列表是标量的零时集合;列表纯粹是数据;而数组是一个存储数据的变量;
  2. array: An array is a vari­able that con­tains a list. 一个数组是一个包含一个列表的变量;
  3. per­l函数的形参和实参: perl语言里没有形参;当我们将变量作为参数传入函数时,这些变量都会放在@_数组里;通过直接操作这个数组改变变量时,产生的改变会影响到函数外对应的变量。(相关材料:https://stackoverflow.com/questions/24063638/if-perl-is-call-by-reference-why-does-this-happen)
  4. perl v5.22.2 和v5.10.1的关于哈希的差异:当我们用each遍历hash时,如果中途又对hash做了改动,5.22.2就会弹出错误:‘Use of each() on hash after inser­tion with­out reset­ting hash iter­a­tor results in unde­fined behavior’。不能一边用哈希,一边改哈希。

排序

  1. 对数组排序
    对一个一维数组排序,实际上是对数组的值排序,然后返回有顺序的这些值
  2. 对哈希排序
    对一个哈希排序,可以是对键进行排序,返回的是键;也可以是对值排序,返回的还是键;

一条狗和一块石头的差别

  1. 特定的组构
    狗的体内存在细胞,有机化合物;石头体内没有细胞、有机化合物。
  2. 新陈代谢
    狗需要主动从外界获取能量,为体内的化学反应(也就是新陈代谢)提供能量;石头不需要主动从外界获取能量,便可存在。
  3. 稳态和应激性
    狗体内的新陈代谢需要在一定的物理、化学条件(温度、pH等)下才能进行,这叫做稳态;狗有许多调节机制去维持这种条件的相对稳定,并且当环境发生某些改变时也能做到,这叫做应激性。石头体内存在化学反应,但是石头的稳定存在的最佳环境是不发生任何化学反应去改变石头的状态;当环境发生改变时,石头不会主动应答,也就是不存在应激性。
  4. 生殖和遗传
    狗会主动产生子代,也就是进行生殖;产生的子代与亲本具有相似的性状,这叫做遗传。石头不会主动进行生殖活动,也就不存在遗传。
  5. 生长和发育
    狗会生长,他的细胞会从小变大,从少变多,狗也会发育,比如他的组织器官的形态建成、性成熟、衰老等。石头可能会变大,但是不存在发育。
  6. 进化和适应
    狗不断生殖,也就是在不断进化,因为生殖产生子代必然会引入突变,无论是简单的SNV还是SV;通过发生这些突变,会生成一些新的性状,在群体范围内,带有不适应环境的新性状的个体被淘汰,适应的个体存活,使得群体朝着更加适应该环境的方向进化。石头没有生殖,所以没有进化;石头无论存在于哪里,都不会根据环境去调整自己,所以不存在适应。
  7. 综述
    综上所述,狗和石头之间存在巨大的差别,而这种差别是广泛存在于生物和非生命体之间的。
  8. 运动
    狗可以并且会自发运动,而石头不会自发运动;
  9. foo­bar

Perl DBI install (local)

  1. down­load the pack­ages;
    DBI, DBI::DBD, DBD::mysql
  2. 安装DBI;
  3. 生成make­file
    DBI的Makefile.PL使用了Config包,如果是本地版的perl,需要编辑Makefile.PL,将通过$Config{“name”}获取的变量的路径稍作调整。
    调整好后,就可以生成makefile文件了。
    perl Makefile.PL PREFIX= PERL_LIB=
  4. 修改make­file文件
  5. make test (option­al)
  6. make install

MySQL common commands

  1. 使用root创建新用户
    insert into
    mysql.user(host, user, pass­word, selec­t_priv, insert_priv, update_priv)
    VALUES (‘local­host’, ‘wangyu’, PASSWORD(‘xxxxxx’), ‘Y’, ‘Y’, ‘Y’);
  2. 更新表记录,以及使更新生效
    UPDATE table SET colum­nA = ‘Fred’ WHERE columnB = ‘Wilson’
    flush priv­i­leges;
  3. 数据库层面
    1. 查看已有数据库
    show data­bas­es;
    2. 授权某个用户操作某个数据库
    GRANT priv­i­leges ON databasename.tablename TO ‘username’@‘host’
    3. 导入数据库
    source filename.sql
    4.查看指定数据库所占空间大小 (Mb)
    SELECT (sum(DATA_LENGTH)+sum(INDEX_LENGTH))/1024/1024 FROM  information_schema.TABLES where TABLE_SCHEMA=‘servicer’;
    5. 查看所有表
    show tables;
    6.
  4. 表层面
    1. 打印表的内容
    select * from table
    2. 修改表内容
    UPDATE 表名称 SET 列名称 = 新值 WHERE 列名称 = 某值
    3. 打印指定模式的表内容
    select field from table where field like “pat­tern”;
    %: 通配符,0-N个字符;
    _: 通配符,1个字符;
    4.

Arabidopsis thaliana

Ara­bidop­sis thaliana is a small flow­er­ing plant of mus­tard fam­i­ly, bras­si­caceae (Cru­cifer­ae). It is dis­trib­ut­ed through­out the world and was first report­ed in the six­teen­th cen­tu­ry by Johan­nes Thal. It has been used for over fifty years to study plant muta­tions and for clas­si­cal genet­ic analy­sis. It is now being used as a mod­el organ­ism to study dif­fer­ent aspects of plant biol­o­gy.
A. thaliana is a diploid plant with 2n = 10 chro­mo­somes. It became the first plant genome to be ful­ly sequenced based on the fact that it has a (1) small genome of ~120 Mb with a sim­ple struc­ture hav­ing few repeat­ed sequences (2) short gen­er­a­tion time of six weeks from seed ger­mi­na­tion to seed set, and (3) pro­duces large num­ber of seeds. The sequenc­ing was done by an inter­na­tion­al col­lab­o­ra­tion col­lec­tive­ly ter­med the Ara­bidop­sis Genome Ini­tia­tive (AGI). Though of no eco­nom­ic impor­tance, it is an invalu­able resource to agri­cul­tur­al­ly impor­tant crops, par­tic­u­lar­ly to mem­bers of the same fam­i­ly, which includes canola, an impor­tant source of veg­etable oil. EST/mRNA align­ments to the Genome are avail­able for ftp down­load. They are in the Splign for­mat.

Selective Sweep

Sweeps can be cat­e­go­rized in three main cat­e­gories.

  1. The “clas­sic selec­tive sweep” or “hard selec­tive sweep” is expect­ed to occur when ben­e­fi­cial muta­tions are rare, but once a ben­e­fi­cial muta­tion has occurred it increas­es in fre­quen­cy rapid­ly, there­by dras­ti­cal­ly reduc­ing genet­ic vari­a­tion in the pop­u­la­tion.
  2. A so-called “soft sweep from stand­ing genet­ic vari­a­tion” occurs when a pre­vi­ous­ly neu­tral muta­tion that was present in a pop­u­la­tion becomes ben­e­fi­cial because of an envi­ron­men­tal change. Such a muta­tion may be present on sev­er­al genomic back­grounds so that when it rapid­ly increas­es in fre­quen­cy, it doesn’t erase all genet­ic vari­a­tion in the pop­u­la­tion.
  3. Final­ly, a “mul­ti­ple orig­in soft sweep” occurs when muta­tions are com­mon (for exam­ple in a large pop­u­la­tion) so that the same or sim­i­lar ben­e­fi­cial muta­tions occurs on dif­fer­ent genomic back­grounds such that no sin­gle genomic back­ground can hitch­hike to high fre­quen­cy.

Genetic terminology

  1. genet­ic hitch­hik­ing: 遗传搭车
  2. sin­gle­ton: SNP’s shared by more than one indi­vid­u­al indi­cate lev­els of relat­ed­ness, while SNP’s found only with­in one indi­vid­u­al, referred to as “sin­gle­tons”, indi­cate unique­ness.

subtle ggplot2 techniques

  1. His­tograms (geom_his­togram) dis­play the count with bars;
    fre­quen­cy poly­gons (geom_fre­qpoly), dis­play the counts with lines.
  2. how to remove extra base line in ggplot2-den­si­ty plots? (https://groups.google.com/forum/#!topic/ggplot2/I3fXMH8foEs) ggplot(a,aes(x=V1))+geom_density() = ggplot(a,aes(x=V1))+geom_line(stat=“density”)

VNC (virtual network computing) server

  1. realVNC
  2. TigerVNC
    1. 在被远程操控端安装TigerVNC服务器端
    
    vncserver :3 #启动用户窗口(非root)
    设置密码
    
    2. 安装tigerVNC本地版:https://bintray.com/tigervnc/stable/tigervnc/1.8.0#files
    3. 在使用端安装并运行TigerVNC客户端
    
    输入ip:3
    输入密码
  3. teamview­er