在 GitHub 上编辑此页面

表示 ACE 文件中存档的重叠群中的比对。

问题

有时,将作为基因组或 EST 序列组装的一部分而产生的重叠群表示为比对非常有用(例如,为了在混合样品的运行中搜索潜在的 SNP,或者能够以更易于查看的方式写入重叠群)。对于使用 ACE 文件格式的序列组装,我们可以使用 Biopython 的 ACE 处理将构成重叠群的读段添加到通用比对中。’

解决方案

让我们将 Biopython 测试框架中使用的 ACE 文件中的重叠群表示为示例:Tests/Ace/contig1.ace

from Bio.Sequencing import Ace
from Bio.Align import MultipleSeqAlignment


def cut_ends(read, start, end):
    """Replace residues on either end of a sequence with gaps.

    In this case we want to cut out the sections of each read which the assembler has
    decided are not good enough to include in the contig and replace them with gaps
    so the other information about positions in the read is maintained
    """
    return (start - 1) * "-" + read[start - 1 : end] + (len(read) - end) * "-"


def pad_read(read, start, conlength):
    """Pad out either end of a read so it fits into an alignment.

    The start argument is the position of the first base of the reads sequence in
    the contig it is part of. If the start value is negative (or 0 since ACE
    files count from 1, not 0) we need to take some sequence off the start
    otherwise each end is padded to the length of the consensus with gaps.
    """
    if start < 1:
        seq = read[-1 * start + 1 :]
    else:
        seq = (start - 1) * "-" + read
    seq = seq + (conlength - len(seq)) * "-"
    return seq


# We will use the Ace parser to read individual contigs from file. Be aware
# that using this iterator can miss WA, CT, RT and WR tags (which can be
# anywhere in the file, e.g. the end). Read the file specification here:
# http://bozeman.mbt.washington.edu/consed/distributions/README.14.0.txt
# If you need these tags you'll need to use  Ace.read() (and lots of RAM).

ace_gen = Ace.parse(open("contig1.ace", "r"))
contig = next(ace_gen)  # Looking only at first contig
align = MultipleSeqAlignment([])

# Now we have started our alignment we can add sequences to it,
# we will loop through contig's reads and get quality clipping from
# .reads[readnumber].qa and the position of each read in the contig
# .af[readnumber].padded_start and use the functions above to cut and
# pad the sequences before they are added

for readn in range(len(contig.reads)):
    clipst = contig.reads[readn].qa.qual_clipping_start
    clipe = contig.reads[readn].qa.qual_clipping_end
    start = contig.af[readn].padded_start
    seq = cut_ends(contig.reads[readn].rd.sequence, clipst, clipe)
    seq = pad_read(seq, start, len(contig.sequence))
    align.add_sequence("read%i" % (readn + 1), seq)

当您打印比对或其中的序列时

>>> print(align)
Alignment with 2 rows and 856 columns
--------------------------------------------...--- read1
------GGATTGCCCTagtaacGGCGAGTGAAGCGGCAACAGCT...--- read2

>>> for read in align:
...     print(read.seq[80:159])
...
tt*gtagagggaTGCTTCTGGGTAGCGACCGGTCTAAGTTCCTCGGAACAGGACGTCATAGAGGGTGAGAATCCCGTAT
TTTGTAGAGG*ATGCTTCTGGGTAGCGACCGGTCTAAGTTCCTCGGAACAGGACGTCATAGAGGGTGAGAATCCCGTAT

您现在也可以使用 AlignIO 将比对写入您想要的任何格式。

讨论

上面的注释中给出了详细信息,简而言之,使用 Ace.parse() 读取 ACE 重叠群,启动一个通用比对,然后将重叠群中的读段添加到新的比对中。