食谱

使用 Bio.Nexus 模块连接多个比对 NEXUS 文件。

问题

从多个基因或蛋白质进行物种水平的系统发育推断很常见。人口统计（和其他）过程会导致单个基因树与物种树发生偏差，因此来自多个基因对相同树拓扑的支持被认为比单个基因推断更有力的证据（当然，我们仍然需要测试每个基因是否讲述了相同的故事）。

这通常通过分别比对每个基因，然后从单个基因比对创建单个“超级矩阵”来处理，即您创建一个包含每个分类群一行、每行数据为分类群的连接比对基因序列的单个比对。在 NEXUS 文件（由系统发育软件 PAUP*、MrBayes 等使用）中，多个基因可以明确地表示为数据矩阵中不同的“字符分区”或“集”，该矩阵包含每个分类群的一个长序列。通过这种方式，您可以创建超级矩阵，但仍然可以将不同的替换模型应用于其中的每个基因，或者运行 PAUP* 的分区同质性检验以检查每个基因树的速率/拓扑的显着差异。

Bio.Nexus 模块使将多个比对连接到超级矩阵变得相对简单。

解决方案

假设我们有三个基因的 NEXUS 文件，btCOI.nex、btCOII.nex 和 btITS.nex，包含比对

#COI
bt1 GGGGGGGGGGGG
bt2 GGGGGGGGGGGG
bt3 GGGGGGGGGGGG
#COII
bt1 AAAAAAAAAAAA
bt2 AAAAAAAAAAAA
bt3 AAAAAAAAAAAA
#ITS
bt1 -TTTTTTT
bt2 -TTTTTTT
bt3 -TTTTTTT
bt4 -TTTTTTT

我们可以使用 Nexus 模块制作超级矩阵

from Bio.Nexus import Nexus

# the combine function takes a list of tuples [(name, nexus instance)...],
# if we provide the file names in a list we can use a list comprehension to
# create these tuples

file_list = ["btCOI.nex", "btCOII.nex", "btITS.nex"]
nexi = [(fname, Nexus.Nexus(fname)) for fname in file_list]

combined = Nexus.combine(nexi)
combined.write_nexus_data(filename=open("btCOMBINED.nex", "w"))

太容易了！让我们看看我们合并的文件

#NEXUS
begin data;
    dimensions ntax=4 nchar=32;
    format datatype=dna missing=? gap=-;
matrix
bt1 GGGGGGGGGGGG-TTTTTTTAAAAAAAAAAAA
bt2 GGGGGGGGGGGG-TTTTTTTAAAAAAAAAAAA
bt3 GGGGGGGGGGGG-TTTTTTTAAAAAAAAAAAA
bt4 ????????????-TTTTTTT????????????
;
end;

begin sets;
charset btITS.nex = 13-20;
charset btCOI.nex = 1-12;
charset btCOII.nex = 21-32;
charpartition combined = btCOI.nex: 1-12, btITS.nex: 13-20, btCOII.nex: 21-32;
end;

啊，太容易了。矩阵已经合并，字符集和分区已经设置好，但 ITS 文件有一个不在其他文件中的分类群（bt4）。在这种情况下，combine 函数会添加带有缺失数据（'?'）的分类群，用于其他字符分区。有时这可能是您想要的结果，但像这样有几个分类群也是使分区同质性检验运行一周的绝佳方法。让我们编写一个函数来测试一组 nexus 实例中是否表示了相同的分类群，并在没有表示的情况下提供一个有用的错误消息（即，如果您希望它们完美地组合在一起，要从您的 NEXUS 文件中删除什么）。

def check_taxa(matrices):
    """Verify Nexus instances have the same taxa information.

    Checks that nexus instances in a list [(name, instance)...] have
    the same taxa, provides useful error if not and returns None if
    everything matches
    """
    first_taxa = matrices[0][1].taxlabels
    for name, matrix in matrices[1:]:
        first_only = [t for t in first_taxa if t not in matrix.taxlabels]
        new_only = [t for t in matrix.taxlabels if t not in first_taxa]
        if first_only:
            missing = ", ".join(first_only)
            msg = "%s taxa %s not in martix %s" % (nexi[0][0], missing, name)
            raise Nexus.NexusError(msg)
        elif new_only:
            missing = ", ".join(new_only)
            msg = "%s taxa %s not in all matrices" % (name, missing)
            raise Nexus.NexusError(msg)
    return None  # will only get here if it hasn't thrown an exception


def concat(file_list, same_taxa=True):
    """Combine multiple nexus data matrices in one partitioned file.

    By default this will only work if the same taxa are present in each file
    use same_taxa=False if you are not concerned by this
    """
    nexi = [(fname, Nexus.Nexus(fname)) for fname in file_list]
    if same_taxa:
        if not check_taxa(nexi):
            return Nexus.combine(nexi)
    else:
        return Nexus.combine(nexi)

现在，使用我们的新函数

>>> handles = [open('btCOI.nex', 'r'), open('btCOII.nex', 'r'), open('btITS.nex', 'r')]
# If we combine them all we should get an error and the taxon/taxa that caused it
>>> concat(handles)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in concat
  File "<stdin>", line 16, in check_taxa
Bio.Nexus.Nexus.NexusError: btITS.nex taxa bt4 not in all matrices

# But if we use just the first two, which do have matching taxa, it should be fine
>>> concat(handles[:2]).taxlabels
['bt1', 'bt2', 'bt3']

# Ok, can we still munge them together if we want to?
>>> concat(handle, same_taxa=False).taxlabels
['bt1', 'bt2', 'bt3', 'bt4']

讨论

Nexus 类的详细信息在 API 文档中提供。