Wiki 文档

Bio.Alphabet 的历史和替代方案

简介

此页面旨在帮助人们更新使用 Biopython 的现有代码，以应对 Biopython 1.78（2020 年 9 月）中 Bio.Alphabet 模块的移除。

Bio.Alphabet 中的对象主要有两种用途：

记录序列的分子类型（DNA、RNA 或蛋白质），
声明序列、比对、基序等中预期的字符。

动机

字母对象的目标用途从未明确定义，并且二十年的设计存在缺陷。特别是 AlphabetEncoder 类（用于添加关于间隙或终止符号的信息）过于复杂，甚至使得确定分子类型变得困难。对多个字母对象达成一致（例如在字符串相加期间）也很复杂。虽然您可以为序列指定一个严格的字母表，如明确的 IUPAC DNA，但这并不能强制执行仅使用 A、C、G 和 T 这些字母。

代码更改

由于没有明确的关于如何改进或替换现有系统的提议，因此一致同意将其移除。通常，您只需从代码中删除任何显式使用 Bio.Alphabet 的部分。

Seq 更改

Seq 对象不再具有 .alphabet 属性，并且不再对 Seq 操作（例如将蛋白质添加到 DNA 中）进行类型检查。首先删除 alphabet 参数

# Old style
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq

my_dna = Seq("ACGTTT", generic_dna)

# New style
from Bio.Seq import Seq

my_dna = Seq("ACGTTT")

请参阅下方，了解 alphabet 用于为输出文件格式设置分子类型的位置。

alphabet 的另一个使用场景是在声明间隙字符时，默认情况下在各种 Biopython 序列和比对解析器中使用 -。如果您使用的是其他字符，则需要将其显式地传递给 Seq 对象的 .replace() 方法。

# Old style
from Bio.Alphabet import generic_dna, Gapped
from Bio.Seq import Seq

my_dna = Seq("ACGT=TT", Gapped(generic_dna, "="))
print(my_dna.ungap())

# New style
from Bio.Seq import Seq

my_dna = Seq("ACGT=TT")
print(my_dna.replace("=", ""))

SeqRecord 更改

一些序列文件格式在写入文件时需要分子类型，以前使用 Bio.Alphabet 对象作为 Seq 对象的 .alphabet 属性进行记录。现在，它作为字符串形式的分子类型记录在 SeqRecord 对象的注释字典中。

# Old style
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

seq = Seq("ATGCGTGCAT", generic_dna)
record = SeqRecord(seq, id="test")
SeqIO.write(record, "test_write.gb", "genbank")

# New style
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

seq = Seq("ATGCGTGCAT")
record = SeqRecord(seq, id="test", annotations={"molecule_type": "DNA"})
SeqIO.write(record, "test_write.gb", "genbank")

# Compatible with both pre- and post Biopython 1.78:
try:
    from Bio.Alphabet import generic_dna
except ImportError:
    generic_dna = None
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

if generic_dna:
    # Newer Biopython refuses second argument
    seq = Seq("ATGCGTGCAT", generic_dna)
else:
    seq = Seq("ATGCGTGCAT")
record = SeqRecord(seq, id="test", annotations={"molecule_type": "DNA"})
SeqIO.write(record, "test_write.gb", "genbank")

Bio.SeqIO 解析函数以前接受一个可选的 alphabet 参数，用于在无法从文件格式中确定时设置此值。现在不再可能。

# Old style
from Bio.Alphabet import generic_dna
from Bio import SeqIO

# This file has a single record only
record = SeqIO.read("Tests/Fasta/wisteria.nu", "fasta", generic_dna)
rec_start = record[:20]
SeqIO.write(rec_start, "start_only.xml", "seqxml")

在以下示例中，您必须在写入记录之前在记录注释中显式设置分子类型。

# New style
from Bio import SeqIO

# This file has a single record only
record = SeqIO.read("Tests/Fasta/wisteria.nu", "fasta")
rec_start = record[:20]
rec_start.annotations["molecule_type"] = "DNA"
SeqIO.write(rec_start, "start_only.xml", "seqxml")

类似地，Bio.SeqIO.convert 函数的可选 alphabet 参数已被可选的分子类型参数替换。

# Old style
from Bio.Alphabet import generic_dna
from Bio import SeqIO

SeqIO.convert("example.fasta", "fasta", "example.xml", "seqxml", generic_dna)

# New style
from Bio import SeqIO

SeqIO.convert("example.fasta", "fasta", "example.xml", "seqxml", "DNA")

这是一种编写向后兼容版本的方法。

# Compatible with both pre- and post Biopython 1.78:
try:
    from Bio.Alphabet import generic_dna
except ImportError:
    generic_dna = "DNA"
from Bio import SeqIO

SeqIO.convert("example.fasta", "fasta", "example.xml", "seqxml", generic_dna)

其他更改

使用字母表指定预期字母或符号列表的代码现在通常期望有效的字符为字符串（例如 Bio.motifs）。