食谱

序列清理器

描述

我想分享我的使用 Biopython 清理序列的脚本。您应该知道，分析不良数据会占用 CPU 时间，而解释不良数据的結果会占用人员时间，因此进行预处理始终很重要。

我将我的脚本称为“Sequence_cleaner”，其主要思路是删除重复序列、删除过短的序列（用户定义最小长度）以及删除包含过多未知核苷酸（N）的序列（用户定义允许的 N 百分比），最后用户可以选择是否将结果输出到文件或打印结果。

脚本

import sys
from Bio import SeqIO


def sequence_cleaner(fasta_file, min_length=0, por_n=100):
    # Create our hash table to add the sequences
    sequences = {}

    # Using the Biopython fasta parse we can read our fasta input
    for seq_record in SeqIO.parse(fasta_file, "fasta"):
        # Take the current sequence
        sequence = str(seq_record.seq).upper()
        # Check if the current sequence is according to the user parameters
        if (
            len(sequence) >= min_length
            and (float(sequence.count("N")) / float(len(sequence))) * 100 <= por_n
        ):
            # If the sequence passed in the test "is it clean?" and it isn't in the
            # hash table, the sequence and its id are going to be in the hash
            if sequence not in sequences:
                sequences[sequence] = seq_record.id
            # If it is already in the hash table, we're just gonna concatenate the ID
            # of the current sequence to another one that is already in the hash table
            else:
                sequences[sequence] += "_" + seq_record.id

    # Write the clean sequences

    # Create a file in the same directory where you ran this script
    with open("clear_" + fasta_file, "w+") as output_file:
        # Just read the hash table and write on the file as a fasta format
        for sequence in sequences:
            output_file.write(">" + sequences[sequence] + "\n" + sequence + "\n")

    print("CLEAN!!!\nPlease check clear_" + fasta_file)


userParameters = sys.argv[1:]

try:
    if len(userParameters) == 1:
        sequence_cleaner(userParameters[0])
    elif len(userParameters) == 2:
        sequence_cleaner(userParameters[0], float(userParameters[1]))
    elif len(userParameters) == 3:
        sequence_cleaner(
            userParameters[0], float(userParameters[1]), float(userParameters[2])
        )
    else:
        print("There is a problem!")
except:
    print("There is a problem!")

使用命令行，您应该运行

python sequence_cleaner.py input[1] min_length[2] min[3]

有 3 个基本参数

[1]: 您的 fasta 文件
[2]: 用户定义的最小长度。默认值为 0，这意味着您不必关心最小长度
[3]: 用户定义的允许的 N 百分比。默认值为 100，所有包含 'N' 的序列都将出现在您的输出中，如果想要输出中不包含 'N' 的序列，则将值设置为 0

例如

python sequence_cleaner.py Aip_coral.fasta 10 10

注意：如果您不关心第 2 和第 3 个参数，您将只删除重复的序列

问题、建议或改进

发送电子邮件至 genivaldo.gueiros@gmail.com