我想分享我的使用 Biopython 清理序列的脚本。您应该知道,分析不良数据会占用 CPU 时间,而解释不良数据的結果会占用人员时间,因此进行预处理始终很重要。
我将我的脚本称为“Sequence_cleaner”,其主要思路是删除重复序列、删除过短的序列(用户定义最小长度)以及删除包含过多未知核苷酸(N)的序列(用户定义允许的 N 百分比),最后用户可以选择是否将结果输出到文件或打印结果。
import sys
from Bio import SeqIO
def sequence_cleaner(fasta_file, min_length=0, por_n=100):
# Create our hash table to add the sequences
sequences = {}
# Using the Biopython fasta parse we can read our fasta input
for seq_record in SeqIO.parse(fasta_file, "fasta"):
# Take the current sequence
sequence = str(seq_record.seq).upper()
# Check if the current sequence is according to the user parameters
if (
len(sequence) >= min_length
and (float(sequence.count("N")) / float(len(sequence))) * 100 <= por_n
):
# If the sequence passed in the test "is it clean?" and it isn't in the
# hash table, the sequence and its id are going to be in the hash
if sequence not in sequences:
sequences[sequence] = seq_record.id
# If it is already in the hash table, we're just gonna concatenate the ID
# of the current sequence to another one that is already in the hash table
else:
sequences[sequence] += "_" + seq_record.id
# Write the clean sequences
# Create a file in the same directory where you ran this script
with open("clear_" + fasta_file, "w+") as output_file:
# Just read the hash table and write on the file as a fasta format
for sequence in sequences:
output_file.write(">" + sequences[sequence] + "\n" + sequence + "\n")
print("CLEAN!!!\nPlease check clear_" + fasta_file)
userParameters = sys.argv[1:]
try:
if len(userParameters) == 1:
sequence_cleaner(userParameters[0])
elif len(userParameters) == 2:
sequence_cleaner(userParameters[0], float(userParameters[1]))
elif len(userParameters) == 3:
sequence_cleaner(
userParameters[0], float(userParameters[1]), float(userParameters[2])
)
else:
print("There is a problem!")
except:
print("There is a problem!")
使用命令行,您应该运行
python sequence_cleaner.py input[1] min_length[2] min[3]
有 3 个基本参数
例如
python sequence_cleaner.py Aip_coral.fasta 10 10
注意:如果您不关心第 2 和第 3 个参数,您将只删除重复的序列
发送电子邮件至 [email protected]