perl 分割 fasta 文件

本来想自己写的,结果看到网上有,功能也比较全面,挺好的,我自己就偷懒了。

脚本来源:http://kirill-kryukov.com/study/tools/fasta-splitter/files/fasta-splitter-0.2.6.zip
使用也比较简单,看说明吧!

#!/usr/bin/env perl
#
#  FASTA Splitter  -  a script for partitioning a FASTA file into pieces
#
#  Version 0.2.6 (August 1, 2017)
#
#  Copyright (c) 2012-2017 Kirill Kryukov
#
#  This software is provided 'as-is', without any express or implied
#  warranty. In no event will the authors be held liable for any damages
#  arising from the use of this software.
#
#  Permission is granted to anyone to use this software for any purpose,
#  including commercial applications, and to alter it and redistribute it
#  freely, subject to the following restrictions:
#
#  1. The origin of this software must not be misrepresented; you must not
#     claim that you wrote the original software. If you use this software
#     in a product, an acknowledgment in the product documentation would be
#     appreciated but is not required.
#  2. Altered source versions must be plainly marked as such, and must not be
#     misrepresented as being the original software.
#  3. This notice may not be removed or altered from any source distribution.
#

use strict;
use File::Basename qw(basename);
use File::Path qw(make_path);
use Getopt::Long qw(:config pass_through);

$| = 1;

my ($script_name,$script_version,$script_date,$script_years) = ('fasta-splitter','0.2.6','2017-08-01','2012-2017');
my $start_time = time;

my @files = ();
my ($opt_n_parts,$opt_part_size,$opt_part_num_prefix,$opt_measure,$opt_line_len,$opt_eol,$out_dir,$nopad,$ver,$help);
GetOptions('n-parts=i'         => \$opt_n_parts,
           'part-size=i'       => \$opt_part_size,
           'part-num-prefix=s' => \$opt_part_num_prefix,
           'measure=s'         => \$opt_measure,
           'line-length=i'     => \$opt_line_len,
           'eol=s'             => \$opt_eol,
           'out-dir=s'         => \$out_dir,
           'nopad'             => \$nopad,
           'version'           => \$ver,
           'help'              => \$help);
for (my $i=0; $i<scalar(@ARGV); $i++)
{
    if (substr($ARGV[$i],0,1) eq '-' and $i < scalar(@ARGV)-1)
    {
        if ($ARGV[$i] eq '-n-parts-total'     ) { $opt_n_parts   = int($ARGV[++$i]); $opt_measure = 'all'; }
        if ($ARGV[$i] eq '-n-parts-sequence'  ) { $opt_n_parts   = int($ARGV[++$i]); $opt_measure = 'seq'; }
        if ($ARGV[$i] eq '-part-total-size'   ) { $opt_part_size = int($ARGV[++$i]); $opt_measure = 'all'; }
        if ($ARGV[$i] eq '-part-sequence-size') { $opt_part_size = int($ARGV[++$i]); $opt_measure = 'seq'; }
        if ($ARGV[$i] eq '-line-length') { $opt_line_len = int($ARGV[++$i]); }
        if ($ARGV[$i] eq '-eol'        ) { $opt_eol      = int($ARGV[++$i]); }
    }
    else { push @files, $ARGV[$i]; }
}

my $ver_str = "$script_name, version $script_version, $script_date\nCopyright (c) $script_years Kirill Kryukov\n";
my $help_str = qq{Usage: ${script_name} [options] <file>...
Options:
    --n-parts <N>        - Divide into <N> parts
    --part-size <N>      - Divide into parts of size <N>
    --measure (all|seq|count) - Specify whether all data, sequence length, or
                           number of sequences is used for determining part
                           sizes ('all' by default).
    --line-length        - Set output sequence line length, 0 for single line
                           (default: 60).
    --eol (dos|mac|unix) - Choose end-of-line character ('unix' by default).
    --part-num-prefix T  - Put T before part number in file names (def.: .part-)
    --out-dir            - Specify output directory.
    --nopad              - Don't pad part numbers with 0.
    --version            - Show version.
    --help               - Show help.
};

print (($ver ? $ver_str : ''), ($help ? $help_str : ''));
if (!defined($opt_n_parts) and !defined($opt_part_size) and !defined($opt_measure) and !defined($opt_line_len) and !defined($opt_eol))
{
    if (!$help and !$ver) { print $ver_str, $help_str; } exit;
}
if (!defined($opt_n_parts) and !defined($opt_part_size)) { die "Splitting method is not specified\nUse -h for help\n"; }
if (!@files) { die "File for splitting is not specified\n"; }

if (defined($opt_n_parts) and $opt_n_parts <= 0) { die "Non-positive number of parts\n"; }
if (defined($opt_part_size) and $opt_part_size <= 0) { die "Non-positive part size\n"; }
if (defined($opt_measure) and $opt_measure ne 'all' and $opt_measure ne 'seq' and $opt_measure ne 'count') { die "Unknown value of --measure option\n"; }
if (defined($opt_eol) and $opt_eol ne 'dos' and $opt_eol ne 'mac' and $opt_eol ne 'unix') { die "Unknown value of --eol option\n"; }
if (defined($out_dir))
{
    $out_dir =~ s/[\/\\]+$//;
    if (!-e $out_dir) { make_path($out_dir); }
    if (!-e $out_dir || !-d $out_dir) { die "Can't create output directory \"$out_dir\"\n"; }
    $out_dir .= '/';
}

my $n_parts = defined($opt_n_parts) ? $opt_n_parts : 0;
my $part_size = defined($opt_part_size) ? $opt_part_size : 0;
my $line_len = (defined($opt_line_len) and $opt_line_len >= 0) ? $opt_line_len : 60;
my $eol = defined($opt_eol) ? (($opt_eol eq 'dos') ? "\x0D\x0A" : ($opt_eol eq 'mac') ? "\x0D" : "\x0A") : "\x0A";
my $eol_len = length($eol);
my $measure = defined($opt_measure) ? (($opt_measure eq 'count') ? 0 : ($opt_measure eq 'seq') ? 1 : 2) : 2;
my $part_num_prefix = defined($opt_part_num_prefix) ? $opt_part_num_prefix : '.part-';
my @part_start = ();
my ($base,$ext,$num_len,$total_size);
my ($OUT,$name,$data,$written_total,$written_this_part,$part_end,$part);

foreach my $infile (@files) { split_file($infile); }

my $elapsed_time = time - $start_time;
print "All done, $elapsed_time second", (($elapsed_time==1)?'':'s'), " elapsed\n";

sub split_file
{
    my ($infile) = @_;
    if (!-e $infile or !-f $infile) { print "Can't find file \"$infile\"\n"; return; }
    print $infile;

    ($base,$ext) = (basename($infile),'');
    if ($base =~ /^(.+?)(\.[^\.]+)$/) { ($base,$ext) = ($1,$2); }

    @part_start = ();
    my ($n_seq,$total_seq_len,$n_parts_found) = (0,0,0);

    if ($part_size)
    {
        ($n_seq,$total_seq_len,$total_size,$n_parts_found) = get_file_size_and_part_boundaries($infile);
        if (!$n_parts) { print ": $n_seq sequences, $total_seq_len bp"; }
        print ' => ', ($n_parts ? 'extracting' : 'dividing into'), ' ', $n_parts_found, ' part', ($n_parts_found > 1 ? 's' : ''),
              " of <= $part_size ", ($measure ? (($measure > 1) ? 'bytes' : 'bp') : 'sequences'), "\n";
        open(my $IN,'<',$infile) or die "Error: Can't open file \"$infile\"\n";
        binmode $IN;
        $num_len = length($n_parts_found);
        $OUT = undef;
        my ($out_file,$part,$si,$buffer) = (undef,0,-1,'');
        while (<$IN>)
        {
            $_ =~ s/[\x0D\x0A]+$//;
            if (substr($_,0,1) eq '>')
            {
                if ($OUT)
                {
                    if ($line_len == 0) { if ($si >= 0) { print $OUT $eol; } }
                    elsif ($buffer ne '') { print $OUT $buffer, $eol; $buffer = ''; }
                }
                $si++;
                if ($si >= $part_start[$part+1])
                {
                    if ($OUT) { close $OUT; }
                    $part++;
                    if ($part > $n_parts_found) { last; }
                    $out_file = $out_dir . $base . $part_num_prefix . ($nopad ? $part : sprintf('%0*d',$num_len,$part)) . $ext;
                    open($OUT,'>',$out_file) or die "Can't create output file \"$out_file\"\n";
                    binmode $OUT;
                }
                print $OUT $_, $eol;
                next;
            }
            if ($line_len)
            {
                $buffer .= $_;
                while (length($buffer) >= $line_len) { print $OUT substr($buffer,0,$line_len,''), $eol; }
            }
            else { print $OUT $_; }
        }
        close $IN;
        if ($OUT)
        {
            if (!$line_len) { if ($si >= 0) { print $OUT $eol; } }
            elsif ($buffer ne '') { print $OUT $buffer, $eol; $buffer = ''; }
            close $OUT;
        }
    }
    else
    {
        ($n_seq,$total_seq_len,$total_size) = get_file_size($infile);
        print ": $n_seq sequences, $total_seq_len bp => dividing into $n_parts part", ($n_parts > 1 ? 's' : ''), " ";
        open(my $IN,'<',$infile) or die "Error: Can't open file \"$infile\"\n";
        binmode $IN;
        $num_len = length($n_parts);
        ($OUT,$name,$data,$written_total,$written_this_part,$part_end,$part) = (undef,undef,'',0,0,int($total_size / $n_parts),1);
        while(<$IN>)
        {
            $_ =~ s/[\x0D\x0A]+$//;
            if (substr($_,0,1) eq '>')
            {
                if (defined $name) { dump_seq(); }
                $name = $_; $data = ''; next;
            }
            $data .= $_;
        }
        if (defined $name) { dump_seq(); }
        close $IN;
        if ($OUT) { close $OUT; }
        print " OK\n";
    }
}

sub dump_seq
{
    my $slen = length($data);
    my $seq_size = seq_size(length($name),$slen);
    my $new_written_total = $written_total + $seq_size;
    if ( !$OUT or
         ($written_this_part and ($new_written_total > $part_end) and ($new_written_total - $part_end > $part_end - $written_total)) )
    {
        if ($OUT) { close $OUT; $part++; $part_end = int($total_size / $n_parts * $part) + 1; }

        my $part_file = $out_dir . $base . $part_num_prefix . ($nopad ? $part : sprintf('%0*d',$num_len,$part)) . $ext;

        open($OUT,'>',$part_file) or die "Error: Can't create file \"$part_file\"\n";
        binmode $OUT;
        $written_this_part = 0;
        print ".";
    }
    print $OUT $name, $eol;
    if ($line_len) { for (my $s=0; $s<$slen; $s+=$line_len) { print $OUT substr($data,$s,$line_len), $eol; } }
    else { print $OUT $data, $eol; }
    $written_this_part += $seq_size;
    $written_total += $seq_size;
}

sub get_file_size_and_part_boundaries
{
    my ($file) = @_;
    open(my $IN,'<',$file) or die "Error: Can't open file \"$file\"\n";
    binmode $IN;
    my ($nseq,$total_seq_length,$total_size,$n_parts_found,$this_part_size,$nlen,$slen,$stop) = (0,0,0,1,0,0,0,0);
    $part_start[1] = 0;
    while (<$IN>)
    {
        $_ =~ s/[\x0D\x0A]+$//;
        my $len = length($_);
        if (substr($_,0,1) eq '>')
        {
            if ($nlen)
            {
                my $seq_size = seq_size($nlen,$slen);
                if ($part_size and $this_part_size and ($this_part_size + $seq_size > $part_size))
                {
                    if ($n_parts and $n_parts_found == $n_parts) { $stop = 1; last; }
                    else { $this_part_size = $seq_size; $n_parts_found++; $part_start[$n_parts_found] = $nseq; }
                }
                else { $this_part_size += $seq_size; }
                $nseq++; $total_seq_length += $slen; $total_size += $seq_size;
            }
            ($nlen,$slen) = ($len,0); next;
        }
        if ($nlen) { $slen += $len; }
    }
    if ($nlen and !$stop)
    {
        my $seq_size = seq_size($nlen,$slen);
        if ($part_size and $this_part_size and ($this_part_size + $seq_size > $part_size))
        {
            if ($n_parts and $n_parts_found == $n_parts) { $stop = 1; }
            else { $this_part_size = $seq_size; $n_parts_found++; $part_start[$n_parts_found] = $nseq; }
        }
        if (!$stop) { $nseq++; $total_seq_length += $slen; $total_size += $seq_size; }
    }
    close $IN;
    $part_start[$n_parts_found+1] = $nseq;
    return ($nseq,$total_seq_length,$total_size,$n_parts_found);
}

sub get_file_size
{
    my ($file) = @_;
    open(my $IN,'<',$file) or die "Error: Can't open file \"$file\"\n";
    binmode $IN;
    my ($nseq,$total_seq_length,$total_size,$nlen,$slen) = (0,0,0,0,0);
    while (<$IN>)
    {
        $_ =~ s/[\x0D\x0A]+$//;
        my $len = length($_);
        if (substr($_,0,1) eq '>')
        {
            if ($nlen) { $nseq++; $total_seq_length += $slen; $total_size += seq_size($nlen,$slen); }
            ($nlen,$slen) = ($len,0); next;
        }
        if ($nlen) { $slen += $len; }
    }
    if ($nlen) { $nseq++; $total_seq_length += $slen; $total_size += seq_size($nlen,$slen); }
    close $IN;
    return ($nseq,$total_seq_length,$total_size);
}

sub seq_size
{
    my ($nlen,$slen) = @_;
    return ($measure == 0) ? 1 :
           ($measure == 1) ? $slen :
           $slen + $nlen + $eol_len*(1 + ($line_len ? int(($slen+$line_len-1)/$line_len) : 1));
}

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 211,884评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,347评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,435评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,509评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,611评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,837评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,987评论 3 408
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,730评论 0 267
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,194评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,525评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,664评论 1 340
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,334评论 4 330
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,944评论 3 313
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,764评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,997评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,389评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,554评论 2 349

推荐阅读更多精彩内容

  • Spring Cloud为开发人员提供了快速构建分布式系统中一些常见模式的工具(例如配置管理,服务发现,断路器,智...
    卡卡罗2017阅读 134,633评论 18 139
  • 1,简介 ECMAScript 6.0(以下简称 ES6)是 JavaScript 语言的下一代标准,它的目标,是...
    嗯后来呢阅读 278评论 0 0
  • 窗外有人晒被子。 一个60多岁的老妪,身着花色布衣,很轻松的把漂洗得干净的被套丢上了一根细细的绳子上。绳子不是...
    凝固的火阅读 389评论 0 0
  • 认同人人都有自己的观点,并且在某种程度上都可能是正确的,才能有更开放的思维。 你为什么喜欢打农药?你都有了女朋友了...
    阿不快跑阅读 193评论 0 0
  • 读书一向囫囵吞枣,而且管不住嘴,所以好像显得知识丰富一点。其实自己知道,自己就是一个渣,因为书是书,我是我,我还是...
    阳婆婆出来照山红阅读 564评论 0 0