Friday, May 20, 2016

[Bioinfo] GATK ERROR: attempting to calculate the mismatch count against a reference string that is smaller than the read

[Bioinfo] GATK ERROR: attempting to calculate the mismatch count against a reference string that is smaller than the read

Before variant calling process, the local realignment at indel regions is known to reduce false positives of variant calls (ref1). There are a few types of errors.

ERROR MESSAGE: attempting to calculate the mismatch count against a reference string that is smaller than the read

1. Diagnose bam file using picard ValidateSamFile:

java -jar picard.jar ValidateSamFile \
     I=input.bam \
     MODE=SUMMARY

Output in the end

## HISTOGRAM    java.lang.String
Error Type      Count
ERROR:CIGAR_MAPS_OFF_REFERENCE  2      <- The reads causing problems
ERROR:MATE_NOT_FOUND    290628

2. Clear bam file using picard CleanSam:

Cleans the provided SAM/BAM, soft-clipping beyond-end-of-reference alignments and setting MAPQ to 0 for unmapped reads

java -jar picard.jar CleanSam \
     I=input.bam \
     O=filtered.bam

The filtered.bam now is able to used as input of GATK realignment.

[Bioinfo] NCBI BLAST command line

[Bioinfo] NCBI BLAST command line

1. Download executables

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST

2. Add the paths of executable and databse to .bash_profile in home directory

vi ~/.bash_profile

enter i to get into edit mode
copy and paste the text below

PATH=$PATH:$HOME/bin:/home/chlin/bin/ncbi-blast-2.2.29+/bin
BLASTDB=/home/chlin/blast/db

press esc then : then type wq to save and quit editing the file

3. (optional) Load the path setting immidiately

sourse .bash_profile

4. Make BLAST database

makeblastdb -in <input_fa> -dbtype <db_type> -out <db_name> -title <db_title>
Examples:
(for nucleotides)

makeblastdb -in hs_chr.fa -dbtype nucl -out hs_chr -title "Human Chromosome, Ref B37.1"

(for peptides)

makeblastdb -in refseq_protein.fa -dbtype prot -out refseq_protein -title "RefSeq Protein Database"

5. BLAST

blastn -query <query.fa> -db <db_name> -out <output_name>
Example:
blastn -query test.fa -db hs_chr -out hs_chr.out

options:
-evalue 1e-10 (E value threshold)
-num_threads 8 (Number of threads (CPUs) to use in blast search.)
-perc_identity 97 (% identity cutoff)
-num_descriptions 5 (hit # threshold)

6. Parse results

Perl module (http://www.bioperl.org/wiki/HOWTO:SearchIO)
Example:

use Bio::SearchIO;
my $in = new Bio::SearchIO( -format => 'blast', -file => "hs_chr.out" );
my $n=0;
while( my $result = $in->next_result ){
    my $query = $result->query_name;
    while ( my $hit = $result->next_hit ){
        while( my $hsp = $hit->next_hsp ){
            print "Query = ", $result->query_name, "\n",
            "Num_hits = ", $result->num_hits,"\n",
            "Hit = ", $hit->name,"\n",
            "Length = ", $hsp->length('total'), "\n",
            "Percent_id = ", $hsp->percent_identity, "\n",
            "Start = ", $hsp->start('hit'), "\n",
            "End = ", $hsp->end('hit'), "\n",
            "E-Value = ", $hsp->evalue,"\n\n";
        }
    }
}

Referces:

  1. NCBI Manual (http://www.ncbi.nlm.nih.gov/books/NBK1763/)

[Linux] Delete/Remove files before certain days with certain extension

[Linux] Delete/Remove files before certain days with certain extension

1. Find files under current directory (maximum 4 subdirectories) at least 1 day ago | select files with ‘.bam’ | select files with ‘variant’ | output file list to variant_bam_removed.txt

find ./ -maxdepth 4 -mtime +1 | grep '.bam' | grep 'variant' > variant_bam_removed.txt

2. Remove files using input of a file, variant_bam_removed.txt

xargs rm < variant_bam_removed.txt

3. Append current file list to a total list

cat variant_bam_removed.txt >> removed_list.txt