Friday, April 7, 2017

Jupyter/iPython Notebook tips, tricks and shortcuts

Jupyter Notebook Shortcuts:
cmd + shift + p(Mac) or ctrl + shift + p(Windows/Linux) Drop-down list of all shortcuts

Editing mode (green border and little pencil on the up right corner)

editing mode
ctrl + shift + - split cell by where the cursor is
shift + tab display the usage and attribute of the function
[esc] enter command mode

Command mode (blue shade)

Command mode
m make the cell to Markdown
y make the cell to Code

dd delete cell
z undo deletion or cut (only able to undo one step at maximum)
x cut current cell
v paste cell below


Written with StackEdit.

Monday, March 6, 2017

Machine Learning Materials

  1. For the beginner:
    Advice for applyingMachine Learning (Andrew Ng)
    Related blog (Jan Hendrik Metzen): Trying to go through Advice for applyingMachine Learning using Python IPython notebook.
  2. How to choose your model?
    Python sklearn
  3. Review of recent Neural network models
    Convolutional Neural Net (Andrej Karpathy )
    The-9-Deep-Learning-Papers-You-Need-To-Know-About (Adit Deshpande)
Written with StackEdit.

It's time to learn Tensor Flow!

Google released Tensor Flow 1.0 on Feb. 15. 2017.
According to this post, Tensor Flow is one of the top choice neural network tools, and used the most in GitHub.
Therefore, it aroused my interests to start to learn to use Tensor Flow.
Hopefully, I will keep updating my learning process here.

Friday, May 20, 2016

[Bioinfo] GATK ERROR: attempting to calculate the mismatch count against a reference string that is smaller than the read

[Bioinfo] GATK ERROR: attempting to calculate the mismatch count against a reference string that is smaller than the read

Before variant calling process, the local realignment at indel regions is known to reduce false positives of variant calls (ref1). There are a few types of errors.

ERROR MESSAGE: attempting to calculate the mismatch count against a reference string that is smaller than the read

1. Diagnose bam file using picard ValidateSamFile:

java -jar picard.jar ValidateSamFile \
     I=input.bam \

Output in the end

## HISTOGRAM    java.lang.String
Error Type      Count
ERROR:CIGAR_MAPS_OFF_REFERENCE  2      <- The reads causing problems

2. Clear bam file using picard CleanSam:

Cleans the provided SAM/BAM, soft-clipping beyond-end-of-reference alignments and setting MAPQ to 0 for unmapped reads

java -jar picard.jar CleanSam \
     I=input.bam \

The filtered.bam now is able to used as input of GATK realignment.

[Bioinfo] NCBI BLAST command line

[Bioinfo] NCBI BLAST command line

1. Download executables

2. Add the paths of executable and databse to .bash_profile in home directory

vi ~/.bash_profile

enter i to get into edit mode
copy and paste the text below


press esc then : then type wq to save and quit editing the file

3. (optional) Load the path setting immidiately

sourse .bash_profile

4. Make BLAST database

makeblastdb -in <input_fa> -dbtype <db_type> -out <db_name> -title <db_title>
(for nucleotides)

makeblastdb -in hs_chr.fa -dbtype nucl -out hs_chr -title "Human Chromosome, Ref B37.1"

(for peptides)

makeblastdb -in refseq_protein.fa -dbtype prot -out refseq_protein -title "RefSeq Protein Database"


blastn -query <query.fa> -db <db_name> -out <output_name>
blastn -query test.fa -db hs_chr -out hs_chr.out

-evalue 1e-10 (E value threshold)
-num_threads 8 (Number of threads (CPUs) to use in blast search.)
-perc_identity 97 (% identity cutoff)
-num_descriptions 5 (hit # threshold)

6. Parse results

Perl module (

use Bio::SearchIO;
my $in = new Bio::SearchIO( -format => 'blast', -file => "hs_chr.out" );
my $n=0;
while( my $result = $in->next_result ){
    my $query = $result->query_name;
    while ( my $hit = $result->next_hit ){
        while( my $hsp = $hit->next_hsp ){
            print "Query = ", $result->query_name, "\n",
            "Num_hits = ", $result->num_hits,"\n",
            "Hit = ", $hit->name,"\n",
            "Length = ", $hsp->length('total'), "\n",
            "Percent_id = ", $hsp->percent_identity, "\n",
            "Start = ", $hsp->start('hit'), "\n",
            "End = ", $hsp->end('hit'), "\n",
            "E-Value = ", $hsp->evalue,"\n\n";


  1. NCBI Manual (

[Linux] Delete/Remove files before certain days with certain extension

[Linux] Delete/Remove files before certain days with certain extension

1. Find files under current directory (maximum 4 subdirectories) at least 1 day ago | select files with ‘.bam’ | select files with ‘variant’ | output file list to variant_bam_removed.txt

find ./ -maxdepth 4 -mtime +1 | grep '.bam' | grep 'variant' > variant_bam_removed.txt

2. Remove files using input of a file, variant_bam_removed.txt

xargs rm < variant_bam_removed.txt

3. Append current file list to a total list

cat variant_bam_removed.txt >> removed_list.txt

Thursday, December 5, 2013

[Science] The Innate Growth Bistability and Fitness Landscapes of Antibiotic-Resistant Bacteria


When we culture the bacteria, sometimes we can observe some bacteria may become resistant to antibiotics. How to predict the emergence of drug resistance? What is the growth state of bacteria in the presence of drugs? One Science article from UCSD Physics investigates the issue using both bulk and single-cell techniques, and further builds up a model to predict the growth rate of the resistant bacteria strain under different drug concentration. 


They observed that the resistant bacteria either grow or do not grow (i.e., growth bistability) when the drug concentration is less than minimum inhibitory concentration (MIC). In addition, their model is able to predict the growth rate of bacteria under different concentration, and the concentration range in which the growth bistability occurs. They suggested that the model can be used to further study the evolution of drug resistance.


As Chang-Ting Jason Lin said, the paper combines the quantitative model in system biology and the single-cell method to study the drug resistance, which may be a hot issue.

 2013 Nov 29;342(6162):1237435. doi: 10.1126/science.1237435.

The innate growth bistability and fitness landscapes of antibiotic-resistant bacteria.


Department of Physics, University of California at San Diego, La Jolla, CA 92093-0374, USA.