Simple tool to count genomic base occurrences

This is a simple tools to count the occurence of each base at a specified genomic coordinate. It is simply to illustrate how to use the vast array of existing bioinformatics tools with some basic code of your own & without the need to use R for simple projects.

Purpose: This piece of software uses a tool called mpileup which is part of the samtools suite. This tool takes a BAM file and returns a dataset containing each genomic location and the base call from each read. My program has been used to find the percentage occurrence of known mutations and/or SNPs to establish a suitable error threshold i.e. % base call error to pass QC for a mutation finding. See citations below.

Pass it a bam file. I pass multiple runs, step through each sample and out to a CSV file.

//take from command line
def file = "${runFolder}/${sampleName}/${sampleName}.bam"
def data = "samtools mpileup -d 1000000 ${file}".execute().text

Some pretty basic groovy/java code to count them up

int coord = 7579619 //take from command line
def dataArray = data.split("\n")
def position = []
dataArray.each { line ->
    if (line.contains(coord)) {
        position = line.split(/\t/)
    }
}
def aCount = 0
def cCount = 0
def gCount = 0
def tCount = 0

if (position.size() > 4 ) {
    def nucleicAcid = position[4].toString()
    nucleicAcid.each { na ->
        if (na == 'A') {
            aCount++
        }
        if (na == 'C') {
            cCount++
        }
        if (na == 'G') {
            gCount++
        }
        if (na == 'T') {
            tCount++
        }
    }
}

results.append("${sampleName}, ${aCount}, 
${cCount}, ${gCount}, ${tCount}\n")

Sbatch file for SLURM cluster

#!/bin/bash
#SBATCH --nodes=1    # note how the extra space makes this a comment
#SBATCH --ntasks=1   # note how the extra space makes this a comment
#SBATCH --mem-per-cpu=64G
#SBATCH --time=2-00:00:00     # 2 days / 48 hours
#SBATCH --output="%u-%N-%j.output"  # lets call the file "<user>-<nodeName>-<jobID>"
    #SBATCH --error="%u-%N-%j.error"  # lets call the file "<user>-<nodeName>-<jobID>"
        #SBATCH --mail-user=gareth.reid@petermac.org
        #SBATCH --mail-type=ALL
        #SBATCH --job-name="BioTools"

        #SBATCH --partition=prod

        module load java/1.8.0_171-jre
        module load samtools
        # 1
        # make any changes to $1 here:
        # RUNNAME=$1
        # RUNNAMEVAR = $1

        while read a; do
        java -cp BioinformaticsTools.jar tools.PercentageBaseCalls -r $a -p <GENOMIC_POSITION> -o /home/greid/georgeGenomicPositions2/resultsOutputfile.csv
            done < runs.csv

            # If running via a bash loop over the excel spreadsheet, pass in the
            #java -cp BioinformaticsTools.jar tools.PercentageBaseCalls -r $1 <run number> -p 7579619 -o /home/greid/resultsPosition7579619-TP53.csv

                # execute like:
                # readline loop
                # sbatch bioTools.sbatch $readline


                # 2
                # alrternatively create the loop inside this sbatch shell

                # open file
                # read line
                # execute command on line value 

Dependencies: Samtools

Full code: https://github.com/gareth-reid/RockMelonBio

Citations: https://jcp.bmj.com/content/71/1/84.info, https://www.ncbi.nlm.nih.gov/pubmed/28801348

Leave a comment