This is a simple tools to count the occurence of each base at a specified genomic coordinate. It is simply to illustrate how to use the vast array of existing bioinformatics tools with some basic code of your own & without the need to use R for simple projects.
Purpose: This piece of software uses a tool called mpileup which is part of the samtools suite. This tool takes a BAM file and returns a dataset containing each genomic location and the base call from each read. My program has been used to find the percentage occurrence of known mutations and/or SNPs to establish a suitable error threshold i.e. % base call error to pass QC for a mutation finding. See citations below.
Pass it a bam file. I pass multiple runs, step through each sample and out to a CSV file.
//take from command line
def file = "${runFolder}/${sampleName}/${sampleName}.bam"
def data = "samtools mpileup -d 1000000 ${file}".execute().text
Some pretty basic groovy/java code to count them up
int coord = 7579619 //take from command line
def dataArray = data.split("\n")
def position = []
dataArray.each { line ->
if (line.contains(coord)) {
position = line.split(/\t/)
}
}
def aCount = 0
def cCount = 0
def gCount = 0
def tCount = 0
if (position.size() > 4 ) {
def nucleicAcid = position[4].toString()
nucleicAcid.each { na ->
if (na == 'A') {
aCount++
}
if (na == 'C') {
cCount++
}
if (na == 'G') {
gCount++
}
if (na == 'T') {
tCount++
}
}
}
results.append("${sampleName}, ${aCount},
${cCount}, ${gCount}, ${tCount}\n")
Sbatch file for SLURM cluster
#!/bin/bash
#SBATCH --nodes=1 # note how the extra space makes this a comment
#SBATCH --ntasks=1 # note how the extra space makes this a comment
#SBATCH --mem-per-cpu=64G
#SBATCH --time=2-00:00:00 # 2 days / 48 hours
#SBATCH --output="%u-%N-%j.output" # lets call the file "<user>-<nodeName>-<jobID>"
#SBATCH --error="%u-%N-%j.error" # lets call the file "<user>-<nodeName>-<jobID>"
#SBATCH --mail-user=gareth.reid@petermac.org
#SBATCH --mail-type=ALL
#SBATCH --job-name="BioTools"
#SBATCH --partition=prod
module load java/1.8.0_171-jre
module load samtools
# 1
# make any changes to $1 here:
# RUNNAME=$1
# RUNNAMEVAR = $1
while read a; do
java -cp BioinformaticsTools.jar tools.PercentageBaseCalls -r $a -p <GENOMIC_POSITION> -o /home/greid/georgeGenomicPositions2/resultsOutputfile.csv
done < runs.csv
# If running via a bash loop over the excel spreadsheet, pass in the
#java -cp BioinformaticsTools.jar tools.PercentageBaseCalls -r $1 <run number> -p 7579619 -o /home/greid/resultsPosition7579619-TP53.csv
# execute like:
# readline loop
# sbatch bioTools.sbatch $readline
# 2
# alrternatively create the loop inside this sbatch shell
# open file
# read line
# execute command on line value
Dependencies: Samtools
Full code: https://github.com/gareth-reid/RockMelonBio
Citations: https://jcp.bmj.com/content/71/1/84.info, https://www.ncbi.nlm.nih.gov/pubmed/28801348