Running GATK4 in Spark Mode
Prerequisites
Prepare data files. Two types of data files are required for the GATK test: genome data files (in FASTA format) and original sequencing data files (FASTQ or BAM format).
This test case uses the following data files:
- Genome data files (in FASTA format)
The following is the human genome information.
human_g1k_v37.fasta
The following two VCF files are from the 1000 Genomes Project and the Mills Project, which record the population InDel areas detected in those projects.- Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
- 1000G_phase1.indels.hg19.sites.vcf.gz
The following document is a collection of almost all public population variants currently available.
dbsnp132_20101103.vcf
The variation dataset selects one to three types of data based on the sequencing purpose.
- Original sequencing data (FASTQ or BAM format)
- SRR742200_1.fastq
- SRR742200_2.fastq
- Use PuTTY to log in to the server as the root user.
- Run the following commands to create an input folder:
mkdir projectDir cd projectDir mkdir input
- Run the following commands to decompress the case files:
gzip -d SRR742200_1.fastq.gz gzip -d SRR742200_1.fastq.gz gzip -d human_g1k_v37.fasta.gz gzip -d dbsnp132_20101103.vcf.gz
- Run the following command to generate a 2-bit file using BLAT:
faToTwoBit human_g1k_v37.fasta human_g1k_v37.fasta.2bit
- Run the following command to save all data files to the input folder:
cp human_g1k_v37.fasta human_g1k_v37.fasta.2bit SRR742200_1.fastq SRR742200_2.fastq dbsnp132_20101103.vcf input
Procedure
- Use PuTTY to log in to the server as the root user.
- Run the following command to create an index for the fasta file:
bwa index -a bwtsw human_g1k_v37.fasta
The command output is as follows:
[bwa_index] Pack FASTA... 27.89 sec [bwa_index] Construct BWT for the packed sequence... [BWTIncCreate] textLength=6203609478, availableWord=448508744 [BWTIncConstructFromPacked] 10 iterations done. 99999990 characters processed. [BWTIncConstructFromPacked] 20 iterations done. 199999990 characters processed. [BWTIncConstructFromPacked] 30 iterations done. 299999990 characters processed. [BWTIncConstructFromPacked] 40 iterations done. 399999990 characters processed. [BWTIncConstructFromPacked] 50 iterations done. 499999990 characters processed. [BWTIncConstructFromPacked] 60 iterations done. 599999990 characters processed. [BWTIncConstructFromPacked] 70 iterations done. 699999990 characters processed. [BWTIncConstructFromPacked] 80 iterations done. 799999990 characters processed. [BWTIncConstructFromPacked] 90 iterations done. 899999990 characters processed. <The following output is omitted.>
- Run the following command to view the generated index files:
ls
After 2 and 3 are complete, you will see 5 files in the current path: human_g1k_v37.fasta.amb, human_g1k_v37.fasta.ann, human_g1k_v37.fasta.bwt, human_g1k_v37.fasta.pac, and human_g1k_v37.fasta.sa. The generated index file is applicable to all pipelines and needs to be built only once. It takes about 1.5 hours to build the index file. The generated index file will be used in subsequent sequencing steps.
- Run the following commands to compare the fastq file:
- Generate a dictionary.
gatk CreateSequenceDictionary -R human_g1k_v37.fasta -O human_g1k_v37.dict
If the information in the red box shown in the following figure is displayed, the execution is complete.

The generated dict file is used for genome comparison.
- Perform the comparison.
bwa mem -M -t 96 human_g1k_v37.fasta SRR742200_1.fastq SRR742200_2.fastq > SRR7.sam
Table 1 Parameters Parameter
Description
-t
The parameter is the number of threads, which is generally the same as the number of server cores.
If the information in the red box shown in the following figure is displayed, the execution is complete.

- Generate a dictionary.
- Run the following command to re-sort the gene sequence:
By default, the sam files generated by BWA are sorted by dictionary mode. You need to convert the files for sorting by genome type, which is completed by the Picard tool integrated in GATK4.
gatk ReorderSam -I SRR7.sam -O SRR7_reorder.bam -R human_g1k_v37.fasta

If the information in the red box shown in the following figure is displayed, the execution is complete.

- Sort gene sequences by coordinate in descending order.
- Spark runs in the HDFS file system and is invisible in the Linux system. Run the following command to transfer the SRR7_reorder.bam file to the HDFS file system:
hadoop fs -mkdir /seqdata hadoop fs -put SRR7_reorder.bam /seqdata
- Run the following commands to check whether the sparklog directory has been manually created:
hadoop fs –mkdir /sparklog hadoop fs –ls /
- Run the following commands to perform the sorting:
gatk SortReadFileSpark -I hdfs://Hostname:9000/seqdata/SRR7_reorder.bam -O hdfs://Hostname:9000/seqdata/SRR7_sorted.bam -- --spark-runner SPARK --spark-master spark://Hostname:7077

If the information in the red box shown in the following figure is displayed, the execution is complete.

- Run the following command to copy the generated file to the data directory:
hadoop fs -get /seqdata/SRR7_sorted.bam ./
- Run the following command to view the data directory:
ll
If the information in the red box shown in the following figure is displayed, the execution is complete.

- Spark runs in the HDFS file system and is invisible in the Linux system. Run the following command to transfer the SRR7_reorder.bam file to the HDFS file system:
- Run the following command to add a header for the gene sequence:
GATK 2.0 and later versions do not support headless file detection. This step can be performed by using -r during BWA comparison or by using the AddOrReplaceReadGroups tool in GATK4.
gatk AddOrReplaceReadGroups -I SRR7_sorted.bam -O SRR7_header.bam -LB lib1 -PL illumina -PU unit1 -SM 20

If the information in the red box shown in the following figure is displayed, the execution is complete.

- Deduplicate gene sequences.
- Run the following commands to upload the head file to HDFS:
hadoop fs -put SRR7_header.bam /seqdata
- Run the following command to query HDFS:
hadoop fs –ls /seqdata

- Run the following command to perform the deduplication.
gatk MarkDuplicatesSpark -I hdfs://Hostname:9000/seqdata/SRR7_header.bam -O hdfs://Hostname:9000/seqdata/SRR7_markdup.bam -- --spark-runner SPARK --spark-master spark://Hostname:7077

If the information in the red box shown in the following figure is displayed, the execution is complete.

- Run the following commands to upload the head file to HDFS:
- BQSR performs the recalibration.
- Run the following commands to upload the original file to HDFS:
hadoop fs –put dbsnp132_20101103.vcf /seqdata hadoop fs –put human_g1k_v37.fasta.2bit /seqdata hadoop fs –ls /seqdata
If the information in the red box shown in the following figure is displayed, the execution is complete.

- Perform base quality score recalibration (BQSR).
gatk BQSRPipelineSpark -I hdfs://Hostname:9000/seqdata/SRR7_markdup.bam -R hdfs://Hostname:9000/seqdata/human_g1k_v37.fasta.2bit -O hdfs://Hostname:9000/seqdata/SRR7_bqsr.bam --known-sites hdfs://Hostname:9000/seqdata/dbsnp132_20101103.vcf --disable-sequence-dictionary-validation true -- --spark-runner SPARK --spark-master spark://Hostname:7077

If the information in the red box shown in the following figure is displayed, the execution is complete.

- Run the following commands to upload the original file to HDFS:
- Run the following command to detect the HaplortypeCaller variation process and generate the variant vcf file:
gatk HaplotypeCallerSpark -R hdfs://Hostname:9000/seqdata/human_g1k_v37.fasta.2bit -I hdfs://Hostname:9000/seqdata/SRR7_bqsr.bam -O hdfs://Hostname:9000/seqdata/SRR7.snp.raw.vcf -- --spark-runner SPARK --spark-master spark://Hostname:7077


The generated data is as follows:
