Bioinformatics Cheatsheet

1) Genomic Data File Types

Format Acronym Format Name File Extension Encoding Notes
FASTA .fa .fasta Text First Used by the FASTA software. Just holds sequences. Used for reference genomes.
FASTQ .fq .fastq Text Adds sequence quality values to FASTA. Usually output from sequencers.
GenBank .gb .gbk Text Developed by NCBI for GenBank Project.
SAM Sequence Alignment Map .sam Text
BAM Binary Alignment Map .bam Binary Same information as SAM but in binary format.
BAI Binary Alignment/Map Index File .bai Binary A table of contents for a BAM file.
VCF Variant Call Format .vcf Text Information about variations from a reference.
GTF Gene Transfer Format .gtf Text Old but still popular.
GFF General Feature Format .gff Text An enhancement to GTF.
GFF3 General Feature Format Version 3 .gff3 Text An enhancement to GFF.

2) Sources of Genomic Data Files

URL Description
https://www.ncbi.nlm.nih.gov/genome/ Genomic data available from NIH NCBI Datasets.
https://www.ensembl.org/ Ensembl is a genome browser for vertebrate genomes.
https://www.internationalgenome.org/data/ The International Genome Sample Resource (IGSR) and the 1000 Genomes Project.
https://b1mg-project.eu/ The Beyond 1 Million Genomes (B1MG) project is helping to create a network of genetic and clinical data across Europe.
https://duos.org/ Data Use Oversight System

3) Computational Pipelines

https://usegalaxy.org/

Data Processing Website, with inbuilt tools.

https://nextflow.io/

Workflow management, Java and Groovy based.

https://snakemake.github.io/

Workflow management, Python based.

https://www.shinyapps.io

A place to share your Shiny applications online, R based.