Educational Cluster Applications
Faculty who will be teaching a course are encouraged to check this list to see if the application (or applications) they would like to use for their classes are available on the Educational Cluster. They may request software or codes to be installed, but the request must be done prior to the start of the semester that the class will be taught so that we have time to configure, build, install, and test the application in our environment. We typically do not install new applications or update existing applications during the semester, so that the environment is consistent throughout the whole semester.
ABySS, Assembly by Short Sequences, is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes.
versions available: 2.1.5, 2.2.5
ADMIXTURE is a software tool for maximum likelihood estimation of individual ancestries from multilocus SNP genotype datasets. It uses the same statistical model as STRUCTURE but calculates estimates much more rapidly using a fast numerical optimization algorithm.
versions available: 1.3.0
AGWG-merge is a version of the 3D-DNA pipeline (Dudchenko et al., Science, 2017) that was used to help generate AaegL5 genome assembly for the mosquito Aedes aegypti.
versions available: 180806
An implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP14 and published in Nature.
versions available: 2.1.1, 2.3.1
AmberTools consists of several independently developed packages that work well by themselves, and with Amber itself. The suite can also be used to carry out complete molecular dynamics simulations, with either explicit water or generalized Born solvent models. AmberTools20 consists of the following major codes: NAB/sff, antechamber and MCPB, tleap and parmed, sqm, pbsa, 3D-RISM, sander, mdgx, cpptraj and pytraj, MMPBSA.py and amberlite
versions available: 20, 20-mpi
Anvi’o is an open-source, community-driven analysis and visualization platform for ‘omics data. It brings together many aspects of today’s cutting-edge genomic, metagenomic, metatranscriptomic, pangenomic, and phylogenomic analysis practices to address a wide array of needs.
versions available: 7
Apricot implements submodular optimization for the purpose of summarizing massive data sets into minimally redundant subsets that are still representative of the original data. These subsets are useful for both visualizing the modalities in the data (such as in the two data sets below) and for training accurate machine learning models with just a fraction of the examples and compute.
versions available: 0.6.1
ARIA (Ambiguous Restraints for Iterative Assignment) is a software for automated NOE assignment and NMR structure calculation. It speeds up and automatizes the assignment process through the use of an iterative structure calculation scheme. Additionally, a refinement in explicit water improves the quality of the calculated structures, validation tests help spectroscopists to judge the quality of the final structures, and the support of the CCPN data model simplifies the exchange of information with other NMR software packages.
versions available: 2.3.2
ARTIC is a pipeline and set of accompanying tools for working with viral nanopore sequencing data, generated from tiling amplicon schemes. It is designed to help run the artic bioinformatics protocols; for example the SARS-CoV-2 coronavirus protocol. There are 2 workflows baked into this pipeline, one which uses signal data (via nanopolish) and one that does not (via medaka).
versions available: 1.2.1
ATSAS is a data analysis software suite for small-angle scattering data analysis from biological macromolecules. Included in the ATSAS suite: BUNCH CHROMIXS, CORAL, CRYSOL, CRYSON, DAMAVER, DAMMIF, DAMMIN, DATtools, EOM, GASBOR, GNOM, MONSA, OLIGOMER, PRIMUS, SASFLOW, SASREF, SREFLEX, SUPCOMB
versions available: 3.0.3
AUGUSTUS is a gene prediction program for eukaryotes. It can be used as an ab initio program, which means it bases its prediction purely on the sequence.
versions available: 3.3.3, 3.4.0
The AWS Command Line Interface (AWS CLI) is an open source tool that enables you to interact with AWS services using commands in your command-line shell. With minimal configuration, the AWS CLI enables you to start running commands that implement functionality equivalent to that provided by the browser-based AWS Management Console from the command prompt in your terminal program
versions available: 2.11.2
BamTools is a project that provides both a C++ API and a command-line toolkit for reading, writing, and manipulating BAM (genome alignment) files.
versions available: 2.4.1, 2.5.1
BayesASE is a complete bioinformatics pipeline that incorporates state-of-the-art error reduction techniques and a flexible Bayesian approach to estimating Allelic Imbalance (AI) and formally comparing levels of AI between conditions. AI indicates the presence of functional variation in cis regulatory regions. Detecting cis regulatory differences using AI is widespread, yet there is no formal statistical methodology that tests whether AI differs between conditions.
versions available: 21.1.13
Beagle is a software package for phasing genotypes and imputing ungenotyped markers. Beagle has improved memory and computational efficiency when analyzing large sequence data sets.
versions available: 5.4
BEAST 2 is a cross-platform program for Bayesian phylogenetic analysis of molecular sequences. It estimates rooted, time-measured phylogenies using strict or relaxed molecular clock models. It can be used as a method of reconstructing phylogenies but is also a framework for testing evolutionary hypotheses without conditioning on a single tree topology.
versions available: 2.6.3
Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome.
versions available: 2.26.0, 2.29.0
Bismark is a set of tools for the time-efficient analysis of Bisulfite-Seq (BS-Seq) data. Bismark performs alignments of bisulfite-treated reads to a reference genome and cytosine methylation calls at the same time. (Requires Bowtie or Bowtie2)
versions available: 0.22.3, 0.24.0
NCBI BLAST (Basic Local Alignment Search Tool) is a suite of programs for aligning query sequences against those present in a selected target database.
versions available: 2.11.0+, 2.3.0+, 2.9.0+
Blat produces two major classes of alignments: 1) at the DNA level between two sequences that are of 95% or greater identity, but which may include large inserts, 2) at the protein or translated DNA level between sequences that are of 80% or greater identity and may also include large inserts. (v36 / 64-bit)
versions available: 36
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters to relatively long (e.g. mammalian) genomes.
versions available: 2.2.9, 2.4.1
BRAKER2 is an unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS
versions available: 2.1.2, 2.1.5
BUSCO provides quantitative measures for the assessment of genome assembly, gene set, and transcriptome completeness, based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB v9.
versions available: 4.0.6, 5.1.3, 5.4.7
BWA (Burrows-Wheeler Aligner) is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM.
versions available: 0.7.12, 0.7.17
Canu is a fork of the Celera Assembler, designed for high-noise single-molecule sequencing, such as the PacBio RS II/Sequel or Oxford Nanopore MinION.
versions available: 1.8, 2.1.1
CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.
versions available: 4.8.1
CFM-ID provides a method for accurately and efficiently identifying metabolites in spectra generated by electrospray tandem mass spectrometry (ESI-MS/MS). The program uses Competitive Fragmentation Modeling to produce a probabilistic generative model for the MS/MS fragmentation process and machine learning techniques to adapt the model parameters from data.
versions available: 2.4.3, 4.4.7
Clustal Omega is the latest addition to the Clustal family. It offers a significant increase in scalability over previous versions, allowing hundreds of thousands of sequences to be aligned in only a few hours. In addition, the quality of alignments is superior to previous versions.
versions available: 1.2.4
CMake is a cross-platform, open-source build system. CMake is a family of tools designed to build, test and package software.
versions available: 3.19.7, 3.25.0
Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples.
versions available: 2.2.1
dammit is a simple de novo transcriptome annotator. It was born out of the observation that: annotation is mundane and annoying; all the individual pieces of the process exist already; and, the existing solutions are overly complicated or rely on crappy non-free software. Science shouldn’t suck for the sake of sucking, so dammit attempts to make this sucky part of the process suck a little less.
versions available: 1.2
dDocent is simple bash wrapper to QC, assemble, map, and call SNPs from almost any kind of RAD sequencing. If you have a reference already, dDocent can be used to call SNPs from almost any type of NGS data set.
versions available: 2.8.13
From demultiplexing to consensus for Nanopore amplicon data, Decona can process multiple samples in one line of code: Mixed samples containing multiple species from bulk and eDNA, Mixed amplicons in one barcode, Multiplexed barcodes, Multiple samples in one run, Outputs Medaka polished consensus sequences
versions available: 1.3.1
DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. The key features are: Pairwise alignment of proteins and translated DNA at 100x-10,000x speed of BLAST; Frameshift alignments for long read analysis; Low resource requirements and suitable for running on standard desktops or laptops; Various output formats, including BLAST pairwise, tabular and XML, as well as taxonomic classification.
versions available: 2.0.9
DeepMind’s software stack for physics-based simulation and Reinforcement Learning environments, using MuJoCo physics.
versions available: 0.0.408
DRAM (Distilled and Refined Annotation of Metabolism) is a tool for annotating metagenomic assembled genomes and VirSorter identified viral contigs. DRAM annotates MAGs and viral contigs using KEGG (if provided by the user), UniRef90, PFAM, dbCAN, RefSeq viral, VOGDB and the MEROPS peptidase database as well as custom user databases.
versions available: 1.4.6
DRAP is a De novo RNA-Seq Assembly Pipeline which wraps two assemblers, Trinity and Oases, in order to improve their results regarding the above-mentioned criteria.
versions available: 1.92
The Eagle software estimates haplotype phase either within a genotyped cohort or using a phased reference panel. Eagle2 uses a new, very fast HMM-based algorithm that improves speed and accuracy over existing methods via two key ideas: a new data structure based on the positional Burrows-Wheeler transform and a rapid search algorithm that explores only the most relevant paths through the HMM.
versions available: 2.4.1
versions available: 2020-12
The Ecosystem Demography Biosphere Model (ED2) is an integrated terrestrial biosphere model incorporating hydrology, land-surface biophysics, vegetation dynamics, and soil carbon and nitrogen biogeochemistry. Like its predecessor, ED, ED2 uses a set of size- and age-structured partial differential equations that track the changing structure and composition of the plant canopy.
versions available: 2.2-intel, 2.2-mpi
EMBOSS (European Molecular Biology Open Software Suite) is a software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web.
versions available: 6.6.0
The Eukaryotic Non-Model Transcriptome Annotation Pipeline (EnTAP) is designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. This software package addresses the fragmentation and related assembly issues that result in inflated transcript estimates and poor annotation rates. Following filters applied through assessment of true expression and frame selection, open-source tools are leveraged to functionally annotate the translated proteins.
versions available: 0.10.8
This code implements the popular RAxML search algorithm for maximum likelihood based inference of phylogenetic trees. It uses a radically new MPI parallelization approach that yields improved parallel efficiency, in particular on partitioned multi-gene or whole-genome datasets.
versions available: 3.0.17, 3.0.21
Exonerate is a generic tool for sequence alignment
versions available: 2.4.0
FastQC is a quality control tool for high throughput sequence data. It takes a FastQ file and runs a series of tests on it to generate a comprehensive QC report. FastQC can be run either as an interactive GUI app, or in a non-interactive way (say as part of a pipeline) which will generate an HTML report for each file you process.
versions available: 0.11.9
FFmpeg is the leading multimedia framework, able to decode, encode, transcode, mux, demux, stream, filter and play pretty much anything that humans and machines have created. It supports the most obscure ancient formats up to the cutting edge. It contains libavcodec, libavutil, libavformat, libavfilter, libavdevice, libswscale and libswresample which can be used by applications. As well as ffmpeg, ffserver, ffplay and ffprobe which can be used by end users for transcoding, streaming and playing.
versions available: 4.2.1, 4.2.1-cuda10.2
Ansys Fluent is a general-purpose computational fluid dynamics (CFD) software used to model fluid flow, heat and mass transfer, chemical reactions, and more. Also known for its efficient HPC scaling, large models can easily be solved in Fluent on multiple processors on either CPU or GPU.
versions available: 2022
Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly.
versions available: 2.9.1
Freyja is a tool to recover relative lineage abundances from mixed SARS-CoV-2 samples from a sequencing dataset (BAM aligned to the Hu-1 reference). The method uses lineage-determining mutational ‘barcodes’ derived from the UShER global phylogenetic tree as a basis set to solve the constrained (unit sum, non-negative) de-mixing problem.
versions available: 1.3
GATK (Genome Analysis Toolkit) offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
versions available: 3.8.1, 4.1.6, 4.2.0
GeneMark developed in 1993 was the first gene finding method recognized as an efficient and accurate tool for genome projects. GeneMark was used for annotation of the first completely sequenced bacteria, Haemophilus influenzae, and the first completely sequenced archaea, Methanococcus jannaschii.
versions available: 4.71
GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions, many groups are also using it for research on non-biological systems, e.g. polymers.
versions available: 2020.6, 2020.6-cuda, 2020.6-mpi, 2020.6-mpi-cuda, 2021.1, 2021.1-cuda, 2021.1-mpi, 2021.1-mpi-cuda
Guppy is a data processing toolkit that contains the Oxford Nanopore Technologies’ basecalling algorithms, and several bioinformatic post-processing features. Early downstream analysis components such as barcoding/demultiplexing, adapter trimming and alignment are contained within Guppy.
versions available: 6.0.6, 6.0.6-cuda11.2, 6.3.4, 6.3.4-cuda11.2, 6.5.7, 6.5.7-cuda11.4
Hecaton is a framework specifically designed for plant genomes that detects copy number variants (CNVs) using short paired-end Illumina reads. CNVs are called by integrating existing structural variant callers through a machine-learning model and several custom post-processing scripts.
versions available: 0.4.0, 0.5.0
The HH-suite is an open-source software package for sensitive protein sequence searching based on the pairwise alignment of hidden Markov models (HMMs). It contains HHsearch and HHblits among other programs and utilities. HHsearch takes as input a multiple sequence alignment (MSA) or profile HMM and searches a database of HMMs (e.g. PDB, Pfam, or InterPro) for homologous proteins.
versions available: 3.3.0
HiCExplorer facilitates the creation of contact matrices, correction of contacts, TAD detection, A/B compartments, merging, reordering or chromosomes, conversion from different formats including cooler and detection of long-range contacts. Moreover, it allows the visualization of multiple contact matrices along with other types of data like genes, compartments, ChIP-seq coverage tracks (and in general any type of genomic scores), long range contacts and the visualization of viewpoints.
versions available: 3.6
HiC-Pro is an optimized and flexible pipeline for Hi-C data processing. HiC-Pro was designed to process Hi-C data, from raw fastq files (paired-end Illumina data) to the normalized contact maps.
versions available: 2.11.1, 3.0.0
HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) against the general human population (as well as against a single reference genome). Based on GCSA (an extension of BWT for a graph), we designed and implemented a graph FM index (GFM), an original approach and its first implementation to the best of our knowledge.
versions available: 2.2.0, 2.2.1
HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs).
versions available: 3.3.2
Homopolish is a genome polisher originally developed for Nanopore and subsequently extended for PacBio CLR. It generates a high-quality genome (>Q50) for virus, bacteria, and fungus. Nanopore/PacBio systematic errors are corrected by retreiving homologs from closely-related genomes and polished by an SVM.
versions available: 0.4
HUMAnN is a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads).
versions available: 0.11.2
InterPro is a database which integrates together predictive information about proteins’ function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains.
versions available: 5.55-88.0, 5.60-92.0
iPHoP stands for integrated Phage Host Prediction. It is an automated command-line pipeline for predicting host genus of novel bacteriophages and archaeoviruses based on their genome sequences.
versions available: 1.3.0
A fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood. IQ-TREE compares favorably to RAxML and PhyML in terms of likelihoods with similar computing time
versions available: 1.6.12, 2.1.2
I-TASSER is an integrated package for protein structure and function predictions. For a given sequence, I-TASSER first identifies template proteins from the Protein Data Bank (PDB) by multiple threading techniques (LOMETS).
versions available: 5.1
JAGS is Just Another Gibbs Sampler. It is a program for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation not wholly unlike BUGS. JAGS was written with three aims in mind: 1) To have a cross-platform engine for the BUGS language, 2) To be extensible, allowing users to write their own functions, distributions and samplers, and 3) To be a platform for experimentation with ideas in Bayesian modelling.
versions available: 4.3.0, 4.3.1
JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA. JELLYFISH can count k-mers using an order of magnitude less memory and an order of magnitude faster than other k-mer counting packages by using an efficient encoding of a hash table and by exploiting the ‘compare-and-swap’ CPU instruction to increase parallelism.
versions available: 2.2.6, 2.3.0
KneadData is a tool designed to perform quality control on metagenomic and metatranscriptomic sequencing data, especially data from microbiome experiments.
versions available: 0.7.2
Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies.
versions available: 1.1.1
Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies.
versions available: 2.1.2
LAMMPS is a classical molecular dynamics code, and an acronym for Large-scale Atomic/Molecular Massively Parallel Simulator. Packages built: ASPHERE ATC AWPMD BOCS BODY CLASS2 COLLOID COLVARS COMPRESS CORESHELL DIFFRACTION DIPOLE DRUDE EFF FEP GRANULAR H5MD KIM KSPACE LATTE MANIFOLD MANYBODY MC MGPT MISC MOFFF MOLECULE MOLFILE MPIIO OPT PERI PHONON POEMS PTM PYTHON QEQ QTB REAXFF REPLICA RIGID SHOCK SMTBQ SPH SPIN SRD TALLY UEF VORONOI
versions available: 10Mar21-cuda, 10Mar21-mpi, 23Jun22-cuda, 23Jun22-mpi
LASTZ: A tool for (1) aligning two DNA sequences, and (2) inferring appropriate scoring parameters automatically.
versions available: 1.04.03
LIGGGHTS(R)-PUBLIC is an Open Source Discrete Element Method Particle Simulation Software based on LAMMPS. LIGGGHTS (R) stands for LAMMPS improved for general granular and granular heat transfer simulations. LIGGGHTS (R) aims to improve the capabilities of LAMMPS with the goal to apply it to industrial applications.
versions available: 3.8.0
LINKS is a genomics application for scaffolding genome assemblies with long reads, such as those produced by Oxford Nanopore Technologies Ltd. It can be used to scaffold high-quality draft genome assemblies with any long sequences (eg. ONT reads, PacBio reads, other draft genomes, etc). It is also used to scaffold contig pairs linked by ARCS/ARKS.
versions available: 1.8.7
A genome assembly correction and scaffolding pipeline using long reads, consisting of up to three steps: 1) Tigmint cuts the draft assembly at potentially misassembled regions, 2) ntLink is then used to scaffold the corrected assembly, and 3) followed by ARKS for further scaffolding (optional extra step of scaffolding)
versions available: 1.0.2
LoRDEC (built with GATB v1.4.1) is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.
versions available: 0.9
MAFFT is a Multiple alignment program for amino acid or nucleotide sequences. It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of <∼200 sequences), FFT-NS-2 (fast; for alignment of <∼30,000 sequences), etc.
versions available: 7.055woe, 7.273woe, 7.487woe
MAKER is a portable and easily configurable genome annotation pipeline. Its purpose is to allow smaller eukaryotic and prokaryotic genome projects to independently annotate their genomes and to create genome databases.
versions available: 2.31, 2.31-mpi, 3.01, 3.01-mpi
The MaSuRCA (Maryland Super Read Cabog Assembler) assembler combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches. MaSuRCA supports hybrid assembly with short Illumina reads and long high error PacBio/MinION data.
versions available: 3.3.4, 4.0.1, 4.0.9
Mathematica is a software package which is ideal for communicating scientific ideas, whether this is visualization of a concept in an intro-level course, or creating a simulation of a new idea related to research.
versions available: 11.3.0, 12.3.1, 8.0
MaxSSmap is a GPU program for mapping divergent short reads to genomes with the maximum scoring subsequence. MaxSSmap aims to achieve comparable accuracy to Smith-Waterman but with faster runtimes. Similar to most programs MaxSSmap identifies a local region of the genome followed by exact alignment.
versions available: 1.0
The MATLAB Compiler Runtime is a standalone set of shared libraries that enables the execution of compiled MATLAB applications or components on computers that do not have MATLAB installed. When used together, MATLAB, MATLAB Compiler, and the MATLAB Runtime enable you to create and distribute numerical applications or software components quickly and securely.
versions available: R2018a, R2018b, R2019b
MEGADOCK is an ultra-high-performance FFT-grid-based protein-protein docking for heterogeneous supercomputers that takes advantage of the massively parallel CUDA architechture of NVIDIA GPUs and multiple computation nodes.
versions available: 4.1.1, 4.1.1-mpi
Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transcriptome.
versions available: 2.4.1
The objective of the MEGA (Molecular Evolutionary Genetics Analysis) software has been to provide tools for exploring, discovering, and analyzing DNA and protein sequences from an evolutionary perspective. MEGA is designed to facilitate extensive sequence data analysis from an evolutionary perspective using a single program package. At the same time, the overlap between the methods implemented in MEGA and those in other existing evolutionary analysis programs has been consciously avoided. This is reflected in the exclusion of the maximum likelihood method (PHYLIP) and in the absence of extensive options for the maximum parsimony method (PAUP and MacClade.
versions available: 10.2.6
Multiple Em for Motif Elicitation. MEME discovers novel, ungapped motifs (recurring, fixed-length patterns) in your sequences. MEME splits variable-length patterns into two or more separate motifs.
versions available: 5.4.1
Evaluate genome assemblies with k-mers and more. Often, genome assembly projects have illumina whole genome sequencing reads available for the assembled individual. The k-mer spectrum of this read set can be used for independently evaluating assembly quality without the need of a high quality reference. Merqury provides a set of tools for this purpose.
versions available: 1.3
MetaBAT: A robust statistical framework for reconstructing genomes from metagenomic data
versions available: 2.13, 2.15
MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With the newly added StrainPhlAn module, it is now possible to perform accurate strain-level microbial profiling.
versions available: 2.8.1, 3.0.7, 4.0.3
MicrobeAnnotator uses an iterative approach to annotate microbial genomes (Bacteria, Archaea and Virus) starting from proteins predicted using your favorite ORF prediction tool, e.g. Prodigal. The iterative approach is composed of three or five main steps, depending on the flavor of MicrobeAnnotator you run.
versions available: 2.0.5
Minialign is a little bit fast and moderately accurate nucleotide sequence alignment tool designed for PacBio and Nanopore long reads. It is built on three key algorithms, minimizer-based index of the minimap overlapper, array-based seed chaining, and SIMD-parallel Smith-Waterman-Gotoh extension.
versions available: 0.4.4, 0.6.0
Miniasm is a very fast OLC-based *de novo* assembler for noisy long reads. It takes all-vs-all read self-mappings, typically by [minimap][minimap] as input and outputs an assembly graph in the [GFA][gfa] format.
versions available: 0.3
Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database.
versions available: 2.17, 2.18
MIRA is a whole genome shotgun and EST sequence assembler for Sanger, 454, Solexa (Illumina), IonTorrent data and PacBio (the later at the moment only CCS and error-corrected CLR reads). It can be seen as a Swiss army knife of sequence assembly developed and used in the past 16 years to get assembly jobs done efficiently – and especially accurately.
versions available: 4.0.2
miRDeep2 is a software package for identification of novel and known miRNAs in deep sequencing data. Furthermore, it can be used for miRNA expression profiling across samples. Last, a new module for preprocessing of raw Illumina sequencing data produces files for downstream analysis with the miRDeep2 or quantifier module. Colorspace sequencing data is currently not supported by the preprocessing module but it is planed to be implemented.
versions available: 0.1.2
miRDeep-P2 (miRDP2) is developed to accurately and fast analyze microRNAs (miRNAs) transcriptome in plants. It is adopted from miRDeep-P (miRDP) with new strategies and overhauled algorithm. We have tested miRDP2 to analyze miRNA transcriptomes in such plants with gradually increased genome size as Arabidopsis, rice, tomato, maize and wheat.
versions available: 1.1.4
microRNA PREdiction From small RNAseq data (miR-PREFeR) uses expression patterns of miRNA and follows the criteria for plant microRNA annotation to accurately predict plant miRNAs from one or more small RNA-Seq data samples of the same species. We tested miR-PREFeR on several plant species. The results show that miR-PREFeR is sensitive, accurate, fast, and has low memory footprint.
versions available: 0.24
The MITObim procedure (mitochondrial baiting and iterative mapping) represents a highly efficient approach to assembling novel mitochondrial genomes of non-model organisms directly from total genomic DNA derived NGS reads. Labor intensive long-range PCR steps prior to sequencing are no longer required.
versions available: 1.9.1
MrBayes is a program for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models. MrBayes uses Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of model parameters.
versions available: 3.2.2, 3.2.7
MS-FINDER was launched as a universal program for compound ‘annotation’ that supports EI-MS (GC/MS) and MS/MS spectral mining. MS-FINDER aims to provide solutions for 1) formula predictions, 2) fragment annotations, and 3) structure elucidations by means of unknown spectra. In addition, the program can annotate your unknowns by the public spectral databases such as MassBank, LipidBlast, and GNPS.
versions available: 3.52
msprime is a population genetics simulator of ancestry and DNA sequence evolution based on tskit. msprime can simulate ancestral histories for a sample of individuals, consistent with a given demography under a range of different models and evolutionary processes. It can also simulate mutations on a given ancestral history (which can be produced by msprime ancestry simulations or other programs supporting tskit) under a variety of different models of genome sequence evolution.
versions available: 1.1.1
MultiQC is a tool to create a single report with interactive plots for multiple bioinformatics analyses across many samples. Use MultiQC to aggregate results from bioinformatics analyses across many samples into a single report MultiQC searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools.
versions available: 1.11
MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. This package provides an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. MUMmer can also align incomplete genomes; it can easily handle the 100s or 1000s of contigs from a shotgun sequencing project, and will align them to another set of contigs or a genome using the NUCmer program included with the system.
versions available: 3.23
MUSCLE is one of the best-performing multiple alignment programs according to published benchmark tests, with accuracy and speed that are consistently better than CLUSTALW. MUSCLE can align hundreds of sequences in seconds.
versions available: 3.8.1551, 3.8.31
NAMD (2.14 x86_64 mpi) is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations.
versions available: 2.13-mcore, 2.13-mcore-cuda, 2.13-mpi, 2.14-mcore, 2.14-mcore-cuda, 2.14-mpi
NetBeans is a free, open source IDE that allows you to quickly and easily develop desktop, mobile and web applications with Java, HTML5, PHP, C/C++ and more.
versions available: 12.2
NetLogo is a programmable modeling environment for simulating natural and social phenomena. NetLogo is particularly well suited for modeling complex systems developing over time.
versions available: 6.2.0
Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data. We provide a continually-updated view of publicly available data alongside powerful analytic and visualization tools for use by the community. Our goal is to aid epidemiological understanding and improve outbreak response.
versions available: 3.0.3
Nextflow is an incredibly powerful and flexible workflow language. Nextflow lets you run nf-core pipelines on virtually any computing environment. nf-core pipelines adhere to strict guidelines – if one works, they all will.
versions available: 2.1
versions available: 14.17.3
Open Babel is a chemical toolbox designed to speak the many languages of chemical data. It’s an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.
versions available: 3.1.1
OpenFOAM is the free, open source CFD software released and developed primarily by the OpenFOAM Foundation. OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer, to acoustics, solid mechanics and electromagnetics. We offer versions from both OpenCFD Ltd and the OpenFOAM Foundation.
versions available: 2012, 7, 8, 9
ORCA is an ab initio, DFT, and semi-empirical SCF-MO package. The ORCA Input Library contains a collection of ORCA input that show you how to easily do various tasks using the many methods and approximations in the ORCA quantum chemistry code.
versions available: 4.2.1
The Oyster River Protocol for (eukaryotic) transcriptome assembly is an actively developed, evidenced based method for optimizing transcriptome assembly. The protocol assembles the transcriptome using a multi-kmer multi-assembler approach, then merges those assemblies into 1 final assembly. Version 2.3.3u1 is an update to ORP 2.3.3 based on Anaconda3-2023.03-1, along with the following updated components: trinity 2.15.1, salmon 1.10.1, spades 3.15.5, busco 5.1.3, rcorrector 1.0.5, samtools 1.17, cd-hit 4.8.1, diamond 2.1.6.
versions available: 2.3.3, 2.3.3u1
OrthoFinder is a fast, accurate and comprehensive platform for comparative genomics. It finds orthogroups and orthologs, infers rooted gene trees for all orthogroups and identifies all of the gene duplcation events in those gene trees.
versions available: 2.2.7, 2.4.0, 2.5.4
PacBio develops comprehensive solutions for scientists that propel the field of genomics, improve science and research, and create positive impact globally. This module includes several of the PacBio open source tools, including BLASR, CCS, ConsensusCore, GenomicConsensus, IsoSeq3, Lima, pbalign, pbcommand, pbcore, and pbcoretools, pbbam, bam2fastx, pb-dazzler, PB Assembly, and FALCON.
versions available: 2021.4
pairtools is a simple and fast command-line framework to process sequencing data from a Hi-C experiment. pairtools process pair-end sequence alignments and perform the following operations: detect ligation junctions (a.k.a. Hi-C pairs) in aligned paired-end sequences of Hi-C DNA molecules, sort .pairs files for downstream analyses, detect, tag and remove PCR/optical duplicates, generate extensive statistics of Hi-C datasets, select Hi-C pairs given flexibly defined criteria, restore .sam alignments from Hi-C pairs
versions available: 0.3.0
Pangolin (Phylogenetic Assignment of Named Global Outbreak Lineages) was developed to implement the dynamic nomenclature of SARS-CoV-2 lineages, known as the Pango nomenclature. It allows a user to assign a SARS-CoV-2 genome sequence the most likely lineage (Pango lineage) to SARS-CoV-2 query sequences.
versions available: 4
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input.
versions available: 20210222
ParaView is an open-source, multi-platform data analysis and visualization application. ParaView users can quickly build visualizations to analyze their data using qualitative and quantitative techniques.
versions available: 5.11.0, 5.11.0-mpi, 5.7.1, 5.7.1-mpi
The OceanParcels project develops Parcels (Probably A Really Computationally Efficient Lagrangian Simulator), a set of Python classes and methods to create customisable particle tracking simulations using output from Ocean Circulation models. Parcels can be used to track passive and active particulates such as water, plankton, plastic and fish.
versions available: 2.1.5
Software for Long-Read Sequencing Data from PacBio. PBSuite is made up of 2 tools: PBJelly and PBHoney. PBJelly is a highly automated pipeline that aligns long sequencing reads (such as PacBio RS reads or long 454 reads in fasta format) to high-confidence draft assembles. PBHoney is an implementation of two variant-identification approaches designed to exploit the high mappability of long reads (i.e., greater than 10,000 bp).
versions available: 15.8.24
Picard is a set of Java command line tools for manipulating high-throughput sequencing (HTS) data and formats. Picard is implemented using the HTSJDK Java library HTSJDK to support accessing file formats that are commonly used for high-throughput sequencing data such as SAM and VCF.
versions available: 2.18.29, 2.25.4, 2.9.2
PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
versions available: 1.90, 2.00
POY is a phylogenetic analysis program that supports multiple kinds of data (e.g. morphology, nucleotides, genes and gene regions, chromosomes, whole genomes, etc). POY is particular in that it can perform true alignment and phylogeny inference (i.e. input sequences need not to be prealigned).
versions available: 5.1.2
Prokka is rapid prokaryotic genome annotation. Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.
versions available: 1.14.6
This module includes Splign. ProSplign is a global alignment tool developed by Dr. Boris Kiryutin. It produces accurate spliced alignments and computes alignments of distantly related proteins with low similarity. Extra afford is taken to locate frameshift positions. Splign is a utility for computing cDNA-to-Genomic, or spliced sequence alignments, which uses a compartmentization algorithm to identify possible gene duplications, and a refined alignment algorithm recognizing introns and splice signals.
versions available: 2.0.0
Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) model
versions available: 0.6.5
A simple pipeline for reassigning primary contigs that should be labeled as haplotigs. Purge Haplotigs helps with curating heterozygous diploid genome assemblies from third-gen long-read sequencing.
versions available: 1.1.2
PyTorch is a python package that provides two high-level features: Tensor computation (like numpy) with strong GPU acceleration, and Deep Neural Networks built on a tape-based autodiff system. Built with CUDA Toolkit 11.3, for GPUs with Compute Capabilities of 3.7, 6.1, 7.0, 7.5, 8.0, 8.6
versions available: 1.10.2-cuda11.3, 1.12.0-cuda11.3, 1.9.0-cuda11.2
Quantum Espresso (QE) is an integrated suite of Open-Source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials.
versions available: 7.1-intel-mpi
QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.
versions available: 2020.2, 2021.2
QUAST (Quality Assessment Tool for Genome Assemblies) evaluates genome/metagenome assemblies by computing various metrics. The current QUAST toolkit includes the general QUAST tool for genome assemblies, MetaQUAST, the extension for metagenomic datasets, QUAST-LG, the extension for large genomes (e.g., mammalians), and Icarus, the interactive visualizer for these tools.
versions available: 5.0.2
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Labs, by John Chambers and colleagues. R can be considered as a different implementation of S.
versions available: 3.6.0, 3.6.0-mpi, 4.0.3, 4.0.3-intel, 4.0.3-mpi, 4.1.3, 4.1.3-mpi, 4.2.2, 4.2.2-mpi
Racon is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step. The goal of Racon is to generate genomic consensus which is of similar or better quality compared to the output generated by assembly methods which employ both error correction and consensus steps, while providing a speedup of several times compared to those methods.
versions available: 1.4.21
RagTag, the successor to RaGOO, is a command line tool for reference-guided genome assembly improvement. Currently, the two main features are misassembly correction and scaffolding. After correction and/or scaffolding, RagTag also provides utilities to update annotations or work with AGP files.
versions available: 1.1.1, 2.0.0
RAxML (Randomized Axelerated Maximum Likelihood) is a program for sequential and parallel Maximum Likelihood based inference of large phylogenetic trees.
versions available: 7.4.2, 7.4.2-mpi, 8.2.12, 8.2.12-mpi, 8.2.4, 8.2.4-mpi
REPdenovo is designed for constructing repeats directly from sequence reads. It based on the idea of frequent k-mer assembly. REPdenovo provides many functionalities, and can generate much longer repeats than existing tools.
versions available: 0.0, 0.1.0
RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns).
versions available: 4.0.8, 4.1.2
RepeatModeler is a de-novo repeat family identification and modeling package. At the heart of RepeatModeler are two de-novo repeat finding programs ( RECON and RepeatScout ) which employ complementary computational methods for identifying repeat element boundaries and family relationships from sequence data. RepeatModeler assists in automating the runs of RECON and RepeatScout given a genomic database and uses the output to build, refine and classify consensus models of putative interspersed repeats.
versions available: 1.0.11, 2.0.2
The purpose of the RepeatScout software is to identify repeat family sequences from genomes where hand-curated repeat databases (a la RepBase update) are not available.
versions available: 1.0.5, 1.0.6
RMBlast is a RepeatMasker compatible version of the standard NCBI blastn program. The primary difference between this distribution and the NCBI distribution is the addition of a new program ‘rmblastn’ for use with RepeatMasker and RepeatModeler.
versions available: 2.11.0, 2.6.0
RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels.
versions available: 1.3.3
RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data. Some basic modules quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation, mapped reads distribution, coverage uniformity, strand specificity, transcript level RNA integrity etc.
versions available: 4.0.0
RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
versions available: 1.3, 1.4
Samtools is a suite of programs for interacting with high-throughput sequencing data, allowing you to read/write/edit/index/view SAM/BAM/CRAM format. This module includes BCFtools, which is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF.
versions available: 1.10, 1.11, 1.3.1, 1.9
Selscan is a program to calculate EHH-based scans for positive selection in genomes. selscan currently implements EHH, iHS, XP-EHH, nSL, XP-nSL and iHH12. It should be run separately for each chromosome and population (or population pair for XP-EHH). selscan is ‘dumb’ with respect ancestral/derived coding and simply expects haplotype data to be coded 0/1. Unstandardized iHS/nSL scores are thus reported as log(iHH1/iHH0) based on the coding you have provided.
versions available: 1.3.0, 2.0.0
A cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang
versions available: 0.11.0, 0.16.1
Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.
versions available: 1.2, 1.3
ShortBRED (Short, Better Representative Extract Dataset) is a pipeline to take a set of protein sequences, reduce them to a set of unique identifying strings (‘markers’), and then search for these markers in metagenomic data and determine the presence and abundance of the protein families of interest.
versions available: 0.9.5
SIESTA, a first-principles materials simulation code using DFT, is both a method and its computer program implementation, to perform efficient electronic structure calculations and ab initio molecular dynamics simulations of molecules and solids.
versions available: 4.0.2, 4.1.5
The SIFT (sorting intolerant from tolerant) algorithm helps bridge the gap between mutations and phenotypic variations by predicting whether an amino acid substitution is deleterious. SIFT has been used in disease, mutation and genetic studies, and a protocol for its use has been previously published with Nature Protocols. This updated protocol describes SIFT 4G (SIFT for genomes), which is a faster version of SIFT that enables practical computations on reference genomes.
versions available: 2017, 2017-cuda
Singularity is a free, cross-platform and open-source computer program that performs operating-system-level virtualization also known as containerization. One of the main uses of Singularity is to bring containers and reproducibility to scientific computing and the high-performance computing (HPC) world. Singularity containers can be used to package entire scientific workflows, software and libraries, and even data.
versions available: 3.8.1
SIRIUS is a java-based software framework for the analysis of LC-MS/MS data of metabolites and other ‘small molecules of biological interest’.
versions available: 4.0.1
SLiM is an evolutionary simulation framework that combines a powerful engine for population genetic simulations with the capability of modeling arbitrarily complex evolutionary scenarios. Simulations are configured via the integrated Eidos scripting language that allows interactive control over practically every aspect of the simulated evolutionary scenarios
versions available: 3.7
The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition.
versions available: 7.20.0
The CFSAN SNP Pipeline is a Python-based system for the production of SNP matrices from sequence data used in the phylogenetic analysis of pathogenic organisms sequenced from samples of interest to food safety. The SNP Pipeline was developed by the United States Food and Drug Administration, Center for Food Safety and Applied Nutrition.
versions available: 2.2.1
SPAdes (St. Petersburg genome assembler) is intended for both standard isolates and single-cell MDA bacteria assemblies.
versions available: 3.13.1, 3.14.1, 3.15.2
Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. These two qualities are key to the worlds of big data and machine learning, which require the marshalling of massive computing power to crunch through large data stores.
versions available: 3.1.3, 3.4.1
SQANTI3 is the newest version of the SQANTI tool that merges features from SQANTI and SQANTI2, together with new additions. SQANTI3 will continue as an integrated development aiming to providing you the best characterization possible for your new long read-defined transcriptome
versions available: 1.6, 4.0
The Sequence Read Archive (SRA) stores raw sequence data from ‘next-generation’ sequencing technologies including Illumina, 454, IonTorrent, Complete Genomics, PacBio and OxfordNanopores. In addition to raw sequence data, SRA now stores alignment information in the form of read placements on a reference sequence. Includes NCBI VDB and NGS SDK.
versions available: 2.10.5, 2.11.0
Stacks is a software pipeline for building loci from short-read sequences, such as those generated on the Illumina platform. Stacks was developed to work with restriction enzyme-based data, such as RAD-seq, for the purpose of building genetic maps and conducting population genomics and phylogeography.
versions available: 2.59
Spliced Transcripts Alignment to a Reference (STAR) is an ultrafast universal RNA-seq aligner, which was developed to align a large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset.
versions available: 2.7.0c, 2.7.9a
STARCCM+ is much more than just a CFD solver, STAR-CCM+ is an entire engineering process for solving problems involving flow (of fluids or solids), heat transfer and stress.
versions available: 2020.2, 2021.3, 2022.1, 2023.2
STAR-Fusion is a component of the Trinity Cancer Transcriptome Analysis Toolkit (CTAT). STAR-Fusion uses the STAR aligner to identify candidate fusion transcripts supported by Illumina reads. STAR-Fusion further processes the output generated by the STAR aligner to map junction reads and spanning reads to a reference annotation set.
versions available: 1.11.1
StructRNAfinder is an automated pipeline that predicts and annotates RNA families in transcript or genome sequences. It not only displays the sequence/structural consensus alignments for each RNA family according to Rfam database, but also provides a taxonomic overview for each assigned functional RNA.
versions available: 17.03.29
Synteny and Rearrangement Identifier (SyRI). SyRI is a comprehensive tool for predicting genomic differences between related genomes using whole-genome assemblies (WGA). The assemblies are aligned using whole-genome alignment tools, and these alignments are then used as input to SyRI.
versions available: 1.4, 1.6.3
TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.
versions available: 2.11.1-cuda11.7, 2.8.2-cuda11.2, 2.9.1-cuda11.2
NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.
versions available: 126.96.36.199-cuda11.2, 188.8.131.52-cuda11.4
TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
versions available: 2.1.1
TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using Tophat and Cufflinks.
versions available: 5.5.0
TreeTime provides routines for ancestral sequence reconstruction and inference of molecular-clock phylogenies, i.e., a tree where all branches are scaled such that the positions of terminal nodes correspond to their sampling times and internal nodes are placed at the most likely time of divergence.
versions available: 0.9.4
Trimmomatic is a fast, multithreaded command line tool that can be used to trim and crop Illumina (FASTQ) data as well as to remove adapters. These adapters can pose a real problem depending on the library preparation and downstream application.
versions available: 0.38, 0.39
Trinity assembles transcript sequences from Illumina RNA-Seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads.
versions available: 2.13.0, 2.14.0, 2.8.5
Trinotate is a suite for the functional annotation of transcriptomes, particularly de novo assembled transcriptomes. It uses a number of different well referenced methods for functional annotation, including homology search against sequence databases (BLAST+/SwissProt), protein domain identification (HMMER/PFAM), and comparison to currently curated annotation databases (like eggNOG, and Gene Ontology terms).
versions available: 3.2.2
Trycycler is a tool that takes as input multiple separate long-read assemblies of the same genome (e.g. from different assemblers or different read subsets) and produces a consensus long-read assembly.
versions available: 0.5.4
VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.
versions available: 0.1.16
The vdb program is designed to query the SARS-CoV-2 mutational landscape. It runs as a command shell in a terminal, and it allows customized searches for mutation patterns over the entire SARS-CoV-2 genome dataset or subsets thereof. These patttern searches can be for spike protein mutations or nucleotide mutations over the whole genome.
versions available: 2.7
Velvet is a sequence assembler for very short reads
versions available: 1.2.10
The ViennaRNA Package consists of a C code library and several stand-alone programs for the prediction and comparison of RNA secondary structures.
versions available: 2.4.13
VisIt is an Open Source, interactive, scalable, visualization, animation and analysis tool. Users can interactively visualize and analyze data ranging in scale from small (<10 core) desktop-sized projects to large (>10,000 core) leadership-class computing facility simulation campaigns.
versions available: 3.2.0
VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting. (1.9.3 x86_64 64-bit, CUDA 8.0, SSE and AVX2, OpenGL)
versions available: 1.9.3-cuda8-opengl, 1.9.3-text
The Visualization Toolkit (VTK) is an open-source, freely available software system for 3D computer graphics, modeling, image processing, volume rendering, scientific visualization, and information visualization.
versions available: 8.2.0, 8.2.0-mpi
Wengan is a new, accurate, and ultra-fast genome assembler that, unlike most of the current long-reads assemblers, avoids entirely the all-vs-all read comparison. The key idea behind Wengan is that long-read alignments can be inferred by building paths on a sequence graph.
versions available: 0.2
Wise2 is a package focused on comparisons of bio polymers, commonly DNA sequence and protein sequence. Wise2 is now a rather stately bioinformatics package that has be around for a while. Its key programs are genewise, a program for aligning proteins or protein HMMs to DNA, and dynamite a rather cranky ‘macro language’ which automates the production of dynamic programming.
versions available: 2.4.1
The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs.
versions available: 4.0.1-intel, 4.0.1-intel-mpi, 4.3, 4.3-mpi
wtdbg is a fuzzy Bruijn graph (FBG) approach to long noisy reads assembly. wtdbg is desiged to assemble huge genomes in very limited time, it requires a PowerPC with multiple-cores and very big RAM (1Tb+). wtdbg can assemble a 100 X human pacbio dataset within one day.
versions available: 2.3, 2.5
Yade is an extensible open-source framework for discrete numerical models, focused on Discrete Element Method. The computation parts are written in c++ using flexible object model, allowing independent implementation of new alogrithms and interfaces. Python is used for rapid and concise scene construction, simulation control, postprocessing and debugging.
versions available: 2020.01a, 2021.01a
Compilers / Interpreters
Anaconda (python 2.7-based) is the world’s most popular Python data science platform. Anaconda, Inc. continues to lead open source projects like Anaconda, NumPy and SciPy that form the foundation of modern data science. Load this module for CPU ONLY (NON-GPU) compute jobs.
versions available: 2019.10
Anaconda (python 3.8-based) is the world’s most popular Python data science platform. Anaconda, Inc. continues to lead open source projects like Anaconda, NumPy and SciPy that form the foundation of modern data science. Load this module for CPU ONLY (NON-GPU) compute jobs.
versions available: 2020.11, 2022.10
Bazel is Google’s own build tool. Bazel has built-in support for building both client and server software, and also provides an extensible framework that you can use to develop your own build rules.
versions available: 3.1.0, 4.2.1, 5.0.0
The NVIDIA CUDA Toolkit provides a development environment for creating high performance GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers.
versions available: 10.2, 11.2, 11.4
DMD is the reference compiler for the D programming language. The D programming language has been said to be ‘what C++ wanted to be,’ which is a better C. D is developed with system level programming in mind, but brings to the table modern language design with a simple C-like syntax. For these reasons D makes for a good language choice for both performance code and application development.
versions available: 2.103.1
The GNU Compiler Collection includes front ends for C, C++, Objective-C, and Fortran, as well as libraries for these languages (libstdc++, libgcj,…).
versions available: 10.3.0, 11.2.0
Go is expressive, concise, clean, and efficient. Its concurrency mechanisms make it easy to write programs that get the most out of multicore and networked machines, while its novel type system enables flexible and modular program construction. Go compiles quickly to machine code yet has the convenience of garbage collection and the power of run-time reflection. It’s a fast, statically typed, compiled language that feels like a dynamically typed, interpreted language.
versions available: 1.14.2, 1.16.4
Haskell is a general-purpose, statically-typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research and industrial applications, Haskell has pioneered a number of programming language features such as type classes, which enable type-safe operator overloading, and monadic IO. Haskell’s main implementation is the Glasgow Haskell Compiler (GHC). It is named after logician Haskell Curry.
versions available: 9.2.7, 9.6.1
The NVIDIA HPC Software Development Kit (SDK) includes the proven compilers, libraries and software tools essential to maximizing developer productivity and the performance and portability of HPC applications. The NVIDIA HPC SDK C, C++, and Fortran compilers support GPU acceleration of HPC modeling and simulation applications with standard C++ and Fortran, OpenACC directives, and CUDA. GPU-accelerated math libraries maximize performance on common HPC algorithms, and optimized communications libraries enable standards-based multi-GPU and scalable systems programming.
versions available: 21.3, 21.3-mpi
Julia is a high-level, high-performance dynamic programming language for numerical computing. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library.
versions available: 1.6.0
Lua is a powerful, efficient, lightweight, embeddable scripting language. It supports procedural programming, object-oriented programming, functional programming, data-driven programming, and data description.
versions available: 5.3.5, 5.4.2
Mamba is a reimplementation of the conda package manager in C++, which uses libsolv for much faster dependency solving and allows parallel downloading of repository data and package files using multi-threading. Mamba utilizes the same command line parser, package installation and deinstallation code and transaction verification routines as conda to stay as compatible as possible.
versions available: 4.14
The Netwide Assembler, NASM, is an 80×86 and x86-64 assembler designed for portability and modularity. It supports a range of object file formats, including Linux and `*BSD’ `a.out’, `ELF’, `COFF’, `Mach-O’, 16-bit and 32-bit `OBJ’ (OMF) format, `Win32′ and `Win64′.
versions available: 2.14
OpenJDK (Open Java Development Kit) is a free and open-source implementation of the Java Platform, Standard Edition (Java SE). It is the result of an effort Sun Microsystems began in 2006. The implementation is licensed under the GNU General Public License (GNU GPL) version 2.
versions available: 11, 13, 15
PowerShell is a cross-platform task automation solution made up of a command-line shell, a scripting language, and a configuration management framework. PowerShell is a modern command shell that includes the best features of other popular shells. Unlike most shells that only accept and return text, PowerShell accepts and returns .NET objects.
versions available: 7.3.1
Scala is an acronym for ‘Scalable Language’. Scala is a pure-bred object-oriented language. Conceptually, every value is an object and every operation is a method-call. The language supports advanced component architectures through classes and traits.
versions available: 2.12.13, 2.13.5
Swift is a general-purpose programming language built using a modern approach to safety, performance, and software design patterns. The goal of the Swift project is to create the best available language for uses ranging from systems programming, to mobile and desktop apps, scaling up to cloud services.
versions available: 5.6.1
YASM, an assembler and disassembler for the Intel x86 architecture, is a complete rewrite of the NASM assembler. YASM currently supports the x86 and AMD64 instruction sets, accepts NASM and GAS assembler syntaxes, outputs binary, ELF32, ELF64, 32 and 64-bit Mach-O, RDOFF2, COFF, Win32, and Win64 object formats, and generates source debugging information in STABS, DWARF 2, and CodeView 8 formats.
versions available: 1.3.0
Boost is a set of libraries for the C++ programming language that provide support for tasks and structures such as linear algebra, pseudorandom number generation, multithreading, image processing, regular expressions, and unit testing.
versions available: 1.66.0, 1.66.0-mpi, 1.76.0, 1.76.0-mpi
Climate Data Operators (CDO) is a collection of command line Operators to manipulate and analyse Climate and NWP model Data. Supported data formats are GRIB 1/2, netCDF 3/4, SERVICE, EXTRA and IEG. There are more than 600 operators available.
versions available: 2.0.5, 2.0.5-intel
The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
versions available: 8.1.0-cuda11.2, 8.2.4-cuda11.4
Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.
versions available: 3.4.0
FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).
versions available: 3.3.10, 3.3.10-mpi
Various Google codes, including Gflags v2.2.2 (Google’s commandline flags library); Glog v0.4.0 (C++ implementation of the Google logging module); LevelDB v1.23 (A fast key-value storage library); Protocol Buffers v3.15.8 ( Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data)
versions available: 2021
HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.
versions available: 1.10.5, 1.10.5-intel, 1.10.5-mpi, 1.10.7, 1.10.7-intel, 1.10.7-intel-mpi, 1.10.7-mpi
Intel runtime libraries
versions available: 2019, 2020
NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
versions available: 4.6.3, 4.6.3-mpi, 4.7.2, 4.7.2-mpi, 4.8.1, 4.8.1-intel, 4.8.1-intel-mpi, 4.8.1-mpi
Hierarchical Data Format (OPENBLAS4; also known as OPENBLAS) is a library and multi-object file format for storing and managing data between machines.
versions available: 0.3.13
A modular scientific software framework. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage.
versions available: 6.12.04, 6.22.08
SuiteSparse is a suite of sparse matrix algorithms, including GraphBLAS, Mongoose, ssget, UMFPACK, CHOLMOD, SPQR, KLU and BTF, CSparse and CXSparse, spqr_rank, Factorize, SSMULT, SFMULT, and ordering methods (AMD, CAMD, COLAMD, and CCOLAMD); AMD and COLAMD appear in MATLAB.
versions available: 5.7.2, 5.9.0
MPI (Message Passing Interface)
MPICH is a high-performance and widely portable implementation of the Message Passing Interface (MPI) standard MPI-1, MPI-2 and MPI-3.
versions available: 3.2.1, 3.3.1, 3.4.1
The Open MPI Project is an open source MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners.
versions available: 3.1.6, 3.1.6-intel, 4.0.3, 4.0.3-intel, 4.1.0, 4.1.0-intel