Human Gut Metagenomics Functional & Taxonomic Profiling¶

Note

If you are starting out with NGLess for metagenomics profiling, consider using the predefined pipeline collection, NG-meta-profiler. This tutorial is based on deconstructing a pipeline very similar to those.

In this tutorial, we will analyse a small dataset of human gut microbial metagenomes.

Note

This tutorial is also available as a slide presentation

Download the toy dataset

First download all the tutorial data:

ngless --download-demo gut-short

This will download and expand the data to a directory called gut-short.

This is a toy dataset. It is based on real data, but the samples were trimmed so that they contains only 250k paired-end reads.

The dataset is organized in classical MOCAT style, with one sample per directory. NGLess does not require this structure, but this tutorial also demonstrates how to upgrade from your existing MOCAT-based projects.:

$ find
./igc.demo.short
./SAMN05615097.short
./SAMN05615097.short/SRR4052022.single.fq.gz
./SAMN05615097.short/SRR4052022.pair.2.fq.gz
./SAMN05615097.short/SRR4052022.pair.1.fq.gz
./SAMN05615096.short
./SAMN05615096.short/SRR4052021.pair.1.fq.gz
./SAMN05615096.short/SRR4052021.single.fq.gz
./SAMN05615096.short/SRR4052021.pair.2.fq.gz
./SAMN05615098.short
./SAMN05615098.short/SRR4052033.pair.2.fq.gz
./SAMN05615098.short/SRR4052033.pair.1.fq.gz
./SAMN05615098.short/SRR4052033.single.fq.gz
./process.ngl

The whole script we will be using is there as well (process.ngl), so you can immediately run it with:

ngless process.ngl

The rest of this tutorial is an explanation of the steps in this script.

Preliminary imports

To run ngless, we need write a script. We start with a few imports:

ngless "1.4"
import "parallel" version "1.0"
import "motus" version "0.1"
import "igc" version "0.0"

These will all be used in the tutorial.

Parallelization

We are going to process each sample separately. For this, we use the lock1 function from the parallel module (which we imported before):

samples = readlines('igc.demo.short')
sample = lock1(samples)

The readlines function reads a file and returns all lines. In this case, we are reading the tara.demo.short file, which contains the three samples (SAMEA2621229.sampled, SAMEA2621155.sampled, and SAMEA2621033.sampled).

lock1() is a slightly more complex function. It takes a list and locks one of the elements and returns it. It always chooses an element which has not been locked before, so you each time you run NGLess, you will get a different sample.

Note

When you are using lock1() you will need to run NGLess multiple times. But you can run multiple instances in parallel.

Preprocessing

First, we load the data (the FastQ files):

input = load_fastq_directory(sample)

And, now, we preprocess the data:

input = preprocess(input, keep_singles=False) using |read|:
    read = substrim(read, min_quality=25)
    if len(read) < 45:
        discard

Filter against the human genome

We want to remove reads which map to the human genome, so we first map the reads to the human genome:

mapped = map(input, reference='hg19')

hg19 is a built-in reference and the genome will be automatically download it the first time you use it. Now, we discard the matched reads:

mapped = select(mapped) using |mr|:
    mr = mr.filter(min_match_size=45, min_identity_pc=90, action={unmatch})
    if mr.flag({mapped}):
        discard

The mapped object is a set of mappedreads (i.e., the same information that is saved in a SAM/BAM file). we use the as_reads function to get back to reads:

input = as_reads(mapped)

Now, we will use the input object which has been filtered of human reads.

Profiling using the IGC

Note

This section of the tutorial uses the Integrated Gene Catalogue and requires ca. 15GiB of RAM. Skip to step 9 if your machine does not have this much memory.

After preprocessing, we map the reads to the integrated gene catalog:

mapped = map(input, reference='igc', mode_all=True)

The line above is the reason we needed to import the igc module: it made the igc reference available.

Now, we need to count the results. This function takes the result of the above and aggregates it different ways. In this case, we want to aggregate by KEGG KOs, and eggNOG OGs:

counts = count(mapped,
            features=['KEGG_ko', 'eggNOG_OG'],
            normalization={scaled})

Aggregate the results

We have done all this computation, now we need to save it somewhere. We will use the collect() function to aggregate across all the samples processed:

collect(counts,
        current=sample,
        allneeded=samples,
        ofile='igc.profiles.txt')

Taxonomic profling using mOTUS

Map the samples against the motus reference (this reference comes with the motus module we imported earlier):

mapped = map(input, reference='motus', mode_all=True)

Now call the built-in count function to summarize your reads at gene level:

counted = count(mapped, features=['gene'], multiple={dist1})

To get the final taconomic profile, we call the motus function, which takes the gene count table and performs the motus quantification. The result of this call is another table, which we can concatenate with collect():

motus_table = motus(counted)
collect(motus_table,
        current=sample,
        allneeded=samples,
        ofile='motus-counts.txt')

Run it!

This is our script. We save it to a file (process.ngl in this example) and run it from the command line:

$ ngless process.ngl

Note

You need to run this script once for each sample. However, this can be done in parallel, taking advantage of high performance computing clusters.

Full script¶

Here is the full script:

ngless "1.0"
import "parallel" version "0.6"
import "mocat" version "0.0"
import "motus" version "0.1"
import "igc" version "0.0"

samples = readlines('igc.demo.short')
sample = lock1(samples)

input = load_fastq_directory(sample)

input = preprocess(input, keep_singles=False) using |read|:
    read = substrim(read, min_quality=25)
    if len(read) < 45:
        discard

mapped = map(input, reference='hg19')

mapped = select(mapped) using |mr|:
    mr = mr.filter(min_match_size=45, min_identity_pc=90, action={unmatch})
    if mr.flag({mapped}):
        discard

input = as_reads(mapped)


mapped = map(input, reference='igc', mode_all=True)

counts = count(mapped,
            features=['KEGG_ko', 'eggNOG_OG'],
            normalization={scaled})

collect(counts,
        current=sample,
        allneeded=samples,
        ofile='igc.profiles.txt')

mapped = map(input, reference='motus', mode_all=True)

counted = count(mapped, features=['gene'], multiple={dist1})

motus_table = motus(counted)
collect(motus_table,
        current=sample,
        allneeded=samples,
        ofile='motus-counts.txt')