Human Gut Metagenomics Functional & Taxonomic Profiling

Note

This tutorial is also available as a slide presentation

In this tutorial, we will analyse a small dataset of human gut microbial metagenomes.

  1. Download the toy dataset

First download all the tutorial data:

ngless --download-demo gut-short

This will download and expand the data to a directory called gut-short.

This is a toy dataset. It is based on real data, but the samples were trimmed so that they contains only 250k paired-end reads.

The dataset is organized in classical MOCAT style. Ngless does not require this structure, but this tutorial also demonstrates how to upgrade from your existing MOCAT-based projects.:

$ find
./igc.demo.short
./SAMN05615097.short
./SAMN05615097.short/SRR4052022.single.fq.gz
./SAMN05615097.short/SRR4052022.pair.2.fq.gz
./SAMN05615097.short/SRR4052022.pair.1.fq.gz
./SAMN05615096.short
./SAMN05615096.short/SRR4052021.pair.1.fq.gz
./SAMN05615096.short/SRR4052021.single.fq.gz
./SAMN05615096.short/SRR4052021.pair.2.fq.gz
./SAMN05615098.short
./SAMN05615098.short/SRR4052033.pair.2.fq.gz
./SAMN05615098.short/SRR4052033.pair.1.fq.gz
./SAMN05615098.short/SRR4052033.single.fq.gz
./process.ngl

The whole script we will be using is there as well (process.ngl), so you can immediately run it with:

ngless process.ngl

The rest of this tutorial is an explanation of the steps in this script.

  1. Preliminary imports

To run ngless, we need write a script. We start with a few imports:

ngless "0.0"
import "parallel" version "0.0"
import "mocat" version "0.0"
import "motus" version "0.1"
import "igc" version "0.0"

These will all be used in the tutorial.

  1. Parallelization

We are going to process each sample separately. For this, we use the lock1 function from the parallel module (which we imported before):

samples = readlines('igc.demo.short')
sample = lock1(samples)

The readlines function reads a file and returns all lines. In this case, we are reading the tara.demo.short file, which contains the three samples (SAMEA2621229.sampled, SAMEA2621155.sampled, and SAMEA2621033.sampled).

lock1() is a slightly more complex function. It takes a list and locks one of the elements and returns it. It always chooses an element which has not been locked before, so you each time you run _ngless_, you will get a different sample.

  1. Preprocessing

First, we load the data (the FastQ files):

input = load_mocat_sample(sample)

And, now, we preprocess the data:

preprocess(input, keep_singles=False) using |read|:
    read = substrim(read, min_quality=25)
    if len(read) < 45:
        discard
  1. Filter against the human genome

We want to remove reads which map to the human genome, so we first map the reads to the human genome:

mapped = map(input, reference='hg19')

hg19 is a built-in reference and the genome will be automatically download it the first time you use it. Now, we discard the matched reads:

mapped = select(mapped) using |mr|:
    mr = mr.filter(min_match_size=45, min_identity_pc=90, action={unmatch})
    if mr.flag({mapped}):
        discard

The mapped object is a set of mappedreads (i.e., the same information that is saved in a SAM/BAM file). we use the as_reads function to get back to reads:

input = as_reads(mapped)

Now, we will use the input object which has been filtered of human reads.

  1. Profiling using the IGC

Note

This section of the tutorial uses the Integrated Gene Catalogue and requires ca. 15GiB of RAM. Skip to step 9 if your machine does not have this much memory.

After preprocessing, we map the reads to the integrated gene catalog:

mapped = map(input, reference='igc', mode_all=True)

The line above is the reason we needed to import the igc module: it made the igc reference available.

Now, we need to count the results. This function takes the result of the above and aggregates it different ways. In this case, we want to aggregate by KEGG KOs, and eggNOG OGs:

counts = count(mapped,
            features=['KEGG_ko', 'eggNOG_OG'],
            normalization={scaled})
  1. Aggregate the results

We have done all this computation, now we need to save it somewhere. We will use the collect() function to aggregate across all the samples processed:

collect(counts,
        current=sample,
        allneeded=samples,
        ofile='igc.profiles.txt')
  1. Taxonomic profling using mOTUS

Map the samples against the motus reference (this reference comes with the motus module we imported earlier):

mapped = map(input, reference='motus', mode_all=True)

Now call the built-in count function to summarize your reads at gene level:

counted = count(mapped, features=['gene'], multiple={dist1})

To get the final taconomic profile, we call the motus function, which takes the gene count table and performs the motus quantification. The result of this call is another table, which we can concatenate with collect():

motus_table = motus(counted)
collect(motus_table,
        current=sample,
        allneeded=samples,
        ofile='motus-counts.txt')
  1. Run it!

This is our script. We save it to a file (process.ngl in this example) and run it from the command line:

$ ngless process.ngl

You also need to run it once for each sample. However, this can be done in parallel, taking advantage of high performance computing clusters.

Full script

Here is the full script:

ngless "0.0"
import "parallel" version "0.0"
import "mocat" version "0.0"
import "motus" version "0.1"
import "igc" version "0.0"

samples = readlines('igc.demo.short')
sample = lock1(samples)

input = load_mocat_sample(sample)

preprocess(input, keep_singles=False) using |read|:
    read = substrim(read, min_quality=25)
    if len(read) < 45:
        discard

mapped = map(input, reference='hg19')

mapped = select(mapped) using |mr|:
    mr = mr.filter(min_match_size=45, min_identity_pc=90, action={unmatch})
    if mr.flag({mapped}):
        discard

input = as_reads(mapped)


mapped = map(input, reference='igc', mode_all=True)

counts = count(mapped,
            features=['KEGG_ko', 'eggNOG_OG'],
            normalization={scaled})

collect(counts,
        current=sample,
        allneeded=samples,
        ofile='igc.profiles.txt')

mapped = map(input, reference='motus', mode_all=True)

counted = count(mapped, features=['gene'], multiple={dist1})

motus_table = motus(counted)
collect(motus_table,
        current=sample,
        allneeded=samples,
        ofile='motus-counts.txt')