Annotating metagenome sequences

Figure 3.7: Basic template to illustrate sequence annotation. A StringGenerator is used to generate a label for the attribute type (CreateMGXAttributeType), which also requires job context information. The attribute type is required to create attributes, thus the node is connected to the CreateMGXAttribute node. Finally, the annotation can be saved to the project database (AnnotateAttribute node).
Image annotate_templ

Annotation of metagenome sequences requires an “attribute type” and an “attribute”. As an example, we will illustrate the implementation of a pipeline for the analysis of GC content within metagenome sequences. We use a StringGenerator node configured to generate the string “GC” to create a label for the attribute type. As GC content is indicated by a number, we appropriately configure the CreateMGXAttributeType node to emit a basic (i.e. not hierarchical) as well as numerical “attribute type” (3.8).

Figure 3.8: Within the configuration dialog for the CreateMGXAttributeType node, structure and type of the generated attribute values are defined.
Image createattrtype

Figure 3.9: Incomplete example; extending upon 3.7, the ReadCSF node will provide the necessary metagenome sequences to be annotated. Still, there is no actual analysis specified.
Image annotate_templ2

In a second step, we use the ReadCSF node to obtain access to the individual metagenome sequences; as MGX annotates sequences individually, a connection between ReadCSF and AnnotateAttribute is required (3.9). Subsequently, we implement the actual analysis, which is provided by the GCContent node. It will process all sequences and emit the corresponding GC content for each of them. To convert these values to appropriate “attributes”, an “attribute type” is required for each value; therefore, a Repeat node is inserted between nodes 5 and 7 (3.10).

Figure 3.10: Step 2: The GCContent node represents the actual analysis step; it is used to determine the GC content of a DNA sequence, which will then be converted into an “attribute”. Since an “attribute type” is required for each “attribute”, a Repeat node is inserted between nodes 5 and 7.
Image annotate_templ3

Finally, as an annotation always refers to only a part of a sequence, we will need to generate the corresponding start and end coordinates; since GC content refers to the full sequence, we can use an ULongGenerator node configured to emit 0 (MGX uses 0-based coordinates) to generate the start coordinate; this node needs to be connected to a Repeat node to generate a series of 0s.
The end coordinate can be created based on the sequences' length, with 1 subtracted, obtained through the GetLength and MinusOne nodes (3.11).
The GetMGXJob node will retain its red border due to missing configuration; this, however, can be ignored, as appropriate configuration will be provided by the MGX framework automatically.

Figure 3.11: Completing the workflow: the ULongGenerator and GetLength nodes are added to specify coordinates for the subregion of the DNA sequence described by the “attribute”; the start coordinate is simply repeated, while 1 is subtracted from the sequences length due to 0-based coordinates.
Image annotate_templ4

Sebastian Jaenicke, 2020-04-28