 |
KNIME Extensions for Next Generation Sequencing
|
We present here nodes and workflows used for processing next generation sequencing results. Most nodes presented here are not necessarily specific to NGS data, but might be useful in other circumstances as well.
With regards to the workflows presented here, they are coming with sample data and are pre-executed. They are not only showing how to use a specific node (this is meant to be described in the node help), but rather solve a specific NGS related problem.
Last but not least, there is no garantee/promise or whatsoever associated with any of the information here. We are very happy to discuss anything described on these pages and also welcome contributions from other KNIME NGS users/developers.
Kind regards,
Bernd (baj)
Nodes
|
Name of the node |
Description |
|
FastQReader |
Reads in FastQ file into table. One FASTQ entry (i.e. 4 lines) are translated into one row. This node is using BioJava |
|
FastQWriter |
Writes out FastQ file into a file. This node is using BioJava. |
|
AdapterRemoval |
Very simple (not well optimized) routine to remove adapter sequences. |
|
CountSorted |
Counts occurrences within a sorted column. It is faster than the ValueCounter and useful for counting reads from a FASTQ file as they are already sorted. It also uses minimum amount of memory |
|
GetRegions |
Identifies regions of interest (ROI). A ROI is defined as a chromosomal region that has no gaps. This node also produces a string of concatenated values (i.e. counts) |
|
PositionStr2Position |
Converts a string containing chromosomal name and location separated by "_" into the chromosome name and its position. Such an chromosomal name-location string can be used to identify a location by a single columsn. |
|
RegionOverlap |
Identifies regions that overlap. This node is usually used within a sub-workflow that divides the data set per chromosome. The first input node is being retained. |
|
Seq2PosIncidents |
This node splits a sequence into one nucleotide per row. |
|
Bash |
Executes commands in bash or cmd.exe (see inline documentation) |
|
CmdwInput |
Similar to the bash node only that it takes the input table and executes strings within that table. |
|
JoinSorted |
Creates a full outer join of two sorted tables. |
|
OneString |
This creates a single cell of type String, Integer, or Double. It is use full when executing workflows from the command line. |
|
Wait |
Does nothing other than synchronising executions. This can also be done using the Variable Ports of existing nodes |
|
BEDGraphWriter |
Writes out BED files. |
|
SAMReader |
Reads Sam or Bam files. |
|
PileupCounts |
PileupCounts counts occurences of e.g. matched positions from the corresponding from the mpileup command from samtools. |
|
TableSpecs |
Retrieves simple stats for table and columns(n) included are column type, index, lower and upper bound (table 1) number of rows and columns (table2) |
|
AdapterRemovalAdv |
This node finds an "adapter sequence" within high-through-put sequencing data and removes them. It leaves fragments and information about what caused the removal. |
|
CollectionLinePlot |
|
|
GroupByLoopStart |
start iteration over one column. Divides the data set based on the values of that column such that consecutive rows with the same value are in one table. Also set the variable "GroupedBy" with the current value. |
|
NGSConcat |
Concat 2 tables with identical table specs |
Workflows
|
Name of the workflow |
Description |
|
FastQ-stats |
Descriptive statistics of Illumina results in fastq format. (usually before mapping) |
|
AdapterRemoval |
Simple workflow that reads in a FastQ file, then applies the adapter removal node, calculates the length and some statistics on the length of the sequences. |
Workflows showing the use of the nodes
|
Name of the workflow |
Description |
|
FASTQReader |
Simply one node with data from NCBI/SRA (SRR001356, Illumina sequencing of Mouse brain transcript fragment library) |
|
FASTQWriter |
Simple workflow that reads in a FastQ file, then reduces the sequence and quality string to the first position and writes out the result. |
|
Count sorted |
Simple workflow that reads in a FastQ file, sorts the data by the sequence and then applies both the value counter and the "countSorted" nodes, as well as sorts by the counts. |
|
GetRegions |
Simple workflow that uses SAMReader, Seq2PosIncidents, CountSorted, PositionStr2Position, and GetRegions. |
|
RegionOverlapp |
Intersect annoation from UCSC database with regions of interest |
|
Bash example |
Execute something (ls) on the command line |
|
CmdwInput example |
Execute something (ls) on the command line |
|
AdapterRemoval |
Simple workflow that reads in a FastQ file, then applies the adapter removal node, calculates the length and some statistics on the length of the sequences. |
License
The NGS nodes are released under GPLv2.