Next Generation Sequencing

Institute Pasteur - PF2

KNIME Extensions for Next Generation Sequencing

We present here nodes and workflows used for processing next generation sequencing results. Most nodes presented here are not necessarily specific to NGS data, but might be useful in other circumstances as well.

With regards to the workflows presented here, they are coming with sample data and are pre-executed. They are not only showing how to use a specific node (this is meant to be described in the node help), but rather solve a specific NGS related problem.

Last but not least, there is no garantee/promise or whatsoever associated with any of the information here. We are very happy to discuss anything described on these pages and also welcome contributions from other KNIME NGS users/developers.

Kind regards,

Bernd (baj)

 

Nodes

Name of the node Description
FastQReader Reads in FastQ file into table. One FASTQ entry (i.e. 4 lines) are translated into one row. This node is using BioJava
FastQWriter Writes out FastQ file into a file. This node is using BioJava.
AdapterRemoval Very simple (not well optimized) routine to remove adapter sequences.
CountSorted Counts occurrences within a sorted column. It is faster than the ValueCounter and useful for counting reads from a FASTQ file as they are already sorted. It also uses minimum amount of memory
GetRegions Identifies regions of interest (ROI). A ROI is defined as a chromosomal region that has no gaps. This node also produces a string of concatenated values (i.e. counts)
PositionStr2Position Converts a string containing chromosomal name and location separated by "_" into the chromosome name and its position. Such an chromosomal name-location string can be used to identify a location by a single columsn.
RegionOverlap Identifies regions that overlap. This node is usually used within a sub-workflow that divides the data set per chromosome. The first input node is being retained.
Seq2PosIncidents This node splits a sequence into one nucleotide per row.
Bash Executes commands in bash or cmd.exe (see inline documentation)
CmdwInput Similar to the bash node only that it takes the input table and executes strings within that table.
JoinSorted Creates a full outer join of two sorted tables.
OneString This creates a single cell of type String, Integer, or Double. It is use full when executing workflows from the command line.
Wait Does nothing other than synchronising executions. This can also be done using the Variable Ports of existing nodes
BEDGraphWriter Writes out BED files.
SAMReader Reads Sam or Bam files.
PileupCounts PileupCounts counts occurences of e.g. matched positions from the corresponding from the mpileup command from samtools.
TableSpecs Retrieves simple stats for table and columns(n) included are column type, index, lower and upper bound (table 1) number of rows and columns (table2)
AdapterRemovalAdv This node finds an "adapter sequence" within high-through-put sequencing data and removes them. It leaves fragments and information about what caused the removal.
CollectionLinePlot  
GroupByLoopStart start iteration over one column. Divides the data set based on the values of that column such that consecutive rows with the same value are in one table. Also set the variable "GroupedBy" with the current value.
NGSConcat Concat 2 tables with identical table specs

Workflows

Name of the workflow Description
FastQ-stats Descriptive statistics of Illumina results in fastq format. (usually before mapping)
AdapterRemoval Simple workflow that reads in a FastQ file, then applies the adapter removal node, calculates the length and some statistics on the length of the sequences.

Workflows showing the use of the nodes

Name of the workflow Description
FASTQReader Simply one node with data from NCBI/SRA (SRR001356, Illumina sequencing of Mouse brain transcript fragment library)
FASTQWriter Simple workflow that reads in a FastQ file, then reduces the sequence and quality string to the first position and writes out the result.
Count sorted Simple workflow that reads in a FastQ file, sorts the data by the sequence and then applies both the value counter and the "countSorted" nodes, as well as sorts by the counts.
GetRegions Simple workflow that uses SAMReader, Seq2PosIncidents, CountSorted, PositionStr2Position, and GetRegions.
RegionOverlapp Intersect annoation from UCSC database with regions of interest
Bash example Execute something (ls) on the command line
CmdwInput example Execute something (ls) on the command line
AdapterRemoval Simple workflow that reads in a FastQ file, then applies the adapter removal node, calculates the length and some statistics on the length of the sequences.

License

The NGS nodes are released under GPLv2.