Term Coocurrence Heatmap Example

Term Coocurrence Heatmap Example Workflow

This workflow starts with a list of genes, queries PubMed with each single gene name and downloads the resulting documents containing each gene name via a loop. The Document Grabber node requires an existing and empty directory for each query, to save the documents to.  These directories are created and emptied beforehand by Java Snippet nodes (hange path in the dialog of the Create Temp Dir node). The query containing each gene name, one at a time, is passed to the Document Grabber node via a flow variable. The Document Grabber queries PubMed, downloads and parses the documents, which are than represented in KNIME as DocumentCell s, additionally this node assigns the query to the resulting documents as document category. This information can be extracted afterwards in order to find out which document has been found by which query gene.

The Dictionary Tagger node is used afterwards to recognize important terms, which are interesting. The tagger assigns a specified tag (in this case “LOCATION”) to the recognized terms. Once these terms have been recognized the BoW creator node transforms the document into a bag of words, on which filtering can be applied. To filter named entities the Standard Named Entity Filter node is used, in this case terms with a “LOCATION” tag assigned.

In a further step the Category to class node extracts the category of each single document and assigns it as a string in an additional column. Since for each document the Document Grabber node assigned the corresponding query by which the document was found as document category, this category represents the name of a gene occurring in the document (otherwise it would not be in the PubMed result list). Now the data can simply be grouped over the extracted category and the remaining important terms (all other terms have been filtered), in order to compute the number of documents in which a certain gene (query term) and an important term co-occur. The grouping is done by the usage of the GroupBy node. Afterwards some column filtering has been applied, as well as a renaming of the remaining columns, missing values have been replaced by 0, and the co-occurrence frequencies have been normalized. Then the R View (Local) node is used to create a heat map based on these frequencies.

This workflow requires the Text Processing plugin and the R plugin.

Download workflow

 

The heatmap with the gene term coocurrences.