Clustering Example Workflow
The workflow starts with a list of documents, which have been downloaded from PubMed and parsed beforehand and saved as data table. The data is available as drop file in the corresponding drop directories.
The documents are assigned to two categories and are split, based on the category assignments, into two sets. The first set consists of documents about human and aids, the second set consists of documents about mouse and cancer.
After transformation into a bag of words, the data is preprocessed by various filters. Then the most important keywords are extracted and, based on these keywords, the documents are transformed into document vectors.
The document vectors are a numerical representation of documents and are in the following used for hierarchical clustering based on Manhattan and Euclidean distance measures.
This workflow requires the Text Processing and Distance Matrix plugin.
Download workflow
The following pictures illustrate the dendogram and the hierarchically clustered data points (mouse cancer in red, human aids in blue).
|
Euclidean distances |
Manhattan distances |
 |
 |