Blog

KNIME UGM 2013 – Text Mining Workshop

From March 6th to 8th the KNIME UGM 2013 took place in Zurich, like the years before, and also in 2013 there was a text mining workshop in which the functionality and usage of the KNIME Textprocessing nodes was presented. This year restaurant reviews from tripadvisor haven been analyzed and mined. Thereby words, occurring in a negative or positive context have been identified, as a basic sentiment analysis. Reviews have been clustered and classified according to the cuisine of the restaurant. And words have been visualized in tag clouds. Some of the example workflows including the data can be downloaded below (see attachments). The required KNIME plugins are: Textprocessing, Distance Matrix, and Itemset Mining.

Textprocessing version v2.7.1 released

With the KNIME version 2.7.1, a new version 2.7.1 of the Textprocessing feature has been released too, providing some bug fixes.

Bug fixes:

  • OpenNLP Tagger: External OpenNLP tagger models can be specified in the dialog properly, without errors and warnings.
  • Dictionary Tagger: Assigns tags of tag sets other than “NE” properly.
  • Document Grabber: Extracts first name of PubMed authors correctly.
  • BoW Creator: Row count, used as row id, is reset on node reset.
  • Snowball Stemmer: Is thread save now, meaning that several nodes can be used in parallel.
  • String to Term: Auto configures properly, using the first available string column.
  • String to Term, Term to String, Term to Structure: Handle missing values.

Textprocessing version v2.7.0 released

With the new KNIME version, a new version 2.7.0 of the Textprocessing feature has been released. It provides new node features and a new openNLP version (1.5.2).

New features:

  • DocumentViewer: The node view allows for the search of strings and the highlighting of matching results in the document. Tagged named entities, such as gene or compound names can be highlighted in different colors and linked to a specified website, e.g. PubMed, Wikipedia or Google. A click on the link opens a browser window, loading the corresponding website.
  • Document Grabber: It is possible to extract meta information from PubMed, such as the chemical list, or the MesH heading list provided with each document (see node dialog).
  • OpenNLP version is updated to 1.5.2.
  • OSCAR (chemical named entity tagger) is updated to version 4.1.1 using OPSIN v.1.3.
  • OpenNLP NE Tagger: It is possible to use external openNLP models, which can be specified in the dialog.

The KNIME Text Processing Feature: An Introduction

This technical report explains the fundamentals of the KNIME Text Processing feature along with detailed descriptions and examples of all key node categories. It describes the philosophy and usage of the feature; meaning which types of nodes can be applied in which order, what needs to be done, what can be done, and in which step. Furthermore the different structures of data tables are explained as well as the provided KNIME data types.

Download the technical report here.

Textprocessing version v2.6.0 released

With the new KNIME version, a new version 2.6.0 of the Textprocessing feature has been released. It provides two new nodes, a few node updates, bug fixes, and API extensions.

New features:

  • Strings To Document: in order to set categories and sources of documents, a string column can be specified (one for document source and one for document category) in addition to the fixed string value.
  • N Chars: If N is set to 1 a warning will be shown, since no filtering is applied (terms always consist of at least one character).

New Nodes:

  • Term co-occurrence counter: Counts the number of co-occurrences for a given list of terms within different parts of the documents, e.g. sentences, paragraphs, sections and titles.
  • Dict Replacer (2 in ports): Replaces entire terms based on a dictionary, equal to the traditional Dict Replacer node. The dictionary is in contrast to the old node provided via a input data table instead of a file.

Bug fixes:

  • Word Parser: Reads older Word files (Word95) as well.
  • Tag Cloud: Font size is computed properly if max value (font size) is equal to min value (font size).

Update Notes: Documentation and Examples

The Documentation section as well as the Examples section has been updated. The different steps IO, Enrichment, Preprocessing, Frequencies, and Transformation are described in more detail and figures have been updated.

The workflow examples available at the Examples section have been revised and annotated and new descriptions have been added. In addition two new example workflows are available, which download articles from PubMed and NY Times rss news feeds respectively, mine and visualize terms contained in the documents.

Textprocessing version v2.5.0.0031650 released

The new version v2.5.0.0031650 of the KNIME Textprocessing feature has finally been released.  The feature can be used with KNIME 2.5 (or later) and is downwards compatible to its earlier versions. Old workflows created with earlier versions of the feature can still be loaded and used.

The new version provides new nodes and features, bug fixes, and major changes in the underlying de-serialization of documents and terms.

Major changes:

  • Memory management has been improved by the usage of soft references in document and term caches.
  • Binary serialization of DocumentCells and TermCells in order to speed up buffering of BufferedDataTables. Old workflows, created with a version <2.5.0 still can be loaded and executed. Downwards compatibility can be disabled at the KNIME->Textprocessing preference page in order to speed up the deserialization.

    Note:Enable downwards compatibility (enabled by default) to read old workflow. Save the workflow and disable downwards compatibility afterwards in order to speed up processing.

  • Extension point to register third party TagSets which can be e.g. used by a third party named entity recognizer (tagger). The class providing the new tag set needs to implement the interface TagBuilder and be registered in the extensions section of the plugin.xml as TagSet. The process is similar to the process of registering new node extensions.

    Note:If you extend the textprocessing feature by implementing your own tag set (and tagger node) and you want to share a workflow using this new tag set (and tagger) it is necessary to distribute the tag set implementation (and tagger node) as well.

New features:

  • Tag cloud provides image out port which can be used in reports. Settings such as font, image size, transparency etc. can be specified in the tag cloud dialog.
  • Document vector and Term vector can create an output data table containing sparse CollectionCells as vectors besides the usual column representation. Note, that processing of data tables with many (>1000) columns can be slow. The usage of CollectionCells can speed up processing.
  • Dictionary tagger has two input ports, one containing the list of documents to tag and the other containing the dictionary. The dictionary input table has to contain one string column where each cell represents one dictionary entry. Furthermore the dialog has been improved in a way that the tag type as well as the value can be specified. External tag sets which have been registered via the TagSet extension point can be chosen as well.

New nodes:

  • Oscar chem tagger: assigns chemical named entity tags to terms which have been recognized by the Oscar chemical named entity recognizer framework.
  • Oscar term filter: filters terms recognized as named entities by the Oscar tagger.
  • Term to Structure: converts terms which have been recognized as chemical named entities into structures (SMILES, INCHI, or CML). These structures can be further processed via nodes of e.g. the CDK feature.
  • Stanford tagger: assigns POS tags based on different models, which can be specified in the dialog. This tagger is capable of tagging German texts using the STTS tag set.
  • STTS term filter: filters terms with STTS tags assigned based on the tag value.
  • PDF Parser: parses .pdf files and creates a DocumentCell for each file. For parsing the PDF Box library is used.
  • Word Parser: parses .doc files and creates a DocumentCell for each file. For parsing the POI library is used.

Bug fixes:

  • Tagger: Tags are not assigned a second (or more) time when tag has already been assigned to a term.
  • Flat file reader: does no longer cut off the last character.
  • Document vector: vectors for documents consisting of only one term are created properly.
  • Term vector: vectors for terms occurring in only one document are created properly.
  • Tag filter: matching terms are filtered if „ignore unmodifiable“ and “filter matching” is checked.
  • Sdm and Dml parser: opening and closing tags are converted to lower case.

Textprocessing version 2.1.1.0023926 released

The new version 2.1.1.0023926 of the Textprocessing feature has finally been released. The feature is of course compatible to KNIME 2.1 and downward compatibel to the previous Textprocessing version 2.0.1.0020407. Lots of things have been revised: bugs have been fixed and many new features have been implemented. Meta nodes make document pre-processing more comfortable, the integrated snowball stemming library enables support of several languages, and OpenNLP named entity recognizer node allows you to tag entities such as persons or organizations. Note that the "Text Processing" category, containing the provided nodes have been moved into the "KNIME labs" category. To update or install the feature use the standard eclipse update mechanism.

Bugs fixed:

  • Document comparison bug: The DocumentValueComparator compared only title and text, now it compares all attributes first title and last the text because of the expense.
  • Publication date bug: Dates like "0/0/0" have been parsed improperly, now a zero date can be assigned to documents.
  • Term vector creator: The names of the output columns have been determined by a flawed enumeration, now the Column names (Document titles) are numbered serially.

New features:

  • Parser node dialogs have "ignore hidden files" option, allowing to ignore hidden files for parsing.
  • The Stopword filter nodes has build in stop word lists (German and English), of course files containing stop word lists can still be specified.
  • The Abner node allows to specify the ABNER model to use. Based on the selected model the tagger is able to recognize genes, protein and cells.
  • The tags "DNA" and "RNA" have been added to Biomedical name entity tag set.
  • The root xml node of the sdml format changed from "document" to "documents", which can contain several "document" objects. This allows you to store more than one document in a valid well formed sdml file.

New examples:

In order to show how the Textprocessing feature works, and how the nodes can be used and combined two more examples are available in the example section.

  • Tag Cloud Example: Shows how to properly use the Tag Cloud node on the PubMed Human-Aids, Mouse-Cancer data.
  • Named Entity Recognition and TagCloud Example: Shows how to convert strings (from an rss news feed) into documents, use Named Entity Tagger combined with a TagCloud

New nodes:

  • Document Data Extractor: Extracts specified data of a document like title, authors, publication date etc. and returns it as output data. For each field to extract (i.e. title) a column in the output data table is created containing its values. According to the data the appropriate data cell type is chosen, i.e. the publication date column contains DateAndTime cells.
  • Sentence Extractor: Extracts all sentences of a document and creates an output table with rows containing triples of document, sentence and number of terms. This node can be used to convert the sentences of documents into string cells and additionally count the number of terms.
  • Dict Replacer: Replaces terms according to a specified dictionary. The dictionary file has to contain key and value pairs, separated by ",". Terms matching a key will be replaced by the corresponding value.
  • Term Grouper: Groups terms consisting of equal strings/words that appear several times because of different tags. Therefore a tag grouping policy can be specified, determining in which way tag conflicts are handled. Available are: "Delete all", "Keep all", and "Delete conflicting only".
  • Snowball Stemmer: Integrates the Snowball stemming library with all available stemmer (Porter, Portuguese, English, Spanish, Hungarian, Russian, Italian, Dutch, Swedish, Danish, Norwegian, Romanian, Turkish, French).
  • OpenNLP NE tagger: Integrates the OpenNLP named entity recognizer with the corresponding models. Named entities like date, money organization, location, person, time can be recognized and tagged appropriately. Additionally a dictionary can be specified, used by the OpenNLP model to increase accuracy.
  • Hyphenator: Hyphenates terms according to the algorithm of Liang. Language dependant pattern files are integrated for, Portuguese, Danish, French, Czech, Italian, Finnish, German, Swedish, English, Dutch.

New Meta nodes:

A couple of meta nodes have been integrated as well. You can find them in your node repository in the category "Meta". These meta nodes contain different frequent pre-processing chains in order to make pre-processing more comfortable and save space on your workbench.

  • Simple Preprocessing: A simple pre-processing chain containing: POS tagger, BoW creator, Punctuation Erasure, N Chars Filter, Stop word Filter, Porter Stemmer)
  • Extended NER Preprocessing: An extended pre-processing chain containing: POS tagger, Person recog., Location recog., Organization recog., ABNER tagger, BoW creator, Punctuation Erasure N Chars Filter, Stopword Filter, Porter Stemmer
  • Frequencies: Containing node to compute tf and idf frequencies. IDF, TF(rel), TF(abs)
  • Vector Creation: Containing a keyword extractor and vector creation nodes. Keygraph Keyword Extractor, Term Vector, Document Vector

Textprocessing version 2.0.1.0020407 released

Textprocessing plugin version 2.0.1.0020407 has been released.
To update your old plugin or install the new one, use the eclipse update machanism.
The new release includes a performance speed up of preprocessing nodes,
some new dialog options, like the ignorance of the unmodifiable flag of terms
by preprocessing nodes, two new nodes, and much more.

Additionally the documentation and the examples on this site have been revised.

New Nodes:

Tags to string
Converts tags of specified types to strings.