The new version 2.1.1.0023926 of the Textprocessing feature has finally been released. The feature is of course compatible to KNIME 2.1 and downward compatibel to the previous Textprocessing version 2.0.1.0020407. Lots of things have been revised: bugs have been fixed and many new features have been implemented. Meta nodes make document pre-processing more comfortable, the integrated snowball stemming library enables support of several languages, and OpenNLP named entity recognizer node allows you to tag entities such as persons or organizations. Note that the "Text Processing" category, containing the provided nodes have been moved into the "KNIME labs" category. To update or install the feature use the standard eclipse update mechanism.
Bugs fixed:
-
Document comparison bug: The DocumentValueComparator compared only title and text, now it compares all attributes first title and last the text because of the expense.
-
Publication date bug: Dates like "0/0/0" have been parsed improperly, now a zero date can be assigned to documents.
-
Term vector creator: The names of the output columns have been determined by a flawed enumeration, now the Column names (Document titles) are numbered serially.
New features:
-
Parser node dialogs have "ignore hidden files" option, allowing to ignore hidden files for parsing.
-
The Stopword filter nodes has build in stop word lists (German and English), of course files containing stop word lists can still be specified.
-
The Abner node allows to specify the ABNER model to use. Based on the selected model the tagger is able to recognize genes, protein and cells.
-
The tags "DNA" and "RNA" have been added to Biomedical name entity tag set.
-
The root xml node of the sdml format changed from "document" to "documents", which can contain several "document" objects. This allows you to store more than one document in a valid well formed sdml file.
New examples:
In order to show how the Textprocessing feature works, and how the nodes can be used and combined two more examples are available in the example section.
-
Tag Cloud Example: Shows how to properly use the Tag Cloud node on the PubMed Human-Aids, Mouse-Cancer data.
-
Named Entity Recognition and TagCloud Example: Shows how to convert strings (from an rss news feed) into documents, use Named Entity Tagger combined with a TagCloud
New nodes:
-
Document Data Extractor: Extracts specified data of a document like title, authors, publication date etc. and returns it as output data. For each field to extract (i.e. title) a column in the output data table is created containing its values. According to the data the appropriate data cell type is chosen, i.e. the publication date column contains DateAndTime cells.
-
Sentence Extractor: Extracts all sentences of a document and creates an output table with rows containing triples of document, sentence and number of terms. This node can be used to convert the sentences of documents into string cells and additionally count the number of terms.
-
Dict Replacer: Replaces terms according to a specified dictionary. The dictionary file has to contain key and value pairs, separated by ",". Terms matching a key will be replaced by the corresponding value.
-
Term Grouper: Groups terms consisting of equal strings/words that appear several times because of different tags. Therefore a tag grouping policy can be specified, determining in which way tag conflicts are handled. Available are: "Delete all", "Keep all", and "Delete conflicting only".
-
Snowball Stemmer: Integrates the Snowball stemming library with all available stemmer (Porter, Portuguese, English, Spanish, Hungarian, Russian, Italian, Dutch, Swedish, Danish, Norwegian, Romanian, Turkish, French).
-
OpenNLP NE tagger: Integrates the OpenNLP named entity recognizer with the corresponding models. Named entities like date, money organization, location, person, time can be recognized and tagged appropriately. Additionally a dictionary can be specified, used by the OpenNLP model to increase accuracy.
-
Hyphenator: Hyphenates terms according to the algorithm of Liang. Language dependant pattern files are integrated for, Portuguese, Danish, French, Czech, Italian, Finnish, German, Swedish, English, Dutch.
New Meta nodes:
A couple of meta nodes have been integrated as well. You can find them in your node repository in the category "Meta". These meta nodes contain different frequent pre-processing chains in order to make pre-processing more comfortable and save space on your workbench.
-
Simple Preprocessing: A simple pre-processing chain containing: POS tagger, BoW creator, Punctuation Erasure, N Chars Filter, Stop word Filter, Porter Stemmer)
-
Extended NER Preprocessing: An extended pre-processing chain containing: POS tagger, Person recog., Location recog., Organization recog., ABNER tagger, BoW creator, Punctuation Erasure N Chars Filter, Stopword Filter, Porter Stemmer
-
Frequencies: Containing node to compute tf and idf frequencies. IDF, TF(rel), TF(abs)
-
Vector Creation: Containing a keyword extractor and vector creation nodes. Keygraph Keyword Extractor, Term Vector, Document Vector