Spacy Model Selector
Spacy Model Selector – initial node that allows to pick the model. The node supports KNIME File Handling framework, so the model can be read from any remote or local source. Custom models are supported, so if the user has his/her own model it is possible to use it – this feature might be interesting for advanced users or researchers.
Spacy Tokenizer – the node that converts the String format to Document format, the same as Strings to Document node. The only difference is that it uses the Spacy model.
Example of tokenization:
Sentence: It wasn’t a waste of time if you learned something.
Tokenized sentence: [“it”, “was”, ”n’t”, “a”, “waste”, “of”, “time”, “if”, “you”, “learned”, “something”, “.”]
Spacy POS Tagger
Spacy POS Tagger – this node is quite the same as the POS Tagger node in KNIME. The difference is that Spacy models might have their own tags. This might bring some confusion and troubles, however it may give more flexibility for some very specific POS notation for the languages.
Spacy NER – if the module is available allows to automatically assign NER tags within the document. The tags always depend on the model, so one is interested in what entities can be recognized should check the model description. But in general it is usually person, location, organization. In KNIMEthere are no ready-to-go NER models available, however it is possible to train one – so this might be beneficial for lazy or non-technical people, or people that do not have enough training data.
Spacy Vectorizer – converts both Document and String data to the vector (List of Double). This node can be used with data preprocessing, it is optional. In KNIME there is no ready-to-go vectorization model, it is only possible to train your own word2vec model, similar to NER case. Depending on the type of the model it is possible to have vectors based on the transformer model (SOTA). The vectors then can be used for ML, both supervised and unsupervised, and for visualization.
Spacy Lemmatizer – converts words to their root form (lemma). In the current Text processing this feature is only available for English. This method is used for processing the texts before classification and topic modeling.
Explanation example of lemmatization:
- enjoyed -> enjoy
- best -> good
- deliveries -> delivery
Spacy Morphologizer – there are no similar nodes in the Text processing extension. The node assigns special tags (similar to POS or NER) that indicate the tense, case, plurality, etc. Might be useful for deep scientific text analysis. Potential use cases: extracting the relationships between entities in the sentence/text, estimation of the frequency of certain words and their forms in the sentence/text. Another scientific use case is the stylometry analysis, where one can try to describe or identify the author by certain words and their forms frequency.
Spacy Stop Words Filter
Spacy Stop Words Filter – removes the words that have no meaning in the sentence using in-built dictionaries. The words like prepositions, conjunctions, articles, etc refer to stop words.