My current learnings around Large Language Models (LLMs) has lead to me running into more acronyms and terms that have either left my memory or are new to me. I decided a blog post was needed to help me keep track.
NLP
Natural Language Processing. The use of Machine Learning (ML) and Deep Learning (DL) has helped to progress NLP in recent times, especially with ML / DL being used with the large amounts of text data available via the internet.
Stop Words
Not a call to arms to stop words, but an actual classification for a type of words. Stop Words are words like “the” and “and”. Stop Words appear in sentences but are considered insignificant. When removed from a sentence, the sentence can still be read and understood. Stop Words are removed during NLP, and seem to be called Stop Words as they are stopped from proceeding further into the process.
Stemming
Stemming is the act of reducing a word to its stem, or root. This is because some words have the same root meaning, even if they mean it in slightly different ways. Wikipedia gives an example of catty, catlike and cats, all three of which can be stemmed back to the word cat.
Lemmatization
Like stemming, Lemmatization looks to group together words. However, Lemmatization does so whilst trying to also take into effect the meaning of the word and the meaning of the sentence the word is used in. The lemma of better would be good.
Text Normalisation
The process of removing punctuation such as full stops, commas and exclamation marks. Also includes changing the case of all the text involved, e.g. to lower case.
n-grams
An n-gram is a multi word phrase. A single word is a unigram, a two word phrase is a bi-gram, a three word phrase is a tri-gram. n-grams allow ML to consider the text as a whole, rather than the individual words.
Corpus
A corpus is a body of text. It’s a latin word.