What Is Natural Language Processing?
Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software.
Where NLP Stands
Corpora, Tokens, and Types
● All NLP methods, be they classic or modern, begin with a text dataset, also called a corpus (plural: corpora).
● A corpus usually contains raw text and any metadata associated with the text. The raw text is a sequence of characters (bytes), but most times it is useful to group those characters into contiguous units called tokens.
● The process of breaking a text down into tokens is called tokenization.
● Types are unique tokens present in a corpus. The set of all types in a corpus is its vocabulary
In machine learning
parlance, the text along with its metadata is called an instance or data point.
The corpus, a collection of instances, is also known as a dataset.
Feature engineering
The process of understanding the linguistics of a language and applying it to solving NLP problems is called feature engineering.
Unigrams, Bigrams, Trigrams, …, N-grams
N-grams are fixed -length (n) consecutive token sequences occurring in the text.
● A bigram has two tokens, a unigram one.
Generating n -grams from a text is straightforward enough.
Lemmas and Stems
Lemmas :-
are root forms of words.
Consider the verb fly.
● It can be inflected into many different words [flow, flew, flies, flown, flowing, and so on]
● fly is the lemma for all of these seemingly different words.
This reduction is called lemmatization
Stemming:-
is the poor man’s lemmatization.
It involves the use of handcrafted rules to strip endings of words to reduce
them to a common form called stems.
Categorizing Sentences and Documents
Problems (supervised document classification)
● Assigning topic labels,
● Predicting sentiment of reviews,
● Filtering spam emails,
● Language identification, and
● Email triaging
Categorizing Words: POS Tagging
A common example of categorizing words is part -of -speech (POS) tagging
Categorizing Spans: Chunking and Named Entity Recognition
We might want to identify the noun phrases (NP) and verb phrases (VP) in text. This is called chunking or shallow parsing. Shallow parsing aims to derive higher -order units
composed of the grammatical atoms, like nouns, verbs, adjectives, and so on.
A named entity is a string mention of a real-world concept like a person, location, organization, drug name, and so on
Structure of Sentences
Whereas shallow parsing identifies phrasal units, the task of identifying the relationship between them is called parsing.