What Is Natural Language Processing or nlp?

 What Is Natural Language Processing?

 Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software.

Where NLP Stands


Corpora, Tokens, and Types

All NLP methods, be they classic or modern, begin with a text dataset, also called a corpus (plural: corpora).

A corpus usually contains raw text and any metadata associated with the text. The raw text is a sequence of characters (bytes), but most times it is useful to group those characters into contiguous units called tokens.

The process of breaking a text down into tokens is called tokenization.

Types are unique tokens present in a corpus. The set of all types in a corpus is its vocabulary

 In machine learning

parlance, the text along with its metadata is called an instance or data point.
The
corpus, a collection of instances, is also known as a dataset.

Feature engineering

The process of understanding the linguistics of a language and applying it to solving NLP problems is called feature engineering.

Unigrams, Bigrams, Trigrams, …, N-grams

N-grams are fixed -length (n) consecutive token sequences occurring in the text.
A bigram has two tokens, a unigram one. 

Generating n -grams from a text is straightforward enough.


Lemmas and Stems

Lemmas :-

 are root forms of words.

Consider the verb fly.
It can be inflected into many different words [flow, flew, flies, flown, flowing, and so on]
fly is the lemma for all of these seemingly different words.
This reduction is called
lemmatization

 

 Stemming:-

 is the poor man’s lemmatization.

 It involves the use of handcrafted rules to strip endings of words to reduce
them to a common form called stems.


Categorizing Sentences and Documents


Problems (supervised document classification)
Assigning topic labels,
Predicting sentiment of reviews,
Filtering spam emails,
Language identification, and
Email triaging
 

Categorizing Words: POS Tagging

A common example of categorizing words is part -of -speech (POS) tagging

Categorizing Spans: Chunking and Named Entity Recognition

We might want to identify the noun phrases (NP) and verb phrases (VP) in text. This is called chunking or shallow parsing. Shallow parsing aims to derive higher -order units

composed of the grammatical atoms, like nouns, verbs, adjectives, and so on.

A named entity is a string mention of a real-world concept like a person, location, organization, drug name, and so on

Structure of Sentences

Whereas shallow parsing identifies phrasal units, the task of identifying the relationship between them is called parsing.

Please leave your comment to encourage us

Previous Post Next Post