Naive Bayes Classifier ?
Naive Bayes Classifier is an algorithm based on Bayes theorem. NBC is very simple and easy to use. It can handle conditional and discrete data as well. It assumes that all the variables in dataset is not correlated to each other.
Bayes Theorem is used to find the probabilities of a hypothesis with the given evidence.
Why Naive Bayes?
As the fact that all the features are independent makes naive algorithm very fast compared to complicated algorithms.
It can be used in high dimensional data like "TEXT CLASSIFIER","EMAIL SPAM DETECTION" .
Disadvantage in Naive Bayes:
ZERO probability problem: when we deal with test data , we might encounter a situation in a way that the word which is present in test might not be in train . In such situations, probability tends to be 0.
This can be handled by using different smoothing techniques.
Lets get into implementation:
The above command helps to install the open datasets library in order to extract the datasets which are present in other websites.
Try downloading using the bellow commands . it helps you to extract all the files present in that particular dataset.
In order to access the above downloaded files in Google collab use drive.mount.
drive.mount helps to allow access to the files that are present on your drive and collab.
Now, Lets import all the libraries required to do text preprocessing.
After importing the libraries . In order to extract the data which is present in the files which we downloaded, we use pd.read_json. panda library provides us the module called read_json() function in order to extract all the data from the jkson file. make sure you use read_json only if the given data is in json format.
The output of the data is been displayed below:
lets define a function called remove_punctuation.
string.punctuation- contains all the characters as string.
punctuationfree checks if there are another characters other than which are not in string.punctuation .
then the remove punctuation function is been applied to data['headline'] category with the help of lambda.
It removes all the text which isn't there in string punctuation and refilters the headline category.
Have a glance at headlline_prep column.
We are again apply lower function to convert all the values to lower case in headline_pre category.
STOPWORDS
Most common words in any natural Language. Stopwords donot add much value to the Document.
Generally the most common words used in text are "the" "is" "in" "for" "where" "when" "to" "at"....etc
eg: "There is a pen on the table"
There, pen, table
KEYWORDS
is, a , on, the
STOPWORDS
Pros:
-
The time to train the model also decreases
-
Dataset size decreases.
-
Helps to improve the performance as there are fewer and meaningful tokens left.(increases the classification accuracy)
There are different methods to remove "stopwords"
Method1: Stopwords using NLTK
Method2: Stopwords using Spacy
Method3: Stopwords using Gensim
....and many more methods.
In our current code we are going to use NLTK . NLTK has a list of stopwords stored in 16 diff language.
The above code explains how to use stopwords using nltk library
at first we are trying to set the language in stopwords (as mentioned earlier that nltk has many languages ).
Now with the help of lambda we try to check the condition, if the word is present in stopwords or not.
TOKENIZATION
Each of these small units are called tokens.
eg: "Natural Language Processing"
['Natural ','Language' ,'Processing']
-
Tokens can be words , numbers or punctuation marks.
-
Smaller units are created by locating" word boundaries"
Word Boundaries: ending point of the word and Beginning of the next word.
These tokens are considered as the first step of stemming and lemmatization.
DATASET:
This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.
Categories and corresponding article counts are as follows:
-
POLITICS: 32739
-
WELLNESS: 17827
-
ENTERTAINMENT: 16058
-
TRAVEL: 9887
-
STYLE & BEAUTY: 9649
-
PARENTING: 8677
-
HEALTHY LIVING: 6694
-
QUEER VOICES: 6314
-
FOOD & DRINK: 6226
-
BUSINESS: 5937
-
COMEDY: 5175
-
SPORTS: 4884
-
BLACK VOICES: 4528
-
HOME & LIVING: 4195
-
PARENTS: 3955
-
THE WORLDPOST: 3664
-
WEDDINGS: 3651
-
WOMEN: 3490
-
IMPACT: 3459
-
DIVORCE: 3426
-
CRIME: 3405
-
MEDIA: 2815
-
WEIRD NEWS: 2670
-
GREEN: 2622
-
WORLDPOST: 2579
-
RELIGION: 2556
-
STYLE: 2254
-
SCIENCE: 2178
-
WORLD NEWS: 2177
-
TASTE: 2096
-
TECH: 2082
-
MONEY: 1707
-
ARTS: 1509
-
FIFTY: 1401
-
GOOD NEWS: 1398
-
ARTS & CULTURE: 1339
-
ENVIRONMENT: 1323
-
COLLEGE: 1144
-
LATINO VOICES: 1129
-
CULTURE & ARTS: 1030
-
EDUCATION: 1004
​
DATASET IS PROVIDED BY KAGGLE : https://www.kaggle.com/datasets/rmisra/news-category-dataset
​
About this file
The file contains 202,372 records. Each json record contains following attributes:
category: Category article belongs to
headline: Headline of the article
authors: Person authored the article
link: Link to the post
short_description: Short description of the article
date: Date the article was published
​
LETS GET INTO THE CODE...
​
-
Firsts lets try to extract the data from the given data
​
Try to run the bellow command
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
TOKENIZATION
phrase
sentence
paragraph
entire text document
splitting into
{
into smaller units
Individual words or terms
Lets perform Lemmatization on the text. It finds the root form of the given word or lemmas in NLP.
example: ->
Â
Â
lemas of words :
- reading
- reads
- read
{
is
"read"
We use WordNetLemmatizer inorder to perform lemmatization.
-
To break the words into individual tokens we have used Whitespcae_Tokenizer from nltk.
-
we have defined a function named lematize_text() and then it is been applied to data['headline'].
This is how it looks after we do tokenization and removing stopwords.
Label Encoder
Encode target labels with value between 0 and n_classes-1 .This approach is very simple and it involves converting each value in a column to a number.
example:
there are 3,4 categories like : Tall , Medium, Short
Similarly, the label encoder helps to categorise the unique values and gives it a specific number.
Now dividing the train and dev and test model by using train_test_split.
After successfully dividing the preprocessed data into train_test
Lets try to find frequencies of each word appearing in the training set.
We store them in a dictionary.
Word count tells you n number of specific word present in that particular category.
we use defaultdict model to declare non existing library key to the defined value.
LAPLACE_SMOOTHING:
smoothing is applied when we have to apply when the probability of the word is zero
we add + 1 to all the words. and make its probability to one.
we define group_by_label and fit and predict functions .
References:
Contributions:
-
understanding the concept and implementation Naive bayes.
-
Documentated based on my understanding on the code .