News Category

Naive bayes Text Classifier using News Category Dataset

click heare to download News Category code

Naive Bayes Classifier ?

Naive Bayes Classifier is an algorithm based on Bayes theorem. NBC is very simple and easy to use. It can handle conditional and discrete data as well. It assumes that all the variables in dataset is not correlated to each other.

Bayes Theorem is used to find the probabilities of a hypothesis with the given evidence.

Why Naive Bayes?

As the fact that all the features are independent makes naive algorithm very fast compared to complicated algorithms.
It can be used in high dimensional data like "TEXT CLASSIFIER","EMAIL SPAM DETECTION" .
Disadvantage in Naive Bayes:
ZERO probability problem: when we deal with test data , we might encounter a situation in a way that the word which is present in test might not be in train . In such situations, probability tends to be 0.
This can be handled by using different smoothing techniques.

Lets get into implementation:

The above command helps to install the open datasets library in order to extract the datasets which are present in other websites.

Try downloading using the bellow commands . it helps you to extract all the files present in that particular dataset.

In order to access the above downloaded files in Google collab use drive.mount.
drive.mount helps to allow access to the files that are present on your drive and collab.

Now, Lets import all the libraries required to do text preprocessing.

After importing the libraries . In order to extract the data which is present in the files which we downloaded, we use pd.read_json. panda library provides us the module called read_json() function in order to extract all the data from the jkson file. make sure you use read_json only if the given data is in json format.

The output of the data is been displayed below:

lets define a function called remove_punctuation.
string.punctuation- contains all the characters as string.
punctuationfree checks if there are another characters other than which are not in string.punctuation .

then the remove punctuation function is been applied to data['headline'] category with the help of lambda.
It removes all the text which isn't there in string punctuation and refilters the headline category.

Have a glance at headlline_prep column.

We are again apply lower function to convert all the values to lower case in headline_pre category.

STOPWORDS

Most common words in any natural Language. Stopwords donot add much value to the Document.
Generally the most common words used in text are "the" "is" "in" "for" "where" "when" "to" "at"....etc
eg: "There is a pen on the table"
There, pen, table

KEYWORDS

is, a , on, the

STOPWORDS

Pros:

The time to train the model also decreases

Dataset size decreases.

Helps to improve the performance as there are fewer and meaningful tokens left.(increases the classification accuracy)

There are different methods to remove "stopwords"
Method1: Stopwords using NLTK
Method2: Stopwords using Spacy
Method3: Stopwords using Gensim
....and many more methods.
In our current code we are going to use NLTK . NLTK has a list of stopwords stored in 16 diff language.

The above code explains how to use stopwords using nltk library
at first we are trying to set the language in stopwords (as mentioned earlier that nltk has many languages ).
Now with the help of lambda we try to check the condition, if the word is present in stopwords or not.

TOKENIZATION

Each of these small units are called tokens.
eg: "Natural Language Processing"
['Natural ','Language' ,'Processing']

Tokens can be words , numbers or punctuation marks.
Smaller units are created by locating" word boundaries"
Word Boundaries: ending point of the word and Beginning of the next word.
These tokens are considered as the first step of stemming and lemmatization.

DATASET:

This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.

Categories and corresponding article counts are as follows:

POLITICS: 32739
WELLNESS: 17827
ENTERTAINMENT: 16058
TRAVEL: 9887
STYLE & BEAUTY: 9649
PARENTING: 8677
HEALTHY LIVING: 6694
QUEER VOICES: 6314
FOOD & DRINK: 6226
BUSINESS: 5937
COMEDY: 5175
SPORTS: 4884
BLACK VOICES: 4528
HOME & LIVING: 4195
PARENTS: 3955
THE WORLDPOST: 3664
WEDDINGS: 3651
WOMEN: 3490
IMPACT: 3459
DIVORCE: 3426
CRIME: 3405
MEDIA: 2815
WEIRD NEWS: 2670
GREEN: 2622
WORLDPOST: 2579
RELIGION: 2556
STYLE: 2254
SCIENCE: 2178
WORLD NEWS: 2177
TASTE: 2096
TECH: 2082
MONEY: 1707
ARTS: 1509
FIFTY: 1401
GOOD NEWS: 1398
ARTS & CULTURE: 1339
ENVIRONMENT: 1323
COLLEGE: 1144
LATINO VOICES: 1129
CULTURE & ARTS: 1030
EDUCATION: 1004

DATASET IS PROVIDED BY KAGGLE : https://www.kaggle.com/datasets/rmisra/news-category-dataset

About this file

The file contains 202,372 records. Each json record contains following attributes:

category: Category article belongs to

headline: Headline of the article

authors: Person authored the article

link: Link to the post

short_description: Short description of the article

date: Date the article was published

LETS GET INTO THE CODE...

Firsts lets try to extract the data from the given data

Try to run the bellow command

TOKENIZATION

phrase

sentence

paragraph

entire text document

splitting into

{

into smaller units

Individual words or terms

Lets perform Lemmatization on the text. It finds the root form of the given word or lemmas in NLP.
example: ->

lemas of words :

- reading
- reads
- read

{

is

"read"

We use WordNetLemmatizer inorder to perform lemmatization.

To break the words into individual tokens we have used Whitespcae_Tokenizer from nltk.
we have defined a function named lematize_text() and then it is been applied to data['headline'].

This is how it looks after we do tokenization and removing stopwords.

Label Encoder

Encode target labels with value between 0 and n_classes-1 .This approach is very simple and it involves converting each value in a column to a number.

example:
there are 3,4 categories like : Tall , Medium, Short

Similarly, the label encoder helps to categorise the unique values and gives it a specific number.

Now dividing the train and dev and test model by using train_test_split.

After successfully dividing the preprocessed data into train_test

Lets try to find frequencies of each word appearing in the training set.
We store them in a dictionary.

Word count tells you n number of specific word present in that particular category.
we use defaultdict model to declare non existing library key to the defined value.

LAPLACE_SMOOTHING:
smoothing is applied when we have to apply when the probability of the word is zero
we add + 1 to all the words. and make its probability to one.

we define group_by_label and fit and predict functions .

References:

https://www.analyticsvidhya.com/blog/2022/03/building-naive-bayes-classifier-from-scratch-to-perform-sentiment-analysis/

Contributions:

understanding the concept and implementation Naive bayes.
Documentated based on my understanding on the code .

News Category

Naive bayes Text Classifier using News Category Dataset

Naive Bayes Classifier ?

Naive Bayes Classifier is an algorithm based on Bayes theorem. NBC is very simple and easy to use. It can handle conditional and discrete data as well. It assumes that all the variables in dataset is not correlated to each other.

Bayes Theorem is used to find the probabilities of a hypothesis with the given evidence.

Why Naive Bayes?

Lets get into implementation:

The above command helps to install the open datasets library in order to extract the datasets which are present in other websites.

Try downloading using the bellow commands . it helps you to extract all the files present in that particular dataset.

In order to access the above downloaded files in Google collab use drive.mount. drive.mount helps to allow access to the files that are present on your drive and collab.

Now, Lets import all the libraries required to do text preprocessing.

Have a glance at headlline_prep column.

We are again apply lower function to convert all the values to lower case in headline_pre category.

STOPWORDS

Most common words in any natural Language. Stopwords donot add much value to the Document. Generally the most common words used in text are "the" "is" "in" "for" "where" "when" "to" "at"....etc eg: "There is a pen on the table" There, pen, table

KEYWORDS

is, a , on, the

STOPWORDS

Pros:

The time to train the model also decreases

Dataset size decreases.

Helps to improve the performance as there are fewer and meaningful tokens left.(increases the classification accuracy)

There are different methods to remove "stopwords" Method1: Stopwords using NLTK Method2: Stopwords using Spacy Method3: Stopwords using Gensim ....and many more methods. In our current code we are going to use NLTK . NLTK has a list of stopwords stored in 16 diff language.

The above code explains how to use stopwords using nltk library at first we are trying to set the language in stopwords (as mentioned earlier that nltk has many languages ). Now with the help of lambda we try to check the condition, if the word is present in stopwords or not.

TOKENIZATION

Each of these small units are called tokens. eg: "Natural Language Processing" ['Natural ','Language' ,'Processing']

Tokens can be words , numbers or punctuation marks.

Smaller units are created by locating" word boundaries" Word Boundaries: ending point of the word and Beginning of the next word. These tokens are considered as the first step of stemming and lemmatization.

DATASET:

TOKENIZATION

phrase

sentence

paragraph

entire text document

splitting into

{

into smaller units

Individual words or terms

Lets perform Lemmatization on the text. It finds the root form of the given word or lemmas in NLP.example: ->

lemas of words :

- reading - reads - read

{

is

"read"

We use WordNetLemmatizer inorder to perform lemmatization.

To break the words into individual tokens we have used Whitespcae_Tokenizer from nltk.

we have defined a function named lematize_text() and then it is been applied to data['headline'].

This is how it looks after we do tokenization and removing stopwords.

Label Encoder

Encode target labels with value between 0 and n_classes-1 .This approach is very simple and it involves converting each value in a column to a number. example: there are 3,4 categories like : Tall , Medium, Short

Similarly, the label encoder helps to categorise the unique values and gives it a specific number. Now dividing the train and dev and test model by using train_test_split.

After successfully dividing the preprocessed data into train_test Lets try to find frequencies of each word appearing in the training set. We store them in a dictionary.

Word count tells you n number of specific word present in that particular category. we use defaultdict model to declare non existing library key to the defined value.

LAPLACE_SMOOTHING: smoothing is applied when we have to apply when the probability of the word is zero we add + 1 to all the words. and make its probability to one.

we define group_by_label and fit and predict functions .

References:

https://www.analyticsvidhya.com/blog/2022/03/building-naive-bayes-classifier-from-scratch-to-perform-sentiment-analysis/

Contributions: ​

understanding the concept and implementation Naive bayes.

Documentated based on my understanding on the code .

I'd love to hear from you. Please Send me an email if you want to contact me .

In order to access the above downloaded files in Google collab use drive.mount.
drive.mount helps to allow access to the files that are present on your drive and collab.

Most common words in any natural Language. Stopwords donot add much value to the Document.
Generally the most common words used in text are "the" "is" "in" "for" "where" "when" "to" "at"....etc
eg: "There is a pen on the table"
There, pen, table

There are different methods to remove "stopwords"
Method1: Stopwords using NLTK
Method2: Stopwords using Spacy
Method3: Stopwords using Gensim
....and many more methods.
In our current code we are going to use NLTK . NLTK has a list of stopwords stored in 16 diff language.

The above code explains how to use stopwords using nltk library
at first we are trying to set the language in stopwords (as mentioned earlier that nltk has many languages ).
Now with the help of lambda we try to check the condition, if the word is present in stopwords or not.

Each of these small units are called tokens.
eg: "Natural Language Processing"
['Natural ','Language' ,'Processing']

Smaller units are created by locating" word boundaries"
Word Boundaries: ending point of the word and Beginning of the next word.
These tokens are considered as the first step of stemming and lemmatization.

Lets perform Lemmatization on the text. It finds the root form of the given word or lemmas in NLP.
example: ->

- reading
- reads
- read

Encode target labels with value between 0 and n_classes-1 .This approach is very simple and it involves converting each value in a column to a number.

example:
there are 3,4 categories like : Tall , Medium, Short

Similarly, the label encoder helps to categorise the unique values and gives it a specific number.

Now dividing the train and dev and test model by using train_test_split.

After successfully dividing the preprocessed data into train_test

Lets try to find frequencies of each word appearing in the training set.
We store them in a dictionary.

Word count tells you n number of specific word present in that particular category.
we use defaultdict model to declare non existing library key to the defined value.

LAPLACE_SMOOTHING:
smoothing is applied when we have to apply when the probability of the word is zero
we add + 1 to all the words. and make its probability to one.

Contributions:

I'd love to hear from you.
Please Send me an email if you want to contact me .