top of page
Folded Newspapers
Folded Newspapers

News Category

Naive bayes Text Classifier using News Category Dataset 

Stocks

Naive Bayes Classifier ?

Naive Bayes Classifier is an algorithm based on Bayes theorem. NBC is very simple and easy to use. It can handle conditional and discrete data as well. It assumes that all the variables in dataset is not correlated to each other.

n.gif

Bayes Theorem is used to find the probabilities of a hypothesis with the given evidence.

formula.png

Why Naive Bayes?

As the fact that all the features are independent makes naive algorithm very fast compared to complicated algorithms. 
It can be used in high dimensional data like "TEXT CLASSIFIER","EMAIL SPAM DETECTION" .
Disadvantage in Naive Bayes:
ZERO probability problem: when we deal with test data , we might encounter a situation in a way that the word which is present in test might not be in train . In such situations, probability tends to be 0. 
This can be handled by using different smoothing techniques.

Lets get into implementation:

Screenshot (1).png
Screenshot (1).png

The above command helps to install the open datasets library in order to extract the datasets which are present in other websites.

Try downloading using the bellow commands . it helps you to extract all the files present in that particular dataset.

In order to access the above downloaded files in Google collab use drive.mount.
drive.mount helps to allow access to the files that are present on your drive and collab.

Screenshot (231).png
Screenshot (231).png

Now, Lets import all the libraries required to do text preprocessing.

Screenshot (233).png

After importing the libraries . In order to extract the data which is present in the files which we downloaded, we use pd.read_json. panda library provides us the module called read_json() function in order to extract all the data from the jkson file. make sure you use read_json only if the given data is in json format.

The output of the data is been displayed below:

Screenshot (234).png
Screenshot (236).png
Screenshot (236).png

lets define a function called remove_punctuation. 
string.punctuation- contains all the characters as string.
punctuationfree checks if there are another characters other than which are not in string.punctuation .

then the remove punctuation function is been applied to data['headline'] category with the help of lambda.
It removes all the text which isn't there in string punctuation and refilters the headline category.
 

Have a glance at headlline_prep column.
 

Screenshot (237).png

We are again apply lower function to convert all the values to lower case in headline_pre category.
 

STOPWORDS
 

Most common words in any natural Language. Stopwords donot add much value to the Document. 
Generally the most common words used in text are "the" "is" "in" "for" "where" "when" "to" "at"....etc
eg: "There is a pen on the table"
           There, pen, table 


 

KEYWORDS

is, a , on, the

STOPWORDS

Pros: 

  • The time to train the model also decreases

  • Dataset size decreases.

  • Helps to improve the performance as there are fewer and meaningful tokens left.(increases the classification accuracy)

There are different methods to remove "stopwords"
Method1:  Stopwords using NLTK
Method2: Stopwords using Spacy
Method3: Stopwords using Gensim
....and many more methods.
In our current code we are going to use NLTK . NLTK has a list of stopwords stored in 16 diff language.



 

Screenshot (239).png

The above code explains how to use stopwords using nltk library
at first we are trying to set the language in stopwords (as mentioned earlier that nltk has many languages ). 
Now with the help of lambda we try to check the condition, if the word is present in  stopwords or not.

Screenshot (240).png

TOKENIZATION


Each of these small units are called tokens.
eg: "Natural Language Processing"
         ['Natural ','Language' ,'Processing']

  • Tokens can be words , numbers or punctuation marks.

  • Smaller units are created by locating" word boundaries"
    Word Boundaries: ending point of the word and Beginning of the next word.
    These tokens are considered as the first step of stemming and lemmatization.


 

DATASET:

This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.

Categories and corresponding article counts are as follows:

 

  • POLITICS: 32739

  • WELLNESS: 17827

  • ENTERTAINMENT: 16058

  • TRAVEL: 9887

  • STYLE & BEAUTY: 9649

  • PARENTING: 8677

  • HEALTHY LIVING: 6694

  • QUEER VOICES: 6314

  • FOOD & DRINK: 6226

  • BUSINESS: 5937

  • COMEDY: 5175

  • SPORTS: 4884

  • BLACK VOICES: 4528

  • HOME & LIVING: 4195

  • PARENTS: 3955

  • THE WORLDPOST: 3664

  • WEDDINGS: 3651

  • WOMEN: 3490

  • IMPACT: 3459

  • DIVORCE: 3426

  • CRIME: 3405

  • MEDIA: 2815

  • WEIRD NEWS: 2670

  • GREEN: 2622

  • WORLDPOST: 2579

  • RELIGION: 2556

  • STYLE: 2254

  • SCIENCE: 2178

  • WORLD NEWS: 2177

  • TASTE: 2096

  • TECH: 2082

  • MONEY: 1707

  • ARTS: 1509

  • FIFTY: 1401

  • GOOD NEWS: 1398

  • ARTS & CULTURE: 1339

  • ENVIRONMENT: 1323

  • COLLEGE: 1144

  • LATINO VOICES: 1129

  • CULTURE & ARTS: 1030

  • EDUCATION: 1004

​

DATASET  IS PROVIDED BY KAGGLE : https://www.kaggle.com/datasets/rmisra/news-category-dataset

​

About this file

The file contains 202,372 records. Each json record contains following attributes:

category: Category article belongs to

headline: Headline of the article

authors: Person authored the article

link: Link to the post

short_description: Short description of the article

date: Date the article was published

​

LETS GET INTO THE CODE...

​

  • Firsts lets try to extract the data from the given data

​

             Try to run the bellow command

 

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

TOKENIZATION

phrase

sentence

paragraph

entire text document

splitting into

{

into smaller units

Individual words or terms

Lets perform Lemmatization on the text. It finds the root form of the given word or lemmas in NLP.
example: ->
 
 

lemas of words : 

- reading
- reads
- read

{

is

"read"

We use WordNetLemmatizer inorder to perform lemmatization.

  • To break the words into individual tokens we have used Whitespcae_Tokenizer from nltk.

  • we have defined a function named lematize_text() and then it is been applied to data['headline'].


 

Screenshot (242).png

This is how it looks after we do tokenization and removing stopwords.

 

Screenshot (243).png

Label Encoder

Encode target labels with value between 0 and n_classes-1 .This approach is very simple and it involves converting each value in a column to a number.

example: 
            there are 3,4 categories like : Tall , Medium, Short

    

 

download.png

Similarly, the label encoder helps to categorise the unique values and gives it a specific number.

Now dividing the train and dev and test model by using train_test_split.
 

Screenshot (244).png

After successfully dividing the preprocessed data into train_test 

Lets try to find frequencies of each word appearing in the training set.
We store them in a dictionary. 
 

Screenshot (244).png
Screenshot (246).png

Word count tells you n number of specific word present in that particular category.
we use defaultdict model to declare non existing library key to the defined value.

Screenshot (247).png

LAPLACE_SMOOTHING:
smoothing is applied when we have to apply when the probability of the word is zero
we add + 1 to all the words. and make its probability to one.

Screenshot (248).png

we define group_by_label and fit and predict functions .

Screenshot (248).png
Screenshot (252).png
Screenshot (252).png

References:

Contributions:

  • understanding the concept and implementation Naive bayes. 

  • Documentated based on my understanding on the code .

bottom of page