/

Click here to download Fake/Real jobs code

JOB PREDICTION

What Exactly is Real/Fake Job Predictions is about?
There are many job postings out there. Where many people are desperate in search of jobs. Many of the scammers take advantage of the situation and target people who are looking out for jobs. These scammers try to collect all the personal valuable information like address , phone number , birth date and etc, and might ask money in return to get a job.
The Dataset is provided by Kaggle. This dataset has few attributes which tends to job postings. It is been divided in such a way that if that particular job post is real or fake. There are 18 k job descriptions where 800 are fake jobs. Our main goal is to find out which is fraudulent.

The dataset which was provided is in csv format with the following columns:
1. job_id: The unique job_id's for each job postings.
2. title: title of the job.
3. location: geographical location of the job.
4. department: Corporate department (eg: sales).
5. salary_range: salary range of that particular job.
6. company_profile: a brief company description.
7. description: the detail description of the job.
8. requirements: requirements related to job openings.
9. benefits: enlisted offered benefits by the employer.
10. telecommuting: True for telecommuting positions.
11. Has_company_logo: True if the company logo is present.
12. Has_questions: True if screening questions are present.
13. Employment_type: Full-time, Part-time, Contract, etc.
14. Required_experience: Executive, Entry-level, Intern, etc.
15. Required_education: Doctorate, Master’s degree, Bachelor’s, etc.
16. Industry: Automotive, IT, Health care, Real estate, etc.
17. Function: Consulting, Engineering, Research, Sales, etc.
18. Fraudulent (Target Variable): 1 if fake posting else 0.

Project Goal:->
The main goal is to create a classifier that helps to identify which is real or fake.

Lets get started....

Data extraction:
lets extract the dataset provided by kaggle: https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction
-Download the dataset using opendatasets .
-start importing the basic packages needed
-try to store the data with the help of pd.read_csv

Checking the null values for each and every category.
isnull() : checks if the null value is present or not in Boolean values.
sum() : gives the count of true values of the null values

-let fill null values with blank .
-we are trying to extract data in which the location which contains string "US"
-we are tying to split location with state and city columns .
- data is from US-based locations(almost like 60% of data is from US location").

lets check the how many fraud and real values are in this dataset.
9868 are real jobs and 725 are fake jobs.

However, an interesting trend was noted for the Boolean variable telecommuting. When both these variables had a value equal to zero, there is a 92% chance that the job will be fraudulent.

Appending all the required categories and storing in new attribute "text".
and also dropping the attributes which are no more needed.

Frequency of characters in Real and Fake job Postings

The final data set before data modeling contains:
telecommuting
fraudulent: checks if its real or fake
ratio: fake to real job ratio based on the location
text: appends the attributes title, location, company_profile, description, requirements, benefits, required_experience, required_education, industry, and function.
character_count: the count of textual words

Algorithms used in this project are:
1. Natural Language Processing
2. Naive Bayes Algorithm
3. SGD Classifier

Natural Language Processing: NLP helps computer to understand human language.
Naive Bayes: Naive bayes is used to calculate the probabilities based on the probability od occurence.
SGD Classifier: It implements Stochastic gradient descent learning. It supports different loss functions and penalties for classification.

DATA PREPROCESSING:
steps:
1. Tokenization
2. To lower
3. Stopwords
4.Lemmatization

TOKENIZATION

phrase

sentence

entire text document

splitting into

{

into smaller units

Individual words or terms

Each of these small units are called tokens.
eg: "Natural Language Processing"
['Natural ','Language' ,'Processing']

Tokens can be words , numbers or punctuation marks.
Smaller units are created by locating" word boundaries"
Word Boundaries: ending point of the word and Beginning of the next word.
These tokens are considered as the first step of stemming and lemmatization.

Stop words:

Most common words in any natural Language. Stopwords do not add much value to the Document.
Generally the most common words used in text are "the" "is" "in" "for" "where" "when" "to" "at"....etc
eg: "There is a pen on the table"
There, pen, table

KEYWORDS

is, a , on, the

STOPWORDS

Pros:

The time to train the model also decreases

Dataset size decreases.

Helps to improve the performance as there are fewer and meaningful tokens left.(increases the classification accuracy)

There are different methods to remove "stopwords"
Method1: Stopwords using NLTK
Method2: Stopwords using Spacy
Method3: Stopwords using Gensim
....and many more methods.
In our current code we are going to use NLTK . NLTK has a list of stopwords stored in 16 diff language.

Lemmatization:

The process of lemmatization groups in which inflected forms of words are used together.

To Lower:

The words are converted to lowercase.

Lets divide data set into train and test

Naive Bayes:

Naive Bayes Classifier is an algorithm based on Bayes theorem. NBC is very simple and easy to use. It can handle conditional and discrete data as well. It assumes that all the variables in dataset is not correlated to each other.
Bayes Theorem is used to find the probabilities of a hypothesis with the given evidence.

As the fact that all the features are independent makes naive algorithm very fast compared to complicated algorithms.
It can be used in high dimensional data like "TEXT CLASSIFIER","EMAIL SPAM DETECTION" .
Disadvantage in Naive Bayes:
ZERO probability problem: when we deal with test data , we might encounter a situation in a way that the word which is present in test might not be in train . In such situations, probability tends to be 0.
This can be handled by using different smoothing techniques.

Bayes theorem provides a way of calculating the posterior probability, P(H|E), from P(H), P(E), and P(E|H). Naive Bayes classifier assume that the effect of the value of a predictor (E) on a given class (H) is independent of the values of other predictors. This assumption is called class conditional independence.

P(H|E) = p(e1|H)p(e2|H)p(e3|H)...................p(en|H)

Applying Naive Bayes theorem:

where P(E) = P(E|H)P(H) + P(NOT E|H)P(NOT E)

ACCURACY:

F1 Score:
This model needs to identify both categories with the highest possible score since both have high costs.

Accuracy score : 97% F1score: 74%

Hyperparameter tuning :
To make sure that there are no 0 predictions to any values we use alpha hyperparameter and apply Laplace smoothing.
alpha values I used in below code are: 1, 0.0001, 0.001

Stochastic Gradient Descent (SGD):

The gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning via the partial_fit method. For best results using the default learning rate schedule, the data should have zero mean and unit variance.

This implementation works with data represented as dense or sparse arrays of floating point values for the features. The model it fits can be controlled with the loss parameter; by default, it fits a linear support vector machine (SVM).

The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve online feature selection.

Applying SGD CLASSIFIER:

Accuracy score : 97%
F1Score: 80%

CONFUSION MATRIX:
The confusion matrix displays the following values — categorized label, number of data points categorized under the label, and percentage of data represented in each category. Based on the confusion matrix, it is evident that the model identifies real jobs and fradulent jobs.

When we compare Naive bayes classifier and SGD the accuracy is almost equal to 97% but F1 score for NBC is 74% and F1 Score for SGD is 80%. SGD seems to be performing better compared to NBC

GridSearchCV helps us to find out the best hyperparameters among the loss functions , alpha values and penality norms.

When we compare Naive bayes classifier and SGD the accuracy is almost equal to 97% but F1 score for NBC is 81% and F1 Score for SGD is 80%. NBC seems to be performing better compared to SGD. The accuracy provided by Grid_predictions where 93% where is comparitively less compared to NBC and SGD.

REFERENCES:

http://scikit-learn.sourceforge.net/0.7/modules/sgd.html -> for SGD image
https://medium.com/analytics-vidhya/na%C3%AFve-bayes-algorithm-5bf31e9032a2 -> Naive bayes diagram
https://towardsdatascience.com/fake-job-predictor-a168a315d866 -> reference related to fake and real job
https://github.com/Anshupriya2694/Fake-Job-Posting-Prediction/tree/master/Code-> Reffered code
https://michael-fuchs-python.netlify.app/2019/11/11/introduction-to-sgd-classifier/ -> gridSearchcv

Contributions:

CHALLENGES:

I have tried to find the best hyperparameter using GridSearchCV. and found accuracy using accuracy_score in SGD classifier.
I have tried to modify hyperparameters in Naive bayes Classifier where trying to checking out with smooting techniques by changing alpha values.

Choosing the algorithm for text classifier was a challenge at the beginning. After going through how exactly it works helped me how it can be implemented.
Understanding the Concept of SGD was new to me I have gone through pros and cons of using this particular algorithm.
I have tried to find the best hyperparameter using GridSearchCV. and found accuracy using accuracy_score.
I have tried to modify Hyperparameters in code

/

JOB PREDICTION

Checking the null values for each and every category. isnull() : checks if the null value is present or not in Boolean values. sum() : gives the count of true values of the null values

-let fill null values with blank . -we are trying to extract data in which the location which contains string "US" -we are tying to split location with state and city columns . - data is from US-based locations(almost like 60% of data is from US location").

lets check the how many fraud and real values are in this dataset. 9868 are real jobs and 725 are fake jobs.

However, an interesting trend was noted for the Boolean variable telecommuting. When both these variables had a value equal to zero, there is a 92% chance that the job will be fraudulent.

Appending all the required categories and storing in new attribute "text". and also dropping the attributes which are no more needed.

Frequency of characters in Real and Fake job Postings

TOKENIZATION

phrase

sentence

entire text document

splitting into

{

into smaller units

Individual words or terms

Each of these small units are called tokens. eg: "Natural Language Processing" ['Natural ','Language' ,'Processing']

Tokens can be words , numbers or punctuation marks.

Smaller units are created by locating" word boundaries" Word Boundaries: ending point of the word and Beginning of the next word. These tokens are considered as the first step of stemming and lemmatization.

Stop words:

​

Most common words in any natural Language. Stopwords do not add much value to the Document. Generally the most common words used in text are "the" "is" "in" "for" "where" "when" "to" "at"....etc eg: "There is a pen on the table" There, pen, table

KEYWORDS

is, a , on, the

STOPWORDS

Pros:

The time to train the model also decreases

Dataset size decreases.

Helps to improve the performance as there are fewer and meaningful tokens left.(increases the classification accuracy)

There are different methods to remove "stopwords" Method1: Stopwords using NLTK Method2: Stopwords using Spacy Method3: Stopwords using Gensim ....and many more methods. In our current code we are going to use NLTK . NLTK has a list of stopwords stored in 16 diff language.

Lemmatization:

The process of lemmatization groups in which inflected forms of words are used together.

​

To Lower:

The words are converted to lowercase.

​

​

​

Lets divide data set into train and test

where P(E) = P(E|H)*P(H) + P(NOT E|H)*P(NOT E)

ACCURACY:

F1 Score: This model needs to identify both categories with the highest possible score since both have high costs.

Accuracy score : 97% F1score: 74%

Hyperparameter tuning : To make sure that there are no 0 predictions to any values we use alpha hyperparameter and apply Laplace smoothing. alpha values I used in below code are: 1, 0.0001, 0.001

Stochastic Gradient Descent (SGD):

This implementation works with data represented as dense or sparse arrays of floating point values for the features. The model it fits can be controlled with the loss parameter; by default, it fits a linear support vector machine (SVM).

Accuracy score : 97% F1Score: 80%

CONFUSION MATRIX: The confusion matrix displays the following values — categorized label, number of data points categorized under the label, and percentage of data represented in each category. Based on the confusion matrix, it is evident that the model identifies real jobs and fradulent jobs.

When we compare Naive bayes classifier and SGD the accuracy is almost equal to 97% but F1 score for NBC is 74% and F1 Score for SGD is 80%. SGD seems to be performing better compared to NBC

GridSearchCV helps us to find out the best hyperparameters among the loss functions , alpha values and penality norms.

When we compare Naive bayes classifier and SGD the accuracy is almost equal to 97% but F1 score for NBC is 81% and F1 Score for SGD is 80%. NBC seems to be performing better compared to SGD. The accuracy provided by Grid_predictions where 93% where is comparitively less compared to NBC and SGD.

REFERENCES:

https://michael-fuchs-python.netlify.app/2019/11/11/introduction-to-sgd-classifier/ -> gridSearchcv

Contributions:

CHALLENGES:​

I have tried to find the best hyperparameter using GridSearchCV. and found accuracy using accuracy_score in SGD classifier.

I have tried to modify hyperparameters in Naive bayes Classifier where trying to checking out with smooting techniques by changing alpha values.

Choosing the algorithm for text classifier was a challenge at the beginning. After going through how exactly it works helped me how it can be implemented.

Understanding the Concept of SGD was new to me I have gone through pros and cons of using this particular algorithm.

I have tried to find the best hyperparameter using GridSearchCV. and found accuracy using accuracy_score.

I have tried to modify Hyperparameters in code

Youtube link: https://www.youtube.com/watch?v=WnTkPZoMHbA

GitHub Link : https://github.com/shivani-14/Fake_Real_jobs

I'd love to hear from you. Please Send me an email if you want to contact me .

Checking the null values for each and every category.
isnull() : checks if the null value is present or not in Boolean values.
sum() : gives the count of true values of the null values

-let fill null values with blank .
-we are trying to extract data in which the location which contains string "US"
-we are tying to split location with state and city columns .
- data is from US-based locations(almost like 60% of data is from US location").

lets check the how many fraud and real values are in this dataset.
9868 are real jobs and 725 are fake jobs.

Appending all the required categories and storing in new attribute "text".
and also dropping the attributes which are no more needed.

Each of these small units are called tokens.
eg: "Natural Language Processing"
['Natural ','Language' ,'Processing']

Smaller units are created by locating" word boundaries"
Word Boundaries: ending point of the word and Beginning of the next word.
These tokens are considered as the first step of stemming and lemmatization.

Most common words in any natural Language. Stopwords do not add much value to the Document.
Generally the most common words used in text are "the" "is" "in" "for" "where" "when" "to" "at"....etc
eg: "There is a pen on the table"
There, pen, table

There are different methods to remove "stopwords"
Method1: Stopwords using NLTK
Method2: Stopwords using Spacy
Method3: Stopwords using Gensim
....and many more methods.
In our current code we are going to use NLTK . NLTK has a list of stopwords stored in 16 diff language.

where P(E) = P(E|H)P(H) + P(NOT E|H)P(NOT E)

F1 Score:
This model needs to identify both categories with the highest possible score since both have high costs.

Hyperparameter tuning :
To make sure that there are no 0 predictions to any values we use alpha hyperparameter and apply Laplace smoothing.
alpha values I used in below code are: 1, 0.0001, 0.001

Accuracy score : 97%
F1Score: 80%

CONFUSION MATRIX:
The confusion matrix displays the following values — categorized label, number of data points categorized under the label, and percentage of data represented in each category. Based on the confusion matrix, it is evident that the model identifies real jobs and fradulent jobs.

CHALLENGES:

I'd love to hear from you.
Please Send me an email if you want to contact me .