Predicting Fake News using NLP and Machine Learning
The fake news dataset is one of the classic text analytics datasets available on Kaggle. It consists of genuine and fake articles’ titles and text from different authors. In this article, I have walked through the entire text classification process using traditional machine learning approaches as well as deep learning. I started with downloading the dataset from Kaggle on Google Colab. Next, I read the DataFrame and checked the null values in it. There are 7 null values in the text articles, 122 in title and 503 in author out of a total of 20800 rows, I decided to drop the rows. For the test data, I filled them up with a blank.
Additionally, I also check the distribution of ‘Fake’ and ‘Genuine’ news in the dataset. Usually, I set the rcParams for all plots on the notebook while importing matplotlib.
It is seen that they start from 0 which is concerning. It actually starts from 1 when I used .describe() to see the numbers. So I took a look at these texts and found that they are blank. The obvious answer to this is strip and drop length zero. I checked the total number of zero-length texts is 74.
I decided to start over again. So, I would fill all nans with a blank and strip them next, then, remove the zero-length texts and that should be good to start the preprocessing. Following is the new code that handles missing values essentially. The final shape of the data is (20684, 6), that is, it contains 20684 rows, only 116 less than 20800.