Blog Post 6 - Fake News Classification
In this Blog Post, we will use TensorFlow techniques to find out if any articles contain fake news.
Acknowledgment
Major parts of this Blog Post assignment, including several code chunks and explanations, are based on Professor Phil Chodrow.
Let’s put all import statements here at the beginning for convenience.
import pandas as pd
import string
import numpy as np
import re
import tensorflow as tf
from matplotlib import pyplot as plt
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
# for embedding viz
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
§1. Acquire Training Data
In this section, we will download the data from the article
Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).
Once we have load the articles into a pandas dataframe, we can take a look at the first five rows of the dataset using df.head()
.
train_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_train.csv?raw=true"
df = pd.read_csv(train_url)
df.head()
Unnamed: 0 | title | text | fake | |
---|---|---|---|---|
0 | 17366 | Merkel: Strong result for Austria's FPO 'big c... | German Chancellor Angela Merkel said on Monday... | 0 |
1 | 5634 | Trump says Pence will lead voter fraud panel | WEST PALM BEACH, Fla.President Donald Trump sa... | 0 |
2 | 17487 | JUST IN: SUSPECTED LEAKER and “Close Confidant... | On December 5, 2017, Circa s Sara Carter warne... | 1 |
3 | 12217 | Thyssenkrupp has offered help to Argentina ove... | Germany s Thyssenkrupp, has offered assistance... | 0 |
4 | 5535 | Trump say appeals court decision on travel ban... | President Donald Trump on Thursday called the ... | 0 |
The data set contains four columns, each row of the data corresponds to an article, and some important column heading variables have the following meanings:
title
: The title of the articletext
: The full article textfake
:0
if the article is true;1
if the article contains fake news
Next, we will use df.isna() to pick out all nan values from data, then use df.sum() to find how many nan values are in each column and save the result in nan_values_summary.
nan_values_summary = df.isna().sum()
nan_values_summary
Unnamed: 0 0
title 0
text 0
fake 0
dtype: int64
Wow, it looks like we don’t have any missing values, which is good.
Next, let’s make a bar chart to visualize the distribution of fake articles
import seaborn as sns
x = df.groupby("fake").apply(len)
fig, ax = plt.subplots(1)
ax = sns.barplot(x =x.index,y=x.values )
ax.set(title = "Bar Charts of Fake News")
# remove border
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# I find how to add annotate on each bar online from
# https://www.geeksforgeeks.org/how-to-annotate-bars-in-barplot-with-matplotlib-in-python/
# Iterrating over the bars one-by-one
for bar in ax.patches:
# Using Matplotlib's annotate function and
# passing the coordinates where the annotation shall be done
# x-coordinate: bar.get_x() + bar.get_width() / 2
# y-coordinate: bar.get_height()
# free space to be left to make graph pleasing: (0, 8)
# ha and va stand for the horizontal and vertical alignment
ax.annotate(int(bar.get_height()),
(bar.get_x() + bar.get_width() / 2 , bar.get_height()),
ha='center', va='center', size=10, xytext=(0, 6),
textcoords='offset points', color="firebrick")
The number of articles that are fake or not is approximately similar.
§2. Make a Dataset
In this section, we change our data frame into a TensorFlow Dataset. We will use tf.data.Dataset.from_tensor_slices()
,to make our dataset.
After successfully creating the Dataset, we need to do data cleaning. For instance, we should remove stopwords from the articles. The stopwords are usually considered as useless information; such as commonly used words (“the,” “and,” “but,” “a”, “an”, or “in”)
Luckily, we don’t have to create the list of stopwords by ourselves. Instead, we can use the natural language toolkit (NLTK) to get all stopwords we need from the package nltk.corpus
. Here are some examples of how to use nltk.
First, let’s create a function called make_dataset()
, which has the following property:
-
Removing stopwords from the article text and title
-
Constructing and returning a tf.data.Dataset with two inputs (title, text) and one output (fake). Because we have multiple inputs, we are going to construct our Dataset from a tuple of dictionaries. The first dictionary is going to specify the different components in the predictor data (title, text), while the second dictionary is going to specify the different components of the target data (fake).
def make_dataset(panda_df):
"""
This function will remove stopwords from the article text and title
in the training data frame,then create a TensorFlow Dataset from the
cleaned training data frame.
Parameters
----------
panda_df: data frame;
Return
----------
tf_dataset: TensorFlow Dataset;
"""
# get stopwords from 'nltk.corpus' package
stop = stopwords.words('english')
# removing stopwords from the article text and title
panda_df[['title','text']].apply(lambda x: [item for item in x if item not in stop])
# Construct and return a tf.data.Dataset with two inputs (title, text) and one output (fake)
tf_dataset = tf.data.Dataset.from_tensor_slices(
(
{
"title" : df[["title"]],
"text" : df["text"]
},
{
"fake" : df[["fake"]] # second dictionary
}
)
)
# batch our Dataset prior to returning it
tf_dataset = tf_dataset.batch(100)
return tf_dataset
The tf_dataset.batch(100)
will process 100 pieces of data at a time when doing stochastic gradient descent. Using batch can faster the training process, but it will decrease the accuracy.
Now, let’s use our function make_dataaset()
to create the dataset.
data = make_dataset(panda_df=df)
Validation Data
This section will perform a train and validation split on our dataset. We will split 20% of the training dataset as validation.
We can create a function called split_dataset()
to perform a train and validation split.
def split_dataset(data, train_size):
"""
This function will perform a train and validation split on input dataset
Parameters
----------
data: dataset;
train_size: float (0.0 - 1.0); size of the training dataset.
Return
----------
train: training dataset
val: validation dataset
"""
# randomly shuffle the data
data = data.shuffle(buffer_size = len(data))
# size of the training data
train_size = int(0.8*len(data))
# 80% data as the training data
train = data.take(train_size)
# 20% data as the validation data
val = data.skip(train_size)
return train, val
Let’s use the function split_dataset()
to spilt our data.
train_dataset, val_dataset = split_dataset(data, train_size=0.8)
Let’s check the length of our training and validation dataset.
len(train_dataset), len(val_dataset)
(180, 45)
We have 180 and 45 batches of data in our training and validation dataset. We used the batched size of 100 when creating our Dataset, so we should multiply by the batch size 100 to get the total number of rows in each Dataset.
Base Rate
Here we will calculate the base rate of our mode. Base rate refers to the accuracy of a model that always makes the same guess (for example, such a model might always say “fake news!”).
The following line of code will create an iterator called labels for the train_dataset.
labels_iterator= train_dataset.unbatch().map(lambda x, label: label['fake'][0]).as_numpy_iterator()
Next, we will caculate total number of articles and how many are labels as fake in our training datase.
- label 0 : The article is true
- label 1 : The article contains fake news
train_labels_list =list(labels_iterator)
num_labels = len(train_labels_list)
num_fake = train_labels_list.count(0)
print("Totoal number of labels " + str(num_labels))
print("Total number of fake: " + str(num_fake))
print("Base Rate \u2248 " + str(round(num_fake/num_labels *100, 4)) + "%")
Totoal number of labels 17949
Total number of fake: 8544
Base Rate ≈ 47.6015%
§3. Create Models
This section will create three different TensorFlow models to determine the effective way of detecting fake news.
- First, we will focus on only the title of the article.
- Second, we will focus on only the text of the article.
- Third, we will focus on both title and text of the article.
Preprocessing
Before building our model, we should do text preprocessing on our dataset.
Standardization
Standardization refers to the act of taking a some text that’s “messy” in some way and making it less messy. Common standardizations include:
- Removing capitals.
- Removing punctuation.
- Removing HTML elements or other non-semantic content.
We will standardize the text date into lowercase and remove all punctuations in the next cell.
def standardization(input_data):
lowercase = tf.strings.lower(input_data)
no_punctuation = tf.strings.regex_replace(lowercase,
'[%s]' % re.escape(string.punctuation),'')
return no_punctuation
Vectorization
Text Vectorization refers to the process of representing text as a vector (array, tensor).
Here, we’ll replace each word by its frequency rank in the data.
the text layer will help us turn our text information ( title, text) into numbers by replacing each word with its rank frequency in the data
# only the top 2600 distinct words will be tracked
size_vocabulary = 2600
vectorize_layer = TextVectorization(
standardize=standardization,
max_tokens=size_vocabulary, # only consider this many words
output_mode='int',
output_sequence_length=500)
We need to adapt the vectorization layer to the title and text. In the adaptation process, the vectorization layer learns what words are common in the title and text.
vectorize_layer.adapt(train_dataset.map(lambda x, y: x["title"]))
vectorize_layer.adapt(train_dataset.map(lambda x, y: x["text"]))
Inputs
We need to create two ‘keras.Input’ for title and text for our model. The title column contains just one article title, and the text column contains just one full article text, so both inputs have a shape (1, )
. The name
should match when we constructed the dataset using tf.data.Dataset.from_tensor_slices()
early. Since both inputs contain text, we should set the type
as a string
.
title_input = keras.Input(
shape = (1,),
name = "title",
dtype = "string"
)
text_input = keras.Input(
shape = (1,),
name = "text",
dtype = "string"
)
Shared Layers
When using functional API, we might use the same layer in the model. Shared Layers are same layers which we used mutiple time in different part of the model.
We will make an Embedding
layer which is shared with title_input
and text_input
.
# Embedding for 2600 unique words mapped to 3-dimentional vectors
shared_embedding = layers.Embedding(size_vocabulary, output_dim = 3, name = "embedding" )
First Model
We will only use the article title as input in the first model.
Let’s write the pipeline for the title.
# pipeline for the first model only focuses on the title
title_features = vectorize_layer(title_input)
# Reuse the same Embedding layer to encode title inputs
title_features = shared_embedding(title_features)
# add Droupout layer to reduce overfitting
title_features = layers.Dropout(0.2)(title_features)
# GlobalAveragePooling1D similar to MaxPooling layer
title_features = layers.GlobalAveragePooling1D()(title_features)
# add Droupout layer to reduce overfitting
title_features = layers.Dropout(0.2)(title_features)
title_features = layers.Dense(32, activation='relu')(title_features)
title_features = layers.Dropout(0.2)(title_features)
# each data point (title) will be represented as 32 numbers
title_features = layers.Dense(32, activation='relu')(title_features)
We should give the output layer a name that matches the key corresponding to the target data in the Dataset we will pass to the model. In our case, the name should be ‘fake.’ This is how TensorFlow knows which part of our data set to compare against the outputs!
# output layer for the first model
title_output = layers.Dense(2, name = "fake")(title_features)
Then, we can create the first model.
title_model = keras.Model(
inputs = title_input,
outputs = title_output
)
complie the model
title_model.compile(optimizer = "adam",
loss = losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
train the model
title_history = title_model.fit(train_dataset,
validation_data=val_dataset,
epochs = 30)
Click to show all 30 epochs from the first model.
Epoch 1/30 /usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:559: UserWarning: Input dict contained keys ['text'] which did not match any model input. They will be ignored by the model. inputs = self._flatten_to_reference_inputs(inputs) 180/180 [==============================] - 4s 11ms/step - loss: 0.6919 - accuracy: 0.5249 - val_loss: 0.6902 - val_accuracy: 0.5367 Epoch 2/30 180/180 [==============================] - 1s 7ms/step - loss: 0.6886 - accuracy: 0.5403 - val_loss: 0.6772 - val_accuracy: 0.5273 Epoch 3/30 180/180 [==============================] - 1s 5ms/step - loss: 0.5496 - accuracy: 0.7633 - val_loss: 0.3428 - val_accuracy: 0.8789 Epoch 4/30 180/180 [==============================] - 1s 5ms/step - loss: 0.3006 - accuracy: 0.8832 - val_loss: 0.2292 - val_accuracy: 0.9071 Epoch 5/30 180/180 [==============================] - 1s 5ms/step - loss: 0.2528 - accuracy: 0.8985 - val_loss: 0.1858 - val_accuracy: 0.9294 Epoch 6/30 180/180 [==============================] - 1s 5ms/step - loss: 0.2289 - accuracy: 0.9082 - val_loss: 0.1728 - val_accuracy: 0.9329 Epoch 7/30 180/180 [==============================] - 1s 5ms/step - loss: 0.2112 - accuracy: 0.9163 - val_loss: 0.1719 - val_accuracy: 0.9342 Epoch 8/30 180/180 [==============================] - 1s 5ms/step - loss: 0.2013 - accuracy: 0.9198 - val_loss: 0.1470 - val_accuracy: 0.9449 Epoch 9/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1969 - accuracy: 0.9228 - val_loss: 0.1517 - val_accuracy: 0.9438 Epoch 10/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1870 - accuracy: 0.9277 - val_loss: 0.1423 - val_accuracy: 0.9487 Epoch 11/30 180/180 [==============================] - 1s 6ms/step - loss: 0.1806 - accuracy: 0.9307 - val_loss: 0.1516 - val_accuracy: 0.9431 Epoch 12/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1804 - accuracy: 0.9310 - val_loss: 0.1315 - val_accuracy: 0.9480 Epoch 13/30 180/180 [==============================] - 1s 6ms/step - loss: 0.1724 - accuracy: 0.9348 - val_loss: 0.1486 - val_accuracy: 0.9444 Epoch 14/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1719 - accuracy: 0.9343 - val_loss: 0.1432 - val_accuracy: 0.9436 Epoch 15/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1666 - accuracy: 0.9380 - val_loss: 0.1408 - val_accuracy: 0.9484 Epoch 16/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1630 - accuracy: 0.9372 - val_loss: 0.1215 - val_accuracy: 0.9564 Epoch 17/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1654 - accuracy: 0.9379 - val_loss: 0.1156 - val_accuracy: 0.9600 Epoch 18/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1592 - accuracy: 0.9403 - val_loss: 0.1294 - val_accuracy: 0.9511 Epoch 19/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1685 - accuracy: 0.9358 - val_loss: 0.1400 - val_accuracy: 0.9456 Epoch 20/30 180/180 [==============================] - 1s 6ms/step - loss: 0.1609 - accuracy: 0.9392 - val_loss: 0.1152 - val_accuracy: 0.9593 Epoch 21/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1543 - accuracy: 0.9413 - val_loss: 0.1191 - val_accuracy: 0.9531 Epoch 22/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1514 - accuracy: 0.9411 - val_loss: 0.1189 - val_accuracy: 0.9591 Epoch 23/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1516 - accuracy: 0.9422 - val_loss: 0.1130 - val_accuracy: 0.9573 Epoch 24/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1535 - accuracy: 0.9403 - val_loss: 0.1072 - val_accuracy: 0.9613 Epoch 25/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1473 - accuracy: 0.9450 - val_loss: 0.0979 - val_accuracy: 0.9689 Epoch 26/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1508 - accuracy: 0.9435 - val_loss: 0.0973 - val_accuracy: 0.9664 Epoch 27/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1494 - accuracy: 0.9445 - val_loss: 0.1046 - val_accuracy: 0.9629 Epoch 28/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1523 - accuracy: 0.9414 - val_loss: 0.1030 - val_accuracy: 0.9633 Epoch 29/30 180/180 [==============================] - 1s 5ms/step - loss: 0.1431 - accuracy: 0.9452 - val_loss: 0.1137 - val_accuracy: 0.9600 Epoch 30/30 180/180 [==============================] - 1s 6ms/step - loss: 0.1498 - accuracy: 0.9441 - val_loss: 0.1010 - val_accuracy: 0.9658
Next, we will create a function called history_plot()
, which will plot the history of the accuracy on both the training and validation sets.
def history_plot(history, model_name):
"""
This function will create the plot of accuracy on the training and validation sets
Parameters
----------
history: history object; it holds a record of the loss values and metric values during training
model_name: string; The name for the model
Return
----------
No return value
"""
fig, ax = plt.subplots(1,1, figsize=(10,5))
ax.yaxis.set_major_locator(plt.MaxNLocator(10))
plt.plot(history.history["accuracy"], label = "training")
plt.plot(history.history["val_accuracy"], label = "validation")
plt.gca().set(xlabel = "epoch", ylabel = "accuracy")
plt.title("Training and Validation Performance of " + model_name)
plt.legend()
Let’s plot the history of the accuracy on both the training and validation sets for the first model.
history_plot(title_history, model_name= "First Model")
The first model consistently scores between 94% - 96% validation accuracy after epoch 8, which is acceptable.
Second Model
We will only use the article text as input in the second model.
Let’s write the pipeline for the text.
# pipeline for the second model only focuses on the text
text_features = vectorize_layer(text_input)
# Reuse the same Embedding layer to encode title inputs
text_features = shared_embedding(text_features)
# add Droupout layer to reduce overfitting
text_features = layers.Dropout(0.2)(text_features)
# GlobalAveragePooling1D similar to MaxPooling layer
text_features = layers.GlobalAveragePooling1D()(text_features)
# add Droupout layer to reduce overfitting
text_features = layers.Dropout(0.2)(text_features)
text_features = layers.Dense(32, activation='relu')(text_features)
text_features = layers.Dropout(0.2)(text_features)
# each data point (title) will be represented as 32 numbers
text_features = layers.Dense(32, activation='relu')(text_features)
# output layer for the second model
text_output = layers.Dense(2, name = "fake")(text_features)
Then, we can create, compile and train the second model.
text_model = keras.Model(
inputs = [text_input],
outputs = text_output
)
text_model.compile(optimizer = "adam",
loss = losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
text_history = text_model.fit(train_dataset,
validation_data=val_dataset,
epochs = 30)
Click to show all 30 epochs from the second model.
Epoch 1/30 /usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:559: UserWarning: Input dict contained keys ['title'] which did not match any model input. They will be ignored by the model. inputs = self._flatten_to_reference_inputs(inputs) 180/180 [==============================] - 3s 14ms/step - loss: 0.6026 - accuracy: 0.6691 - val_loss: 0.4291 - val_accuracy: 0.8613 Epoch 2/30 180/180 [==============================] - 2s 13ms/step - loss: 0.3333 - accuracy: 0.8716 - val_loss: 0.2620 - val_accuracy: 0.8827 Epoch 3/30 180/180 [==============================] - 2s 13ms/step - loss: 0.2211 - accuracy: 0.9227 - val_loss: 0.2307 - val_accuracy: 0.8929 Epoch 4/30 180/180 [==============================] - 2s 13ms/step - loss: 0.1795 - accuracy: 0.9389 - val_loss: 0.2435 - val_accuracy: 0.8929 Epoch 5/30 180/180 [==============================] - 2s 13ms/step - loss: 0.1646 - accuracy: 0.9440 - val_loss: 0.2013 - val_accuracy: 0.9080 Epoch 6/30 180/180 [==============================] - 2s 13ms/step - loss: 0.1447 - accuracy: 0.9531 - val_loss: 0.1952 - val_accuracy: 0.9065 Epoch 7/30 180/180 [==============================] - 2s 13ms/step - loss: 0.1328 - accuracy: 0.9573 - val_loss: 0.1883 - val_accuracy: 0.9111 Epoch 8/30 180/180 [==============================] - 2s 13ms/step - loss: 0.1198 - accuracy: 0.9620 - val_loss: 0.1510 - val_accuracy: 0.9287 Epoch 9/30 180/180 [==============================] - 2s 13ms/step - loss: 0.1149 - accuracy: 0.9641 - val_loss: 0.1360 - val_accuracy: 0.9360 Epoch 10/30 180/180 [==============================] - 2s 13ms/step - loss: 0.1078 - accuracy: 0.9650 - val_loss: 0.1537 - val_accuracy: 0.9253 Epoch 11/30 180/180 [==============================] - 2s 13ms/step - loss: 0.0963 - accuracy: 0.9693 - val_loss: 0.1533 - val_accuracy: 0.9276 Epoch 12/30 180/180 [==============================] - 2s 13ms/step - loss: 0.0945 - accuracy: 0.9689 - val_loss: 0.1251 - val_accuracy: 0.9440 Epoch 13/30 180/180 [==============================] - 2s 13ms/step - loss: 0.0870 - accuracy: 0.9724 - val_loss: 0.1015 - val_accuracy: 0.9540 Epoch 14/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0837 - accuracy: 0.9735 - val_loss: 0.1140 - val_accuracy: 0.9444 Epoch 15/30 180/180 [==============================] - 2s 13ms/step - loss: 0.0846 - accuracy: 0.9722 - val_loss: 0.0788 - val_accuracy: 0.9609 Epoch 16/30 180/180 [==============================] - 3s 17ms/step - loss: 0.0762 - accuracy: 0.9749 - val_loss: 0.1171 - val_accuracy: 0.9453 Epoch 17/30 180/180 [==============================] - 3s 17ms/step - loss: 0.0680 - accuracy: 0.9774 - val_loss: 0.0775 - val_accuracy: 0.9589 Epoch 18/30 180/180 [==============================] - 2s 13ms/step - loss: 0.0665 - accuracy: 0.9775 - val_loss: 0.1104 - val_accuracy: 0.9480 Epoch 19/30 180/180 [==============================] - 2s 13ms/step - loss: 0.0668 - accuracy: 0.9773 - val_loss: 0.0673 - val_accuracy: 0.9669 Epoch 20/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0622 - accuracy: 0.9788 - val_loss: 0.0720 - val_accuracy: 0.9613 Epoch 21/30 180/180 [==============================] - 2s 13ms/step - loss: 0.0584 - accuracy: 0.9801 - val_loss: 0.0706 - val_accuracy: 0.9633 Epoch 22/30 180/180 [==============================] - 2s 13ms/step - loss: 0.0568 - accuracy: 0.9802 - val_loss: 0.0597 - val_accuracy: 0.9709 Epoch 23/30 180/180 [==============================] - 2s 13ms/step - loss: 0.0567 - accuracy: 0.9799 - val_loss: 0.0397 - val_accuracy: 0.9900 Epoch 24/30 180/180 [==============================] - 2s 13ms/step - loss: 0.0525 - accuracy: 0.9814 - val_loss: 0.0671 - val_accuracy: 0.9683 Epoch 25/30 180/180 [==============================] - 2s 13ms/step - loss: 0.0501 - accuracy: 0.9827 - val_loss: 0.0712 - val_accuracy: 0.9656 Epoch 26/30 180/180 [==============================] - 3s 15ms/step - loss: 0.0464 - accuracy: 0.9837 - val_loss: 0.0424 - val_accuracy: 0.9889 Epoch 27/30 180/180 [==============================] - 4s 22ms/step - loss: 0.0473 - accuracy: 0.9845 - val_loss: 0.0434 - val_accuracy: 0.9900 Epoch 28/30 180/180 [==============================] - 5s 27ms/step - loss: 0.0466 - accuracy: 0.9839 - val_loss: 0.0462 - val_accuracy: 0.9738 Epoch 29/30 180/180 [==============================] - 4s 24ms/step - loss: 0.0459 - accuracy: 0.9836 - val_loss: 0.0491 - val_accuracy: 0.9716 Epoch 30/30 180/180 [==============================] - 5s 27ms/step - loss: 0.0430 - accuracy: 0.9847 - val_loss: 0.0395 - val_accuracy: 0.9769
Let’s plot the history of the accuracy on both the training and validation sets for the second model.
history_plot(text_history, model_name="Second Model")
The second model consistently scores around 95% - 99% validation accuracy after half epochs (epoch 15), which is not bad.
Third Model
In the third model, we will use both the article title and the article text as input.
First, we should concatenate the output of the title pipeline and the output of the text pipeline.
main = layers.concatenate([title_features, text_features], axis=1)
Then, we add one more Dense layer for the third model.
main = layers.Dense(32, activation='relu')(main)
num_class = 2
# output layer for the third model
output = layers.Dense(num_class, name = "fake")(main)
Let’s create the model.
main_model = keras.Model(
inputs = [title_input, text_input],
outputs = output
)
Let’s check the third model’s structure by using the model.summary()
.
main_model.summary()
Model: "model_2"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
title (InputLayer) [(None, 1)] 0 []
text (InputLayer) [(None, 1)] 0 []
text_vectorization (TextVector (None, 500) 0 ['title[0][0]',
ization) 'text[0][0]']
embedding (Embedding) (None, 500, 3) 7800 ['text_vectorization[0][0]',
'text_vectorization[1][0]']
dropout (Dropout) (None, 500, 3) 0 ['embedding[0][0]']
dropout_3 (Dropout) (None, 500, 3) 0 ['embedding[1][0]']
global_average_pooling1d (Glob (None, 3) 0 ['dropout[0][0]']
alAveragePooling1D)
global_average_pooling1d_1 (Gl (None, 3) 0 ['dropout_3[0][0]']
obalAveragePooling1D)
dropout_1 (Dropout) (None, 3) 0 ['global_average_pooling1d[0][0]'
]
dropout_4 (Dropout) (None, 3) 0 ['global_average_pooling1d_1[0][0
]']
dense (Dense) (None, 32) 128 ['dropout_1[0][0]']
dense_2 (Dense) (None, 32) 128 ['dropout_4[0][0]']
dropout_2 (Dropout) (None, 32) 0 ['dense[0][0]']
dropout_5 (Dropout) (None, 32) 0 ['dense_2[0][0]']
dense_1 (Dense) (None, 32) 1056 ['dropout_2[0][0]']
dense_3 (Dense) (None, 32) 1056 ['dropout_5[0][0]']
concatenate (Concatenate) (None, 64) 0 ['dense_1[0][0]',
'dense_3[0][0]']
dense_4 (Dense) (None, 32) 2080 ['concatenate[0][0]']
fake (Dense) (None, 2) 66 ['dense_4[0][0]']
==================================================================================================
Total params: 12,314
Trainable params: 12,314
Non-trainable params: 0
__________________________________________________________________________________________________
The third model has over 12 thousand parameters.
We also can create a visual version of the structure for the third model by using the plot_model
function.
keras.utils.plot_model(main_model)
Finally, let’s compile and train the third model.
main_model.compile(optimizer = "adam",
loss = losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
main_history = main_model.fit(train_dataset,
validation_data=val_dataset,
epochs = 30)
Click to show all 30 epochs from the third model.
Epoch 1/30 180/180 [==============================] - 5s 18ms/step - loss: 0.1592 - accuracy: 0.9398 - val_loss: 0.0293 - val_accuracy: 0.9918 Epoch 2/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0472 - accuracy: 0.9835 - val_loss: 0.0314 - val_accuracy: 0.9924 Epoch 3/30 180/180 [==============================] - 3s 15ms/step - loss: 0.0413 - accuracy: 0.9873 - val_loss: 0.0253 - val_accuracy: 0.9947 Epoch 4/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0405 - accuracy: 0.9865 - val_loss: 0.0167 - val_accuracy: 0.9962 Epoch 5/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0386 - accuracy: 0.9871 - val_loss: 0.0178 - val_accuracy: 0.9967 Epoch 6/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0370 - accuracy: 0.9877 - val_loss: 0.0242 - val_accuracy: 0.9944 Epoch 7/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0335 - accuracy: 0.9887 - val_loss: 0.0379 - val_accuracy: 0.9798 Epoch 8/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0336 - accuracy: 0.9891 - val_loss: 0.0470 - val_accuracy: 0.9756 Epoch 9/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0303 - accuracy: 0.9893 - val_loss: 0.0134 - val_accuracy: 0.9973 Epoch 10/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0290 - accuracy: 0.9909 - val_loss: 0.0292 - val_accuracy: 0.9882 Epoch 11/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0280 - accuracy: 0.9908 - val_loss: 0.0345 - val_accuracy: 0.9827 Epoch 12/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0289 - accuracy: 0.9902 - val_loss: 0.0344 - val_accuracy: 0.9827 Epoch 13/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0245 - accuracy: 0.9928 - val_loss: 0.0160 - val_accuracy: 0.9951 Epoch 14/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0247 - accuracy: 0.9913 - val_loss: 0.0196 - val_accuracy: 0.9933 Epoch 15/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0251 - accuracy: 0.9914 - val_loss: 0.0290 - val_accuracy: 0.9854 Epoch 16/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0270 - accuracy: 0.9920 - val_loss: 0.0509 - val_accuracy: 0.9724 Epoch 17/30 180/180 [==============================] - 3s 15ms/step - loss: 0.0263 - accuracy: 0.9918 - val_loss: 0.0345 - val_accuracy: 0.9844 Epoch 18/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0260 - accuracy: 0.9908 - val_loss: 0.0440 - val_accuracy: 0.9764 Epoch 19/30 180/180 [==============================] - 3s 15ms/step - loss: 0.0252 - accuracy: 0.9915 - val_loss: 0.0171 - val_accuracy: 0.9933 Epoch 20/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0232 - accuracy: 0.9925 - val_loss: 0.0246 - val_accuracy: 0.9878 Epoch 21/30 180/180 [==============================] - 3s 15ms/step - loss: 0.0255 - accuracy: 0.9911 - val_loss: 0.0294 - val_accuracy: 0.9858 Epoch 22/30 180/180 [==============================] - 3s 15ms/step - loss: 0.0204 - accuracy: 0.9935 - val_loss: 0.0225 - val_accuracy: 0.9916 Epoch 23/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0203 - accuracy: 0.9925 - val_loss: 0.0286 - val_accuracy: 0.9860 Epoch 24/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0199 - accuracy: 0.9935 - val_loss: 0.0201 - val_accuracy: 0.9907 Epoch 25/30 180/180 [==============================] - 4s 23ms/step - loss: 0.0205 - accuracy: 0.9928 - val_loss: 0.0151 - val_accuracy: 0.9947 Epoch 26/30 180/180 [==============================] - 3s 17ms/step - loss: 0.0216 - accuracy: 0.9924 - val_loss: 0.0506 - val_accuracy: 0.9753 Epoch 27/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0223 - accuracy: 0.9925 - val_loss: 0.0241 - val_accuracy: 0.9889 Epoch 28/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0224 - accuracy: 0.9915 - val_loss: 0.0547 - val_accuracy: 0.9780 Epoch 29/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0224 - accuracy: 0.9911 - val_loss: 0.0551 - val_accuracy: 0.9771 Epoch 30/30 180/180 [==============================] - 3s 14ms/step - loss: 0.0212 - accuracy: 0.9924 - val_loss: 0.0303 - val_accuracy: 0.9856
Let’s plot the history of the accuracy on both the training and validation sets for the third model.
history_plot(main_history, model_name="Third Model")
The third model consistently scores at least 97% validation accuracy in all epochs. Therefore, the third model is the best one. On the other hand, the second model, which only uses article title, consistently scores approximately 95% - 99% validation accuracy after epoch 14, which is also good.
As a result, using both the title of the article and the full text of the article is the most effective way to detect fake news. However, the third model is much more complicated and needs more time to train than the second model, and both models have a good performance. We might consider using the second model when we only have limited computing power and time.
§4. Model Evaluation
In this section we will download the test data, and test our model performance on unseen test data.
Don’t forget that we also need to convert the test dataframe to the dataset by using the function make_dataset()
.
test_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_test.csv?raw=true"
test_df = pd.read_csv(train_url)
# conver to dataset
test_dataset = make_dataset(test_df)
Now, we can check our best model (the third model) performance on unseen data.
main_model.evaluate(test_dataset)
225/225 [==============================] - 3s 11ms/step - loss: 0.0284 - accuracy: 0.9864
[0.02836262620985508, 0.9864136576652527]
That’s impressive; the third model has almost 99% accuracy predicted if the article contains fake news.
Since our second model also performs well, let’s check its performance on unseen data too.
text_model.evaluate(test_dataset)
225/225 [==============================] - 3s 14ms/step - loss: 0.0606 - accuracy: 0.9710
[0.060581598430871964, 0.9710454940795898]
The result for the second model is almost 97% accuracy predicted if the article contains fake news.
Both the third and second models have an excellent performance. Therefore, we might consider using only the article’s full text to detect fake news when time is limited.
§5. Embedding Visualization
Word embeddings are often produced as intermediate stages in many machine learning algorithms. Let’s take a look at the embedding layer to see how our own model represents words in a vector space.
weights = main_model.get_layer('embedding').get_weights()[0] # get the weights from the embedding layer
vocab = vectorize_layer.get_vocabulary() # get the vocabulary from our data prep for later
Let’s check the shape of the weight
weights.shape
(2600, 3)
We have three columns because we set the output_dim = 3
in the embedding layer.
If we want to plot in 2-dimensional, we must reduce the weight from 3-dimensional into 2-dimensional by using the principal component analysis (PCA).
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
weights = pca.fit_transform(weights)
Let’s double-check the shape of the weights.
weights.shape
(2600, 2)
Good, now we have successfully converted the weights from 3-dimensional into 2-dimensional.
Let’s make a dataframe from our results:
embedding_df = pd.DataFrame({
'word' : vocab,
'x0' : weights[:,0], # zero column of the weights
'x1' : weights[:,1], # first column of the weights
})
Let’s check our embedding_df.
embedding_df
word | x0 | x1 | |
---|---|---|---|
0 | -0.009576 | 0.025002 | |
1 | [UNK] | -0.058210 | 0.053124 |
2 | the | -0.387607 | 0.042387 |
3 | to | 0.051087 | 0.091192 |
4 | of | 0.302065 | 0.076064 |
... | ... | ... | ... |
2595 | jury | -0.039374 | -0.065970 |
2596 | jackson | 0.302881 | -0.100250 |
2597 | impossible | 0.828751 | -0.256427 |
2598 | emerged | -0.646941 | 0.213412 |
2599 | complained | 0.178686 | -0.326047 |
2600 rows × 3 columns
We have a special world UNK
in our data frame. The first row in the word frequency is the unknown word that is not in the top 2600 words. So the most common word is the word does not actually on the list, which makes sense since those individual words it’s not common enough to make the top 2600, but if you add them all together and say they all count as an unknown word, added all together, they are going to be at the top rank.
Finally, we can make the plot.
fig = px.scatter(embedding_df,
x = "x0",
y = "x1",
size = list(np.ones(len(embedding_df))),
size_max = 4,
hover_name = "word",
title='Embedding Visualization')
fig.update_layout(title_x=0.5)
fig.show()
On the left side, negativity x0 axis, we have words such as:
- (-6.200, -0.175) : “gop” (the most left side word)
- (-4.599, 0.192) : “watch”
- (-4.208, -0.204) : “breaking”
- (-4.091, 0.122). : “reportedly”
- (-4.084, -0.241) : “rep”
On the right side, positive x0 axis, we have words such as:
- (6.464, -0.206) : “trumps” (most right side word)
- (4.420, -0.105) : “myanmar”
- (4.210, -0.191) : “im”
- (4.071, -0.058) : “dont”
- (4.017, -0.143) : “rohingya”
I am not sure which side is more likely to be fake news, but there are more sensitive words on the left side, such as:
- (-3.030,-0.074) : “racist”,
- (-1.912,-0.156) : “conspiracy”
- (-1.490,-0.175) : “hate”
- (-1.159, 0.137) : “shooting”
- (-0.727,-0.172) : “muslims”
Bias in Language Models
Bias is a common problem in language models, and bias can sneak into our language model. However, it does not require a biased modeler to create a biased model. It just requires a modeler who’s not being sufficiently careful.
Here we can learn about the possible biases learned by our model. Let’s check what kinds of words our model associated with females and males.
feminine = ["she", "her", "woman"]
masculine = ["he", "him", "man"]
highlight_1 = ["strong", "powerful", "smart", "thinking"]
highlight_2 = ["hot", "sexy", "beautiful", "shopping"]
def gender_mapper(x):
if x in feminine:
return 1
elif x in masculine:
return 4
elif x in highlight_1:
return 3
elif x in highlight_2:
return 2
else:
return 0
embedding_df["highlight"] = embedding_df["word"].apply(gender_mapper)
embedding_df["size"] = np.array(1.0 + 50*(embedding_df["highlight"] > 0))
fig = px.scatter(embedding_df,
x = "x0",
y = "x1",
color = "highlight",
size = list(embedding_df["size"]),
size_max = 20,
hover_name = "word",
title='Embedding Visualization with Highlight')
fig.update_layout(title_x=0.5)
fig.show()