Word2Vec Embeddings For Sentiment Analysis Using Python!

Yeshwanth G
Jul 12, 2023
5 min read

I hope you have read the intuition behind the word2vec model by now.If not,I suggest you read the previous blog where i have explained briefly the intuition behind the word2vec model with its architecture. In this blog, We shall get our hands dirty and implement word2vec model to perform financial sentiment analysis.You can either go through the code snippets for reference and implement along side in a google colab(link:https://colab.research.google.com/ ) or in the end,I have attached a colab link of my implementation as well.

I will be using the same dataset that i used to implement the N-Grams model,In order to compare the performances of both the models.I have left the download link below just in case.

Okie,Let's get right to it!

Let's import all the required libraries for implementing the module.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('dark_background')

Reading the dataset and seeing if there are any missing values in each column.

clnames=['Sentiment','News Headline']
df=pd.read_csv('all-data.csv',encoding='ISO-8859-1',names=clnames)
print(df.head())

We assign first column to y and second to x.shape prints the number of rows and columns in (# of rows,# of columns)

4846 rows and 1 columns is what each dataframe objects(x and y) contained in the dataset.

y=df['Sentiment'].values
x=df['News Headline'].values
y.shape
x.shape

Test_train_split splits the datset into two fragments. Most of it is used for training our model .i.e (decompose dataset into bigrams and learn context using conditional probability concept). The below code assigns 40% of the dataset .4*4845 =1938 rows(approximately) for testing purpose and the remaining for training. Note that the function returns numpy arrays.

from sklearn.model_selection import train_test_split
(x_train,x_test,y_train,y_test)=train_test_split(x,y,test_size=0.4)
x_train.shape

converting these numpy arrays(x_train and y_train) to pandas dataframe and concatinating both of them to form a structure like our initial dataset

df1=pd.DataFrame(x_train)
df1=df1.rename(columns={0:'News headline'})
df2=pd.DataFrame(y_train)
df2=df2.rename(columns={0:'Sentiment'})
df_train=pd.concat([df1,df2],axis=1)
print(df_train.head(10))
df3=pd.DataFrame(x_test)
df3=df3.rename(columns={0:'News headline'})
df4=pd.DataFrame(y_test)
df4=df4.rename(columns={0:'Sentiment'})
df_test=pd.concat([df3,df4],axis=1)
print(df_test.head(10))

Data preprocessing

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
string.punctuation
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
def preprocess_text(text):
    # we convert the text to lowercase
    text = text.lower()
    #we are removing the punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # we tokenize the text as in break each sentence down to a list of words
    tokens = word_tokenize(text)

    # we remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # we lemmatize the tokens as convert words to their base form for ex:
    #convert running to run,walking to walk,loves to love since those extra
    #letters dont add any value but can affect models performance
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    # Return the preprocessed text as a single string
    preprocessed_text = ' '.join(lemmatized_tokens)
    return preprocessed_text

# Apply the preprocessing function to the 'text' column
df_train['News headline'] = df_train['News headline'].apply(preprocess_text)

# Display the DataFrame with preprocessed text
print(df)

Here we are generating the word embeddings and performing various actions on them.In the very end notice that we are printing the averaged vector of all words in the first sentence.

Later on,we will be doing that for all sentences in the training dataset so as to generalize the context for each sentence.

from gensim.models import Word2Vec
model = Word2Vec(sentences=[sentence.split() for sentence in df_train['News headline'].values],vector_size=100)
model.save("word2vec.model")
import numpy as np
# Load the model from the model file
sg_w2v_model = Word2Vec.load("word2vec.model")
# Unique ID of the word
print("Index of the word 'year':")
print(sg_w2v_model.wv.key_to_index["year"])
# Total number of the words
print(len(sg_w2v_model.wv.key_to_index))
# Print the size of the word2vec vector for one word
print("Length of the vector generated for a word")
print(len(sg_w2v_model.wv['year']))
# Get the mean for the vectors for an example review
print("Print the length after taking average of all word vectors in first sentence:")
print(np.mean([sg_w2v_model.wv[i]  for token in df_train['News headline'].values[0] for i in token.split()if i in sg_w2v_model.wv], axis=0))

An example here shows the second most similar word to 'year'.The most similar word is given by:

most_similar_word = similar_words[0][0] which is the word itself...

word_vector = sg_w2v_model.wv['year']

# Find the most similar word(s) to the given word vector
similar_words = sg_w2v_model.wv.similar_by_vector(word_vector, topn=10)

# Retrieve the key (word) from the similar_words result
most_similar_word = similar_words[1][0]

print(most_similar_word)

In the code snippet below,We are essentially generating averaged word embeddings for each sentence and storing it in a file called 'word2veccc.csv'. We store it in a file since our dataset is a little huge and can be easily accessible later on...

Open the file that is created on running this snippet to see what exactly is happening.

with open("word2veccc.csv", 'w+') as word2vec_file:
  for index, row in df_train.iterrows():
    model_vector = (np.mean([sg_w2v_model.wv[i]  for token in df_train['News headline'].values[index] for i in token.split()if i in sg_w2v_model.wv], axis=0)).tolist()
    if index == 0:
      #header representing indexing projection of vector on each axis 
       #(there are 100 axis's(humanely not possible to visualize but mathematical operations are feasible)
      header = ",".join(str(ele) for ele in range(100))
      word2vec_file.write(header)
      word2vec_file.write("\n")
    if type(model_vector) is list:
      #each element in a .csv file are separated by ','.we make use of that to 
      line1 = ",".join( [str(vector_element) for vector_element in model_vector] )
      #print(line1) to see each vector elements of type string separated by ,
    else:
      #in very rare cases we could potentially end up with np.mean being a single float value instead of an array or list
      #,so to handle that 
      line1 = ",".join([str(0) for i in range(100)])
    word2vec_file.write(line1)
    word2vec_file.write('\n')

I have fed these vectors to one classic model(Decision tree classifer) and a Artifical neural network(ANN) model just to compare the performances.

from sklearn.tree import DecisionTreeClassifier
# Load from the filename
with open("word2veccc.csv", 'r') as word2vec_file:
  word2vec_df = pd.read_csv(word2vec_file)
#Initialize the model
  clf_decision_word2vec = DecisionTreeClassifier()
# Fit the model
  clf_decision_word2vec.fit(word2vec_df, df_train['Sentiment'])

we preprocess the testing dataset to the embeddings that the model understands and predict

from sklearn.metrics import accuracy_score
test_features_word2vec = []
df_test['News headline']=df_test['News headline'].apply(preprocess_text)
for index, row in df_test.iterrows():
    model_vector = np.mean([sg_w2v_model.wv[i]  for token in df_test['News headline'].values[index] for i in token.split()if i in sg_w2v_model.wv], axis=0)
    if type(model_vector) is list:
        test_features_word2vec.append(model_vector)
    else:
        test_features_word2vec.append(np.array([0 for i in range(100)]))
test_predictions_word2vec = clf_decision_word2vec.predict(test_features_word2vec)
print([i for i in test_predictions_word2vec])
print(accuracy_score(df_test['Sentiment'],test_predictions_word2vec)*100)

Importing libraries needed for implementing simple ann

import tensorflow as tf
from tensorflow import keras
from keras import layers
from keras.utils import to_categorical
from keras.layers import Dropout
from keras.optimizers import Adam
from sklearn.preprocessing import LabelEncoder
#encoding the sentiments using one hot encoding which will be the output of the model
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_train['Sentiment'])
y_encoded = to_categorical(y)
print(y_encoded)

Initializing and compiling the neural net

ann=tf.keras.models.Sequential()
#adding 300 neurons for the hidden layer
ann.add(layers.Dense(units=300,activation='relu'))
ann.add(Dropout(rate=.2))
#3 neurons at the output since we have three sentiments
ann.add(layers.Dense(units=3,activation='softmax'))
ann.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
ann.fit(word2vec_df,y_encoded,epochs=20)

Checking the performance!

from sklearn.metrics import accuracy_score

test_features_word2vec = []
df_test['News headline'] = df_test['News headline'].apply(preprocess_text)

for index, row in df_test.iterrows():
    model_vector = np.mean([sg_w2v_model.wv[i] for token in row['News headline'] for i in token.split() if i in sg_w2v_model.wv], axis=0)
    if type(model_vector) is list:
        test_features_word2vec.append(model_vector)
    else:
        test_features_word2vec.append(np.array([0 for i in range(100)]))

test_features_word2vec = np.array(test_features_word2vec)

# Reshape the test features
test_features_word2vec = test_features_word2vec.reshape((len(test_features_word2vec), -1))

# Predict the sentiment labels for the test features
test_predictions_word2vec = ann.predict(test_features_word2vec)
predicted_labels = np.argmax(test_predictions_word2vec, axis=1)

# Convert the true labels to numerical values
y = label_encoder.transform(df_test['Sentiment'])
y_test_enc = to_categorical(y)

# Calculate the accuracy score
accuracy = accuracy_score(np.argmax(y_test_enc, axis=1), predicted_labels)
print("Accuracy:", accuracy * 100)

Link for my implementation:

https://colab.research.google.com/drive/1TgScojAPtf03O7eHoR_EesLs4DAHqXJ7?usp=sharing#scrollTo=d2NjSiLJobCD

Let me know what y'all think.Stay tuned for more !!

Word2Vec Embeddings For Sentiment Analysis Using Python!

Recent Posts

Comentários