Visualising Word Vectors Using TF2 [Advisable]
Exploration and Visualisation of Word Vectors Using TensorFlow 2
Tensorflow released their latest version of TensorFlow 2 on September 30, 2019.
As also mentioned in Visualising Word Vectors Using TF1 the reason we are visualising FastText is because:
FastText uses the concept that embeddings are formed based on the sub-word approach, this method helps us to visualise and obtain misspellings of a word or different spellings of the same word.
I couldn't find any blogs on the Internet that have covered or updated their code which describes the visualisation of embeddings through the latest version of Tensorflow.
Although I did take the help of an issue TF 2.0 API for using the embedding projector raised on the Tensorflow repository and have come to a concluding notebook suiting my goal i.e. to visualise FastText embeddings using TF2.
The tensorflow version used in this notebook is version 2.
# import statements
from pathlib import PurePath
import fasttext
import numpy as np
import tensorflow as tf
from tensorflow.python.framework import ops
from tensorboard.plugins import projector
from tensorboard.plugins.projector import ProjectorConfig
#hide_output
# load pre-trained fasttext model
model = fasttext.load_model("fasttextmodel.bin")
for i, w in enumerate(model.get_words()):
print(w)
if i > 4:
break
#hide_output
# number of words in the dataset
VOCAB_SIZE = len(model.get_words())
# size of the dimension of each word vector
EMBEDDING_DIM = len(model.get_word_vector(w))
# 2D numpy array initialised to store words with their vector representation
embed = np.zeros((VOCAB_SIZE, EMBEDDING_DIM))
embed.shape
# store the vector representation of each word in the 2D numpy array
for i, word in enumerate(model.get_words()):
embed[i] = model.get_word_vector(word)
embed
# path to store the words
tsv_file_path = "tensorboard/metadata.tsv"
Projection on Tensorboard 2
Steps for projection:
- Define the function
register_embedding()
andsave_label_tsv()
to configure the projector as well as save the projector configuration files and metadata file to the same folder. - Initialise the path variables accordingly and call the above function with suitable path variables as shown in below cells.
- Creation of the tensorflow variable instead of tensorflow placeholder.
- A saver class object is initialised and checkpoint is created.
Differences between TF 1 and TF 2
--> We cannot call the reset default graph method directly from tf library which we did for TensorFlow 1. It is invoked in TF2 as: </li>
from tensorflow.python.framework import ops
ops.reset_default_graph()
--> There is no placeholder required here. A tf variable is created which is passed parameters. A TF variable is shown below. Here the parameters are x: the array which contains the embeddings and name: name of the embedding file.
tensor_embeddings = tf.Variable(x, name=EMBEDDINGS_TENSOR_NAME)
According to the TF2 documentation a tensorflow variable is defined as:
A variable maintains shared, persistent state manipulated by a program. The Variable() constructor requires an initial value for the variable, which can be a Tensor of any type and shape. This initial value defines the type and shape of the variable. After construction, the type and shape of the variable are fixed. The value can be changed using one of the assign methods.
--> There is no concept of session
as well as saver
in TF2 yet. To workaround this, we just use the saver
class for the creation of checkpoints by initialising the saver
object with the tensorflow variable and pass None
as value to the session
parameter in saver.save()
.</li>
saver = tf.compat.v1.train.Saver([tensor_embeddings])
saver.save(sess=None, global_step=STEP, save_path=EMBEDDINGS_FPATH)
ops.reset_default_graph() # clearing the default graph stack
def register_embedding(
embedding_tensor_name: str, meta_data_fname: str, log_dir: str,
) -> None:
"""
Configuring the projector to be read by the tensorboard.
Args:
embedding_tensor_name(str): embeddings file name
meta_data_fname(str): metadata file name
log_dir(str): folder where tensorboard files and the metadata file are saved
Returns:
None
"""
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
embedding.tensor_name = embedding_tensor_name
embedding.metadata_path = meta_data_fname
projector.visualize_embeddings(
log_dir, config
) # storing the configuration files of projector where tensorboard files are saved
def save_labels_tsv(labels: list, filepath: str, log_dir: str,) -> None:
"""
Storing the vocabulary of words in the dataset to a file
Args:
labels: vocabulary i.e. words in the dataset
filepath: metadata file name
log_dir: "folder where tensorboard files and projector files are saved
Returns:
None
"""
with open(PurePath(log_dir, filepath), "w") as f:
for label in labels:
f.write("{}\n".format(label))
LOG_DIR = "tb2files" # folder which will contain all the tensorboard log files
# Labels i.e. the words in the dataset will be stored in this file
META_DATA_FNAME = "meta.tsv"
# name of the file which will have the embeddings stored
EMBEDDINGS_TENSOR_NAME = "embeddings"
# path for checkpoint of the saved embeddings
EMBEDDINGS_FPATH = PurePath(LOG_DIR, EMBEDDINGS_TENSOR_NAME + ".ckpt")
STEP = 0
x = embed # array containing the embeddings
y = model.get_words() # list containing the vocabulary
register_embedding(EMBEDDINGS_TENSOR_NAME, META_DATA_FNAME, LOG_DIR)
save_labels_tsv(y, META_DATA_FNAME, LOG_DIR)
tensor_embeddings = tf.Variable(
x, name=EMBEDDINGS_TENSOR_NAME
) # creation of the tensorflow variable, x: array which contains the embeddings,
# name: name of the file which will have the embeddings stored
#hide_output
saver = tf.compat.v1.train.Saver(
[tensor_embeddings]
) # Tensorflow variable passed as argument for saver object to be initialised
saver.save(
sess=None, global_step=STEP, save_path=EMBEDDINGS_FPATH
) # saving the checkpoint for the embedding files