The visualisation of embeddings using Tensorflow 1 is inspired from this blog.

The above mentioned blog gives out an appropriate theoretical description but their code has become obsolete and the required changes for the correct visualisation as well as the complete theoretical description is given in this notebook and in an updated way with TF2 blog-post.

FastText uses the concept that embeddings are formed based on the sub-word approach, this method helps us to visualise and obtain misspellings of a word or different spellings of the same word.

As we currently have the latest Tensorflow version installed, instead of downgrading it to previous version 1, we use the following code:

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

This method helps us to get the behaviour of Tensorflow 1 in Tensorflow 2.

#hide_output
from pathlib import PurePath

import fasttext
import numpy as np
from tensorflow.python.framework import ops
from tensorboard.plugins import projector
import tensorflow.compat.v1 as tf

tf.disable_v2_behavior()  # disabling v2 behaviour of tf1
from tensorboard.plugins.projector import ProjectorConfig

ops.reset_default_graph()
#hide_output
model = fasttext.load_model("fasttextmodel.bin")
# directory to save files to visualise on tensorboard
FOLDER_PATH = "tb1files"
for i, w in enumerate(model.get_words()):
    print(w)
    if i > 4:
        break
s
said
mr
</s>
people
new
#hide_output

# number of words in the dataset
VOCAB_SIZE = len(model.get_words())


# size of the dimension of each word vector
EMBEDDING_DIM = len(model.get_word_vector(w))


# 2D numpy array initialised to store words with their vector representation
embed = np.zeros((VOCAB_SIZE, EMBEDDING_DIM))
embed.shape
# store the vector representation of each word in the 2D numpy array
for i, word in enumerate(model.get_words()):
    embed[i] = model.get_word_vector(word)
embed
array([[-0.11363645,  0.00304414,  0.00589875, ...,  0.00278742,
         0.03564256, -0.10496949],
       [ 0.05821591,  0.07343163, -0.06941246, ...,  0.00737938,
         0.08668958, -0.05127012],
       [ 0.06867523, -0.02112868, -0.02132288, ...,  0.05362611,
         0.13982825,  0.04221647],
       ...,
       [ 0.16511762,  0.04439345, -0.14276202, ...,  0.02632121,
         0.03970968,  0.03706815],
       [ 0.09471416,  0.09356211,  0.00358974, ..., -0.0174412 ,
         0.13414964,  0.02268019],
       [ 0.07753251, -0.02356024, -0.05303693, ...,  0.14130574,
         0.09740689,  0.0418443 ]])
# path to store the words
tsv_file_path = FOLDER_PATH + "/metadata.tsv"
tsv_file_path
'tb1files/metadata.tsv'
with open(tsv_file_path, "w+", encoding="utf-8") as f:
    for i, word in enumerate(model.get_words()):
        f.write(word + "\n")  # write the words to an external file
embed.shape
(10891, 300)
TENSORBOARD_FILES_PATH = FOLDER_PATH

Projection on Tensorboard 1 [Part 1]

Steps for projection [Part 1]:

  1. Placeholder is created of size Vocab Size * Dimension of Embeddings.
  2. Creation of a global variable to store the placeholder values.
  3. New tensorflow session is started and the placeholder is passed the value of our array which stores the vocabulary and their respective embeddings.
  4. For saving values into variables and restoring variables from checkpoints, a saver object is instantiated and a writer object is initialised which outputs the graph.

Differences between TF 1 and TF 2

--> In TF1, reset default graph can be directly called by the tensorflow library to clear the default graph stack and reset the global default graph. Implemented by -

tf.reset_default_graph()

--> TF 2 doesn't have the placeholders as mentioned below:

X_init = tf.placeholder(tf.float32, shape=(VOCAB_SIZE, EMBEDDING_DIM), name="embedding")

It is cleared off by the disabling tf 2 behaviour defined through the import technique-

import tensorflow.compat.v1 as tf           
tf.disable_v2_behavior()

If the import technique isn't followed we are subjected to receive the error:AttributeError: module 'tensorflow' has no attribute 'placeholder.'


# Tensorflow Placeholders
tf.reset_default_graph()
X_init = tf.placeholder(tf.float32, shape=(VOCAB_SIZE, EMBEDDING_DIM), name="embedding")
X = tf.Variable(X_init)


# Initializer
init = tf.global_variables_initializer()


# Start Tensorflow Session
sess = tf.Session()
sess.run(init, feed_dict={X_init: embed})


# Instance of Saver, save the graph.
saver = tf.train.Saver()
writer = tf.summary.FileWriter(TENSORBOARD_FILES_PATH, sess.graph)

Projection on Tensorboard 1 [Part 2]

Steps for projection [Part 2]:

  1. Instantiating the projector object.
  2. Assigning the file which contains the vocabulary to the embedding variable.
  3. Writing the configuration file for the projector read by the tensorboard using projector.visualize_embeddings(writer, config)
  4. Saving the checkpoint and closing the connection.

Here both the projector imports are important [already imported in cell 1] i.e.

from tensorboard.plugins import projector
from tensorboard.plugins.projector import ProjectorConfig

as visualize_embeddings() function is defined under projector and we need ProjectorConfig() in creation for the configuration file of the projector.

If the projector is imported in the following way:

from tensorboard.plugins import projector
config = projector.ProjectorConfig()

the error received would be AttributeError: module 'tensorboard.plugins.projector' has no attribute 'ProjectorConfig.'


# Configure a Tensorflow Projector
config = ProjectorConfig()
embed = config.embeddings.add()
embed.metadata_path = "metadata.tsv"

# Write a projector_config
projector.visualize_embeddings(writer, config)


# save a checkpoint
saver.save(sess, TENSORBOARD_FILES_PATH + "/model.ckpt", global_step=VOCAB_SIZE)


# close the session
sess.close()