Exploration and Visualisation Of Word Vectors in Chat
At Verloop, we observed that the exploration-pipeline for different intents helped us in seeing a coverage jump of about ~10-11% from the dataset the machine learning model wasn't able to cover after it's first run of prediction. This is because, we weren't extracting misspellings and paraphrased sentences of the intents before.
Although the data used for the pipeline in this blog-post is based on the articles printed in BBC newspapers, the underlying logic replicates to chat data as well.
For example, the intent exchange found in chat queries of E-Commerce companies like Amazon can be misspelled and written as:
- Order hasn't been exchanged yet!
- Please ex change my product.
- Expecting exch to take place by Thursday.
The same intent can be paraphrased as:
- Change my product. It is not working.
- I had issued a replacement. What is its status?
- Order hasn't been replaced yet. Who is responsible here?
It is not feasible to use Edit-distance based, Token based, and Sequence-based algorithmic approaches to cover all the misspellings as well as paraphrased sentences of an intent in the dataset. The exploration-pipeline, here, can do that, increasing the coverage of the intent in the dataset.
The exploration-pipeline is diagrammatically represented as:
A popular idea in machine learning is to represent words by vectors. FastText vectors capture hidden information about a language, like word analogies or semantic inforamtion.
The word analogies and semantic information help us in :
- Finding and extracting misspellings of a target word in the entire dataset.
- Finding words that occur commonly in context (in neighbour) to the target word.
- Finding similarity between different words given as a cosine of the angle between the word vectors.
Visualisation on TensorBoard gives us a 3-dimensional view of a 300-dimensional FastText word vector. It makes us visualise and extract the top misspellings and the closest neighbouring words of a target intent, increasing the coverage of the intent in the dataset.