"FastText"

import re
import string

import fasttext
import pandas as pd
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

nlp = English()

The dataset can be downloaded through:

  1. Terminal: gsutil cp gs://dataset-uploader/bbc/bbc-text.csv

                  OR
  2. Visiting this website from the browser.


file = pd.read_csv("bbc-text.csv")  # reading the dataset
df = file
df
category text
0 tech tv future in the hands of viewers with home th...
1 business worldcom boss left books alone former worldc...
2 sport tigers wary of farrell gamble leicester say ...
3 sport yeading face newcastle in fa cup premiership s...
4 entertainment ocean s twelve raids box office ocean s twelve...
... ... ...
2220 business cars pull down us retail figures us retail sal...
2221 politics kilroy unveils immigration policy ex-chatshow ...
2222 entertainment rem announce new glasgow concert us band rem h...
2223 politics how political squabbles snowball it s become c...
2224 sport souness delight at euro progress boss graeme s...

2225 rows × 2 columns

df.drop(columns=["category"], inplace=True)  # don't need the labels for the text
df
text
0 tv future in the hands of viewers with home th...
1 worldcom boss left books alone former worldc...
2 tigers wary of farrell gamble leicester say ...
3 yeading face newcastle in fa cup premiership s...
4 ocean s twelve raids box office ocean s twelve...
... ...
2220 cars pull down us retail figures us retail sal...
2221 kilroy unveils immigration policy ex-chatshow ...
2222 rem announce new glasgow concert us band rem h...
2223 how political squabbles snowball it s become c...
2224 souness delight at euro progress boss graeme s...

2225 rows × 1 columns

refined_string_list = []
token_list = []
filtered_list = []


for query in df["text"]:
    
    # removing punctuations from string
    string_translate = query.translate(
        str.maketrans("", "", string.punctuation)
    )  

    # initialise string to english langauge functions of spacy
    spacy_doc = nlp(
        string_translate
    )  

    # appending empty list with tokenised string
    for token in spacy_doc:
        token_list.append(token.text)  

    # checking if tokenised word exists in a given list of stopwords obtained     
    for word in token_list:
        lexeme = nlp.vocab[
            word
        ]  
        if lexeme.is_stop == False:
            filtered_list.append(
                word
            )  

    # converting list of tokenised words without stopwords to sentence
    filtered_sentence = " ".join(
        filtered_list
    )  

    # removing multiple spaces from the string
    filtered_sentence = re.sub(
        " +", " ", filtered_sentence
    )  

    # appending the list with strings without stop words
    refined_string_list.append(
        filtered_sentence
    )  

    # reinitialising the lists
    token_list = []
    filtered_list = []


refined_string_list[0]
'tv future hands viewers home theatre systems plasma highdefinition tvs digital video recorders moving living room way people watch tv radically different years time according expert panel gathered annual consumer electronics las vegas discuss new technologies impact favourite pastimes leading trend programmes content delivered viewers home networks cable satellite telecoms companies broadband service providers rooms portable devices talkedabout technologies ces digital personal video recorders dvr pvr settop boxes like s tivo uk s sky system allow people record store play pause forward wind tv programmes want essentially technology allows personalised tv builtin highdefinition tv sets big business japan slower europe lack highdefinition programming people forward wind adverts forget abiding network channel schedules putting alacarte entertainment networks cable satellite companies worried means terms advertising revenues brand identity viewer loyalty channels leads technology moment concern raised europe particularly growing uptake services like sky happens today months years time uk adam hume bbc broadcast s futurologist told bbc news website likes bbc issues lost advertising revenue pressing issue moment commercial uk broadcasters brand loyalty important talking content brands network brands said tim hanlon brand communications firm starcom mediavest reality broadband connections anybody producer content added challenge hard promote programme choice means said stacey jolna senior vice president tv guide tv group way people find content want watch simplified tv viewers means networks terms channels leaf google s book search engine future instead scheduler help people find want watch kind channel model work younger ipod generation taking control gadgets play suit panel recognised older generations comfortable familiar schedules channel brands know getting want choice hands mr hanlon suggested end kids diapers pushing buttons possible available said mr hanlon ultimately consumer tell market want 50 000 new gadgets technologies showcased ces enhancing tvwatching experience highdefinition tv sets new models lcd liquid crystal display tvs launched dvr capability built instead external boxes example launched humax s 26inch lcd tv 80hour tivo dvr dvd recorder s biggest satellite tv companies directtv launched branded dvr 100hours recording capability instant replay search function set pause rewind tv 90 hours microsoft chief bill gates announced preshow keynote speech partnership tivo called tivotogo means people play recorded programmes windows pcs mobile devices reflect increasing trend freeing multimedia people watch want want'
with open("refined-bbc-text.txt", "w") as f:
    for item in refined_string_list:
        f.write("%s\n" % item)
# <--TRAINING THE MODEL BASED ON FASTTEXT-->
print(fasttext.train_unsupervised.__doc__)
    Train an unsupervised model and return a model object.

    input must be a filepath. The input text does not need to be tokenized
    as per the tokenize function, but it must be preprocessed and encoded
    as UTF-8. You might want to consult standard preprocessing scripts such
    as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html

    The input field must not contain any labels or use the specified label prefix
    unless it is ok for those words to be ignored. For an example consult the
    dataset pulled by the example script word-vector-example.sh, which is
    part of the fastText repository.
    

Default Configuration for parameters mentioned in [ ] for fasttext.train_unsupervised():

input             # training file path (required)
model             # unsupervised fasttext model {cbow, skipgram} [skipgram]
lr                # learning rate [0.05]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [5]
minn              # min length of char ngram [3]
maxn              # max length of char ngram [6]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [ns]
bucket            # number of buckets [2000000]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
verbose           # verbose [2]
%%time
model = fasttext.train_unsupervised("refined-bbc-text.txt", dim=300, thread=4)
CPU times: user 2min 34s, sys: 1.14 s, total: 2min 35s
Wall time: 46.2 s
with open("tensorboard/metadata.tsv", "w") as f:
    for item in model.words:
        f.write(
            "%s\n" % item
        )  # writing the vocabulary words of the model to a text file
model.save_model("fasttextmodel.bin")  # saving the model