TF2 GloVe embedding layer

This is how to create an word2vec embedding layer in Tensorflow 2.x, using the GloVe vectors. Assume we downloaded the GloVe data from the web, for example glove.6B.zip of 6B tokens, 400K vocab from Wikipedia data, then we can build a matrix of floats and a word mapping dictionary from it. The mapping is for transforming words into integer indices and the matrix translates the indices into vectors of floats. Here is the code:

import zipfile
import numpy as np
import tensorflow as tf

def make_glove(glovezip = 'glove.6B.zip', vectorfile='glove.6B.100d.txt'):
    lookup = []
    vectors = []
    with zipfile.ZipFile(glovezip) as zfp, zfp.open(vectorfile) as fp:
        for line in fp:
            values = line.decode('utf8').split()
            lookup.append(values[0])
            vectors.append(np.asarray(values[1:], dtype='float32'))
    return {w:i for i,w in enumerate(lookup)}, np.vstack(vectors)

lookup, vectors = make_glove()
vocabsize = len(lookup)       # how many words supported by this GloVe
vectorlen = vectors.shape[1]  # size of each vector
inputlen = 300                # arbitrary -- num of words per input sentence
model = tf.keras.models.Sequential([
    # This is the GloVe embedding layer, must be the first layer
    tf.keras.layers.Embedding(vocabsize, vectorlen, input_length=inputlen, weights=[vectors], trainable=False),
    # other layers below
    tf.keras.layers.Conv1D(128, 5, activation='relu'),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

The important thing about the GloVe embedding layer is the layer weights are provided as a numpy array and the layer is set as not trainable. Moreover, when we use the model, we have to transform the input text word-by-word into integers (with any out-of-vocabulary word replaced by a placeholder <unk>):

def transform(text, lookup):
    default = lookup['<unk>']
    return [lookup.get(w.lower(), default) for w in nltk.tokenize.word_tokenize(text)]

The above TF2 model is not exactly on a bag-of-words model because the convolution layer is retaining collocation information. For a BoW model, we can add a global pooling:

tf.keras.layer.GlobalAveragePooling1D()

right after the embedding layer. It will transform a 2D ouput from the embedding layer into 1D output by averaging (GlobalAveragePooling1D) or taking max values (GlobalMaxPool1D) across each vector offset.