How to use pretrained word vectors
There are several sources for pretrained word embeddings. Popular options include Stanford's GloVE and spaCy's built-in vectors (refer to the using_pretrained_vectors
notebook for details). In this section, we will focus on GloVe.
GloVe – Global vectors for word representation
GloVe (Global Vectors for Word Representation, Pennington, Socher, and Manning, 2014) is an unsupervised algorithm developed at the Stanford NLP lab that learns vector representations for words from aggregated global word-word co-occurrence statistics (see resources linked on GitHub). Vectors pretrained on the following web-scale sources are available:
- Common Crawl with 42 billion or 840 billion tokens and a vocabulary or 1.9 million or 2.2 million tokens
- Wikipedia 2014 + Gigaword 5 with 6 billion tokens and a vocabulary of 400,000 tokens
- Twitter using 2 billion tweets, 27 billion tokens, and a vocabulary of 1.2 million tokens...