Code of Word Vectors

Part 1: Count-Based Word Vectors

实现term-context matrix,然后用奇异值分解(SVD)对此稀疏矩阵进行降维,最后用散点图进行可视化。

给出词$w_i$,假设$w_i$的context window size是n,那么$w_i$的context words就是$w_i$前面的n个词和后面的n个词,也就是$w_{i-n},…w_{i-1}$和$w_{i+1},…w_{i+n}$。下面构建共现矩阵M,$M_{ij}$是$w_j$出现在$w_i$的context window中的次数。(在NLP中,我们经常在开头和末尾加上START和END表示句子/段落/文章的开始和结束。)

例如:window size=1的共现矩阵
Document 1: “all that glitters is not gold”
Document 2: “all is well that ends well”
c0 图标

1.1 Implement distinct_words [code]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def distinct_words(corpus):
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
num_corpus_words (integer): number of distinct words across the corpus
"""
corpus_words = []
num_corpus_words = -1

# ------------------
# Write your implementation here.
process_words = [y for x in corpus for y in x]
corpus_words = sorted(list(set(process_words)))
num_corpus_words = len(corpus_words)
# ------------------

return corpus_words, num_corpus_words

——-下面是测试代码——-

1
2
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
test_corpus_words, num_corpus_words = distinct_words(test_corpus)

c1 图标
♦ 如何 flatten a list of lists,这是个值得学习的地方。
以前将a list of lists转化为一个list无非是两层循环append。现在可以用更简洁的方式。

1
2
3
4
5
#The list of lists
list_of_lists = [range(4), range(7)]

#flatten the lists
flattened_list = [y for x in list_of_lists for y in x]

1.2 Implement compute_co_occurrence_matrix [code]

用numpy(np)来表示向量和矩阵。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def compute_co_occurrence_matrix(corpus, window_size=4):
""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).

Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
number of co-occurring words.

For example, if we take the document "START All that glitters is not gold END" with window size of 4,
"All" will co-occur with "START", "that", "glitters", "is", and "not".

Params:
corpus (list of list of strings): corpus of documents
window_size (int): size of context window
Return:
M (numpy matrix of shape (number of corpus words, number of corpus words)):
Co-occurence matrix of word counts.
The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
"""
words, num_words = distinct_words(corpus)
M = None
word2Ind = {}
# ------------------
# Write your implementation here.
M = np.zeros((num_words,num_words))#first create |V|×|V|'s size zero matrix
for sentence in corpus:
for i,word in enumerate(sentence):
for j in range(max(i-window_size,0),min(i+window_size+1,len(sentence))): #Avoiding out of boundary
if j!=i: #word itself isnt in its context window
M[words.index(word),words.index(sentence[j])]+=1#corresponding matrix cell adds one

word2Ind = dict.fromkeys(words)
count=0
for i in word2Ind.keys():
word2Ind[i]=count
count+=1
# ------------------

return M, word2Ind

——-下面是测试代码——-

1
2
3
4
import numpy as np

test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)

c2 图标
♦ enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标。

1.3 Implement reduce_to_k_dim [code]

上述生成的co-occurrence matrix是large and sparse matrix(很多cell的值是0),因此进行降维(dimensionality reduction)来省去不重要的维度。这里采用奇异值分解(SVD)来选取前k个重要成分。SVD也是广义的PCA(Principal Components Analysis)。本文只专注于代码实现。

下面对矩阵A(n行代表n个词,d是context word 也就是维度)进行full SVD。对角矩阵S中对角线上的奇异值会按顺序排列,其实也就是每个维度对应的权重,得到的$U_k$也就是最终我们所要的矩阵。降维后的共现矩阵仍会保留词之间的语义关系,比如doctor和hospital仍会比doctor和dog距离近。
c3 图标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def reduce_to_k_dim(M, k=2):
""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
to a matrix of dimensionality (num_corpus_words, k)
Params:
M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
k (int): embedding size of each word after dimension reduction
Return:
M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
In terms of the SVD from math class, this actually returns U * S
"""
n_iters = 10 # Use this parameter in your call to `TruncatedSVD`
M_reduced = None
print("Running Truncated SVD over %i words..." % (M.shape[0]))

# ------------------
# Write your implementation here.
svd = TruncatedSVD(n_components=k,n_iter=n_iters)
M_reduced = svd.transform(M)

# ------------------

print("Done.")
return M_reduced

——-下面是测试代码——-

1
2
3
4
5
6
import numpy as np
from sklearn.decomposition import TruncatedSVD

test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)
M_test_reduced = reduce_to_k_dim(M_test, k=2)

c4 图标
♦ 在实践中,由于执行PCA或SVD需要内存,将full SVD应用于大型语料库是一个挑战。但是,如果𝑘相对较小——被称为截断奇异值分解(Truncated SVD)。
♦ 所有的numpy、scipy和scikit-learn (sklearn)都提供了SVD的一些实现,但是只有scipy和sklearn提供了截断SVD的实现,而且只有sklearn提供了计算大规模截断SVD的有效随机算法。
♦ 下面对sklearn.decomposid.truncatedsvd的使用做个笔记。(原文参考)
TruncatedSVD实现了带变量值的SVD,即仅计算k个最大的奇异值,k是用户指定的。
使用截断SVD对term-document矩阵进行降维也叫做浅层语义分析(LSA),因为它将矩阵转化到低维度的语义空间。
通过截断奇异值分解(SVD)执行线性降维。与PCA相反,该方法在计算奇异值分解之前不会使数据居中。这意味着它可以有效地使用scipy.sparse矩阵。该方法支持两种算法:一种是快速随机SVD求解器,另一种是使用效率更高的ARPACK作为特征求解器(X X.T)或(X.T X)的“朴素”算法。
class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm=’randomized’, n_iter=5, random_state=None, tol=0.0)

  • 参数
    n_components:int,输出数据的维数,默认值是2;对于LSA,推荐值是100。
    algorithm:string,使用的SVD求解器(取值是’arpack‘’randomized’),默认值是’randomized‘。其中,’randomized‘是根据Halko(2009)的随机算法,’arpack’是是SciPy里的ARPACK包装器(scipy.sparse.linalg.svds)。
    n_iter:int,可选参数,随机SVD求解器的迭代次数。默认值是5。如果设置了参数algorithm=’arpack’不需设置此参数。
    random_state:int或RandomState实例或None,可选参数,默认值是None。如果是int, random_state是随机数生成器使用的种子;如果是RandomState实例,random_state是随机数生成器;如果是None,则随机数生成器是np.random使用的随机状态实例。
    tol:float,可选参数。ARPACK的容错值。0表示机器精度。如果设置了参数algorithm=’randomized’不需设置此参数。
  • 属性
    components_:array, shape (n_components, n_features)
    explained_variance_:array, shape (n_components,)
    explained_variance_ratio_:array, shape (n_components,)
    singular_values_:array, shape (n_components,)
  • 方法
method meaning
fit(self, X[, y] 将SVD应用到训练数据X上
fit_transform(self, X[, y]) 将SVD应用到X上
get_params(self[, deep]) 获取参数
inverse_transform(self, X) 将X转化到初始空间上
set_params(self, **params) 设置参数
transform(self, X) 在X上降维

Note:SVD有“符号不确定性”的问题,也就是svd.components_和输出结果的符号取决于算法和随机状态。解决办法是,先fittransform

1.4 Implement plot_embeddings [code]

用Matplotlib(plt)进行词向量投射到二维空间上的散点图的绘制。
首先看下demo,分别用plt.scatterplt.text进行点和对应文本的绘制。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def plot_embeddings(M_reduced, word2Ind, words):
""" Plot in a scatterplot the embeddings of the words specified in the list "words".
NOTE: do not plot all the words listed in M_reduced / word2Ind.
Include a label next to each point.

Params:
M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings
word2Ind (dict): dictionary that maps word to indices for matrix M
words (list of strings): words whose embeddings we want to visualize
"""

# ------------------
# Write your implementation here.
for word in words:
x = M_reduced[word2Ind.get(word)][0]
y = M_reduced[word2Ind.get(word)][1]
plt.scatter(x, y, marker='x', color='red')
plt.text(x+0.005, y+0.005, word, fontsize=9)
# ------------------
plt.show()

# ------------------

——-下面是测试代码——-

1
2
3
4
5
6
7
8
9
10
11
12
13
# -----------------------------
# Run This Cell to Produce Your Plot
# ------------------------------
reuters_corpus = read_corpus()
M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_normalized, word2Ind_co_occurrence, words)

c5 图标
c6 图标
可以看到,high similarity和high relatedness的词容易汇聚。
read_corpus()是从Reuters (business and financial news)语料库中提取’crude‘原油类别的语料,在每个文档开头和末尾添加’START‘和’END’,并将所有字母小写。
♦ Truncated SVD返回U*S,因此我们对返回的向量进行正则化,使所有向量都出现在单位圆的周围。注意:下面执行规范化的代码行使用了NumPy中的broadcast概念。numpy的broadcast简单来说就是维数不同的矩阵进行运算时使得维度统一方便计算。

以上为整个构建term-context共现矩阵并可视化的过程,完整代码可参考我的github

Part 2: Prediction-Based Word Vectors

Prediction-Based表示的词向量近年是很流行的。本部分探讨的是word2vec模型产生的词向量。
下面是从gensim中下载已训练好的word2vec模型。300万个词向量,维数是300。

1
2
3
4
5
6
7
8
9
10
def load_word2vec():
""" Load Word2Vec Vectors
Return:
wv_from_bin: All 3 million embeddings, each lengh 300
"""
import gensim.downloader as api
wv_from_bin = api.load("word2vec-google-news-300")
vocab = list(wv_from_bin.vocab.keys())
print("Loaded vocab size %i" % len(vocab))
return wv_from_bin

2.1 Reducing dimensionality of Word2Vec Word Embeddings

将word2vec向量构建矩阵M。并运行part1的reduce_to_k_dim,降维至2维。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def get_matrix_of_vectors(wv_from_bin, required_words=['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']):
""" Put the word2vec vectors into a matrix M.
Param:
wv_from_bin: KeyedVectors object; the 3 million word2vec vectors loaded from file
Return:
M: numpy matrix shape (num words, 300) containing the vectors
word2Ind: dictionary mapping each word to its row number in M
"""
import random
words = list(wv_from_bin.vocab.keys())
print("Shuffling words ...")
random.shuffle(words)
words = words[:10000]
print("Putting %i words into word2Ind and matrix M..." % len(words))
word2Ind = {}
M = []
curInd = 0
for w in words:
try:
M.append(wv_from_bin.word_vec(w))
word2Ind[w] = curInd
curInd += 1
except KeyError:
continue
for w in required_words:
try:
M.append(wv_from_bin.word_vec(w))
word2Ind[w] = curInd
curInd += 1
except KeyError:
continue
M = np.stack(M)
print("Done.")
return M, word2Ind

1
2
3
4
5
6
# -----------------------------------------------------------------
# Run Cell to Reduce 300-Dimensinal Word Embeddings to k Dimensions
# Note: This may take several minutes
# -----------------------------------------------------------------
M, word2Ind = get_matrix_of_vectors(wv_from_bin)
M_reduced = reduce_to_k_dim(M, k=2)

2.2 Word2Vec Plot Analysis

下面同样进行示例单词的向量可视化。

1
2
words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_reduced, word2Ind, words)

c7 图标

2.3 Polysemous Words

1
print(wv_from_bin.most_similar("lion"))

c8 图标

2.4 Synonyms & Antonyms

1
2
3
4
5
6
7
8
9
10
11
12
13
# ------------------
# Write your synonym & antonym exploration code here.

w1 = "happy"
w2 = "cheerful"
w3 = "sad"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

# ------------------

这里的距离指的是用cosine计算的,运行结果如下:
c9 图标

2.5 Solving Analogies with Word Vectors

1
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))

运行结果如下:
c10 图标

另外,在进行类比的过程中,会产生之前所述的bias。