从word2vec到doc2vec

此处省略开头，回归核心。。。

尽管word2vec提供了高质量的词汇向量，仍然没有有效的方法将它们结合成一个高质量的文档向量。在本篇文章中，受一个随机过程问题（中餐馆订餐过程CRP）的启发，我们讨论了一种可能的探索。基本思路是，使用CRP来驱动聚类过程，并将各种词向量在合适的聚类中合在一起。

假设我们有一个关于鸡肉订单（chicken recipe）的文档。它包含了下面的词汇：”chicken”, “pepper”,“salt”, “cheese”. 它也包含了其它的词汇：“use”, “buy”, “definitely”, “my”, “the”。 word2vec的模型将为每个单词生成一个vector。简单的，我们可以将所有词向量(word vector)合成一个文档向量（doc vector）。这将引入许多噪声。一种降噪方法是使用加权的合并，基于相应的算法，比如：IDF 或者 POS tag.

那么问题来了：当添加词汇时，是否可以更有选择性？回到鸡肉订单文档上，它不应该考虑以下的词汇：“definitely”, “use”, “my” 。基于权重的idf可以有效地减少一些停留词（”the”、”is”等等）的噪声问题。然而，对于这样的词汇：“definitely”， “overwhelming”，那么idf值将不会如你所愿那样的小。

如果我们首先将词汇聚类，像这样的词“chicken”, “pepper”将聚集到同一个类中，而像其它的词类似“junk”则希望聚到另一个类中。如果我们能区别相关的类，那么我们可以将相关类的词相量（word vector）合并，我们就可以得到一个很好的文档（doc vector）.

当然，我们可以使用通用的算法：K-means，但是大多数这些算法都需要一个距离公式。word2vec可以通过余弦相似度（cosine）很方便地进行相似判断，不一定需要采用欧氏距离。

如果我们使用余弦相似度，我们可以很快进行聚类词汇。

回到中餐馆问题，假设你来到一个中餐馆，发现已经有n张桌子，每张桌子有不同的人。另外还有一张空桌子。CRP有一个超参数 r > 0，它表示这张空桌子上可能的人数。你到了（n+1）的桌子中其中之一，桌子上存在不同数目的人（对于空桌子，数目为r）。当你到达其中的一张桌子时，那么整个过程完成。如果你决定坐在空桌子上，餐厅会自动创建一张空桌子。在这个例子中，如果下一个顾客到来时，他会在(n+2)张桌子上做选择（包括新的空桌子）

受CRP的启发，我们尝试了在CRP中，包含相似因子的的以下变量。过程大致相同：我们给定聚类的M个向量。我们去维护两个东西：聚类和（cluster sum，没有中心），聚类中的各个向量（vector）。通过各向量进行迭代。对于当前的向量V，假设我们已经有了n个聚类。现在我们去找到聚类C，它的聚类和与当前的向量相似。我们将这个分数称为 sim(V,C).

变量1: v 创建了一个新的聚类，它的概率为1/(1+n). 否则v就到聚类C中。
变量2:如果sim(V,C) > 1/(1+n)，则归到聚类C中。否则概率为1/(1+n)，它将创建一个新的聚类，概率为n/(1+n)，它将归到C。

在任意两个变量中，如果v归到一个聚类中，我们将更新聚类和，及相应的关系。

对于传统CRP，有一个明显的区别是：如果我们不到空桌子上，我们将决定去往“最相似”的桌子上。

实际上，我们将找到这些创建相似结果的变量。有个不同是，变量1趋向于更多但是单个量级更小的聚类；变量2趋向于少量，但是单个量级更大的聚类。变量2的例子如下所示：

对于chick recipe document，聚类如下：

‘cayenne’, ‘taste’, ‘rating’, ‘blue’, ‘cheese’, ‘raved’, ‘recipe’, ‘powdered’, ‘recipe’, ‘dressing’, ‘blue’, ‘spicier’, ‘spoon’, ‘cup’, ‘cheese’, ‘cheese’, ‘blue’, ‘blue’, ‘dip’, ‘bake’, ‘cheese’, ‘dip’, ‘cup’, ‘blue’, ‘adding’, ‘mix’, ‘crumbled’, ‘pepper’, ‘oven’, ‘temper’, ‘cream’, ‘bleu’, ……
‘the’, ‘a’, ‘that’, ‘in’, ‘a’, ‘use’, ‘this’, ‘if’, ‘scant’, ‘print’, ‘was’, ‘leftovers’, ‘bring’, ‘a’, ‘next’, ‘leftovers’, ‘with’, ‘people’, ‘the’, ‘made’, ‘to’, ‘the’, ‘by’, ‘because’, ‘before’, ‘the’, ‘has’, ‘as’, ‘amount’, ‘is’, ……
‘stars’, ‘big’, ‘super’, ‘page’, ‘oct’, ‘see’, ‘jack’, ‘photos’, ‘extras’, ‘see’, ‘video’, ‘one’, ‘page’, ‘f’, ‘jun’, ‘stars’, ‘night’, ‘jul’, ……

很明显地，第一个聚类最相关。接着，我们获取聚类和向量。下面是python代码，word vector通过c版本将英文Wiki语料训练得到，它将使用gensim.model.word2vec的python库获取模型文件。 c[0]表示聚类0:

>>> similar(c[0], model[“chicken”])

0.95703287846549179

>>> similar(c[0], model[“recipe”] + model[“chicken”])

0.95602993446153006

>>> similar(c[0], model[“recipe”] + model[“fish”])

0.7678791380788017

>>> similar(c[0], model[“computer”])

0.0069432409372725294

>>> similar(c[0], model[“scala”])

0.061027248018988116

看上去语义信息保存完好。我们使用doc向量是可信服的。菜单文档看起来很简单。我们可以尝试更多的挑战，比如一篇新闻文章。新闻本身是叙事型的，包含很少的“主题词”。我们尝试在这篇文章标题为：“Signals on Radar Puzzle Officials in Hunt for Malaysian Jet”的文章进行聚类。我们可以得到4个聚类：

‘have’, ‘when’, ‘time’, ‘at’, ‘when’, ‘part’, ‘from’, ‘from’, ‘in’, ‘show’, ‘may’, ‘or’, ‘now’, ‘on’, ‘in’, ‘back’, ‘be’, ‘turned’, ‘for’, ‘on’, ‘location’, ‘mainly’, ‘which’, ‘to’,, ‘also’, ‘from’, ‘including’, ‘as’, ‘to’, ‘had’, ‘was’ ……
‘radar’, ‘northwest’, ‘radar’, ‘sends’, ‘signal’, ‘signals’, ‘aircraft’, ‘data’, ‘plane’, ‘search’, ‘radar’, ‘saturated’, ‘handles’, ‘search’, ‘controlled’, ‘detection’, ‘data’, ‘nautical’, ‘patrol’, ‘detection’, ‘detected’, ‘floating’, ‘blips’, ‘plane’, ‘objects’, ‘jets’, ‘kinds’, ‘signals’, ‘air’, ‘plane’, ‘aircraft’, ‘radar’, ‘passengers’, ‘signal’, ‘plane’, ‘unidentified’, ‘aviation’, ‘pilots’, ‘ships’, ‘signals’, ‘satellite’, ‘radar’, ‘blip’, ‘signals’, ‘radar’, ‘signals’ ……
‘of’, ‘the’, ‘of’, ‘of’, ‘of’, ‘the’, ‘a’, ‘the’, ‘senior’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘a’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘of’, ‘the’, ‘of’, ‘a’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘its’, ……
‘we’, ‘authorities’, ‘prompted’, ‘reason’, ‘local’, ‘local’, ‘increasing’, ‘military’, ‘inaccurate’, ‘military’, ‘identifying’, ‘force’, ‘mistaken’, ‘expanded’, ‘significance’, ‘military’, ‘vastly’, ‘significance’, ‘force’, ‘surfaced’, ‘military’, ‘quoted’, ‘showed’, ‘military’, ‘fueled’, ‘repeatedly’, ‘acknowledged’, ‘declined’, ‘authorities’, ‘emerged’, ‘heavily’, ‘statements’, ‘announced’, ‘authorities’, ‘chief’, ‘stopped’, ‘expanding’, ‘failing’, ‘expanded’, ‘progress’, ‘recent’, ……

看起来挺不错的。注意，这是个输入为1的聚类过程，并且我们不必去指定聚类数目。这对于对延迟很敏感的服务来说很有帮助。

缺失了一环：如何找出相关的聚类？我们在这部分不必做扩展实验。可以考虑：

idf权值
POS tag。我们不必在文档中标记每个词。根据经验，word2vec趋向于在语法构成上聚在一起。我们对每个簇都抽取出一些tag。
计算聚类和总向量，与标题向量

当然，还有其它问题需要考虑：

1) 如何合并簇？基于向量间的相似度？或者簇成员间的平均相似度
2)词的最小集合，可以重构簇和向量？可以使用关键词抽取方法。

结构：google的word2vec提供了强大的词向量。我们可以以有效的方式，来使用这些vector来生成高质量的文档向量。我们尝试了一个基于CRP变种的策略，并取得了结果。当然，还有很多问题需要研究，BalabalaBala…

代码如下：

# vecs: an array of real vectors
def crp(vecs):
    clusterVec = []         # tracks sum of vectors in a cluster
    clusterIdx = []         # array of index arrays. e.g. [[1, 3, 5], [2, 4, 6]]
    ncluster = 0
    # probablity to create a new table if new customer
    # is not strongly "similar" to any existing table
    pnew = 1.0/ (1 + ncluster)
    N = len(vecs)
    rands = random.rand(N)         # N rand variables sampled from U(0, 1)

    for i in range(N):
        maxSim = -Inf
        maxIdx = 0
        v = vecs[i]
        for j in range(ncluster):
            sim = cosine_similarity(v, clusterVec[j])
            if sim < maxSim:
                maxIdx = j
                maxSim = sim
            if maxSim < pnew:
                if rands(i) < pnew:
                    clusterVec[ncluster] = v
                    clusterIdx[ncluster] = [i]
                    ncluster += 1
                    pnew = 1.0 / (1 + ncluster)
                continue
        clusterVec[maxIdx] = clusterVec[maxIdx] + v
        clusterIdx[maxIdx].append(i)

    return clusterIdx

本文译自：http://eng.kifi.com/from-word2vec-to-doc2vec-an-approach-driven-by-chinese-restaurant-process/

从word2vec到doc2vec

September 26, 2014

Netflix关于cosine相似度的讨论

Meta AdaTT介绍

SATrans介绍