北大在《AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks》提出了AutoInt，我们来看下。

摘要

CTR预估的目标是，预测一个用户在一个ad或item上的概率，它对于许多在线应用（比如在线广告和推荐系统）很关键。但存在许多挑战，因为：

1) input features（例如：user id、user age、item id、item category）通常是稀疏高维的
2) 有效的预测通常依赖于高阶组合特征（cross features），由domain experts手工处理非常耗时，很难穷举。因此，对于稀疏高维原始特征，以及它们的特征组合，发现它们的低维表示需要一些工作。

本文提出了一种有效方法：AutoInt来自动学习关于input features的高阶特征交叉。我们提出的算法非常通用，它可以被同时应用到numerical和categorical的input features上。特别的，我们会将numerical和categorical features映射到相同的低维空间中。接着，使用一个带residual connections的multihead self-attentive neural network来显式建模在低维空间中的feature interactions。整个模型可以通过end-to-end的方式有效满足大规模的原始数据。具体代码：: https://github.com/DeepGraphLearning/RecommenderSystems

3.问题定义

我们首先正义定义ctr预测问题：

定义1: CTR Prediction

假设：\(x \in R^n\)表示user u的features和item v的features的concatenation，其中：

categorical features使用one-hot encoding表示
n是concatenated features的维度

那么，CTR预测的问题的目标是：预测user u根据feature vector x在item v上的点击概率。

CTR预测的一个简单做法是：将x看成是input features，并部署类似于LR的分类器进行预测。然而，由于原始featrue vector x非常稀疏且高维，模型很容易overfit。因此，在低维连续空间中表示raw input features是可行的。另外，在其它文献中，利用高阶组合特征来生成好的预测表现很重要。特别的，我们以如下方式定义高阶组合特征：

定义2: p-order组合特征

给定input feature vector \(x \in R^n\)，一个p阶组合特征被定义成：

\[g(x_{i_1}, \cdots, x_{i_p})\]

其中：每个feature来自一个不同的field

p是feature fields的数目
\(g(\cdot)\)是non-additive combination function，比如：乘法和外积，例如：\(x_{i_1} \times x_{i_2}\)是一个关于\(x_{i_1}\)和\(x_{i_2}\)的二阶组合特征

传统的，有意义的高阶组合特征（high-order comibatorial features）可以通过domain experts进行人工构建。然而，这非常耗时，很难泛化到其它domains上。另外，手工构建所有有意义的高阶特征是不可能的。因此，我们开发了一种方法来自动发现有意义的高阶组合特征，同时将所有这些features映射到低维连续空间上，正式地，我们以如下方式定义问题：

定义3: 问题定义

给定一个input feature vector \(x \in R^n\)用于ctr预测，我们的目标是：学习一个关于x的低维表示，它会建模高阶组合特征。

4.AutoInt

4.1 总览

我们的方法会将原始稀疏高维feature vector映射到低维空间上，同时建模高阶特征交叉。如图1所示，我们提出的方法会将sparse feature vector x作为input，后跟一个embedding layer，它会将所有features（包括：categorical和numerical）投影到相同的低维空间上。接着，我们将所有fields的embeddings feed到一个新的interacting layer上，它使用一个multi-head self-attentive neural network来实现。对于每个interacting layer，高阶features通过attention机制来组合，不同类型的combinations可以使用multi-head机制进行评估，它会将features映射到不同的subspaces上。通过将多个interacting layers进行stacking，不同阶的combinatorial features可以被建模。

图片名称

图1

最终interacting layer的output是input feature的低维表示，它可以建模high-order组合特征，进一步通过一个sigmoid function用来估计ctr。接着，我们会详细介绍。

4.2 Input Layer

我们首先表示user profiles和item属性作为一个sparse vector，它是所有fields的concatenation。特别的：

\[x = [x_1; x_2; \cdots; x_M]\]

…(1)

其中：

M是总的feature fields的数目
\(x_i\)是第i个fields的feature representation

当第i个field是categorical时，\(x_i\)是一个one-hot vector（例如：在图2中的\(x_1\)）；当第i个field为numerical时，\(x_i\)是一个scalar value（例如：图2中的\(x_M\)）。

图片名称

图2

4.3 Embedding Layer

由于categorical features的feature表示是非常稀疏高维的，一种常用方式是将它们表示成低维空间（例如：world embeddings）。特别的，我们将每个categorical feature使用一个低维vector来表示：

\[e_i = V_i x_i\]

…(2)

其中：

\(V_i\)是field i的一个embedding matrix
\(x_i\)是一个one-hot vector

通常，categorical features可以是multi-valued，例如：\(x_i\)是一个multi-hot vector。以电影观看预测为例，由于有个Genre的feature field，它会描述一个电影的types，它通常是multi-valued（例如：对于电影来说”Titanic”是Drama和Romance）。为了兼容multi-valued inputs，我们进一步修改等式(2)，将multi-valued feature表示成相应feature embedding vectors的平均：

\[e_i = \frac{1}{q} V_i x_i\]

…(3)

其中:

q是样本对于第i个field的values的数目
\(x_i\)是该field的multi-hot vector表示

为了允许categorical和numerical features的特征交叉，我们在相同的低维特征空间中表示numerical features。特别的，我们将numerical feature表示成：

\[e_m = v_m x_m\]

…(4)

其中：

\(v_m\)是field m的一个embedding vector
\(x_m\)是一个scalar value

通过这么做，embedding layer的output可以是一个关于多个embedding vectors的concatenation，如图2表示。

4.4 Interacting layer

一旦numerical和categorical features在相同的低维空间中存在，我们会在该空间中建模高阶组合特征（high-order cominatorical features）。关键问题是决定：哪个features应该被组合来形成有意义的high-order features。这在传统上由domain experts完成，它们会基于经验来创建有意义的特征组合。在本paper中，我们使用一个新方法“multi-head self-attention”机制来解决该问题。

Multi-head self-attentive network已经在建模复杂关系中取得了很好的效果。例如，它在机器翻译和句子embedding上，对于建模特别的word dependency具有优越性，已经被成功应用到捕获在graph embedding中的node相似性。这里，我们会将这些最新技术进行扩展来建模不同feature fields间的相关性。

特别的，我们采用key-value attention机制来决定，哪个feature combinations是有意义的。以feature m为例，接下来我们将解释如何标识涉及feature m的多个有意义的高阶特征。我们首先定义：feature m和feature k间在一个指定attention head h下的相关性：

\[a_{m,k}^{(h)} = \frac{exp(\phi^{(h)} (e_m, e_k))}{\sum_{l=1}^M exp(\phi^{(h)} (e_m, e_l))} \\ \phi^{(h)}(e_m, e_k)= <W_{Query}^{(h)} e_m, W_{Key}^{(h)} e_k>\]

…(5)

其中，\(\phi^{(h)} (\cdot, \cdot)\)是一个attention function，它定义了feature m和k间的相似性。它可以定义成一个neural network，或者一个简单的内积，例如：\(<\cdot, \cdot>\)。在本工作中，我们使用inner product是因为它的简单和有效。等式（5）中的\(W_{Query}^{(h)}, W_{Key}^{(h)} \in R^{d' \times d}\)是transformation矩阵，它会将原始的embedding space \(R^d\)映射到一个新的空间\(R^{d'}\)中。接着，我们会在子空间h中更新feature m的表示，通过将所有由系数\(a_{m,k}^{(h)}\)指定的所有相关特征进行组合来完成：

\[\bar{e}_m^{(h)} = \sum_{k=1}^M a_{m,k}^{(h)} (W_{Value}^{(h)} e_k)\]

…(6)

其中，\(W_{Value}^{(h)} \in R^{d' \times d}\)

由于，\(\bar{e}_m^{(h)} \in R^{d'}\)是一个feature m和它相关features（在head h下）的组合，它可以表示成由我们的方法学到的一个新的组合特征。考虑气候，个维护feature不能上课测IHI而已工程i莫高窟人combinatorial features，我们可以使用多个heads来达成，它可以创建不同的subspaces并分别学习不同的feature interactions。我们以如下方式收集在所有subspaces中学到的combinatorial features：

\[\bar{e}_m = \bar{m}^{(1)} \oplus \bar{m}^{(2)} \oplus \cdots \oplus \bar{m}^{(H)}\]

…(7)

其中，\(\oplus\)是concatenation operator，其中H是total heads的数目。

图片名称

图3

为了保留之前学到的combinatorial features，包含raw individual (例如：一阶) features，我们在网络中添加标准的residual connections：

\[e_m^{Res} = ReLU(\bar{e}_m + W_{Res} e_m)\]

…(8)

其中，\(W_{Res} \in R^{d' H \times d}\)是关于didension mismatching的投影矩阵，其中，\(ReLU(z) = max(0, z)\)是一个非线性activation function。

有了这样的一个interacting layer，每个feature的表示\(e_m\)会被更新成一个新的feature representation \(e_m^{Res}\)，它是一个高阶features的表示。我们可以将多个这样的layers进行stack，前一interacting layer的output可以做为下一interacting layer的input。通过这样做，我们可以建模任意阶的combinatorical features。

4.5 Output layer

interacting layer的output是一个关于feature vectors \(\lbrace e_m^{Res} \rbrace_{m=1}^M\)的集合，其中，包含了由residual block保留的raw individual features，以及由multi-head self-attention机制学到的combinatorial features。对于最终的CTR预测，我们可以将所有进行concatenate，接着应用一个非线性投影：

\[\hat{y} = \sigma(w^T (e_1^{Res} \oplus e_2^{Res} \oplus \cdots e_M^{Res} ) + b)\]

…(9)

其中，\(w \in R^{d' H M}\)是一个列投影向量，它可以对concatenated features进行线性组合，b是bias，\(\sigma(x) = 1 / (1+e^{-x})\)会将values转化成users的点击概率上。

4.6 训练

我们的loss funciton 是log loss，它根据以下进行定义：

\[Logloss = - \frac{1}{N} \sum_{j=1}^N (y_j log(\hat{y}_j + (1-y_j) log(1-\hat{y}_j))\]

…(10)

其中，\(y_j\)和\(\hat{y}_j\)分别是user clicks的ground truth和预估的CTR，j会索引训练样本，N是训练样本总数。模型中学习的参数是：\(\lbrace V_i, v_m, W_{Query}^(h), W_{Key}^{(h)}, W_{Value}^{(h)}, W_{Res}, w, b\rbrace\)，它们会通过使用gradient descent方式对total Logloss进行最小化更新。

4.7 AutoInt分析

建模任意阶组合特征。

给定由等式(5)-(8)的feature interaction的operator，我们接着分析低阶和高阶组合特征是如何建模的。

对假设存在四个feature fields（例如：M=4）分别由\(x_1, x_2, x_3与x_4\)各自表示。在第一个interacting layer，每个独立的feature会通过attention机制（等式5）与任意其它features进行交叉，因此会使用不同的相关weightes捕获二阶特征交叉：\(g(x_1,x_2), g(x_2, x_3), g(x_3, x_4)\)，其中interaction function \(g(\cdot)\)的non-additive特性（定义2）可以通过activation function \(ReLU(\cdot)\)的non-linearity来进行保证。理想的，涉及\(x_1\)的组合特征可以被编码成第一个feature field \(e_1^{Res}\)的updated representation。由于对于其它feature fields来说是相同的源头，所有二阶特征交叉可以被编码成第一个interacting layer的output，其中attention weights会distill有用的特征组合。

接下来，我们证明了高阶特征交叉可以在第二个interacting layer中建模。给定由第一个interacting layer生成的第一个feature field \(e_1^{Res}\)的representation、以及第三个feature field \(e_3^{Res}\)，涉及\(x_1, x_2, x_3\)的三阶组合特征可以被建模，允许\(e_1^{Res}\)来attend on \(e_3^{Res}\)，因为\(e_1^{Res}\)包含了interaction \(g(x_1, x_2)\)以及\(e_3^{Res}\)包含了单个特征\(x_3\)（来自residual connection）。另外，组合特征的最大阶会随着interacting layers的数目进行指数增长。例如，四阶特征交叉\(g(x_1, x_2, x_3, x_4)\)可以通过\(e_1^{Res}\)和\(e_3^{Res}\)的组合进行捕获，它分别包含了二阶交叉\(g(x_1, x_2)\)以及\(g(x_3, x_4)\)。因此，少量的interacting layers足够去建模高阶特征交叉。

基于上述分析，我们可以看到，AutoInt会以一个hierarchical的方式使用attention机制来学习feature interactions，例如：从低阶到高阶，所有低阶特征交叉可以通过residual connections进行捎带。这是有保证且合理的，因为学习hierarchical representation已经在计算机视觉和语音处理中对于DNN来说相当有效。

空间复杂度（Space Complexity）

embedding layer，它在NN-based方法中是一个共享组件，包含了nd的参数，其中n是input feature的sparse representation的维度，d是embedding size。由于一个interacting layer包含了以下的weight matrics：\(\lbrace W_{Query}^{(h)} , W_{Key}^{(h)}, W_{Value}^{h}, W_{Res} \rbrace\)，在一个L-layer network的参数数目是\(L \times (3d d' + d'Hd)\)，它与feature fields M的数目是独立的。最终，在output layer中存在\(d' H M + 1\)个参数。只要interacting layers被关注，空间复杂度是\(O(Ldd'H)\)。注意，H和d’通常很小（例如：H=2 以及d’=32），它会使得interacting layer相当memory-efficient。

时间复杂度（TIme Complexity）

在每个interacting layer中，计算开销是two-fold的。首先，对于一个head计算attention weights会花费\(O(Mdd' + M^2 d')\)的时间。接着，在一个head下形成组合特征也会花费\(O(Md d' + M^2 d')\)的时间。由于我们有H个heads，它总共花费\(O(MHd'(M+d)))\)的时间。由于H, d, d’通常很小，所以很高效。

5.实验

略

参考

1.https://arxiv.org/pdf/1810.11921.pdf

AutoInt介绍

October 10, 2020

摘要

3.问题定义

4.AutoInt

4.1 总览

4.2 Input Layer

4.3 Embedding Layer

4.4 Interacting layer

4.5 Output layer

4.6 训练

4.7 AutoInt分析

5.实验

参考

kuaishou LiveForesighter介绍

kuaishou MomentCross介绍

kuaishou LiveStream RS介绍