逻辑回归–Logistic Regression

重点考察 梯度下降 证明,推导得出 交叉熵(损失函数)

整篇感觉都是 99%

开门见山

逻辑回归交叉熵函数

J(θ)=1mi=1my(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)

以及 J(θ)J(\theta) 对参数 $ \theta$ 的偏导数(用于诸如梯度下降法等优化算法的参数更新),如下:

θjJ(θ)=1mi=1m(hθ(x(i))y(i))xj(i)\frac{\partial}{\partial \theta_{j}} J(\theta)=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}

交叉熵函数(Logistic Regression 损失函数)

假设一共有 mm 组已知样本 ( Batchsize=mBatchsize = m),(x(i),y(i))\left(x^{(i)}, y^{(i)}\right) 表示第 ii 组数据及其对应的类别标记。

其中 x(i)=(1,x1(i),x2(i),,xp(i))Tx^{(i)}=\left(1, x_{1}^{(i)}, x_{2}^{(i)}, \ldots, x_{p}^{(i)}\right)^{T}p+1p+1 维向量 (考虑偏置项), y(i)y^{(i)} 则为表示类别的一个数:

  • logisticlogistic 回归 (是非问题,二分类) 中, y(i)y^{(i)} 取0或者1 ;
  • softmaxsoftmax 回归 (多分类问题) 中, y(i)y^{(i)}1,2k1,2 \ldots k 中的一个表示类别标号的一个数(假设共有 k\mathrm{k} 类)。

这里只讨论二分类问题

sigmoid函数

对于输入样本数据 x(i)=(1,x1(i),x2(i),,xp(i))Tx^{(i)}=\left(1, x_{1}^{(i)}, x_{2}^{(i)}, \ldots, x_{p}^{(i)}\right)^{T}, 模型的参数为 θ=(θ0,θ1,θ2,,θp)T\theta=\left(\theta_{0}, \theta_{1}, \theta_{2}, \ldots, \theta_{p}\right)^{T} , 因此有 θTx(i):=θ0+θ1x1(i)++θpxp(i)\theta^{T} x^{(i)}:=\theta_{0}+\theta_{1} x_{1}^{(i)}+\cdots+\theta_{p} x_{p}^{(i)} .

二元问题中常用 sigmoidsigmoid 作为假设函数(hypothesis functionhypothesis\ function),定义为:

hθ(x(i))=11+eθTx(i).h_{\theta}\left(x^{(i)}\right)=\frac{1}{1+e^{-\theta^{T} x^{(i)}}} .

因为 LogisticLogistic 回归问题就是0/1的二分类问题,因此有

P(y^(i)=1x(i);θ)=hθ(x(i))P(y^(i)=0x(i);θ)=1hθ(x(i))\begin{array}{c} P\left(\hat{y}^{(i)}=1 \mid x^{(i)} ; \theta\right)=h_{\theta}\left(x^{(i)}\right) \\ P\left(\hat{y}^{(i)}=0 \mid x^{(i)} ; \theta\right)=1-h_{\theta}\left(x^{(i)}\right) \end{array}

交叉熵推导

现在,不考虑“熵”的概念,根据下面的说明,从简单直观角度理解,就可以得到我们想要的损失函数:将概率取对数,其单调性不变,有

logP(y^(i)=1x(i);θ)=loghθ(x(i))=log11+eθTx(i)logP(y^(i)=0x(i);θ)=log(1hθ(x(i)))=logeθTx(i)1+eθTx(i)\begin{array}{c} \log P\left(\hat{y}^{(i)}=1 \mid x^{(i)} ; \theta\right)=\log h_{\theta}\left(x^{(i)}\right)=\log \frac{1}{1+e^{-\theta^{T} x^{(i)}}} \\ \log P\left(\hat{y}^{(i)}=0 \mid x^{(i)} ; \theta\right)=\log \left(1-h_{\theta}\left(x^{(i)}\right)\right)=\log \frac{e^{-\theta^{T} x^{(i)}}}{1+e^{-\theta^{T} x^{(i)}}} \end{array}

那么对于第 ii 组样本,假设函数表征正确的组合对数概率为:

I{y(i)=1}logP(y^(i)=1x(i);θ)+I{y(i)=0}logP(y^(i)=0x(i);θ)=y(i)logP(y^(i)=1x(i);θ)+(1y(i))logP(y^(i)=0x(i);θ)=y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))\begin{array}{c} I\left\{y^{(i)}=1\right\} \log P\left(\hat{y}^{(i)}=1 \mid x^{(i)} ; \theta\right)+I\left\{y^{(i)}=0\right\} \log P\left(\hat{y}^{(i)}=0 \mid x^{(i)} ; \theta\right) \\ =y^{(i)} \log P\left(\hat{y}^{(i)}=1 \mid x^{(i)} ; \theta\right)+\left(1-y^{(i)}\right) \log P\left(\hat{y}^{(i)}=0 \mid x^{(i)} ; \theta\right) \\ =y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right) \end{array}

其中, I{y(i)=1}I{y(i)=0}I\left\{y^{(i)}=1\right\} 和 I\left\{y^{(i)}=0\right\} 为示性函数 (indicative functionindicative\ function),简单理解为 {}\{\} 内条件成立时, 取1 , 否则取0, 这里不㸷言。

那么对于一共 mm 组样本, 我们 就可以得到模型对于整体训练样本的表现能力:

i=1my(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))\sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)

由以上表征正确的概率含义可知, 我们希望其值越大, 模型对数据的表达能力越好。而我们在参数更新或衡量模型优劣时是需要一个能充分反映模型表现误差的损失函数(Loss functionLoss\ function)或者代价函数(Cost functionCost\ function)的, 而且我们希望损失函数越小越好。由这两个矛盾, 那么我们不妨领代价函数为上述组合对数概率的 相反数:

J(θ)=1mi=1my(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)

上式即为大名鼎鼎的 交叉熵损失函数

对于熵的概念,其实可以理解一下信息熵 E[logpi]=i=1mpilogpiE\left[-\log p_{i}\right]=-\sum_{i=1}^{m} p_{i} \log p_{i}

交叉熵(损失函数)求导

又臭又长

元素表示形式

交叉熵损失函数为:

J(θ)=1mi=1my(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)

其中:

loghθ(x(i))=log11+eθTx(i)=log(1+eθTx(i)),log(1hθ(x(i)))=log(111+eθTx(i))=log(eθTx(i)1+eθTx(i))=log(eθTx(i))log(1+eθTx(i))=θTx(i)log(1+eθTx(i))(13.\begin{array}{l} \log h_{\theta}\left(x^{(i)}\right)=\log \frac{1}{1+e^{-\theta^{T} x^{(i)}}}=-\log \left(1+e^{-\theta^{T} x^{(i)}}\right), \\ \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)=\log \left(1-\frac{1}{1+e^{-\theta^{T} x^{(i)}}}\right) \\ =\log \left(\frac{e^{-\theta^{T} x^{(i)}}}{1+e^{-\theta^{T} x^{(i)}}}\right) \\ =\log \left(e^{-\theta^{T} x^{(i)}}\right)-\log \left(1+e^{-\theta^{T} x^{(i)}}\right) \\ =-\theta^{T} x^{(i)}-\log \left(1+e^{-\theta^{T} x^{(i)}}\right)_{(13} . \end{array}

由此可得:

J(θ)=1mi=1m[y(i)(log(1+eθTx(i)))+(1y(i))(θTx(i)log(1+eθTx(i)))]=1mi=1m[y(i)θTx(i)θTx(i)log(1+eθTx(i))]=1mi=1m[y(i)θTx(i)logeθTx(i)log(1+eθTx(i))](3)=1mi=1m[y(i)θTx(i)(logeθTx(i)+log(1+eθTx(i)))](2)=1mi=1m[y(i)θTx(i)log(1+eθTx(i))]\begin{aligned} J(\theta) & =-\frac{1}{m} \sum_{i=1}^{m}\left[-y^{(i)}\left(\log \left(1+e^{-\theta^{T} x^{(i)}}\right)\right)+\left(1-y^{(i)}\right)\left(-\theta^{T} x^{(i)}-\log \left(1+e^{-\theta^{T} x^{(i)}}\right)\right)\right] \\ & =-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \theta^{T} x^{(i)}-\theta^{T} x^{(i)}-\log \left(1+e^{-\theta^{T} x^{(i)}}\right)\right] \\ & =-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \theta^{T} x^{(i)}-\log e^{\theta^{T} x^{(i)}}-\log \left(1+e^{-\theta^{T} x^{(i)}}\right)\right]_{(3)} \\ & =-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \theta^{T} x^{(i)}-\left(\log e^{\theta^{T} x^{(i)}}+\log \left(1+e^{-\theta^{T} x^{(i)}}\right)\right)\right]_{(2)} \\ & =-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \theta^{T} x^{(i)}-\log \left(1+e^{\theta^{T} x^{(i)}}\right)\right] \end{aligned}

计算 J(θ)J(\theta) 对第 jj 个参数分量 θj\theta_{j} 求偏导:

θjJ(θ)=θj(1mi=1m[log(1+eθTx(i))y(i)θTx(i)])=1mi=1m[θjlog(1+eθTx(i))θj(y(i)θTx(i))]=1mi=1m(xj(i)eθTx(i)1+eθTx(i)y(i)xj(i))=1mi=1m(hθ(x(i))y(i))xj(i)\begin{aligned} \frac{\partial}{\partial \theta_{j}} J(\theta) & =\frac{\partial}{\partial \theta_{j}}\left(\frac{1}{m} \sum_{i=1}^{m}\left[\log \left(1+e^{\theta^{T} x^{(i)}}\right)-y^{(i)} \theta^{T} x^{(i)}\right]\right) \\ & =\frac{1}{m} \sum_{i=1}^{m}\left[\frac{\partial}{\partial \theta_{j}} \log \left(1+e^{\theta^{T} x^{(i)}}\right)-\frac{\partial}{\partial \theta_{j}}\left(y^{(i)} \theta^{T} x^{(i)}\right)\right] \\ & =\frac{1}{m} \sum_{i=1}^{m}\left(\frac{x_{j}^{(i)} e^{\theta^{T} x^{(i)}}}{1+e^{\theta^{T} x^{(i)}}}-y^{(i)} x_{j}^{(i)}\right) \\ & =\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)} \end{aligned}

这就是交叉熵对参数的导数

θjJ(θ)=1mi=1m(hθ(x(i))y(i))xj(i)\frac{\partial}{\partial \theta_{j}} J(\theta) = \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}

向量形式

只是写法不同,过程基本一致,向量形式感觉比较清晰。

忽略交叉熵前面的参数 1m\frac{1}{m}

J(θ)=[yTloghθ(x)+(1yT)log(1hθ(x))]J(\theta)=-\left[y^{T} \log h_{\theta}(x)+\left(1-y^{T}\right) \log \left(1-h_{\theta}(x)\right)\right]

hθ(x)=11+eθTxh_{\theta}(x)=\frac{1}{1+e^{-\theta^{T} x}} 带入, 得到:

J(θ)=[yTlog11+eθTx+(1yT)logeθTx1+eθTx]=[yTlog(1+eθTx)+(1yT)logeθTx(1yT)log(1+eθTx)]=[(1yT)logeθTxlog(1+eθTx)]=[(1yT)(θTx)log(1+eθTx)]\begin{array}{l} J(\theta)=-\left[y^{T} \log \frac{1}{1+e^{-\theta^{T} x}}+\left(1-y^{T}\right) \log \frac{e^{-\theta^{T} x}}{1+e^{-\theta^{T} x}}\right] \\ =-\left[-y^{T} \log \left(1+e^{-\theta^{T} x}\right)+\left(1-y^{T}\right) \log e^{-\theta^{T} x}-\left(1-y^{T}\right) \log \left(1+e^{-\theta^{T} x}\right)\right] \\ =-\left[\left(1-y^{T}\right) \log e^{-\theta^{T} x}-\log \left(1+e^{-\theta^{T} x}\right)\right] \\ =-\left[\left(1-y^{T}\right)\left(-\theta^{T} x\right)-\log \left(1+e^{-\theta^{T} x}\right)\right] \end{array}

再对 θ\theta 求导, 前面的负号直接消掉了,

θjJ(θ)=θj[(1yT)(θTx)log(1+eθTx)]=(1yT)xeθTx1+eθTxx=(11+eθTxyT)x=(hθ(x)yT)x\begin{aligned} \frac{\partial}{\partial \theta_{j}} J(\theta) & =-\frac{\partial}{\partial \theta_{j}}\left[\left(1-y^{T}\right)\left(-\theta^{T} x\right)-\log \left(1+e^{-\theta^{T} x}\right)\right] \\ & =\left(1-y^{T}\right) x-\frac{e^{-\theta^{T} x}}{1+e^{-\theta^{T} x}} x \\ & =\left(\frac{1}{1+e^{-\theta^{T} x}}-y^{T}\right) x \\ & =\left(h_{\theta}(x)-y^{T}\right) x \end{aligned}

梯度下降的参数更新

初始化参数 θ\theta 后,重复:

θj:=θjαθjJ(θ)\theta_{j}:=\theta_{j}-\alpha \frac{\partial}{\partial \theta j} J(\theta)

以上公式就为参数更新公式,结合之前的交叉熵推到公式可得:

θj:=θjα1mi=1m[y(i)hθ(x(i))]xj(i)\theta_{j}:=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}-h_{\theta}\left(x^{(i)}\right)\right] x_{j}^{(i)}

其中 α\alpha 为学习率