矩阵所有元素的平方和再开方,他是是向量二范式的拓展类比。
∣ ∣ A ∣ ∣ F = ∑ a i j 2
||A||_F = \sqrt{\sum a_{ij}^2}
∣∣ A ∣ ∣ F = ∑ a ij 2 如何作为距离度量呢?
可以利用如下方式,如果标签矩阵为B B B (通常为一个batch里的多个样本标签向量组成,如标签向量为4维,batch_size 为128)则B B B 的形状为(128,4)),距离度量公式为
L F = ∣ ∣ A − B ∣ ∣
\mathcal L_F = ||A-B||
L F = ∣∣ A − B ∣∣
作为距离度量方式不详
∣ ∣ A ∣ ∣ ∗ = tr ( A T A ) = tr ( ( U Σ V T ) T U Σ V T ) = tr ( V Σ T U T U Σ V T ) = tr ( V Σ 2 V T ) ( Σ T = Σ ) = tr ( V T V Σ 2 ) = tr ( Σ )
\begin{aligned}
||A||_*=\operatorname{tr}\left(\sqrt{A^{T} A}\right) &=\operatorname{tr}\left(\sqrt{\left(U \Sigma V^{T}\right)^{T} U \Sigma V^{T}}\right) \
&=\operatorname{tr}\left(\sqrt{V \Sigma^{T} U^{T} U \Sigma V^{T}}\right) \
&=\operatorname{tr}\left(\sqrt{V \Sigma^{2} V^{T}}\right)\left(\Sigma^{T}=\Sigma\right) \
&=\operatorname{tr}\left(\sqrt{V^{T} V \Sigma^{2}}\right) \
&=\operatorname{tr}(\Sigma)
\end{aligned}
∣∣ A ∣ ∣ ∗ = tr ( A T A ) = tr ( ( U Σ V T ) T U Σ V T ) = tr ( V Σ T U T U Σ V T ) = tr ( V Σ 2 V T ) ( Σ T = Σ ) = tr ( V T V Σ 2 ) = tr ( Σ )
参考代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import torch
label = np . array ([[ 1 , 2 , 3 , 4 ], [ 1 , 2 , 3 , 4 ]])
pred = np . array ([[ 2 , 3 , 4 , 5 ], [ 2 , 3 , 4 , 5 ]])
print ( "F范数矩阵" )
print ( torch . norm ( torch . from_numpy ( pred - label ) . type ( torch . cuda . FloatTensor ), p = "fro" , dim =- 1 ))
print ( "2范数矩阵" )
print ( torch . norm ( torch . from_numpy ( pred - label ) . type ( torch . cuda . FloatTensor ), p = 2 , dim =- 1 ))
print ( "核范数矩阵" ) # 核范数作为Loss使用方法不详
print ( torch . norm ( torch . from_numpy ( pred - label ) . type ( torch . cuda . FloatTensor ), p = "nuc" ))
print ( "平均损失计算:每个样本的损失相加,再除以总样本数" )
print ( torch . norm ( torch . from_numpy ( pred - label ) . type ( torch . cuda . FloatTensor ), p = "fro" , dim =- 1 ) . sum () / label . shape [ 0 ])
print ( torch . norm ( torch . from_numpy ( pred - label ) . type ( torch . cuda . FloatTensor ), p = 2 , dim =- 1 ) . sum () / label . shape [ 0 ])
输出 :
F范数矩阵
tensor ([ 2. , 2. ], device = 'cuda:0' )
2 范数矩阵
tensor ([ 2. , 2. ], device = 'cuda:0' )
核范数矩阵
tensor ( 2.8284 , device = 'cuda:0' )
平均损失计算 : 每个样本的损失相加 , 再除以总样本数
tensor ( 2. , device = 'cuda:0' )
tensor ( 2. , device = 'cuda:0' )
Process finished with exit code 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>>> # 不同shape的欧距计算 作者:https://blog.csdn.net/sinat_24899403/article/details/119249822
>>> import torch
>>> from torch import nn
>>> a = torch . tensor ([[ 1 , 1 , 1 ],
[ 2 , 2 , 2 ]])
>>> b = torch . tensor ([[ 2 , 2 , 2 ],
[ 1 , 1 , 1 ],
[ 2 , 2 , 2 ],
[ 1 , 1 , 1 ],
[ 2 , 2 , 2 ]])
>>> def pdist ( a : torch . Tensor , b : torch . Tensor , p : int = 2 ) -> torch . Tensor :
return ( a - b ) . abs () . pow ( p ) . sum ( - 1 ) . pow ( 1 / p )
>>> a_ = a . unsqueeze ( 1 )
>>> b_ = b . unsqueeze ( 0 )
>>> print ( pdist ( a_ , b_ ))
tensor ([[ 1.7321 , 0.0000 , 1.7321 , 0.0000 , 1.7321 ],
[ 0.0000 , 1.7321 , 0.0000 , 1.7321 , 0.0000 ]])
如果是多对多,且shape不一致,使用numpy计算是最快的,考虑原因可能是torch未封装int类型的不同shape数据的计算方式,如果shape相同可以使用pairwise的方法
形状相同(用于标签预测)
1
2
3
4
5
6
7
8
9
>>> b = torch . tensor ([[ 0 , 0 , 1 ],
[ 0 , 1 , 0 ]
])
>>> c = torch . tensor ([[ 1 , 0 , 1 ],
[ 0 , 1 , 0 ]
])
>>> metric = BinaryHammingDistance ( multidim_average = 'samplewise' )
>>> print ( metric ( b , c ))
tensor ([ 0.3333 , 0.0000 ])
形状不同(用于检索)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
>>> from sklearn.neighbors import DistanceMetric
>>> import time
>>> a = np . array ([[ 0 , 0 , 2 ],])
>>> b = np . array ([[ 0 , 0 , 1 ],
[ 0 , 1 , 0 ]
])
>>> s1 = time . time ()
>>> print ( DistanceMetric . get_metric ( "hamming" ) . pairwise ( a , b ))
>>> print ( f "耗时 { time . time () - s1 } " )
[[ 0.33333333 0.66666667 ]]
耗时0 .0003008842468261719
>>> a_tensor = torch . as_tensor ( a , dtype = torch . float )
>>> b_tensor = torch . as_tensor ( b , dtype = torch . float )
>>> s2 = time . time ()
>>> print ( torch . cdist ( a_tensor , b_tensor , p = 0 ))
>>> print ( f "耗时 { time . time () - s2 } " )
tensor ([[ 1. , 2. ]])
耗时0 .0003383159637451172
>>> a_tensor = torch . as_tensor ( a , dtype = torch . float ) . cuda ()
>>> b_tensor = torch . as_tensor ( b , dtype = torch . float ) . cuda ()
>>> s3 = time . time ()
>>> print ( torch . cdist ( a_tensor , b_tensor , p = 0 ))
>>> print ( f "耗时 { time . time () - s3 } " )
tensor ([[ 1. , 2. ]], device = 'cuda:0' )
耗时0 .0015494823455810547
维基百科 :KL散度(Kullback-Leibler divergence,简称KLD),在讯息系统中称为相对熵(relative entropy),在连续时间序列中称为随机性(randomness),在统计模型推断中称为讯息增益(information gain)。也称讯息散度(information divergence)。它是两个几率分布P P P 和Q Q Q 差别的非对称性的度量。 KL散度是用来度量使用基于Q Q Q 的分布来编码服从P P P 的分布的样本所需的额外的平均比特数,注意P P P ,Q Q Q 先后顺序。典型情况下,P表示数据的真实分布,Q表示数据的理论分布、估计的模型分布、或P的近似分布。
D K L ( P ∣ ∣ Q ) = − ∑ i P ( i ) l n Q ( i ) P ( i ) = ∑ i P ( i ) l n P ( i ) Q ( i )
\begin{equation}
\begin{aligned}
D_{KL}(P||Q) &= -\sum_iP(i)ln\frac{Q(i)}{P(i)} \\
&= \sum_iP(i)ln\frac{P(i)}{Q(i)}
\end{aligned}
\end{equation}
D K L ( P ∣∣ Q ) = − i ∑ P ( i ) l n P ( i ) Q ( i ) = i ∑ P ( i ) l n Q ( i ) P ( i ) 相对熵的值为非负数:
D K L ( P ∣ ∣ Q ) ≥ 0
D_{KL}(P||Q)\geq 0
D K L ( P ∣∣ Q ) ≥ 0
另,概率都为零时取零。
维基百科 :在信息论中,基于相同事件测度的两个概率分布p p p 和q q q 的交叉熵是指,当基于一个“非自然”(相对于“真实”分布p p p 而言)的概率分布q q q 进行编码时,在事件集合中唯一标识一个事件所需要的平均比特数(bit)。
给定两个概率分布p p p 和q q q ,p p p 相对于q q q 的交叉熵定义为:
H ( p , q ) = E p [ − log q ] = H ( p ) + D K L ( p ∣ ∣ q ) ,
H(p,q) = E_p[-\log q] = H(p) + D_{KL}(p||q),
H ( p , q ) = E p [ − log q ] = H ( p ) + D K L ( p ∣∣ q ) , 其中,H ( p ) H(p) H ( p ) 是p p p 的熵,D K L ( p ∣ ∣ q ) D_{KL}(p||q) D K L ( p ∣∣ q ) 是从p p p 与q q q 的KL散度(也被称为p p p ,相对于q q q 的相对熵)。
对于离散分布p p p 和q q q ,交叉熵可以定义为:
H ( p , q ) = − ∑ x p ( x ) log q ( x ) .
H(p,q)=-\sum_xp(x)\log q(x).
H ( p , q ) = − x ∑ p ( x ) log q ( x ) . 其中求和是指在样本空间进行计算求和。记住这个计算方法,因为后续介绍分类任务的交叉熵损失是基于此的。
假设在二分类任务时,真实标签y y y 和预测标签y ^ \hat{y} y ^ 取值空间为0 , 1 {0,1} 0 , 1 。根据交叉熵定义,可以将交叉熵损失函数定义如下。
J = − 1 N ∑ i = 1 N [ y log y ^ + ( 1 − y ) log ( 1 − y ^ ) ]
J=-\frac{1}{N}\sum_{i=1}^{N}[y\log\hat{y}+(1-y)\log(1-\hat{y})]
J = − N 1 i = 1 ∑ N [ y log y ^ + ( 1 − y ) log ( 1 − y ^ )]
假设真实分布为p ( i ) p(i) p ( i ) (真实标签y y y 的分布),模型预测的分布为q ( i ) q(i) q ( i ) (预测标签y ^ \hat{y} y ^ 的分布,推导如下,
H ( p , q ) = − ∑ x p ( x ) ⋅ log ( q ( x ) ) = ∑ x p ( x ) ⋅ log ( 1 q ( x ) ) = ∑ x p ( y = 0 ∣ x ) ⋅ log ( 1 q ( y = 0 ∣ x ) ) + p ( y = 1 ∣ x ) ⋅ log ( 1 q ( y = 1 ∣ x ) ) = ∑ x y log ( 1 y ^ ) + ( 1 − y ) log ( 1 1 − y ^ ) = − ∑ x [ y log y ^ + ( 1 − y ) log ( 1 − y ^ ) ]
\begin{equation}
\begin{aligned}
H(p,q) &=-\sum_x p(x)\cdot \log({q(x)}) \\
&=\sum_x p(x)\cdot \log(\frac{1}{q(x)})\\
&=\sum_x p_{(y=0|x)} \cdot \log(\frac{1}{q_{(y=0|x)} }) + p_{(y=1|x)} \cdot \log(\frac{1}{q_{(y=1|x)}})\\
&=\sum_x y\log(\frac{1}{\hat{y}}) + (1-y)\log(\frac{1}{1-\hat{y}})\\
&=-\sum_x [y\log \hat{y} + (1-y)\log(1-\hat{y})]
\end{aligned}
\end{equation}
H ( p , q ) = − x ∑ p ( x ) ⋅ log ( q ( x ) ) = x ∑ p ( x ) ⋅ log ( q ( x ) 1 ) = x ∑ p ( y = 0∣ x ) ⋅ log ( q ( y = 0∣ x ) 1 ) + p ( y = 1∣ x ) ⋅ log ( q ( y = 1∣ x ) 1 ) = x ∑ y log ( y ^ 1 ) + ( 1 − y ) log ( 1 − y ^ 1 ) = − x ∑ [ y log y ^ + ( 1 − y ) log ( 1 − y ^ )] 最后乘上1 N \frac{1}{N} N 1 进行平均操作。
与二分类交叉熵类似,多分类交叉熵同样是计算标签分布的熵值,基于此,我们需要把多分类考虑在内,也就是多一步求和,即每个类别上的交叉熵求和并在样本空间上进行求和,基于二分类交叉熵的参数定义,定义分类的标签种类为n n n ,则多分类交叉熵损失可以表示如下
L = − 1 N ∑ i = 1 N ∑ j = 1 n y j ( i ) ⋅ log y ^ j ( i )
\begin{equation}
\mathcal L = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^ny_j^{(i)}\cdot \log\hat{y}_j^{(i)}
\end{equation}
L = − N 1 i = 1 ∑ N j = 1 ∑ n y j ( i ) ⋅ log y ^ j ( i )
其中,n n n 为分类个数,即标签向量维度。N N N 为样本个数。
给定anchor
p p p ,以欧距为例,衡量相似度,其正样本为
q q q ,负样本为
r r r 。
L = m a x ( ∣ ∣ p − q ∣ ∣ 2 − ∣ ∣ p − r ∣ ∣ 2 + ϵ , 0 )
\begin{equation}
\begin{aligned}
\mathcal L = max(||\boldsymbol p-\boldsymbol q||^2-||\boldsymbol p-\boldsymbol r||^2+\epsilon,0)
\end{aligned}
\end{equation}
L = ma x ( ∣∣ p − q ∣ ∣ 2 − ∣∣ p − r ∣ ∣ 2 + ϵ , 0 ) 代码实现
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
"""
loss class
"""
class triplet_loss ( nn . Module ):
def __init__ ( self ):
super ( triplet_loss , self ) . __init__ ()
self . margin = 0.2
def forward ( self , anchor , positive , negative ):
pos_dist = ( anchor - positive ) . pow ( 2 ) . sum ( 1 )
neg_dist = ( anchor - negative ) . pow ( 2 ) . sum ( 1 )
loss = F . relu ( pos_dist - neg_dist + self . margin )
return loss . mean () #we can also use #torch.nn.functional.pairwise_distance(anchor,positive, keep_dims=True), which #computes the euclidean distance.
"""
training part
"""
loss_fun = triplet_loss ()
optimizer = Adam ( custom_model . parameters (), lr = 0.001 )
for epoch in range ( 30 ):
total_loss = 0
for i , ( anchor , positive , negative ) in enumerate ( custom_loader ):
anchor = anchor [ 'image' ] . to ( device )
positive = positive [ 'image' ] . to ( device )
negative = negative [ 'image' ] . to ( device )
anchor_feature = custom_model ( anchor )
positive_feature = custom_model ( positive )
negative_feature = custom_model ( negative )
optimizer . zero_grad ()
loss = loss_fun ( anchor_feature , positive_feature , negative_feature )
loss . backward ()
optimizer . step ()
https://zhuanlan.zhihu.com/p/414327252?ivk_sa=1024320u
更多对比损失 https://lilianweng.github.io/posts/2021-05-31-contrastive/