但是随机梯度下降,处理损失函数往往就不是凸函数,therefore the initial parameters is actually quite important to neural network. It has direct relationship to the performance of the gradient based algorithm.
ReLU is differentiable at all the point except 0. the left derivative at z = 0 is 0 and the right derivative is 1. This may seem like g is not eligible for use in gradient based optimization algorithm. But in practice, gradient descent still performs well enough for these models to be used for machine learning tasks. This is in part because neural network training algorithms do not usually arrive at a local minimum of the cost function. Hence it is acceptable for the minima of the cost function to correspond to points with undefined gradient.(一般达不到)