Posts

Showing posts from June, 2017

Lasso (l1 penalty VS Ridge (l2 penalty)

Ridge and Lasso are forms of regularized linear regressions. The regularization can also be interpreted as prior in a maximum a posterior estimation method. Ridge and Lasso regression use two different penalty functions. Ridge uses l2, which is the sum of the squares of the coefficients. And for Lasso is the L1 norm, which is the sum of the absolute values of the coefficients. The ridge (L2) regression can't zero coefficients out, so we either select all the coefficients or none of them, whereas Lasso (L1) does both parameter shrinkage and variable selection automatically because it zero out the coefficient of collinear variables, which mean it can help to select the variables out of given n variables while  performing lasso regression.  We will continue to talk about the difference between L1 and L2 norm. While practicing machine learning, you may have come upon a choice of L1 and L2. Usually the two decisions are : 1) L1-norm vs L2-norm loss function; and 2) L1-regularization

Recurrent Neural Network vs Recursive Neural Network

Image
A recurrent neural network basically unfolds over time. It is used for sequential inputs where the time factor is the main differentiating factor between the elements of the sequence. For example, here is a recurrent neural network used for language modelling that has been unfolded over time. At each time step, in addition to the user input at that time step, it also accepts the output of the hidden layer that was computed at the previous time step. A recursive neural network is more like a hierarchical network where there is really no time aspect to the input sequence but the input has to be processed hierarchically in a tree fashion. Here is as example of how a recursive neural network looks. It shows the way to learn a parse tree of a sentence by recursively taking the output of the operation performed on a smaller chunk of the text. The Recurrent NN are in fact recursive neural networks with a particular structure as linear chain shape, which is good at handling the lin

Autoencoders VS Sparse Coding

Sparse Coding Sparse coding minimizes the objective L sc = | | W H − X | | 2 2   reconstruction term + λ | | H | | 1    sparsity term where  W  is a matrix of bases, H is a matrix of codes and  X  i s a matrix of the data we wish to represent.  λ   implements a trade of between sparsity and reconstruction. Note that if we are given  H , estimation of  W is easy via least squares. Autoencoders Autoencoders are a family of unsupervised neural networks. There are quite a lot of them, e.g. deep autoencoders or those having different regularisation tricks attached--e.g. denoising, contractive, sparse. There even exist probabilistic ones, such as generative stochastic networks or the variational autoencoder. Their most abstract form is D ( d ( e ( x ; θ r ) ; θ d ) , x ) but we will go along with a much simpler one for now: L ae = | | W σ ( W T X ) − X | | 2 where  σ σ  is a nonlinear function such as the logistic sigmoid  σ ( x ) = 1 1 + exp ( − x ) Difference: 1 W