Support Vector Machines vs. Deep Neural Networks – from a Math POV

While some geeks like to compare SVM with DNN for their use cases and optimal tasks, people rarely differentiate these two from a mathematical point of view. This post is to document the comparable aspects of these two classes of algorithms I have learned in the past 12 months and keep it for future reference.  

Linear Algebra and Statistics are two fundamental branches of modern Mathematics. Both are critical to the success of Machine Learning algorithms in recent years, even though some Mathematicians tend to dismiss Statistics as “not real math” (the same group may also consider Linear Algebra and Differential Equation as “not real math” ).

The objective of SVMs is to find the optimal hyperplane that separates data points of different classes. Linear algebra operations are used to find the hyperplane with the maximum margin.  Data points are represented as vectors that can perform operations such as addition, subtraction, and scalar multiplication. More advanced operations required in SVMs include: 

  • Dot Products: Calculating the distance and relationships between data points in a high-dimensional feature space.  
  • Linear Transformations: Using kernel functions (though they can be non-linear, the underlying computation often involves transforming data into a higher-dimensional linear space).  
  • Matrix Operations: Solving optimisation problems that can be formulated using matrices (e.g., finding the Lagrange multipliers).

Statistical learning theory provides a theoretical framework for SVMs in terms of minimising training error and generalisation error. For instance,  Vapnik-Chervonenkis (VC) dimension is an important concept used to predict a probabilistic upper bound on the test error of a classification model. If the test error is much higher than the training error then the trained model is considered overfitted. Such a statistically motivated technique is to improve generalisation and prevent overfitting.  Moreover, it is common to believe that a larger margin will lead to better performance on data that is unseen to the trained models in production.

  • Regularisation: The “C” parameter in SVM is used to control the tradeoff between maximisation of the margin and minimisation of the training error which in turns helps preventing overfitting. 
  • Probabilistic Output: Although the default SVM output is a class label, fitting a separate statistical model to the SVM’s decision function scores can produce probabilistic output. 

Each layer of a DNN is essentially built upon linear algebra operations. Input data (vectors or tensors) is multiplied by weight matrices at every layer with bias vectors added to the results of matrix multiplications. 

  • Tensor Operations: Operations on multi-dimensional arrays are critical to more complex architectures such as convolutional or recurrent neural nets.  
  • Gradient Calculation: The famous backpropagation algorithm relies heavily on the chain rule of calculus applied to matrix operations to calculate gradients of the loss function in relation to the weights. 

DNNs are highly complex non-linear statistical models built for learning intricate patterns from large amount of data. Minimising the loss function (statistical in nature, e.g. cross-entropy, MSE) is a vital part of training DNNs involving quantification of the difference between predictions and the true labels. 

  • Optimisation Algorithms: Due to its statistical properties and the benefits of approximation (reduced batch size to compute), Stochastic Gradient Descent (SGD) and its variations are extensively used in DNN training models on very large datasets. SGD provides a computationally cheaper approach to iteratively update the weights based on gradient calculated during back propagation. 
  • Regularisation Techniques: Statistical overfitting is also a significant problem to DNNs. Dropout, Weight decay (L1 and L2 regularisation) and batch normalisation are common techniques to mitigate. 
  • Probabilistic Interpretation: The output of the final layer of a DNN, when performing classification tasks, is often passed through a softmax function to provide a probability distribution over the classes. 

Key Differences in the Application of Linear Algebra and Statistics:

Feature Support Vector Machines (SVMs) Deep Learning Neural Networks (DNNs)
Linear Algebra Central to finding the optimal separating hyperplane, kernel trick. Fundamental building block at each layer (matrix multiplications, etc.), crucial for backpropagation.
Focus Finding a single optimal linear boundary (potentially in higher dimensions). Learning hierarchical representations through multiple layers of linear transformations.
Statistics Rooted in statistical learning theory, margin maximisation, regularisation. Statistical modeling, loss function minimisation, optimisation algorithms, regularisation.
Interpretability Can be more interpretable (especially linear SVMs) in terms of the separating hyperplane and support vectors. Often considered “black boxes” with limited interpretability of individual weights.
Complexity Generally less complex in terms of the number of parameters. Can have millions or even billions of parameters.
Data Scale Performs well on smaller to medium-sized datasets. Excels on large datasets where it can learn complex patterns.

In a nutshell, while both SVMs and DNNs leverage linear algebra for their core computations, DNNs utilise linear algebra operations more extensively to as building blocks to their layered & hierarchical architecture.

Both SVMs and DNNs are grounded in statistical principles, but SVMs have stronger ties to statistical learning theory for generalisation guarantees, while DNNs rely more on empirical risk minimisation and various regularisation techniques to manage the behaviours of their models.