University of London / MSc Computer Science: Applied machine learning(後半)

University of London / MSc Computer Science: Applied machine learning(後半)

July 10, 2023

ロンドン大学で MSc Computer Science: Applied machine learning モジュールを履修中。


全 12 週のうち 6〜12 週目の内容を記録します。(6 週目開始:2023 年 5 月 22 日 / 12 週目終了:2023 年 7 月 9 日)

Week 6: Rule-based algorithms: decision tree and random forest #

Decision trees #

Decision tree #

Decision trees are a nonparametric model that can be used for classification and regression tasks in machine learning. This means that decision trees are very flexible in terms of machine learning models, and they don’t increase their number of parameters as we add more features, that is if we build them correctly.

As the name tree suggests, it’s composed of a number of branches, and these branches connect up a series of nodes. Under each node, one of the features of our data is evaluated. As we pass in our features, at each node we evaluate the features and essentially, their importance.


One of the main advantages of decision trees is how easy they are to actually interpret.While other machine learning models are close to what we consider as black boxes, decision trees provide a graphical intuitive way to understand what the algorithm has actually done.

Compared to other machine learning algorithms, decision trees require a lot less data to train.

They’re very flexible in the sense that we can apply them to both classification as well as regression problems.


One of them with respect to decision trees are they’re quite prone to overfitting the training data.

They’re also considered what we call weak learners.Single decision tree normally does not make very good predictions in and of itself. Multiple trees are often combined to make what we call forests to give birth to stronger ensemble models.

Pruning #

Why would we want to prune a decision tree? One of the issues with decision trees is that they tend to overfit the data. By pruning a decision tree, that is, limiting the maximum depth of the tree, we are helping to reduce overfitting and the complexity of the resulting decision tree.

A decision tree will always overfit training data if we allow it to grow to its maximum depth. A decision tree will overfit when the tree is trained to fit all samples in a training data set pathway.

A simple solution is to assign a maximum depth to a tree, which is far simpler and can actually also combat overfitting. Pruning also simplifies a decision tree by removing the weakest rules in the tree.

We can distinguish pruning into two types. There’s “pre-pruning”, which is what we also call “early stopping”. We stop growing the tree before it has completed classifying the training sets. “Post-pruning”, on the other hand, allows the tree to classify the training set perfectly, and then prunes the tree.

Random forest #

Ensemble learning #

Ensemble learning is an approach in machine learning which tries to obtain better predictive performance by combining the predictions from multiple machine learning models.

  • Ensemble learning methods:
    • Bagging
    • Stacking
    • Boosting


Bagging involves fitting many decision trees on different samples of the same data set and then averaging the results in predictions.


Stacking fit many different model types on the same data and then learn how to best combine the predictions of each model.


Boosting involves adding ensemble members sequentially that correct the predictions made by prior models and outputs of waiting average often predictions.

Random forest #


  1. Take the original dataset and create N bagged samples of size n
  2. Train a Decision Tree with each of the N bagged datasets as input
  3. Aggregate the results of the individual decision trees into a single output

Random Forest elements:

  • Fully grown trees
    • Each tree is growing to the largest possible extent, there’s no pruning.
  • Bootstrapped data
  • Majority vote rule to make predictions


Random Forest do have all the advantages that we did discuss in the context of decision trees. It’s a data-robust algorithm being able to handle different types of data. It doesn’t require that our data is actually pre-processed.

Plus, It can combat overfitting.

Also, another advantage is that there’s no holdout set required. In machine learning, you typically split data into training and test sets, as we’ve seen before. We do this in an effort to evaluate the model performance with observations as not seen before. But this becomes a challenging problem when you have a small data set or across an effort of collecting more data is particularly high. Random Forest, you can use the entire data set to train and evaluate the model. That’s certainly an advantage, and a further advantage is that the bagging process takes care of much of this for you.


You can’t see how the model makes its decisions. In a sense it’s somewhat of a black box. Really it’s down to the fact that there’s no transparency in the model, which if you are going to explain it to a client.

Extra tree #

Extra tree is from “Extremely Randomised Trees”.

Random Forests versus ExtraTrees

What really is the difference between these two algorithms?

Unlike bagging and Random Forests algorithms that develop each decision tree from a bootstrapped sample of our training data set, the ExtraTrees algorithm fits each decision tree on the whole entire training data set. That’s one of the main differences.

Like Random Forests, the ExtraTrees algorithm will randomly sample the features at each split point of a decision tree, but unlike the Random Forests, which uses a greedy algorithm that we’ve discussed before to select an optimal split point, the ExtraTrees algorithms select a split point completely at random. Those are the two main differences.

Random Forest or ExtraTrees?

Which one would you choose?

Random Forests uses bootstrap replicas or bootstrap samples. That is to say, it subsamples the input data with replacement, whereas ExtraTrees uses the whole original sample. That has some advantages, but another difference is a selection of cut points that we use in order to split the nodes. In Random Forests, this algorithm chooses the optimum split, given the features, while ExtraTrees will actually choose this point at random. However, once the split points are selected, the two algorithms choose the best one between all the subset of features. This means that ExtraTrees adds randomisation, but is still quite optimised in an algorithm, but these differences really motivate the reduction of both bias and variants that we’ve discussed in the beginning of the module. On the one hand, using the whole original sample instead of a bootstrap sample will reduce the bias. On the other hand, choosing randomly the split point of each node will help to reduce the variance.

In terms of computational costs, the ExtraTrees algorithm is actually faster and the algorithm saves time because the whole procedure is ultimately the same as the Random Forests, but it chooses the splits of the nodes completely at random, whereas the Random Forests has to calculate the optimal one at each point, then obviously it takes a bit more time to compute.

If you were to choose one or the other, if speed is a concern, then choose the ExtraTrees algorithm.

Week 7: Regression-based algorithms: logistic regression and neural networks #

Regression-based algorithms #

Linear regression #

In regression, there’s a linear model, which means it’s a model that assumes a linear relationship between the input variable x and the single output variable y. More specifically, the output y can be calculated from a linear combination of our input variables, x.

y = a + (b * x)

Preparing the data for linear regression

We need to process our data to ensure it fits into certain assumptions made by our linear regression model.

  • Linear assumption: Linear regression assumes that the relationship between our input data and output is linear.
  • Remove noise
  • Remove collinearity
  • Gaussian distribution
  • Standardise or normalise the input variable

Multiple regression

We call it when dimension is high.

y = a + (b * x) + (b * z)

Logistic regression #

Logistic regression predicts whether something is true or false rather than a continuous prediction in the form of a numeric value. Instead of a line, this fit a sigmoid function.


Neural Network (MLP) #

Neural network model essentials #

An artificial neural network is a series of algorithms that aim at recognizing underlying relationships in a set of data through a process that mimics the way the human brain learns.

Neural networks layers:

  • input layer
  • hidden layer
  • output layer


All neural networks are made up of a very basic component known as the perception.

The perception is really a linear classifier. We can classify or separate the data into one of two categories.

Multi-layer Perceptron

The multilayer perceptron consists of an input layer, an output layer,and then one or more hidden layers. This is probably what you would traditionally think of as a neural network.

Cost function and Gradient descent

To learn a perceptron, we must essentially know that it has made a mistake as well as the answer that it should have given. This is what we’re trying to compute. To do this, it’s necessary to use this cost function whose sole purpose really is to compute this error. What it does is quantifying the gap between the prediction of the model and our target value or target variable Y. This typically expresses the difference or distance between the predictive value and the actual value.

The cost function can be it’s something that we can estimate using the iterative process.We do this by running the model and comparing the estimated predictions against the known values of the target variable. Then our objective is to go back and further retrain the model with the objective to find a weight and bias combination that minimizes this cost function. Then once we minimize cost function, it means that the network is making good predictions.

In order to do this, we need to optimize our weights. Gradient descent is one of the most popular algorithms to perform this optimization that attempts to essentially find a local or global minima of any particular function.

Gradient descent = 勾配降下法

The activation function(活性化関数) #

When you construct a neural network, you need to decide on which one or more activation functions you will use. This depends on the type of machine learning task you are trying to resolve.

It aims at converting the signal entering a neuron into an output signal or response.

Activation functions are used to check the Y value produced by a neuron and decide whether the outside connection should consider this neuron as fired or not

activation functions:

  • Sigmoid function
  • Hyperbolic tangent (TanH)
  • ReLU

In short, sigmoid is more frequently used in the output layer, when doing binary classification, and ReLU we tend to find as part of our hidden layers.

Training neural network models #

  1. Initialise the weights with values close to 0
  2. Send the first observation in the input layer, with one variable per neuron
  3. Forward propagation: neurons are activated dependent on their assigned weights.
  4. Compare the prediction with the expected value and measure the error with the const function
  5. Back propagation: update the weights according to their responsibility in the error, and adjust the learning rate

We repeat these steps one to five, adjusting the way it’s each observation or batch observations which is called batch learning. When all the data has passed through the neural network in this entirety, we call this an epoch. We repeat more epochs, as we need it.

Further material #

Week 8: Large-scale machine learning using TensorFlow #

Large-scale machine learning #

Machine learning vs deep learningPage #

What is deep learning then? As I said, machine learning really is about obtaining some data and training various algorithms on the data.We attempt to do this by selecting and extracting a series of features which we think will be useful in the tasks that we have in mind. However, with deep learning, this is a slightly different approach which is inspired by the structure of the human brain. This structure is known as an artificial neural network. Deep learning is really just a subset of machine learning, which is itself a subset of artificial intelligence. The key difference really is this organization, or really concept of a neural network mimicking the human brain, has enabled us to tackle some more sophisticated pattern recognition problems.

Neural networks are often publicized in the media as being magic or the answer to everything, but they are essentially just another flavor of machine learning that just needs a lot more data and parameter estimation to obtain good performance. When a simpler algorithm achieves good performance, its simplicity and transparency is a greater benefit than applying a neural network just for the sake of it.

Deep learning #

Convolutional neural networks (CNN) #

Convolution neural networks, known as CNN is a class of artificial neural network, which are most commonly applied to the analysis of visual imagery.

Generative adversarial network (GAN) #

The main purpose of Generative adversarial networks, GANs is to have a model to generate new samples from our data. Especially this is used to generate images.

Recurrent neural networks (RNN) #

Recurrent neural network, RNN is a special type of artificial neural network adapted to work for time series, data, or data that involves some sequence.

Further material #

Week 9: Real-life case studies: financial forecasting #


Week 10: Real-life case studies: computer vision #


Week 11, 12 #


メモ:機械学習の勉強に便利なサイト #