University of London / MSc Computer Science: Applied machine learning（前半）

May 23, 2023

ロンドン大学で MSc Computer Science: Applied machine learning モジュールを履修中。

講義内容に関して記録した個人的なスタディノートです。

全 12 週のうち 1〜5 週目の内容を記録します。（1 週目開始：2023 年 4 月 17 日 / 5 週目終了：2023 年 5 月 21 日）

モジュール概要 #

Content

Machine learning is an important topic in both academia and industry these days. There has been growing interest in the practical side of machine learning and so this module focuses more on the practical techniques and methods with Python and Scikit-Learn than on the theories or statistics behind these methods. You will cover just the right amount of theory to enable you to implement these models backed-up with a good level of understanding of the core principles of machine learning.

Aims

The primary aims of this module are to:

gain hands-on and practical skills for machine learning based analytics tasks
use appropriate Python libraries and tools to analyse data
develop the design and programming skills that will help you to build intelligent artifacts
assess the performance of machine learning models
develop a deeper understanding of several real-life topics in applied machine learning
develop the practical skills necessary to pursue research in applied machine learning.

Weeks

講義は Week 10 まで。Week 11, 12 は最終課題を作成する期間。

Week 1: Introduction to Applied Machine Learning
Week 2: Preparing data
Week 3: Feature selection and extraction
Week 4: Data sampling
Week 5: Feature and model evaluation
Week 6: Rule-based algorithms: decision tree and random forest
Week 7: Regression-based algorithms: logistic regression and neural networks
Week 8: Large-scale machine learning using TensorFlow
Week 9: Real-life case studies: financial forecasting
Week 10: Real-life case studies: computer vision
Week 11: Preparation for final assessment
Week 12: Preparation for final assessment

参考文書 #

Essential reading

Géron, A. “Hands-on machine learning with Scikit-Learn, Keras & Tensorflow”. (O’Reilly Media, Inc, 2019) 2nd edition.
Lage Dyndal, G., T. Arne Berntsen and S. Redse-Johansen “NATO review – autonomous military drones: no longer science fiction”, 28 July 2017.
Anirudh, V.K. “Top 10 Python libraries for machine learning”, Toolbox, February 2022.
Halevy, A., P. Norvig and F. Pereira “The unreasonable effectiveness of data”, IEEE Intelligent Systems 24(2) 2009, pp.8–12.
Zheng, A. and A. Casari “Feature engineering for machine learning. Principles and techniques for data scientists”. (O’Reilly Media, Inc, 2018).
Bruce, P., A. Bruce and P. Gedeck “Practical statistics for data scientists”. (O’Reilly Media Inc, 2017).
“Introduction to Resampling methods” – GeeksforGeeks, 2021.
Chong, T.-W. and B.-G. Lee “American Sign Language recognition using leap motion controller with machine learning approach”, Sensors 18(10) 2018.
Goodfellow, I., Y. Bengio and A. Courville “Deep learning”. (MIT Press, 2016) Chapter 11: Practical methodology.
Blumenfeld, J. “SpaceML: Rise of the Machine (Learning)”, May 6 2021.

Further reading

Grus, J. “Data science from scratch: first principles with Python”. (O’Reilly Media, Inc, 2019).
Müller, A.C. and S. Guido “Introduction to machine learning with Python: a guide for data scientists”. (O’Reilly Media, Inc, 2016).
Bondade, N. “The new AI toilets will scan your poop to diagnose your ailments”.
“The Adidas Speedfactory: Reinventing an Industry”, The Future Factory, 2018.
Porter, J. “Adidas to end robotic shoe production in Germany and the US: Automated factories were seen as an alternative to overseas labor”, The Verge, 2019.
Roh, Y., G. Heo and S.E. Whang “A survey on data collection for machine learning: a big data – AI integration perspective”, IEEE Transactions on Knowledge and Data Engineering 33(4) 2021, pp.1328–47.
AFP, “Facebook researchers use math for better translations”, 2019.
Chen, D. “Statistical Learning (II): Data Sampling and Resampling, Towards Data Science”, 2020.
Huang, L, X. Nguyen, M. Garofalakis et al. “In-Network PCA and anomaly detection”, Advances in Neural Information Processing Systems 2007.
Demircioğlu, A. “Measuring the bias of incorrect application of feature selection when using cross-validation in radiomics”, Insights Imaging 12(172) 2021.
Shaikh, R. “Feature selection techniques in machine learning with Python”, Towards Data Science, 2018.
Edmundson, A. “The Rise (and Lessons Learned) of ML Models to Personalize Content on Home (Part II)”, 2021.
Edmundson, “A. The Rise (and Lessons Learned) of ML Models to Personalize Content on Home (Part I)”, 2021.
Sezer, O., U. Gudelek and M. Ozbayoglu “Financial time series forecasting with deep learning: A systematic literature review: 2005–2019”, Applied Soft Computing 2020.
Chui, M., M. Harrysson, J. Manyika, R. Roberts, R. Chung, P. Nel and A. van Heteren “Applying AI for social good”, McKinsey, 28 November 2018.
Hao, K. AI pioneer Geoff Hinton: “Deep learning is going to be able to do everything”, 3 November 2020.
Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, Journal of Machine Learning Research 15 2014, pp.1929–58.
Van Veen, F. “The Neural Zoo”, Asimov Institute, 14 September 2016.
Luengo-Oroz, M., K. Hoffmann Pham, J. Bullock et al. “Artificial intelligence cooperation to support the global response to COVID-19”, Nature Machine Intelligence 2 2020, pp.295–97.
Tomašev, N., J. Cornebise, F. Hutter et al. “AI for social good: unlocking the opportunity for positive impact”, Nature Communications 11 2020, 2468.
Brownlee, J. “Ensemble learning methods for deep learning neural networks”, (2019).
Ultraleap, “Why the best hand tracking camera is infrared” (2021).

Week 1: Introduction to Applied Machine Learning #

Welcome to applied machine learning #

Industry 4.0 #

Industry 1.0 (1784)
- Mechanism and steam, weaving loom
Industry 2.0 (1870)
- Assembly line, electrical energy
Industry 3.0 (1969)
- Automation, computers and electronics
Industry 4.0 (Today)
- Cyber Physical Systems, Internet of Things(IoT), Networks

Example

Car
- Industry 2.0: Ford’s conveyer belt
- Industry 4.0: Tesla’s Fremont factory
Shoes
- Industry 2.0: Nike traditional shoes factory
- Industry 4.0: Adida’s speed factory

ML experts you need to know #

Geoffrey Hinton: Backpropagation (1980s)
Michael I Jordan: RNN (1980s)
Yann LeCun: CNN with backpropagation
Yoshua Bengio: tons of theoretical work (1996)
Jürgen Schmidhuber: LSTM (1992)
Andrew Ng: Reinforcement Learning
Vladimir Vapnik: SVM (1963)
Ian Goodfellow: GANs (2014)

ディープニューラルネットワーク研究の功績を讃えて、2018 年のチューリング賞（計算機科学界のノーベル賞といえるもの）が、ジェフリー・ヒントン、ヤン・ルカン、ヨシュア・ベンジオの３人に送られた。この３人を指して、AI のゴッドファーザー、ディープラーニングのゴッドファーザーと呼ぶ人もいる。

Predictive modelling and the analytic workflow #

Predictive modelling

What is predictive modeling? In short, predictive modeling is a statistical technique used in machine learning and data mining to predict and all forecasts are like the future outcomes using some historical or existing data.

https://en.wikipedia.org/wiki/Predictive_modelling

日本語では予測分析、予測モデリングと呼ばれている。現在のデータと過去のデータをもとにモデルを使用して将来を予測すること。

There are everal stages in the predictive modeling processes.

Define problem
- Investigate and characterise the problem in order to better understand the goals of the project.
Analyse data
- Use descriptive statistics and visualisation to better understand the data you have available.
Prepare data
- Use data transforms in order to better expose the structure of the prediction problem to modelling algorithms.
Evaluate algorithms
- Design a test harness to evaluate a number of standard algorithms on the data and select the top few to investigate further.
Present results
- Finalise the model, make predictions and present results.

The analytic workflow

予測モデリングの一部には、分析ワークフロー（The analytic workflow）と呼ばれるものがある。重要なことは、データの収集から使用する形に落とし込むまで、データを正しく処理すること。

Define analytic objective
Select cases
Extract input cases
Validate input data
Repair input data
Transform input data
Apply analysis

When we think about this pipeline of work, if we fail to do any of these steps appropriately, then we can end up with bad data. Bad data comes bad results, that is, “Garbage in garbage out”.

The Machine Learning Toolbox #

Python libraries used for ML #

Machine Learning tools

Main Tools

NumPy
SciPy
Scikit-learn
TensorFlow
Keras

Complementary Tools

Jupyter notebook
Pandas
Matplotlib
PIL

Finding data for ML applications #

Where do we get the data from?

UCI Machine Learning Repository
- https://archive.ics.uci.edu/ml
Kaggle
- https://www.kaggle.com/
Google datasearch
- https://datasetsearch.research.google.com/
Government Home page
- e.g. https://www.data.gov.uk/
National Library archives
Social media APIs
- Twitter
- Reddit
- Instagram
- YouTube
- …
Streaming data
- Weather data
- Earthquake data
- Live webcams
- …

Further videos material #

Week 2: Preparing data #

Data preparation #

Predictive modelling #

Predictive modelling: a concise representation of the input and target association

Predictive modeling is really the process of using known results to create process and validate a model that we can then use to make future predictions.

The Bias-variance Trade-off

Bias and variance are errors that can occur in machine learning algorithms. When the algorithm has a limited flexibility, to deduce the correct observation from the data set, it results in what we call bias. On the other hand, variance occurs when the model is extremely sensitive to small fluctuations. If one adds more features while building a model, it will add more complexity and we will lose bias, but gain some variance. In order to construct a good model, we have to form a trade-off between bias and variance, based on the needs of our objective task.

Bias = 偏り / Variance = 分散

機械学習におけるバイアスとは、予測値と真の値（正解値）とのズレ、バリアンスとは、予測値の広がり（ばらつき）。モデルの予測においてバイアスが大きすぎる場合、そのモデルは入力と出力の関係を正確に表現できておらず、アンダーフィッティング（未学習）状態である。モデルの予測においてバリアンスが大きすぎる場合、そのモデルは訓練データのノイズまで学習してしまっており、オーバーフィッティング（過学習）状態である。

バイアスとバリアンスの関係は、「あちらを立てれば、こちらが立たず」というトレードオフの関係にあるため、「バイアスとバリアンスのトレードオフ（Bias-Variance Tradeoff）」と呼ばれる。

https://atmarkit.itmedia.co.jp/ait/articles/2009/09/news025.html

Prediction types #

Three prediction types

Decision
Rankings
Estimates

Example:

Decision
- Whether some one will develop cancer, true or false, base on the data about their age, weight, and the height.
- Predict the type of weather with labels defining raining, cloudy, sunny.
- Whether an email is spam or ham (ham being good thing).
Rankings
- Recommending a list of videos or movies to use a user, for example, on YouTube or Netflix.
Estimates
- Chatbot, which will generate a list of the top three responses and how certain or competent it is that has predicted the correct response, also Speech Recognition engine is same.
- Chess game engine and Self-driving cars which choose best move from a list of candidate moves.

Machine learning models

Main categories of machine learning models.

Supervised learning（教師あり学習）
Unsupervised learning（教師なし学習）
Semi-supervised learning（半教師あり学習）
Reinforcement learning（強化学習）

The unreasonable effectiveness of data #

Problems that involve interacting with humans, such as natural language understanding, have not proven to be solvable by concise, neat formulas like F = ma. Instead, the best approach appears to be to embrace the complexity of the domain and address it by harnessing the power of data: if other humans engage in the tasks and generate large amounts of unlabeled, noisy data, new algorithms can be used to build high-quality models from the data.

Data quality #

Pre-processing and missing values #

Measures for data quality

How do we judge whether the data is in a good state and of the right quality?

Accuracy: correct of wrong, accurate or not.
Completeness: not recorded, unavailable.
Consistency: some modified but some not, dangling.
Timeliness: timely update?
Believability: how likely is the data to be correct?
Interpretability: how easy is it to understand the data?

Data cleaning

Data in the real world is dirty. Example:

Incomplete:
- Occupation="" (missing data)
- Noisy: Salary=-10 (an error)
Inconsistent:
- Age=“42”
- Birthday=““03/07/2010”
- Rating “1,2,3”, now rating “A,B,C”

Missing value

Missing data may be due to:

equipment malfunction
inconsistent with other recorded data and deleted
data not entered
certain data may not be considered important at time of entry.

Missing values remedies

Fill in missing values manually
Fill in automatically - a global constant (e.g. “unknown”)

The curse of dimensionality #

= 次元の呪い

https://en.wikipedia.org/wiki/Curse_of_dimensionality

Adding features increases the dimensions of our data.

An increase in the dimensions coming through add more information to the data, thereby improving the quality of our data that will be then using as our features. However, this also has the opposite effect,an increase of noise and redundancy, due to these features having a very little influence on the final analysis.

Some of our features of our data may be missing. This can also cause another issue. This is known as data sparsity.

Why is this a curse? As dimensionality increases in our data, the amount of data required for good performance of any machine learning algorithm increases exponentially. The reason is that we would need more data points for any given combination of our features, for any machine learning model to be valid in a sense it’s able to recognize some kind of pattern.

Solutions to the curse of dimensionality

Manually pick a useful subset of features from all givin features.
PCA (principal components analysis: 主成分分析) can help in reduction of number of features.

Data redundancy #

= データ冗長性

https://en.wikipedia.org/wiki/Data_redundancy

Redundancy means having multiple copies of the same data.

Which dataset of dogs would you pick?

100 pictures of one dog taken over two weeks.
100 pictures of different dogs taken over a year.

Week 3: Feature selection and extraction #

Feature selection and extraction #

Introduction to feature selection and extraction #

You will find yourself spend a lot of time on and also coming back to as you train and test all the machine learning algorithm. The aim is to compile a set of descriptive features that will provide the machine learning algorithm all the information it needs essentially to perform well at the task we’ve designated it. What is feature selection? In more detail, feature selection is the process of selecting, as I said, a small subset of relevant attributes from our data for use in machine learning model construction.

Feature selection techniques are used for several reasons:

Simplification of models and interpretability
Shorter training times versus better accuracy (There is a fine balance between accuracy and optimizing the training phase)
To avoid the curse of dimensionality
Enhanced generalisation by reducing overfitting (reduction of variance)

Feature selection techniques #

Art is the elimination of the unnecessary.（芸術は不要なものの排除だ） - Pablo Picasso

Feature selection methods are typically presented in three categories based on how they combine the selection algorithm and the model building phase.

The feature selection methods come in three types:

Filter: get the relevance of the features through univariate statistics（単変量統計で特徴量の関連性をつかむ）

統計的な物差しにもとづいて特徴量を評価する

ここでは、単変量統計学によって特徴量の関連性を求めることができます。単変量とは、1 つの変数や 1 種類のデータを意味し、平均値、中央値、最頻値の測定、値の範囲の分析、標準偏差の適用など、いくつかのことを行うことができます。フィルター特徴選択法では、これらの統計的手法のうち、選択したものを適用して、データセットの各特徴にスコアを付けます。そして、そのスコアによって特徴をランク付けし、スコアが高い特徴や低い特徴をデータセットから選択または削除することができます。この方法は、多くの場合、単変量で、特徴を独立に、あるいは従属変数との関係で考えます。

Percent missing values
- フィルター手法の例としては、例えば、データのある列を見て、欠損値のあるレコードの数を全レコードの数で割ったものを指標にするとします。また、バイナリ分類では、欠損値や非欠損値を表すために、真や偽などのバイナリ指標を作成します。そして、欠損値のパーセントセットが高い変数をレビューまたは視覚化し、欠損値の割合が低い特徴を選択することができます。
Amount of variation
- もう一つの方法は、変動量です。ここでは、値の変動量が非常に少ないレビュー変数をすべて削除します。
Pairwise correlations
- また、ペアワイズ相関もあります。これは、多くの変数が互いに相関していることが多く、それゆえ冗長と考えられるものです。2 つの変数の相関が高い場合、どれかを残すと、情報をあまり失うことなく次元を下げることができます。答えは、ターゲット変数との相関が高いものを選ぶことです。
X2 correlation test
- カイ二乗相関検定もあります。これは、多くの統計パッケージで見られる指標です。カイ二乗の値が大きいほど、変数が関連している可能性が高くなります。例えば、ある都市で病院の数と車の盗難の数が相関していることがわかりますが、それは両者が何気なく、あるいは密接に関連しているからに他なりません。

Wrapper: Use a classifier to measure feature importance

機械学習のモデルを用いて特徴量を評価する

ラッパー法は、その名の通り、機械学習モデルをラップし、入力された特徴の異なるサブセットでモデルをフィッティングして評価し、最も良いモデル性能をもたらす特徴のサブセットを選択する方法です。ラッピング手法では、特徴量の選択を本質的に探索問題と考え、さまざまな組み合わせを構築し、評価し、他の組み合わせと比較します。次に、予測モデルを使用して特徴量の組み合わせを評価し、モデルの精度に基づいたスコアを割り当てます。例えば、これまでに触れた主成分分析、クラスター分析、ターゲットとの相関などの手法があります。PCA や原因分析については、要約すると、ラッパーは、分類器の性能に基づいて特徴の有用性を測定します。しかし、これらの方法の問題は、学習ステップを繰り返す必要があるため、計算量とコストが高くなることです。PCA は次元削減の手法であり、データのばらつきを重視するものです。ターゲットとの相関については、これも特徴選択手法の一つで、先ほど触れたターゲット変数との相関が非常に低い特徴を削除します。ターゲットとの相関が非常に低い変数は、モデル予測に役立たないので、特徴候補から外すことができるのです。

Embedded: a model is used during learning

モデルが学習するタイミングで特徴量を評価する

特徴選択手法の 3 つ目のタイプは、埋め込み型と呼ばれる手法群です。埋め込み型特徴選択手法は、それ自体が機械学習アルゴリズムの意味で、限られた数の特徴を使用したモデルを返します。より優れた手法は、モデルが実際に作成されている間に、モデルの精度に最も貢献する特徴を学習します。最も一般的なタイプの埋め込み型特徴選択手法は、特定のアルゴリズムの最適化に追加の制約を導入する正規化手法です。基本的には、モデルの複雑性が低い方に偏らせることです。多くのアルゴリズムは、次のようなアプローチで特徴選択のための埋め込みメソッドに変えることができます。最初のステップでは、モデルがどの程度の性能を持つかを示す指標を選び、次のステップでは、ある特徴を削除してアルゴリズムを実行するとこの指標の値がどう変化するのかを調べます。

Dimensionality reduction #

Introduction to dimensionality reduction #

The number of input variables or features for a dataset is referred to as its dimensionality.

Too many input features can result in the curse of dimensionality.

Dimensionality reduction refers to the technique that reduces the number of input variables in a data set. Each time we select a feature to use in our analysis, we increase the dimensions of that data. As more features are added, the data becomes very sparse, and so the analysis, as we can say, suffers from the curse of dimensionality.

Fewer inputs = simple models

Now, essentially, fewer inputs equals a far simpler model. What that means is that fewer input dimensions often mean corresponding fewer parameters or a simpler structure in a machine learning model, which are referred to as the degrees of freedom. A model with too many degrees of freedom is likely to overfit the training data set, and therefore, may not perform well when it sees new data during the prediction stage. Simpler models generally generalize better, and so it’s desirable a machine learning to have simpler models that generalize well, and in turn enter data with fewer input variables as it’s much easier to manage.

Dimensionality reduction

We can divide dimension reduction techniques into two types.

feature selection
- There’s the first, which is used to select a subset of our features from the full set of features, which we call feature selection.
feature extraction
- Then we have a second group of methods, which are tasked with extracting new features by combining the existing features we already have, which we call feature extraction.

Principal component analysis（PCA） #

is used for dimensionality reduction in machine learning.

PCA assumptions（仮定）
- Sample size minimum
- Correlated features
- Linearity
- No extreme outliers
- Low variance axes are noisy and discarded
PCA limitations（制限）
- Model performance
- Classification accuracy
- Interpretability

PCA limitations

モデル性能の面では、特徴相関がない、あるいは低いデータセットでは、PCA によってモデル性能が低下したり、線形性の前提を満たせなかったりすることがあります。あるクラスと別のクラスを区別する情報は低分散成分にあるかもしれませんが、これは特徴セットの次元を下げる際に実際に捨てられてしまうかもしれません。また、解釈可能性についても、各主成分は本質的に元の特徴の組み合わせなので、各特徴の個々の重要性を説明することはできません。

Cluster analysis（or Clustering） #

is typically used to classify data into structures that are more easily understandable.

Its aim is to find groups of correlated features.

Clustering approaches:
- K-Means clustering
- Hierarchical clustering

K-Means

Decide the number of clusters (k)
Select k random points from the data as centroids
Assign all the points to the nearest cluster centroid
Calculate the centroid of the new clusters
Repeat steps 3 and 4

Centroid = 重心

Problems with K-means for clustering

Hierarchical clustering

This algorithm starts with all the data points assigned to a cluster of their own.Then essentially, the two nearest clusters are merged into the same cluster.In the end, this algorithm terminates when there’s only a single cluster left at the top.

When to use feature selection and dimension reduction #

Often, feature selection and dimensionality reduction are grouped together. Both methods are used for reducing the number of features in a dataset, but there are, in fact, some important differences.

Which one do you choose? Which one’s best? Is it feature selection or dimensionality reduction? Let’s say a dataset consists of N points of data with T columns or features. PCA will essentially aim at compressing these T features, whereas clustering aims at compressing the N data points. Cluster analysis and PCA are also very different. Despite this difference, there is some connection between them, whereas cluster analysis works better with a set of input variables where your cases are uncorrelated. We have principal component variables which are actually uncorrelated by construction. In terms of which to use, it really depends on your user case. You would start maybe with feature selection techniques, as they’re quite simple, and then, if your data is quite complex or noisy, you might want to think about dimensionality reduction technique. This assumes, obviously, you have labeled data. If you have unlabeled data, you can apply cluster analysis instead, as I said, using either k means or powerful clustering.

Week 4: Data sampling #

Data sampling #

Random sampling and sample bias #

Now, as you know, data is the core of any machine learning application,and what we aim for really is a model with high variance,which defines the variability in the model prediction.

What is a sample?

A sample is a subset of data from a much larger data set known as the population.（population = 母集団）

What do we mean when we say sample is biased?

Statistical bias refers to sampling errors that are produced by the measure or the sampling process itself. It’s quite easy to introduce bias into machine and algorithms.

The chosen sampling method can lead to sampling bias

If humans are involved in any part of the sampling process, there is a greater chance of bias in the model.

There is a solution:
- You can try using a larger or more diverse data sets to train your model. This is the ideal solution but it’s not always feasible.
- Adopt a sample selection method that removes you from the process of selection your data.
  - Random sampling
  - Random sampling with replacement

Resampling versus bootstrapping #

There are a number of resampling methods.

Re-sampling approaches:
- Train and Test Sets
- k-fold Cross Validation
- Leave One Out Cross Validation
- Repeated Random Test-Train Splits

The term “Resampling” also includes a permutation procedure.（permutation = 置換）

You may also hear this term “resampling” used synonymously with the term “bootstrapping”. In any case, the term “bootstrap” always implies that we are randomly sampling a raw data with replacement from the observed data sets.

Bootstrap is often used as a general tool to generate statistical measures such as averages over the whole population, to generate confidence intervals as well as model parameters that we use for machine learning algorithms.

Key ideas of bootstrap:
- Assess the variability of a sample statistic.
- It is non-parametric.
- We can estimate sample statistics where there is no appropriate method.

For now, one thing we just wanted you to take away the fact that resampling and bootstrapping, you’ll see these terms mentioned quite frequently, and often together. The resampling in simple sense is that once we’ve taken out data to be part of our sample, we don’t put it back into the population, whereas with bootstrapping we do.

NOTE

パラメトリック手法：「与えられた母集団が何らかの分布に従っている前提」が「ある」時に使える手法
ノンパラメトリック手法：「与えられた母集団について何らかの分布に従っている前提」が「ない」時に使う手法

ノンパラメトリック手法は仮定もなく使用することが出来るため、どんな母集団にも使用することが出来る（パラメトリック手法の使える母集団に対してもノンパラメトリック手法を用いることが出来る）。その分、精度はパラメトリック手法に比べると低くなる傾向にある。

Common re-sampling approaches #

K-fold cross-validation #

ML steps:
- Obtain clean the data
- Use the data to train the model
- Test how the model performs

In more detail, our first step is to essentially estimate the parameters of machine learning model. This estimation of parameters is achieved by feeding the model some training data. That means we will need some data to train the model.

Second step is then to assess how well the model did given the parameters it estimated for the data we introduced to it during the training phase.

What data do we use to test how well the model performed?

Use 75% of the data for training
Use 25% of the data for testing

How can we be sure we have a good split between train and test data?

Cross-validation allows us to split up our data into chunks and try each chunk of data one after another. It’s a good approach to check whether your model’s overfitting the data,that is to say, to use a limited sample in order to estimate how a model is expected to perform in general when you submit predictions.

The method is:
- Shuffle the data and split the data into k chunks.
- Fit a model on the training set and evaluate it on the test set.
- Keep a note of the accuracy score and discard the model.
- Repeat k times.
- Take an average of the accuracy scores at the end.

So, K-fold cross-validation means we divide our data into k chunks.

There several other approaches, but, k-fold cross-validation is actually a gold standard in machine learning evaluation.

Leave-one-out cross-validation #

Leave-one-out cross-validation is essentially the same as k-fold cross-validation except for the fact we are working on individual rows of our data instead of chunks of rows of data representing several items.

If our data is only composed with six rows, we would hold out one of those rows for testing, and the remaining five would be used for training the model. We can repeat the process each time we train the model.

The disadvantages is it’s process can be time-consuming if our dataset is large or model is complex. you can usually do quite well with just using k-fold cross-validation with k set to a number between 3, 5, and 10.

Repeated random test-train splits #

The important difference is that at each iteration the training and test sets are shuffled randomly.

It’s really just an approach which is a hybrid of the traditional train-test splitting that we’ve seen combined with the k-fold cross-validation method. In this technique, we essentially create random splits of the data in both the training and then test set in the manner that we’ve done before, and then we repeat this process of splitting and evaluating the algorithm multiple times.It’s just like the cross-validation method.

We can think of it as a repeated k-fold cross-validation. This means that we’re essentially repeating the k-fold cross-validation a certain number of times,and in each iteration, we’re going to shuffle our data before performing the next run.

What re-sampling techniques to use and when #

Generally, my advice is K-fold Cross-validation is probably one of the better techniques.It’s considered the gold standard for evaluating the performance of the machine learning algorithm,particularly on unseen data.We can set K to any value, but generally speaking,people tend to experiment with values of K equals 3, 5, and 10,but if you are in doubt, and you have sufficient data,then you can apply a 10-fold cross-validation in most cases.

Further material #

You have learnt a few approaches that can be applied to sample our data set. After a brief overview of the approaches, the following reading will give you a bit more detail on the topic together with some useful illustrations to help. There are also some further examples of how you can apply these practically using Python:

https://towardsdatascience.com/statistical-learning-ii-data-sampling-resampling-93a0208d6bb8

Week 5: Feature and model evaluation #

Feature selection and evaluation #

Methods for selecting features #

Recap feature selection. There are supervised and unsupervised feature selection.

Supervised methods:
- filter
- wrapper
- intrinsic (or embedded)

In terms of filter-based selection methods there are actually two categories.

Filter-based selection:
- statistical methods
- feature importance methods

Statistical methods

When we look at statistical feature selection, these are essentially a group of methods which evaluate the relationship between each of the input variables and the target variable. We only select those input variables that have the strongest relationship with the target variable.

It’s common to use correlation types statistical measures between the input and output variables. As a result, the choice of statistical measure is highly dependent upon the variable’s data type, that is the data type of both your input and output variables.

The most common technique to apply is:
- Numerical Input, Numerical Output: This is a regression predictive modeling problem.
  - Pearson’s correlation coefficient (linear)
  - Spearman’s rank coefficient (nonlinear)
- Numerical Input, Categorical Output: This is a classification predictive modeling problem.
  - ANOVA correlation coefficient (linear)
  - Kendall’s rank coefficient (nonlinear)
- Categorical Input, Numerical Output: This is a regression predictive modeling problem. (This is actually a strange example of a regression problem, which you are essentially unlikely to encounter very often.)
  - ANOVA correlation coefficient (linear)
  - Kendall’s rank coefficient (nonlinear)
- Categorical Input, Categorical Output: This is a classification predictive modeling problem.
  - Chi-squared test (contingency tables)
  - Mutual information

Information gain #

Information gain = 情報利得

Information is anything that essentially decreases our uncertainty.

Information gain calculates the reduction in what we call entropy or surprise from transforming a dataset in some way. You can use information gain for feature selection.

Information Gain and Mutual Information

Mutual information is really a name given to information gain in the context. the two can be used interchangeably.

We also mentioned entropy. Now entropy quantifies how much information is represented by a random variable.

We calculate the information gain by comparing the entropy of the dataset before and after we apply some form of transformation on it.

Measures the reduction in entropy when transforming our data in some way.
Smaller entropy suggests more purity or less surprise.
We compare the entropy of the data before and after we transform it.

Entropy

Skewed distribution = low entropy = unsurprising
Balanced Distribution = high entropy = surprising

When information gain is used

Information gain can be used as a split criterion in most implementations of decision trees.

The process works really by first calculating the information gain for each of our variables in the dataset, and then we’ll use the variable that has the largest information gain to essentially split the dataset as part of the tree.

Feature importance #

What is feature importance

Feature Importance = 各特徴量のターゲットの分類寄与率を評価する指標

Feature importance is techniques to assign scores to features, based on how useful to predict our target variable. High score means that it is larger effect on the model. In short, feature importance are techniques to rank our features based on the score.

Feature importance techniques

Model coefficients
Decision trees
Permutation testing

Decision trees based features importance techniques to estimate importance of our features:

Random Forest
ExtraTrees

Model selection and evaluation #

Confusion matrices #

Confusion matrices = 混同行列

This is often used for classification model evaluation. This is n by n matrix.

TP: True Positive (model’s prediction is correct)
TN: True Negative (model’s prediction is correct)
FP: False Positive (model’s prediction is wrong)
FN: False Negative (model’s prediction is wrong)

Sensitivity and Specificity

Sensitivity is the “true positive rate” (or recall):
- TPR = TP / (TP + FN)
Specificity is called the “false positive rate”:
- FPR = FP / (FP + TN)

Precision, recall, and F1 #

Precision

適合率。正と予測したものがどれだけ正しかったか。

Precision is a measure of how well our model correctly performed on the data. That is how correctly it predicted the right answer.

Precision = TP / (TP + FP)

Recall

再現率。実際に正だったもののうち、どれだけどれだけ正と予測できたか。どれだけ取りこぼしなく予測することができたか。

Recall is a measure of how well our model correctly found the data.

Recall = TP / (TP + FN)

F1 is actually combines both precision and recall into a single measure that captures both of these properties.

F1 = (2 * Precision * Recall) / (Precision + Recall)

Bootstrapping #

Sampling
Sampling with replacement

Sampling

Sampling is used in statistics and that’s really the process of selecting a subset of items from a vast collection, which we call the population. We do this to estimate a certain characteristic of the population itself.

Basic sampling implies that we do not sample replacement. We are removing items from the population, and therefore the population is getting smaller.

Sampling with replacement

Sampling replacement is an extension of this, and it means a data point, which we draw from as a sample from the population, can also reappear in future draws because essentially we will put it back into the population.

When we’re using bootstrapping or sampling with replacement, we are able to leave that data item within the population and sample it many times over.

Bootstrapping is a sampling method, involving a random process just like sampling. Bootstrapping is a very powerful and popular method for estimating these parameters, particularly if there’s no other statistical measure available.

We draw our samples from our data repeatedly, and then we put them back in again. Once we have a sample, we can estimate a population parameter. That means that we can use statistics to compute the mean or any other measure. The great thing about bootstrapping is that it’s non-parametric.

You also come across another term called “bootstrap aggregating”. We can also use essentially bootstrapping in machine learning, ensemble algorithms, called bootstrap aggregating. Which is also called “bagging”, which we’ll discuss next.

Bagging #

Bagging is an ensemble learning technique that helps improve the performance and accuracy of machine learning algorithms.

It’s best used to deal with the bias-variance tradeoff and reduces the variance of our prediction model. In essence, bagging avoids overfitting data and is used for both regression and classification models specifically for decision-tree algorithms.

First, let’s consider a single model that may not perform particularly well individually due to high variance or high bias, depending on how we’ve trained it. These individual models are what we call weak models. However, when we aggregate the predictions for more than one model, that is the average performing model, and the combination of them can actually help reduce bias or variance, providing better model performance.

That’s the basic idea. It’s this idea of collective intelligence or wisdom of the crowd.

Random forest algorithm is considered an extension of the bagging method, as it uses both bagging and feature randomness to created on correlated forests of decision trees.