University of London / MSc Computer Science: Cloud computing（後半）

April 2, 2024

ロンドン大学で MSc Computer Science: Cloud computing モジュールを履修中。

講義内容に関して記録した個人的なスタディノートです。

全 12 週のうち 6〜12 週目の内容を記録します。（6 週目開始：2024 年 2 月 12 日 / 12 週目終了：2024 年 4 月 1 日）

Week 6: Introduction to distributed computing principles #

メモ

講義内容は分散システムで起きる問題とそのフォールトトレランスについて。

レクチャー内容

Advanced Distributed Systems
- Lecture 1: Distributed systems: Dive in
- Lecture 2: Time and clocks
- Lecture 3: Fault tolerance in distributed systems
- Lecture 4: Famous problem – Byzantine fault tolerance
- Lecture 5: Consensus on distributed systems

Synchronous vs Asynchronous

Synchronous
- Each message is received within bounded time
- Physical clods are synchronized
- Clock drift is known
- Note: telephone communication is synchronous
Asynchronous
- No bounds on message transmission delays
- No bounds on process executions
- No physical clocks
- Clock drift is arbitrary
- Note: postal communication is asynchronous
Synchronous distributed Systems
- Easier to design synchronous distributed algorithms
- Restrictive requirements:
  - Precision on the clock synchronisation
  - Limited concurrent network usage
- Time is bounded!
- Bounded execution speed and time
Asynchronous distributed Systems
- More difficult to design asynchronous distributed algorithms
- Open requirements:
  - No assumption for time delays
  - No precision on the clock synchronisation
  - Concurrent network usage
- Time is not bounded!
- Varying execution speed and time

A solution for asynchronous distributed system is a solution for synchronous distributed system.

Time and clocks

Non distributed systems:
- A global clock for all processes.
Distributed systems:
- Each computer has its own clock. No clock is perfect!

The solutions are:

Lamport’s logical clock
- https://en.wikipedia.org/wiki/Lamport_timestamp
- https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/
Vector clock
- https://en.wikipedia.org/wiki/Vector_clock

Fault tolerance in distributed systems

Distributed systems are highly complex. It require mechanism to control and cope with fault tolerance. There are two categories of failures:

Stop fail: Failures that crash the system
- Node is off
- Network fails
- Bug failure
- Data corruption
- Flood in the datacentre
- We didn’t pay the bill…
- These nodes do not return a value
- Can be detected by other nodes
Non-stop fail: Failures that don’t crash the system (a node does not work as expected, it becomes a traitor …)
- Code is buggy, but still works (e.g. send invalid messages)
- A disk is faulty, but still works
- Network delays
- Node send incorrect/corrupted values
- Difficult to detect

Non-stop failures are difficult to identify. “Byzantine fault tolerance” is the idea of building a distributed system that survive non-fail stop failures.

ビザンチン将軍問題 - Wikipedia
https://ja.wikipedia.org/wiki/%E3%83%93%E3%82%B6%E3%83%B3%E3%83%81%E3%83%B3%E5%B0%86%E8%BB%8D%E5%95%8F%E9%A1%8C

Consensus on distributed systems

A consensus algorithm is a process to achieve agreement on a single data value among distributed processes or systems. Consensus algorithms are designed to achieve reliability in a network of unreliable nodes, in such a way that reliable nodes agree on an action. Consensus is achieved even if: Processes fail, messages are lost or delivered out of order.

Consensus algorithms solves stop-fail failures.
Consensus algorithms do not solve “byzantine” failures, where a node is a traitor!

Further material #

What is DevOps? - DevOps.com
- https://devops.com/what-is-devops/
What is DevOps? | IBM
- https://www.ibm.com/topics/devops

Week 7: Infrastructure as Code, CI/CD pipeline #

メモ

既知の内容がほとんどだったため、講義内容の記録は簡略化する。
講義内容は IaC と CI/CD についての概要説明。
ラボの内容は GitHub Action と Terraform を利用したアプリケーションのデプロイおよび Compute Engine のプロビジョニング自動化のハンズオン。

レクチャー内容

Infrastructure as code, CI/CD pipeline
- Lecture 1: Infrastructure as code
- Lecture 2: Benefits of CI/CD using IaC
Labs
- Lab 1: CI/CD with Terraform, GCP, GitHub actions and node.js

Infrastructure as Code

Infrastructure as Code (IaC) is the managing and provisioning of infrastructure through node instead of manual processes. We create configuration files that contain infrastructure specifications, which makes it easier to edit and distribute configurations. By codifying and documenting your configuration specifications, IaC aids configuration management and helps you to avoid undocumented, ad-hoc configuration changes. Automating infrastructure provisioning with IaC means that developers don’t need to manually provision and manage servers, operating systems, storage etc. each time they develop or deploy an application.

Two approach of configuration files:

Declarative: Defines the desired state of the system, including what resources you need and any properties they should have, and an IaC tool will configure it for you.
Imperative: Defines the specific commands needed to achieve the desired configuration, and those commands then need to be executed in the correct order.

Week 8: Introduction to distributed database systems #

メモ

講義内容は RDBMS と NoSQL データベースの特性の違いや、NoSQL であり分散データベースである Apache Cassandra についての概要説明。
ラボの内容は Apache Cassandra での CRUD や、複数の Cassandra ノードを作成してデータをレプリケーションさせる（＝あるノードが落ちてもデータベースは機能し続けることができる）やり方のハンズオン。

レクチャー内容

Introduction to distributed computing principles
- Lecture 1: Distributed database systems
- Lecture 2: RDBMSs vs NoSQL
- Lecture 3: NoSQL characteristics
- Lecture 4: The Bloomfilter algorithm
Labs
- Lab 1: Introduction to Apache Cassandra
- Lab 2: Create an Apache Cassandra installation on GCP

Apache Cassandra

Apache Cassandra is “Column based” (also known as “Wide-column store”) NoSQL and is distributed database. Data is stored in columns instead of rows as in a conventional SQL system.

Each query should be handled by one table in the database.
Increased performance: We query for data that are always in the same table!
アパッチカサンドラ - Wikipedia
- https://ja.wikipedia.org/wiki/Apache_Cassandra
Apache Cassandra - Wikipedia
- https://en.wikipedia.org/wiki/Apache_Cassandra

Bloomfilter algorithm

A Bloomfilter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is not present in a set. This algorithm is used widely by big key-value stores including Apache Cassandra.

The price paid for this efficiency is that a Bloomfilter is a probabilistic data structure. It tells us that the element either is not in the set or may be in the set.

ブルームフィルタ - Wikipedia
- https://ja.wikipedia.org/wiki/%E3%83%96%E3%83%AB%E3%83%BC%E3%83%A0%E3%83%95%E3%82%A3%E3%83%AB%E3%82%BF
Bloom filter - Wikipedia
- https://en.wikipedia.org/wiki/Bloom_filter

Week 9: Introduction to distributed and big data systems (Part 1) #

メモ

講義内容はビックデータを効率的に扱う際に必要な要素（分散して保存すること、並列して取得することなど）の説明や、大規模データ処理のソフトウェアフレームワークである Apache Hadoop についての概要説明。
ラボの内容は Apache Hadoop の使い方のハンズオン。

レクチャー内容

Introduction to big data principles
- Lecture 1: Why do we need big data systems?
- Lecture 2: Big data systems
- Lecture 3: What is Apache Hadoop?
Labs
- Lab 1: Introduction to Apache Hadoop MapReduce
- Lab 2: Create an Apache Hadoop installation on GCP

Apache Hadoop

Hadoop facilitates and simplifies the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. It can process efficiently petabytes of data in thousands of nodes.

アパッチハドゥープ - Wikipedia
- https://ja.wikipedia.org/wiki/Apache_Hadoop
Apache Hadoop - Wikipedia
- https://en.wikipedia.org/wiki/Apache_Hadoop

Hadoop is a collection of modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce: A system for parallel processing of large data sets.
YARN: A framework for job scheduling and cluster resource management.
HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.
MapReduce works by breaking the processing into two phases, the map phase and the reduce phase.
- The map phase: we break the data and complete some operations for it.
- The reduce phase: we reduce the results and aggregate everything the results back to the user.

Week 10: Introduction to distributed and big data systems (Part 2) #

メモ

講義内容は Apache Spark についての説明と、Apache Hadoop との違いについて。
ラボの内容は Apache Spark の使い方のハンズオン。

レクチャー内容

Introduction to in-memory data processing
- Lecture 1: Big data processing systems
- Lecture 2: RDD: Resilient distributed dataset
- Lecture 3: Building Spark applications
Labs
- Lab 1: Introduction to Apache Spark
- Lab 2: Create an Apache Spark installation on GCP

Apache Spark

Apache Spark is a framework for batch and also for streaming data analytics.

Bid data streaming is a process in which big data is quickly processed in order to extract real-time insights from it. The data on which processing is done is tha data in motion.

アパッチスパーク - Wikipedia
- https://ja.wikipedia.org/wiki/Apache_Spark
Apache Spark - Wikipedia
- https://en.wikipedia.org/wiki/Apache_Spark

Since Hadoop is designed batch-processing system, a batch job for data analytics is ideal for Hadoop MapReduce, however we cannot use Hadoop for streaming. Instead we can use Spark for a long running streaming job.

Week 11, 12 #

最終課題の期間。課題は、Node.js の Express.js と MongoDB を用いて、記事の投稿管理システムの REST API の作成。ユーザ作成や記事の投稿、その記事に対するコメントなどを保存するための REST API を作成する。シンプルな Web アプリのバックエンド側の開発といったところ。