System & Network Administration

IMA5SC Polytech Lille - December 2017




Thomas Maurice - Production Engineer - PE Security

tmaurice@fb.com

Agenda

  • Quick intro
  • What is system administration ?...
  • ... and how does it differ from Production Engineering ?
  • What's a multi layered application ?
  • What is load balancing and why you need it ?
  • Introduction to containers & µservices
  • Now what ? Orchestration !
  • Monitoring
  • Configuration management

Introduction

"Dafuq is this class and why should I care ?"

- You right now

About this module

  • Basic principles of systems administration: loadbalancing, automation, you name it
  • Real world examples of why you need that stuff. Because spoiler: things break
  • Introduction to the not so new (but new anyway) technology: containers
  • Having containers in production is nice, orchestrating them is better
  • Monitoring, have some
  • Automation and config management
  • Hopefully live demos, wish me luck

Some words about me

  • IMA8SC
  • Worked ~2 years at OVH as a devops engineer, mostly on the core container infrastructure
  • Worked (very) briefly at Criteo as a Site Reliability Engineer
  • Now I work at Facebook London sice April '17, as a Production Engineer in the PE Security team

What's a "traditional" sysadmin ?

Let's define it

  • The guy responsible for servers being up
  • The guy responsible for the applications being responsive
  • The guy responsible for monitoring
  • The guy responsible for infrastructure security
  • In short: the guy you wake up at 3a.m. because something is broken

How does it differs from production engineering ?

  • Production Engineers at Facebook are hybrid software/systems engineers who ensure that Facebook's services run smoothly and have the capacity for future growth.
  • They are embedded in every one of Facebook's product and infrastructure teams, and are core participants in every significant engineering effort underway in the company.
  • Source Facebook careers

Practical skills PEs have

  • Automation, minimize the human cost of running agiven system
  • Design large scale systems
  • Troubleshooting problems, eventually handling emergency situations
  • Transfering knowledge between team members and partner teams
  • Adaptation to various technical stacks

Setting up a scenario

  • You are the sysadmin of a communication agency
  • Your company wants a website
  • The requirements are dead simple: "We need a wordpress, use our servers. You have the afternoon"

How do you proceed ?

The naive approach


timmy@ljosalfheim ~ $ ssh root@wordpress.polytech-lille.ovh
root@wordpress ~ # apt-get update &&
    apt-get install \
        php-fpm \
        nginx-extras \
        mysql5.5-server \
        mysql5.5-client
root@wordpress # vim /etc/nginx/sites-enabled/wordpress.conf
root@wordpress # vim /etc/php-fpm/conf.d/main.conf
root@wordpress # wget wordpress.com -O - | tar zxvf /var/www/wordpress
root@wordpress # systemctl restart php5-fpm nginx
                    

My job there is done, I came up with a quick solution

#YOLO

Let's go into production, what can go possibly wrong ?

GG !

  • Your company's website was very successfull
  • Visites increase
  • Revenues increase
  • Server load increases
  • Server eventually dies
  • Revenues drop
  • You are fired

¯\_(ツ)_/¯

How could that have been avoided ?

Multi tiered applications

Multi tiering

That's the concept segregating software components on dedicated machines, in order to provide scalability and security to your infrastructure.

  • A database must never ever be on an Internet facing machine
  • The application server should never directly be exposed to users
  • Use a reverse proxy instead
  • If possible, separate those two on different machines/containers

Sample architecture overview

Open question: why would you do that ?

  • You can scale each tier separately
  • You can load balance at each level
  • You isolate your backends from the internet
  • Ideally behind a firewall
  • That sounds silly ? Have a look at shodan.io

Other benefits of doing so

  • You can afford to "lose" nodes if you have many of them
  • You can put in place canarying of the nodes before a release

Loadbalancing

Load balancing improves the distribution of workloads across multiple computing resources, such as computers, a computer cluster, network links, central processing units, or disk drives. - Wikipedia

One problem, several solutions

  • DNS load balancing
  • Physical load balancing
  • Software load balancing

Reminders about DNS

  • Aims at associating hostnames with IPs
  • Decentralized protocol
  • Old protocol: RFC 882 from 1983
  • Works over UDP & TCP for zone transfer

Reminders about DNS

Reminders about DNS

DNS load balancing

  • Aim: Make a single hostname resolve to multiple hosts to spread load evenly
  • Pretty easy to setup
  • ... but still!

Simple round robin: NTP Pool


thomas@laptop $ dig fr.pool.ntp.org
;; ANSWER SECTION:
fr.pool.ntp.org.	518	IN	A	176.31.109.7
fr.pool.ntp.org.	518	IN	A	91.121.197.13
fr.pool.ntp.org.	518	IN	A	37.187.104.44
fr.pool.ntp.org.	518	IN	A	94.23.0.110

Advantages

  • Easy to setup
  • Spreads evenly the load accross backends
  • Does not require a complex infrastructure

Pitfalls

  • Cache invalidation within a reasonnable delay
  • Loosing a server
  • Client side cache

But wait, there's more!

  • Geographical load blancing
  • Redirect your request to the closest server
  • "Blackout" a DC in case of outage

Random trivia about DNS

  • DNS is usually a plain text protocol over UDP
  • You can now use it over TCP and TLS, with resolvers like quad9

Real world example: Facebook


# From home, in London, UK
thomas ljosalfheim.maurice.fr ~ $ dig facebook.com +short
31.13.90.36

# From a random box, in Roubaix, FR
thomas devel.maurice.fr ~ $  dig facebook.com +short
157.240.20.35

Traceroute from LDN (UK)


$ mtr 31.13.90.36 -c 10 -r
Start: Wed Nov 15 14:33:04 2017
HOST: ljosalfheim.maurice.fr      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- gateway                    0.0%    10    4.2   4.2   2.4   5.9   1.1
  2.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
  3.|-- hari-core-2a-xe-102-0.net  0.0%    10   13.6  15.5  11.9  23.1   3.9
  4.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
  5.|-- telw-ic-1-ae0-0.network.v  0.0%    10   18.2  21.0  16.4  39.5   7.0
  6.|-- ae13.pr04.lhr3.tfbnw.net   0.0%    10   18.1  20.0  17.9  23.0   1.6
  7.|-- po161.asw01.lhr6.tfbnw.ne  0.0%    10   17.7  18.8  15.8  21.5   1.7
  8.|-- po233.psw02.lhr3.tfbnw.ne  0.0%    10   15.5  19.3  15.5  26.0   3.4
  9.|-- 173.252.67.69              0.0%    10   15.6  19.8  15.6  30.4   4.2
 10.|-- edge-star-mini-shv-01-lhr  0.0%    10   21.1  23.7  19.6  30.5   3.6

Traceroute from RBX (FR)


$ mtr -c10 -r 157.240.20.35
Start: Wed Nov 15 14:33:24 2017
HOST: devel.maurice.fr            Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.42.0.1                  0.0%    10    0.2  45.0   0.2 447.8 141.5
  2.|-- vss-3b-6k.fr.eu            0.0%    10    1.1  42.8   0.6 347.8 109.5
  3.|-- 10.95.69.6                 0.0%    10    0.4  25.2   0.4 247.4  78.1
  4.|-- 10.95.66.56                0.0%    10    0.8  15.3   0.5 147.2  46.3
  5.|-- 10.95.64.2                 0.0%    10    3.7   6.7   1.5  48.0  14.5
  6.|-- be100-1047.fra-5-a9.de.eu  0.0%    10    8.8   8.7   8.3   8.9   0.0
  7.|-- ae1.br02.fra1.tfbnw.net    0.0%    10   11.1  10.9   9.7  15.0   1.5
  8.|-- po116.asw01.fra2.tfbnw.ne  0.0%    10    8.4   8.6   8.4   8.9   0.0
  9.|-- po212.psw01.fra3.tfbnw.ne  0.0%    10    8.5   8.5   8.4   8.7   0.0
 10.|-- 173.252.67.255             0.0%    10    8.6   9.0   8.6  10.4   0.3
 11.|-- edge-star-mini-shv-02-frt  0.0%    10    8.6   8.7   8.6   8.8   0.0

What did we observe ?

  • The english server's request ended up in LHR (London, UK)
  • The french one ended up somewhere in France

How do we do it at Facebook ?

Let's switch real quick to Arun Moorthy's presentation of our network infrastructure

In-datacenter load balancing mechanisms

Reminders about the OSI model

Using a "real" load balancer

  • Can be a big physical one (F5)
  • Can be a software one (nginx, haproxy, lvs, iptables)
  • Can operate at various level (OSI level 4 or 7)
  • Can provide additional services (SSL termination, stickyness)

Their role is simple: dispatch the load to the most suitable backends

Physical load balancers

  • Very expensive (in terms of money)
  • Can handle lot of load
  • Fast
  • Full featured (firewall, anti DDoS)

L4 load balancers

  • Fast
  • Operate at IP level
  • They don't know what protocol they load balance
  • Stickyness at IP level
  • Example: Linux Virtual Server

Linux Virtual Server's model

  • One Virtual IP
  • Several Real Servers
  • The load balancing is taken care of at the kernel level

How it works

L7 load balancers

  • They do applicative load balancing
  • They know the protocol underneath
  • They can perform adapted healthchecks
  • Examples: nginx, haproxy, galera...

L4 or L7 ?

  • L4 is faster, but has no visibility over the protocol
  • L7 require more computation, but can perform smart operations

L7 LB example: haproxy

  • Dispatches HTTP requests to backends
  • Performs healtchecks, SSL offloading...
  • Ejects sick backends
  • Can do TCP & HTTP load balancing
  • Used by GitHub, Farmville, YouPorn...

How to check if a backend is healthy ?

  • Basic TCP probing
  • HTTP ping on a static route
  • HTTP ping on a monitoring route that will actually probe the whole system stack

Demo time !

Cross your fingers so my demo does not break

What we did

What's the problem ?

We loose haproxy-1, we loose everything

  • That's called a SPOF (Single Point of Failure)
  • SPOF should be avoided as much as possible

Fault tolerant systems

High availability is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period - Wikipedia

What can we make HA ?

  • Network addresses
  • Services

HA != LB

Highly available IP

  • Failover IP that can be mounted on several hosts
  • Only mounted on one host at a time
  • If the host fails, the IP is remounted elsewhere
  • Failure detection is achieved via healthchecks

Sample setup

In our example, the failover IP is the user facing address

How to manage my failover IPs ?

  • keepalived
  • corosync + pacemaker
  • BGP

Keepalived is way simplier but the corosync/pacemaker combo is way more powerful

Note on cloud environments

Softwares like VMWare or Openstack can take care of these issues for you.

Demo animation

Pitfalls

  • Be careful about your corosync quorum to avoid split brain situations
  • NEVER ever mount the same IP twice, you are going to have a bad time

Make your services able to bind to the address


root@prod # sysctl -w net.ipv4.ip_nonlocal_bind = 1

This kernel option enables listening on an address that is not currently mounted on the host

Or loopback mount


root@prod # ip address add 10.0.0.1/32 dev lo
root@prod # ip address show lo
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.0.0.1/32 scope global lo
       valid_lft forever preferred_lft forever

Example of services

Stateless HA services: API workers

  • No need to syncronize
  • Work pretty much out of the box
  • Scale at will

Data consumers (tailers)

  • Can consume different "partitions" of the same topic (in Kafka terminology)
  • Each partition is independant
  • Scale by adding more workers and more partitions

Databases

  • Way more complicated to make HA
  • Usual approach: Master-Slave architecture
  • Shared storage or not
  • STONITH

Usual workflow for databases

  • The slave follows the master, each write is replicated
  • The slave is used only for reads
  • If the master dies, the slave is promoted
  • The slave is now the new master

Beware of data corruption

In the event of a shared storage like NFS, Ceph, whatever, implement STONITH

This works for databases, but also for anything using shared storage

STONITH

Fault tolerant & distributed application design

Simple Master - Slave design

  • You have a master process
  • You have slave processes
  • The slaves follow the master's orders
  • Exemple: intensive computation

What happens if the master dies ?

Nothing, the whole process is stuck and the slave don't have coordination

How can we make it resilient ?

Can't we just scale up the masters ?

Not really, unless you implement some kind of distributed decision making algorithm. Otherwise:

  • The two masters could send conflicting orders to slaves
  • In case of conflict, who is right ?
  • How could the slave know how to whom report status ?

Never assume the network will work

  • The masters are necessarly on different nodes to prevent a physical failure to take everything down
  • If a network partition occurs between two master groups, who is right ?
  • Have the masters concience of each other ?

So, what do we do ?

  • Have a leading master
  • Have a quorum of backup masters
  • If the leading master dies, the quorum votes for the new one
  • If the quorum is not reached, no operation is performed

It is better to have no leading master than a split brain situation which leads to an inconsistent cluster state

How can master processes synchronize ?

Use distributed strongly consistant lock systems

What is that ?

  • A software that maintains a representation of the state of the cluster
  • Each change to the state has to be agreed upon by a quorum of nodes
  • The change is accepted once it has been committed to disk by a quorum of nodes
  • You can see that as a distributed log file
  • Example: Chubby, Zookeeper, Consul, Etcd...

Example: Zookeeper

  • Apache Foundation project written in Java
  • Offers a data tree structure to store data
  • Heavily used by the Apache Mesos ecosystem
  • Disk IO sensitive

Example leader election using Consul

  • All the processes are started
  • They all try to grab a particular handle
  • The first one to have it is the master
  • The others continuousely try to grab it in case the master disconnects

Demo !

That's all for today, questions ?

Part 2: Working with containers

Now what ?

  • We have a service which is HA
  • We have load balancing
  • But we still have other problems...
  • We have several developper teams and a very limited set of machines

How about mutualizing servers ?

  • That means: using them at full potential
  • Servers are expensive, you may want to run several apps on them
  • Make the applications run on the same machine

Do you see any problems with that ?

  • Noisy neighbour phenomenon: Application A would prevent application B from working properly by stealing it resources.
  • Lots of apps: lots of potential security breaches
  • In short: isolation

What are "resources" ?

The system prerequisites an application needs in order to work properly

Examples

  • CPU time
  • RAM
  • Network bandwidth
  • IP addresses
  • Disk space
  • Disk I/O

If the application cannot have access to the resources it needs, it may work in a degraded mode (or not work at all) and offer a degraded service to end users

Examples

  • A Minecraft server is RAM sensitive
  • A Kafka broker is IO sensitive
  • A fileserver is bandwidth sensitive

We must find a way to share these resources

Applicative dependencies issues

  • Some apps may have conflicting dependencies (Python <3)
  • Some apps may depend on libs unavailable on the host system (any dynamically linked language)
  • Admins cannot just install every interpreter ever on all the machines

Security concerns

  • What if an app is compromised: the whole machine is
  • ... And all the other apps that run onto it

So, is mutualizing a bad idea ?

Nope !

Let's use containers !

A container is an isolation mechanism at OS level. All containers share the same kernel and thus are way lighter than virtual machines. The containers are isolated at several level, such as PID, FS, Network, UID and so on using kernel namespaces

How does it looks like ?

Container != VM

Saying a container is a light VM is like saying a folder is a light partition

Advantages of containers over VMs

  • No instruction translation from the guest to the host
  • Native performances
  • Almost instant provisioning: just launch a process
  • Mutualize one kernel for all the containers

Some history

  • 1979: Introducing chroot in Unix v7
  • 2000: BSD jails
  • 2001: Linux VServer. Introducing resource shares
  • 2004: Solaris zones
  • 2008: Linux Containers (LXC). cgroup-based isolation and user namespaces
  • 2013: Docker. cgroups-based complete ecosystem
  • 2016: Rocket (backed by CoreOS). Standard implementation of OCI

Isolation example 1: FS


$ sudo debootstrap stable ~/chroot http://httpredir.debian.org/debian/
... long output ...
$ sudo chroot ~/chroot /bin/bash
root@ljosalfheim:/# ls
bin  boot  dev	etc  home  lib	lib64  media  mnt  opt	proc  root  run  sbin  srv  sys  tmp  usr  var
root@ljosalfheim:/# pwd
/
root@ljosalfheim:/# ps faux
Error, do this: mount -t proc proc /proc
root@ljosalfheim:/# mount -t proc proc /proc
root@ljosalfheim:/# ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         2  0.0  0.0      0     0 ?        S    06:49   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    06:49   0:00  \_ [ksoftirqd/0]
.....

Isolation example 2: NET


$ sudo ip netns add polytech
$ sudo ip netns list
polytech
$ sudo ip link add veth0 type veth peer name veth1
$ ip link
...
6: veth1@veth0:  mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 42:ff:6e:ae:7b:02 brd ff:ff:ff:ff:ff:ff
7: veth0@veth1:  mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether da:41:24:c2:55:96 brd ff:ff:ff:ff:ff:ff

Configure the interface inside the netns


$ sudo ip link set veth1 netns polytech
$ sudo ip netns exec polytech ifconfig veth1 10.1.1.1/24 up
$ sudo ip netns exec polytech /bin/bash
# ip address
1: lo:  mtu 65536 qdisc noop state DOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
6: veth1@if7:  mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default qlen 1000
    link/ether 42:ff:6e:ae:7b:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.1.1.1/24 brd 10.1.1.255 scope global veth1
       valid_lft forever preferred_lft forever

Let's setup a bridge


# these commands are run in the host's netns
$ sudo ip link add br0 type bridge
$ sudo ip link set master br0 dev veth0
$ sudo ip address add 10.1.1.254/24 dev br0
$ sudo ip link set up dev br0
$ sudo ip link set up dev veth0

Does it ping ?


$ ping 10.1.1.254 # This is run inside the netns
PING 10.1.1.254 (10.1.1.254) 56(84) bytes of data.
64 bytes from 10.1.1.254: icmp_seq=1 ttl=64 time=0.127 ms
64 bytes from 10.1.1.254: icmp_seq=2 ttl=64 time=0.033 ms

We are able to ping the host from inside the new network namespace

Can we go further and ping Facebook ?

Yes we can !


# ping facebook.com
ping: facebook.com: Name or service not known

No we can't... What is missing ?

Routes and NAT !


## This is run on the host
$ iptables -I FORWARD -i br0 -o wlp4s0 -j ACCEPT
$ iptables -I FORWARD -i wlp4s0 -o br0 -m state --state RELATED,ESTABLISHED -j ACCEPT
$ iptables -t nat -I POSTROUTING -s 10.1.1.0/24 -j MASQUERADE
## This is run inside the netns
$ ip route
10.1.1.0/24 dev veth1  proto kernel  scope link  src 10.1.1.1
$ ip route add 10.1.1.254 dev veth1 scope link
$ ip route add default via 10.1.1.254 dev veth1

And now ?


$ ping facebookcom
PING facebook.com (31.13.90.36) 56(84) bytes of data.
64 bytes from facebook.com (31.13.90.36): icmp_seq=1 ttl=247 time=26.5 ms
64 bytes from facebook.com (31.13.90.36): icmp_seq=2 ttl=247 time=25.9 ms

NetNS have other fun properties

  • Own routing table
  • Own firewalling rules
  • Meaning: endless possibilities
  • Pitfall: The networking stack is terribly handled by Docker

Process isolation

  • The container will not be able to see the processes running on the host
  • The host will be able to see the processes running in the container
  • The process launched in the container will be PID 1

UserID shift

  • A shift will be applied to all the userids
  • From the container point of view, root will be 0
  • From the host point of view, root will be 0 + SHIFT
  • That provides security in case of escape
  • That provides security regarding host file access

Resource share

  • Uses cgroups
  • Work on cgroup was started by Google in 2006
  • Allows resource limiting, priorization, isolation, accountability, control over processes
  • Simple interface via a virtual FS /sys/fs/cgroups/
  • What to remember ? cgroup enables you to limit what resources processes can access

Today we are going to talk about Docker

Docker is pretty trendy this days, but the principles it relies on are way older than you and me

What is docker ?

  • A container engine created in 2013
  • It has a full featured ecosystem
  • Implements network isolation
  • Implements FS isolation
  • Implements process isolation
  • Since 1.10 implements user id shift

Docker: Build. Ship. Run. Any app. Anywhere

The next part is heavily inspired from Balthazar Rouberol's presentation. Thanks mate <3

Build

  • Creates a base image that is reproductible and immutable
  • The image contains the application and all its dependencies
  • Once the image is built, it cannot be changed
  • You end up with the same image in dev, staging and prod environments
  • An image can be used as a base image for another one

Docker images

  • They are stacked layers, you can view it as binary diffs
  • They are built using a Dockerfile
  • Each Dockerfile instruction is a layer of the final image
  • Each layer is a diff with the layer before
  • Layers are immutable

Example Dockerfile


FROM debian:stretch
MAINTAINER Thomas Maurice <thomas@maurice.fr>

RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y \
        python \
        python-pip \
        openvpn && \
    pip install j2cli && rm -rf /var/cache/apt

COPY run.sh /
COPY openvpn.conf.j2 /etc/openvpn/

ENTRYPOINT ["/run.sh"]
  • We've created an OpenVPN Docker image
  • Based on Debian Stretch
  • At startup /run.sh will be executed

Ship

  • Docker provides a way to store images: registries
  • Public and private registries
  • Open source images :)
  • Registry access control (read/write per user/group)

Run

  • Downloads the base image
  • Creates a writable layer on top of the image layers
  • Creates cgroups and launches container

Container != Image

  • An image is imutable, not the container !
  • A container is an image plus a writable layer

Any App

  • Statically compiled Go binary
  • Elasticsearch (Java)
  • Python API
  • Random bash script
  • C++ rendering software
  • Basically whatever since the deps are embeded

Anywhere

  • Linux
  • Windows
  • MacOS
  • Dedicated server
  • Virtual machine

Summary: Architecture

Why is Docker a revolution ?

  • It enables developpers to be sure of the behaviour of their apps
  • It eases the deployment process for the admins
  • It plays well with CI/CD processes
  • It heavily encourages mutualization

But

Good practices

  • You usually want to run only one process per container
  • Use Dockerfiles to build your images
  • Rebuild them on a regular basis, to patch CVEs
  • Use CI/CD to perform automated and reproductible builds
  • Isolation does not mean you don't have to test your application
  • Use image tags to ease rollback
  • Use a secret management system to manage your credentials (i.e. Vault)

What if you had to manage all these containers on our own ?

Container orchestration

What is an orchestrator ?

  • A piece of software responsible for dispatching containers accross a cluster of slave machines
  • They can be either sophisticated (Mesos/Kubernetes/Tupperware) or easier to use (Docker Swarm)

Examples

  • Docker Swarm
  • Kubernetes (k8s)
  • Apache Mesos
  • Tupperware, the one we use at Facebook

Apache Mesos

  • Used by Apple and Twitter and OVH for instance
  • Reported to scale at up to 10K slave nodes
  • I will talk mainly about this one because I have a solid experience with it

How does it work ?

Mesos is splitted in 4 important parts:

  • Slave
  • Master
  • Scheduler
  • Executor

All of these components communicate via APIs

Mesos Slave

  • Accounts the resources it has available
  • Offers them to the master
  • Launches tasks when requested to
  • Launches tasks using a given Executor
  • Send reports to the master

Mesos Master

  • Aggregates the offers of the slaves
  • Send offers to schedulers
  • If a framework accepts an offer, it asks a slave to spawn a task
  • One master is elected using a ZooKeeper quorum
  • The Mesos Master is highly available

Scheduler

  • Responsible for starting tasks on the cluster
  • Recieves offer from the master, and asks it to start tasks accordingly
  • May implement a constraint system
  • Has direct interraction with the user (developper/admin)
  • May be developped for a particular application (ElasticSearch, Hadoop...)
  • Responsible for restarting tasks if they die

Executor

  • Binary usually associated to a scheduler
  • Designed to launch tasks asked by this framework
  • May be customized to implement some business logic
  • Mesos provides by default some executors (command, docker for instance)

As long as you follow the docs, you can code your own executor

Overview

Some frameworks examples

  • Marathon for long running tasks
  • Chronos for distributed cron-like jobs
  • ElasticSearch for managing ES clusters

Zoom on Marathon

Marathon

  • Backed by Mesosphere
  • Written in Scala
  • Natively HA using Zookeeper like Mesos
  • Specialized with long running jobs

Why would you use it ?

  • It is highly available
  • It offers advanced features (redeploy strategies, health checks)
  • It implements a powerful constraint system
  • It is stable
  • It offers both a nice UI, and a well documented REST API

Exemple features

  • Allows one-click-scaling
  • One-cURL-scaling too obviously
  • Can spread containers on the hosts for resilience
  • Can kill and restart ill tasks
  • Performs task reconciliation in case of netsplit

Demo time !

Let's boot a Mesos cluster on my laptop shall we ? :)

What's the problem with our nginxes ?

The ports are assigned randomly :(

How can we fix that ?

Let's use a load balancer !

marathon-lb

  • Grabs tasks port mapping using Marathon's API
  • Generates HAproxy configuration
  • Reloads the configuration of the HAproxy

What happends if we periodically reload the load balancer ?

  • Active connection are broken
  • New connections are refused during the reload

This is not acceptable for a production environment !

Graceful reload

  • Use HAproxy soft finish
  • Use iptables SYN drop
  • Or use qdisc based approach like Yelp does
  • Or use the new HAproxy features that allow you to do that properly :)

Soft finish

  • The current HAproxy stops listening on its ports
  • It finishes to process all the active connections
  • A new one is spawned using the new configuration

But the new connections are still refused during the time of the reload

iptables SYN drop

  • Marathon-lb regenerates the configuration
  • It drops all the SYNs on HAproxy ports
  • It reloads the configuration
  • It removes the iptables

TCP is designed to retry to send SYNs after one second if the first one is dropped

No connections are dropped !

But we have a one second delay in incomming requests

Using qdisc

  • The incomming SYNs are sent to a buffer
  • Before the reload we lock the buffer, keeping the packets in
  • We reload
  • After the reload, we let the SYNs flow again
  • It took like 20ms

Microservice architectures

How software were usually built

  • Monolithic software
  • Deployed on one machine
  • Vertical scaling
  • Painful upgrades

But containers introduced a new way of building stuff

Introducing microservices

Microservices

  • Split each logical function into a separate component
  • Each component can be developped separately
  • Each component can be upgraded separately
  • Components communicate via APIs or message queues
  • Each component can be written in different languages according to your needs

It plays well with containers

  • Each component is a separate image
  • Each component can be maintained by a separate team
  • Each component can be written in a different language
  • Each component can be scaled accordingly

Interfacing components together

  • You can use REST APIs to speak HTTP(S hopefully)
  • You can also use RPC !

What is RPC ?

  • Remote Procedure Call
  • It is basically a contract betwen the server and the client, that agree on the data format that they will use
  • These come with code generation tool to generate your client libraries
  • The interface specification is usualy human friendly

Example: Thrift


service MultiplicationService
{
        int32 multiply(1:int32 n1, 2:int n232),
}

If I want to use it in a simle Python script


# I will get a thrift client with my destination ip:port
result = client.multiply(42, 2)
# And that's as simple as that

Other advantages

  • You do not have to worry about client code
  • Code can be generally retro-compatible, and if not super easy to rebuild
  • RPC frameworks usually support middleware which makes implementing retries seamless
  • They also allow you to abstract away all the secure communication issues

But separate components means new potential issue

8 fallacies of distributed computing

  • The network is reliable.
  • Latency is zero.
  • Bandwidth is infinite.
  • The network is secure.
  • Topology doesn't change.
  • There is one administrator.
  • Transport cost is zero.
  • The network is homogeneous.

The network is reliable

  • Sometimes network equipments fail
  • For geographically-wide networks, you may experience fiber cuts
  • The network can be overloaded (see 3.)
  • That means that you have to implement retry in your code, and handle these errors.
  • Latency is zero

  • Your packets have to travel to get from one machine to another
  • For instance, the latency between London and New York is about 90ms
  • The latency is never a constant, because your packets can travel through various pathes
  • That means that you have to bring your latency sensitive components close together geographically
  • Bandwidth is infinite

  • It is not. Your network equipments have an upper bound capacity
  • You have to take it into account when designing bandwith-hungry systems (think video streaming)
  • The network is secure

  • It is not. Even if your "owned" network is, when your packets are outside they can be intercepted by anyone who has a physical access to network devices (ISPs, Governments, ...)
  • You need to implement transport security !
  • This specific point is going to be developped more extensively later on :)
  • Topology does not change

  • Stuff can move around, for instance your database server may be switching IP at some point
  • That means: do not hardcode anything depending in the present state of the world in your applications
  • There is one administrator

  • There is never only one person in charge
  • You have to socialize what you do/want to do with other people working with you
  • Transport cost is zero

  • Everything has a cost, transport cost can be one of the following: latency, size and computing
  • Take it into account, be aware of it, and make tradeofs
  • The network is homogeneous

  • It is not. London does not have the same connectivity as Lens for instance
  • If you do not take that into account and base choices on this assumption, your end users may have an unpleasant experience
  • Some other issues you may encounter

    Database crash due to massive data accesses

    • Implement caching
    • Use relevant cache TTL
    • Have a cache invalidation policy

    Worker overload due to input overload

    • Use a worker-queue system
    • Scale up/Autoscale according to your load
    • For anything that does not realtime processing, use asynchronous processing as mush as possible

    Transport security

    What problem are we trying to solve ?

    • The client wants to be sure of the server's identity
    • The server may want to be sure of the client's identity
    • Both definately want the data exchange to be private

    How do we achieve that ?

    Mutual authentication: x509 digital certificates

    • They are composed of two parts: a certificate, and a private key
    • The certificate contains information about the server (or the client), like IP addresses, domain names that we can validate
    • It also contains the public key you will use to encrypt messages with
    • It has an expiration date
    • And a whole bunch of other metadata
    
    $ openssl s_client -connect facebook.com:443
    CONNECTED(00000003)
    depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert High Assurance EV Root CA
    verify return:1
    depth=1 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert SHA2 High Assurance Server CA
    verify return:1
    depth=0 C = US, ST = California, L = Menlo Park, O = "Facebook, Inc.", CN = *.facebook.com
    verify return:1
    ---
    Certificate chain
     0 s:/C=US/ST=California/L=Menlo Park/O=Facebook, Inc./CN=*.facebook.com
       i:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert SHA2 High Assurance Server CA
     1 s:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert SHA2 High Assurance Server CA
       i:/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert High Assurance EV Root CA
    

    What does it mean ?

    • It basically indicates you the identity of who you are talking to
    • C = US, the organization owning the certificate is US based
    • ST = CA, more precisely in California
    • L = Menlo Park, that should be obvious by now
    • O = "Facebook, Inc", ... ?
    • CN = *.facebook.com, this is the common name of te certificate, basically all the domains for which this one is valid

    Yeah but I can fake that

    
    openssl req -newkey rsa:1024 -nodes -keyout /dev/null -x509 \
        -days 365 -out /dev/stdout | \
        openssl x509 -noout -text
    Certificate:
        Data:
            Version: 3 (0x2)
            Serial Number:
                ec:51:da:68:79:80:79:74
        Signature Algorithm: sha256WithRSAEncryption
            Issuer: C = US, ST = CA, L = Menlo Park, O = "Facebook, Inc", CN = *.facebook.com
            Validity
                Not Before: Nov 21 18:36:34 2017 GMT
                Not After : Nov 21 18:36:34 2018 GMT
            Subject: C = US, ST = CA, L = Menlo Park, O = "Facebook, Inc", CN = *.facebook.com
    
    

    Having an identity is cool, trusting it is better

    • Let's get a few slides back and look at the actual Facebook certificate
    • The certificate embeds a chain of trust
    • Meaning that each certificate has been signed by the previous one in the chain...
    • ...up to a root one that you can trust, and is installed by default on your machine

    This way the client can know for sure it is speaking to Facebook !

    This mechanism is also valid for whenever the server wants to authenticate the client

    A pitfall to this is if your system's certificate store is compromise, an attacker could impersonate the CA

    TLS handshake - source IBM

    Summary

    • Using TLS allows your clients to identify your clients
    • It also allows your servers to identify your clients
    • It makes sure the communication is secure and trusted within your network

    Monitoring and alerting

    Why do you need monitoring ?

    • Have an idea of how your system behaves at any point in time
    • Be able to track capacity needs for your system
    • Be able to track performance regressions
    • Be able to detect and fix an issue before it impacts your end users

    What can you monitor ?

    • The CPU and RAM consumption of a given process on a machine
    • The network ingress/egress on your whole infrastructure
    • Trends accross time for all of the above
    • Proportions of 5xx errors on your web traffic

    Any variation on each of these may or may not indicate a problem

    • A 10% CPU increase during peak hour is probably fine
    • A 75% network egress drop on your video streaming service is probably not
    • An overall 5% raise in the proportion of 5xx errors on your web machines may follow the introduction of a new bug in prod

    How do you pick metrics that are relevant for your monitoring needs ?

    • There is one rule: there is no general rule
    • Your best judgment is your friend here, pick those that are relevant to what defines your application's behaviour.

    Examples

    • If you are running a latency sensitive web application, you want to monitor the time a request takes to complete
    • If you are running a processing pipeline, you want to check on the quantity of data processed by unit of time
    • If you are running a file sharing service, keeping track of your network traffic is probably a good idea
    • If your service calls other services, you may want to log how many calls it is performing and their respective performances

    How to do that ?

    • Parse logs and transform them to metrics (httpd logs -> counters of return codes)
    • Even better instrument your applications to report metrics
    • Example: Prometheus

    How do I want to track my metrics ?

    • Do you want to track them at host level ?...
    • ... or at aggregated levels ? (i.e. per cluster/service/...)

    How should I aggregate my timeseries ?

    • Min ?
    • Max ?
    • Average ?
    • pXX ?

    pXX is usually a good start

    • Averaging can hide outliers and hide specific problems
    • pXX is usually cool because it can give you how the XX percentile of your metrics behave and you can compare them
    • Exemple, the p99 latency is the latency that 99% of your request will be under, so it actually tells you that the 1% of your slower requests take longer than this value

    Alerting

    • Alerting is the action of pinging an actual human to fix an issue
    • Do not alert unless something is actually broken
    • Alert and page people only when there is a real user or business impact
    • Otherwise you end up in a situation where the human will ignore the alert
    • Alerts should be actionable by a human

    That's all for now, questions ? :)

    Configuration management

    Dafuq is that again ?

    Remember configuring your wordpress earlier yesterday

    That was manual, wasn't it ?

    Now imagine

    • Doing that on 1000 machines
    • ... Without committing any mistakes

    Yeah but Thomas, you know, I could just run it manually in a script and a for loop over SSH

    Just in case you did not get it, that would make me angery

    You *could* technically do that

    • But how about upgrade ?
    • How about reruning your script on some nodes that failed ?
    • How about the very concept of maintenability ?

    A bash script does not qualify as an appropriate solution

    Configuration management to the rescue !

    What is this ?

    • It is a way for you to mnage your hosts's configs
    • Install and setup packages
    • Drop config files
    • Control services' restarts
    • Basically tweak any any aspect of the system

    Ideally it has the following characteristics

    • It uses "code" to express that
    • It is mostly idempotent
    • It uses text files, so it can be versionned ez

    You have 3 main actors there

    • Puppet
    • Chef
    • Ansible

    Puppet

    • Written in Ruby
    • Configs using a DSL
    • Master - Agent architecture

    Chef

    • Written in Ruby
    • Configs using plain old Ruby code
    • Master - Agent architecture

    Ansible

    • Written in Python
    • Configs using YAML
    • Masterless and agentless !

    We are going to talk about ansible

    Ansible

    • Written in Python
    • Configs using YAML
    • Masterless and agentless !
    • Configuration instructions are stored in playbooks
    • Playbooks are a set of roles
    • Roles have a set of tasks and handlers, and eventually files and templates

    How does it look like ?

    
    $ cat base.yml
    - hosts: all
      roles:
        - root_user
        - base
        - ssh
        - ntp
        - bash
        - iptables
        - vim
        - tmux
        - ssl
        - telegraf
    

    How does it look like ?

    
    $ cat roles/ssh/tasks/main.yml
    ---
    - name: "Install ssh"
      apt: name=openssh-server
    - name: "Install mosh"
      apt: name=mosh
    - name: "Service"
      service: name=ssh enabled=yes state=started
    - name: "Config sshd_config"
      copy: src=sshd_config dest=/etc/ssh/sshd_config
      notify: "Restart ssh"
    
    

    How does it look like ?

    
    $ cat roles/ssh/handlers/main.yml
    ---
    - name: "Restart ssh"
      service: name=ssh state=restarted
    

    What did we just do ?

    Install openssh and configure it

    Why is it cool ?

    • It is structured and readable
    • I can easily edit the config
    • I can run it several time in a row without side effects
    • The same config is going to be applied to all the nodes :)

    Practical work now !