Rust, signals, and rowing machines

Recently I have started using Rust again. I had a brief interaction with it a few years ago and found it lacking (pre-1.0). This time around has been far better!

I started working on a program to interact with my rowing machine. It is a WaterRower with an S4 monitor. This comes with a serial interface on which the rower will spit out loads of raw data. The plan is to take that output and do fun things with it. Logging workouts, analyzing data, and possibly even a basic game for it.

I created a Rust project for interacting with the rowing machine. It is very basic right now, mostly helping me get my brain around the, admittedly ugly, Rust syntax. In v0.1.0, the idea is to have one thread for interacting with the serial device, one thread for displaying the data, and the main thread for keeping track of these other threads.

To initiate the connection with the waterrower and have it actually start sending out data over the serial port I have to send it a very specific line, at UTF-8 encoded string that reads “USB\r\n”. It will continue to spit out data until I send it another string, “EXIT\r\n”. If I do not send that last string, the rowers serial buffer will fill up and it will reset and crash, maybe not in that order. Either way, it is less that ideal.

The problem is, when I send Control+C or otherwise kill the program from Linux, it doesn’t properly exit. This is were signals come in. When I send ctrl+c, it is actually sending a signal, SIGINT. When killing the process, it sends SIGTERM. Ideally, I want to handle these signals gracefully inside my program. Unfortunately, there is no good way to handle signals in Rust from what I have seen.

The best way I have found involves a global bool. Here is a small snippet of the basic structure I used to handle signals. I have annotated the code, so consider it part of the post.

extern crate nix;

use nix::sys::signal;
use std::sync::atomic::{AtomicBool, Ordering, ATOMIC_BOOL_INIT};
use std::thread;
use std::time::Duration;

// define EXIT_NOW bool (ATOMIC_BOOL_INIT is false)
static EXIT_NOW: AtomicBool = ATOMIC_BOOL_INIT;

// define what we do when we receive a signal
extern fn early_exit(_: i32) {
    println!("Caught signal, exiting!");
    // set EXIT_NOW bool to true
    EXIT_NOW.store(true, Ordering::Relaxed);
}

fn main() {
    // define an action to take (the key here is 'signal::SigHandler::Handler(early_exit)'
    //    early_exit being the function we defined above
    let sig_action = signal::SigAction::new(signal::SigHandler::Handler(early_exit),
                                            signal::SaFlags::empty(),
                                            signal::SigSet::empty());
    // use sig_action for SIGINT
    unsafe { signal::sigaction(signal::SIGINT, &sig_action); }
    // use sig_action for SIGTERM
    unsafe { signal::sigaction(signal::SIGTERM, &sig_action); }

    // spawn a new thread
    let handle = thread::spawn(move || {
        // loop forever
        loop {
            // check if bool, EXIT_NOW, is true
            if EXIT_NOW.load(Ordering::Relaxed) {
                println!("Cleaning up");
                break;
            }
            println!("Doing something...");
            thread::sleep(Duration::from_secs(1));
        } // end loop
    });

    // block until thread ends (until user sends signal)
    handle.join();
}

Notice how we had to declare the sigactions as unsafe? that is because it is “unsafe” in the Rust world. It allows me to do things that Rust cannot guarantee memory access is safe. Now, it is completely possible to use this to write code that does _not_ have any race conditions, but it does make it more difficult. Depending on what the thread is doing, you may need to have a special EXIT_NOW if statement in each thread you spawn to ensure a proper exit.

If anyone has a better way to handle signals in Rust, now or in the future, please let me know! I am not super happy about the way EXIT_NOW is declared and used, but it is safe enough. I am always looking for better ways though!

All in all, I am fairly happy with Rust so far. It hasn’t prevented me from doing anything and it has already stopped me from making a few mistakes that would have resulted in bugs due to race conditions that may have been very difficult to track down.

bonding, bridging, and port density

Today is the day. You finally got 10Gbe networking! Boy is it expensive though. The price-per-port is very high, $500-$1000 on the low end. Add to that, “production-grade” things usually need to be highly-available. This begs the question, how can we best take advantage of this costly networking?

Enter bonding, also known as nic or channel teaming, and link aggregation. Bonding can solve the issue of high availability with several different modes of operation. The most common ones are active-backup and LACP. As far as taking advantage of ports, active-backup is the worst. It provides only high availability. This leaves an entire port unused until a failure happens. Given the price-per-port, this is not ideal. A better solution is LACP since it will use both interfaces to send and recieve. But that requires switch support and has a protocol overhead. This means bonding 2 interfaces with LACP does not yield 2x the bandwidth.

For me, the solutions above were less than ideal. I have a home lab with a small 8 port 10Gbe switch (XS708E) and every port counts. I have three servers with two 10Gbe nics each (Intel X540-T2). I simply can’t afford (figuratively and literally) to let a single interface sit unused with active-backup bonding, and LACP support on the switch was difficult to use and configure. This led me to using all of my ports without high availability, without bonding of any kind. But, I had an idea….

I have figured out a way to have each interface essentially in two active-backup bonds. This allows me to use each interface 100% without affecting the other unless an interface has failed, at which point the traffic is merged together into a single physical port. It looks like this:

fun-networking-madness01

In this example I have two physical interfaces, eth2 and eth3. The rest of the interfaces we will be creating now.

I start by creating two bridges:

# ip l a br0 type bridge
# ip l a br1 type bridge

I then create two bonds:

# ip l a bond0 type bond
# ip l a bond1 type bond

I then have to create my veth pairs for linking the bonds to the bridges.

# ip l a veth00 type veth peer name veth01
# ip l a veth10 type veth peer name veth11

Finally, we need to plug in all the veth pairs and add our interfaces to the bond.

# ifenslave bond0 eth2 veth00
# ifenslave bond1 eth3 veth10
# brctl addif br0 bond0
# brctl addif br0 veth11
# brctl addif br1 bond1
# brctl addif br1 veth01

And there you have it. Fancy networking. You’ll want to address the bridges as if they are the physical interfaces. In my case, br0 == eth2, and br1 == eth3. Things to check on if you are having issues:

  • bonds are in active-backup mode
  • all bonds, interfaces, bridges, and veth pairs are in state UP
  • the physical interface is set as the primary interface in each bond

I am not sure whether I will stick with this configuration in the long run, but that is what I have been running for a little while now and it is working as well as I could hope. There is no noticeable delay when either interface drops and the failover occurs as it would in a normal bonding setup (polling defaults to 500ms). There is also no measurable performance degradation when failing over and having to traverse two bridges. Overall, I am very happy with the entire setup.

Bonus: Here are the commands used to implement this in openvswitch. One caveat is that there is no way I could find to enforce a default or primary bond member, so you must run the `ovs-appctl` command to set the active slave in the bond after a failover. I solved this with a simple timer/cron job in systemd set to run every minute. Not ideal, but certainly functional.

# ovs-vsctl add-br br0
# ovs-vsctl add-br br0
# ovs-vsctl add-bond br0 bond0 eth2 veth00 -- set port bond0 bond_mode=active-backup -- set interface veth00 type=patch options:peer=veth01
# ovs-vsctl add-bond br1 bond1 eth3 veth10 -- set port bond1 bond_mode=active-backup -- set interface veth10 type=patch options:peer=veth11
# ovs-vsctl add-port br0 veth11 -- set interface veth11 type=patch options:peer=veth10
# ovs-vsctl add-port br1 veth01 -- set interface veth01 type=patch options:peer=veth00
# ovs-appctl bond/set-active-slave bond0 eth2
# ovs-appctl bond/set-active-slave bond1 eth3

letsencrypt, haproxy, and auto-renewal

If you have not heard about letsencrypt it is an amazing, and free, certificate authority. It proves free (as in beer) ssl certificates for anyone who can prove they own the domain. There is a little helper utility that can you can use to help you get a cert called certbot. The concept is pretty simple, a quick breakdown looks like this:

  1. Client says to Server “I want a cert for x.y.z domain”
  2. Server says verify you own this domain by serving file “/.well-known/acme-challenge/1234567890abcdef” from x.y.z domain
  3. Client setups the file as requested
  4. Server verifies file exists
  5. Server issues certificate for x.y.z domain

I have glossed over the massive amount of research and security involved in doing all of this, but that is the general concept. Of note, the certificate is only valid for 3 months and cannot be a wildcard cert.

Now lets talk about the issue. Once you receive your certificate you then use it on your webserver or application you are consuming the ports letsencrypt is expecting to use, namely 80 and 443. How do you renew the certificate without stopping the services? Enter Haproxy. If you happen to be loadbalancing through haproxy, you are in luck! You can host your site _and_ still do proper renewals with no downtime. The way it works is quite simple, haproxy can check certain things about the request and trigger conditions based on that. In this case we will be testing for if the URI begins with “/.well-known/acme-challenge”. If it does, we know to forward that to our certbot client. Here is how the haproxy config looks.

frontend ssl_redirector
    bind 1.1.1.1:443 ssl crt /etc/haproxy/ssl/
    http-request del-header X-Forwarded-Proto
    http-request set-header X-Forwarded-Proto https if { ssl_fc }

    # Check if this is a letsencrypt request based on URI
    acl letsencrypt-request path_beg -i /.well-known/acme-challenge/
    # Send to letsencrypt-backend if it is a letsencrypt-request
    use_backend letsencrypt_backend if letsencrypt-request

    default_backend website_backend

frontend http_redirect
    bind 1.1.1.1:80
    # Redirect to HTTPS if this is not a letsencrypt-request
    redirect scheme https code 301 if !letsencrypt-request

    # Check if this is a letsencrypt request based on URI
    acl letsencrypt-request path_beg -i /.well-known/acme-challenge/
    # Send to letsencrypt-backend if it is a letsencrypt-request
    use_backend letsencrypt_backend if letsencrypt-request

backend letsencrypt_backend
    server letsencrypt 127.0.0.1:49494

backend website_backend
    server server01 192.168.1.1:80
    server server02 192.168.1.2:80

Lets go through each section. The first section, ‘ssl_redirector’, listens on public ip 1.1.1.1 and port 443. It has all certs it can server in /etc/haproxy/ssl/. It sets the X-Forward-Proto to https (some applications may require this). The next step is the meat of the issue we are solving. The acl checks if the path begins with “/.well-known/acme-challenge/” and if it does it sends it to the “backend letsencrypt_backend” section. All of this is the same for the ‘http_redirect’ section. If it doesn’t detect anything letsencrypt related, it forwards the request to one of the ‘website_backend’ servers like normal.

So that’s it. Haproxy will now detect and forward letsencrypt requests to a server located at “127.0.0.1:49494”. Now its time to setup that server.

I wrote a little bash script to do a cert renewal using a docker container I created. The docker container is samyaple/certbot. It is based on the github repo SamYaple/certbot. It builds automatically in DockerHub when I push changes to the github repo, which is pretty sweet. That’s a subject for another time, though. The script I use for autorenewing is here. I’ve left comments throughout the script to explain why certain code gets run.

#!/bin/bash
# cert_renewal.sh

set -o errexit

FQDN=$1

# This should only run when fetching a new cert
function http_failback {
    docker run --rm -v /etc/letsencrypt:/etc/letsencrypt -p 127.0.0.1:49494:49494 samyaple/certbot:v0.8.1 --standalone --standalone-supported-challenges http-01 --http-01-port 49494 -d ${FQDN}
}

function fetch_certs {
    # If SNI fails, fail back to http authorization
    docker run --rm -v /etc/letsencrypt:/etc/letsencrypt -p 127.0.0.1:49494:49494 samyaple/certbot:v0.8.1 --standalone --standalone-supported-challenges tls-sni-01 --tls-sni-01-port 49494 -d ${FQDN} || http_failback
}

function install_certs {
    if [[ -e "/etc/letsencrypt/live/${FQDN}/fullchain.pem" ]]; then
        cat /etc/letsencrypt/live/${FQDN}/{fullchain.pem,privkey.pem} > /etc/haproxy/ssl/${FQDN}.pem
    fi
}

fetch_certs
install_certs

systemctl reload haproxy

You execute this script with the paramater of your domain name and you are golden. It should create/renew your ssl cert and then reload haproxy. This can be made into a cronjob or simply run every 2 months to ensure renewal before the cert expires.

`./cert_renewal.sh test.example.com` will produce a test.example.com.pem file that haproxy will be able to use. Magic!

bcache, partitions, and DKMS

bcache is a fantastic way to speed up your system. The idea is simple: You have a fast-but-small SSD and a slow-but-large HDD and you want to cache the HDD with the SSD for better performance. This is what bcache was designed for and it does it’s job very, very well.

It has been in the kernel since 3.10 and according to the author, Kent Overstreet, it has been stable since 2012. I have personally been using it since it was announced as “stable” and haven’t ever had any data corruption issues with it. I even wrote a how-to on it a few years back. I highly recommend it and I am actually a bit shocked it hasn’t gotten more attention from the community at large. My guess is that the reason for that is it requires you to reformat your disk (or use a tool to convert the disk by shuffling metadata blocks) and this turns some people off of the idea.

One of the pain points at the time of this writing I just ran into is bcache devices are not setup to allow partitions. This is due to only being allocated 1 minor number when created in the kernel. Normally this would be ok, you could use tools like kpartx to generate a target for that partition. But in this case I needed a true partition that was predictable and autogenerated by the kernel for use with some existing tooling that expected to be able to take a raw block device and create a partition table and then address the partition on it. That tool was ceph-disk while playing around with the new bluestore OSD backend in Jewel.

The fix for this only-allocates-one-minor number is really quite simple, two lines of code. The issue is compiling the new module and using it. I long ago stopped rolling my own kernel and now-a-days I only do so when I am testing a new feature (like bcachefs). That leaves me with a situation where even if I did patch this partition “issue” upstream, I wouldn’t be able to consume that on a stable kernel for a few years.

DKMS saves the day! You have likely had some experience with DKMS in the past, perhaps when compiling zfs.ko or some proprietary graphics module. It is normally a smooth procedure and you don’t even notice you are actually compiling anything. Let’s start by installing dkms and the headers for your target kernel.

# apt-get install dkms linux-headers

Thanks to DKMS we can build a new, updated bcache module with the small changes needed to support partitions and use it with your preferred packages and stable kernel. The process was pleasantly simple. Since bcache is an in-tree linux kernel module we need to start by grabbing the source for your running kernel. On debian based systems this can be done with the following command:

# apt-get install linux-source

Now you will see the source tarball for your particular kernel in /usr/src/. We need to extract that. After extraction we need to copy the bcache source files to a new directory for dkms to use. The version number here is made up, in this case you should be able to use anything. I chose the kernel version I was working with.

# cd /usr/src
# tar xvf linux-source-3.16.tar.xz
# mkdir bcache-3.16
# cp -av linux-source-3.16/drivers/md/bcache/* bcache-3.16/

At this point we have copied all we need out of the kernel source tree for this particular module. If you are adapting these instructions for a different module then you may need additional files from other locations. Now that we have the source can go ahead and make our changes to the code. Here is a patch of the two lines I have changed. Like I said, very simple code change here.

# diff -up a/super.c b/super.c
--- a/super.c   2016-03-31 21:03:25.189901913 +0000
+++ b/super.c   2016-03-31 21:03:08.205513288 +0000
@@ -780,9 +780,10 @@ static int bcache_device_init(struct bca
        minor = ida_simple_get(&bcache_minor, 0, MINORMASK + 16, GFP_KERNEL);
        if (minor < 0)
                return minor;
+        minor = minor * 16;
 
        if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
-           !(d->disk = alloc_disk(1))) {
+           !(d->disk = alloc_disk(16))) {
                ida_simple_remove(&bcache_minor, minor);
                return -ENOMEM;
        }

Now we need to create a dkms.conf file and use dkms to build and install our new module. Finally, we update our initramfs to pull in the new module on boot.

# cat << EOF > bcache-3.16/dkms.conf
PACKAGE_NAME="bcache"
PACKAGE_VERSION="bcache-3.16"
BUILT_MODULE_NAME[0]="bcache"
DEST_MODULE_LOCATION[0]="/updates"
AUTOINSTALL="yes"
EOF
# dkms add -m bcache -v 3.16
# dkms build -m bcache -v 3.16
# dkms install -m bcache -v 3.16
# update-initrafs -u -k all

And there you have it. We are now using our version of bcache with slight twist without a custom kernel! Here are the fruits of our labor, bcache0p1 :

# ls -lh /dev/bcache*
brw-rw---- 1 root disk 254,  0 Mar 31 20:17 /dev/bcache0
brw-rw---- 1 root disk 254,  1 Mar 31 20:17 /dev/bcache0p1
brw-rw---- 1 root disk 254, 16 Mar 31 20:17 /dev/bcache16
brw-rw---- 1 root disk 254, 17 Mar 31 20:17 /dev/bcache16p1
brw-rw---- 1 root disk 254, 32 Mar 31 20:17 /dev/bcache32
brw-rw---- 1 root disk 254, 33 Mar 31 20:17 /dev/bcache32p1

OpenStack Neutron, OpenVSwitch, and Jumbo frames

MTU has always been a touchy subject in Neutron. Who manages it? Should instances have info about the underlying infrastructure? Should this all be on the operator to configure properly? Luckily these questions appear to have been answered for the most part in the Mitaka release of OpenStack. This bug for OpenVSwitch (and this one for linuxbridge) more or less solve this issue for us.

Now we can configure MTU in Neutron and Neutron will be intelligent about how to use it. My end goal is to have an infrastructure with 9000 mtu, while the instances themselves can live with the standard 1500 mtu. To achieve that I did have to pull in one patch early, though it looks like it will make it into Neutrons mitaka-rc2 release. The patch applies global_physnet_mtu value to br-int and br-tun so the operator doesn’t have to. Beyond that, it was all just a matter of Neutron config options, which is fantastic!

Here are the changes I had to make to properly get Neutron using my larger 9000 MTU without my intervention.

# /etc/neutron/neutron.conf
[DEFAULT]
global_physnet_mtu = 9000

# /etc/neutron/plugins/ml2/ml2_conf.ini
[ml2]
physical_network_mtus = physnet1:9000
# The default value for path_mtu is 1500, if you want your instances to have
# larger mtus you should adjust this to <= global_physnet_mtu
path_mtu = 1500

That was it! My interfaces we’re properly configured.

160: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default 
    link/ether a0:36:9f:67:32:c6 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::a236:9fff:fe67:32c6/64 scope link tentative 
       valid_lft forever preferred_lft forever
161: br-int: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default 
    link/ether 66:fd:95:70:37:4a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::64fd:95ff:fe70:374a/64 scope link tentative 
       valid_lft forever preferred_lft forever
162: br-tun: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default 
    link/ether 76:37:eb:d7:3b:48 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::7437:ebff:fed7:3b48/64 scope link tentative 
       valid_lft forever preferred_lft forever

A shoutout to Sean Collins (sc68cal) for fixing the br-int/br-tun issue and Armando Migliaccio (armax) for targeting the bug for inclusion in mitaka-rc2 if all goes well!

Glance, Ceph, and raw images

If you are a user of Glance and Ceph right now, you know that using raw images is a requirement for Copy-on-Write cloning (CoW) to the nova or cinder pools. Without that there is a thick conversion that must occur which wastes time, bandwidth, and IO. The problem is raw images can take up quite a bit of space even if its _all_ zeros because raw images cannot be sparse images. This becomes even more visible when doing a non-CoW clone from Cinder or Nova to Glance whose volumes can be quite large.

Enter fstrim. For the uninitiated, OpenStack and fstrim.

Basically the idea is we want to tell that image to drop all of the unused space and thus reclaim that space on the Ceph cluster itself. Please note, this only works on images that do not have any CoW snapshots on them. So if you do have anything booted/cloned from that image you’ll need to flatten or remove those instances/volumes.

We will start with a “small” 80GiB raw image and upload that to Glance.

# ls -l 
total 82416868
-rw-r--r-- 1 root root 85899345920 Mar 21 15:34 ubuntu-trusty-80GB.raw
# openstack image create --disk-format raw --container-format bare --file ubuntu-trusty-80GB.raw ubuntu-trusty-80GB
+------------------+----------------------------------------------------------------------------------------------------------+
| Field            | Value                                                                                                    |
+------------------+----------------------------------------------------------------------------------------------------------+
| checksum         | 9ca30159fcb4bb48fbdf876493d11677                                                                         |
| container_format | bare                                                                                                     |
| created_at       | 2016-03-21T15:38:50Z                                                                                     |
| disk_format      | raw                                                                                                      |
| file             | /v2/images/bb371f84-2a61-47a0-ab22-f4dbf8467070/file                                                     |
| id               | bb371f84-2a61-47a0-ab22-f4dbf8467070                                                                     |
| min_disk         | 0                                                                                                        |
| min_ram          | 0                                                                                                        |
| name             | ubuntu-trusty-80GB                                                                                       |
| owner            | 762565b94f314ec6b370d978db902a78                                                                         |
| properties       | direct_url='rbd://20d283ba-5a51-49b0-9be7-9220bcc9afd0/glance/bb371f84-2a61-47a0-ab22-f4dbf8467070/snap' |
| protected        | False                                                                                                    |
| schema           | /v2/schemas/image                                                                                        |
| size             | 85899345920                                                                                              |
| status           | active                                                                                                   |
| tags             |                                                                                                          |
| updated_at       | 2016-03-21T15:53:54Z                                                                                     |
| virtual_size     | None                                                                                                     |
| visibility       | private                                                                                                  |
+------------------+----------------------------------------------------------------------------------------------------------+

NOTE: If you are running with a cache tier you will need to evict the cache to get the proper output from the command below. It is not a requirement to evict the cache to make this work, just if you want valid output from the command below.

rados -p glance-cache cache-flush-evict-all

Now if we look at Ceph we should see it consuming 80GiB of space as well.

# rbd -p glance info bb371f84-2a61-47a0-ab22-f4dbf8467070
rbd image 'bb371f84-2a61-47a0-ab22-f4dbf8467070':
 size 81920 MB in 10240 objects
 order 23 (8192 kB objects)
 block_name_prefix: rbd_data.10661586915
 format: 2
 features: layering, striping
 flags: 
 stripe unit: 8192 kB
 stripe count: 1
# rados -p glance ls | grep rbd_data.10661586915 | wc -l
10240

Sure enough, 10240 * 8MiB per object equals 80GiB. So Ceph is clearly using 80GiB of data on this one image, but it doesn’t have to be!

First step is to remove the snapshot associated with the glance image. This is the reason you cannot have any CoW clones based on this image.

# rbd -p glance snap unprotect bb371f84-2a61-47a0-ab22-f4dbf8467070@snap                                                                                                                                                                                                                    
# rbd -p glance snap rm bb371f84-2a61-47a0-ab22-f4dbf8467070@snap

The process to reduce this usage is to map the rbd to a linux host and mount the filesystem. At that point we can run fstrim on the filesystem(s) and tell Ceph it is ok to free up some space. Finally, we need to re-snapshot it so that moving forward Glance is CoW cloning the sparse rbd. Those steps are as follows:

# rbd -p glance snap unprotect bb371f84-2a61-47a0-ab22-f4dbf8467070@snap                                                                                                                                                                                                                    
# rbd -p glance snap rm bb371f84-2a61-47a0-ab22-f4dbf8467070@snap                                                                                                                                                                                                                           
# rbd -p glance map bb371f84-2a61-47a0-ab22-f4dbf8467070
/dev/rbd0
# mount /dev/rbd0p1 /mnt/
# df -h /mnt/
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd0p1      79G  780M   74G   2% /mnt
# fstrim /mnt/
# umount /mnt/
# /usr/bin/rbd unmap /dev/rbd0

If you are running a cache tier now would be the time to evict your cache again.

# rbd -p glance info bb371f84-2a61-47a0-ab22-f4dbf8467070                                                                                                                                                                                                                 
rbd image 'bb371f84-2a61-47a0-ab22-f4dbf8467070':
        size 81920 MB in 10240 objects
        order 23 (8192 kB objects)
        block_name_prefix: rbd_data.10661586915
        format: 2
        features: layering, striping
        flags: 
        stripe unit: 8192 kB
        stripe count: 1
# rados -p glance ls | grep rbd_data.10661586915 | wc -l
895

And there it is! Now we are only using 895 * 8MiB objects (~7GiB). That is a 90% reduction in usage. Now why are we using 7GiB and not 780M as the df output shows? fstrim is a quick tool, its not a perfect tool. If you want even more efficiency you can use zerofree or write out a large file full of zeros on the filesystem and delete it before running fstrim. That will further reduce the size of this RBD.

Finally, to make this image usable to glance again you need to recreate and protect the snapshot we removed previously.

# rbd -p glance snap create bb371f84-2a61-47a0-ab22-f4dbf8467070@snap
# rbd -p glance snap protect bb371f84-2a61-47a0-ab22-f4dbf8467070@snap

All in all, this can be done fairly quickly and reduces usage a great deal in some cases. If you have glance and ceph and large images, this may be something to consider doing.

Deploying OpenStack Mitaka with Kolla, Docker, and Ansible

It’s true, Mitaka is not quite released yet. That said, these instructions haven’t changed since Liberty and will stay relevant once Mitaka is officially tagged.

The requirements and steps to build Kolla images are provided at docs.openstack.org. Those have already been done and the Docker images exist in my private registry.

A bit about my environment before we begin.

3 identical custom servers with the following specs:

These servers are interconnected at 10Gb using a Netgear XS708E switch. I have one 10Gb interface (eth3) dedicated to VM traffic for Neutron. The other is in a bond (bond0) with one of my 1Gb nics for HA.

I will be deploying ceph, haproxy, keepalived, rabbitmq, mariadb w/ galera, and memcached along side the other OpenStack services with Kolla. To start, we need to do some prep work to the physical disks for Kolla to pick up the disks in the ceph bootstrap process. This would also be the same procedure needed to add new disks in the future.

The disks I will be using are /dev/sde, /dev/sdf, /dev/sdg with external journals on my pcie ssd located at /etc/nvme0n1. I will also be setting up an OSD for using as a cache tier with ceph on the ssd as well.

In order for the bootstrap process to tie the appropriate devices together we use GPT partition names to do this. For /dev/sde I create an fresh partition table with a new partition labeled KOLLA_CEPH_OSD_BOOTSTRAP_1. This explicit naming process is so Kolla never, ever messes with a disk it shouldn’t be.

# parted /dev/sde -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_1 1 -1
root@ubuntu1:~# parted /dev/sde print
Model: ATA ST4000DM000-1F21 (scsi)
Disk /dev/sde: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
1 1049kB 4001GB 4001GB btrfs KOLLA_CEPH_OSD_BOOTSTRAP_1

The same method is used for /dev/sdf and /dev/sdg, but with labels KOLLA_CEPH_OSD_BOOTSTRAP_2 and KOLLA_CEPH_OSD_BOOTSTRAP_3 respectively. Now we have to setup the external journals for each of those OSDs (you can co-locate the journals as well by using the label KOLLA_CEPH_OSD_BOOTSTRAP).

The external journal labels are simply the bootstrap label with ‘_J’ appended. For example, the journal for /dev/sde would be KOLLA_CEPH_OSD_BOOTSTRAP_1_J. Once those labels are in place, the Kolla bootstrap process with happily setup ceph on those disks. If you mess up any of the labels all that will happen is the Kolla bootstrap won’t pick up on those disks and you can rerun the playbooks after correcting the issue.

The final look of my disks with the cache tier osd and journals is as follows:

Model: ATA ST4000DM000-1F21 (scsi)
Disk /dev/sde: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number Start End Size File system Name Flags
 1 1049kB 4001GB 4001GB btrfs KOLLA_CEPH_OSD_BOOTSTRAP_1

Model: ATA ST4000DM000-1F21 (scsi)
Disk /dev/sdf: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number Start End Size File system Name Flags
 1 1049kB 4001GB 4001GB btrfs KOLLA_CEPH_OSD_BOOTSTRAP_2

Model: ATA ST4000DM000-1F21 (scsi)
Disk /dev/sdg: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number Start End Size File system Name Flags
 1 1049kB 4001GB 4001GB btrfs KOLLA_CEPH_OSD_BOOTSTRAP_3

Model: Unknown (unknown)
Disk /dev/nvme0n1: 400GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number Start End Size File system Name Flags
 1 1049kB 100GB 100GB btrfs docker
 2 100GB 330GB 230GB KOLLA_CEPH_OSD_CACHE_BOOTSTRAP_1
 3 330GB 340GB 9999MB KOLLA_CEPH_OSD_CACHE_BOOTSTRAP_1_J
 4 340GB 350GB 10.0GB KOLLA_CEPH_OSD_BOOTSTRAP_1_J
 5 350GB 360GB 10.0GB KOLLA_CEPH_OSD_BOOTSTRAP_2_J
 6 360GB 370GB 9999MB KOLLA_CEPH_OSD_BOOTSTRAP_3_J
 7 370GB 380GB 10.0GB KOLLA_CEPH_OSD_BOOTSTRAP_4_J
 8 380GB 390GB 10.0GB KOLLA_CEPH_OSD_BOOTSTRAP_5_J
 9 390GB 400GB 10.1GB KOLLA_CEPH_OSD_BOOTSTRAP_6_J

Next up is configuring my inventory. Normally, you won’t need to configure more than the first 4 sections of your inventory if you have copied it from ansible/inventory/multinode. And that is all I have changed in this case as well. My inventory for my three hosts, ubuntu1 ubuntu2 and ubuntu3, is as follows:

# /etc/kolla/inventory
[control]
ubuntu[1:3]

[network]
ubuntu[1:3]

[compute]
ubuntu[1:3]

[storage]
ubuntu[1:3]

...snip...

Once that was finished I modified my globals.yml for my environment. The final result is below, all options that I configured (with comment sections removed for brevity).

---
config_strategy: "COPY_ALWAYS"
kolla_base_distro: "ubuntu"
kolla_install_type: "source"
kolla_internal_vip_address: "192.0.2.10"
kolla_internal_fqdn: "openstack-int.example.com"
kolla_external_vip_address: "203.0.113.5"
kolla_external_fqdn: "openstack.example.com"
kolla_external_vip_interface: "bond0.10"
kolla_enable_tls_external: "yes"
kolla_external_fqdn_cert: "/etc/kolla/haproxy.pem"
docker_registry: "registry.example.com:8182"
network_interface: "bond0.10"
tunnel_interface: "bond0.200"
neutron_external_interface: "eth3"
openstack_logging_debug: "True"
enable_ceph: "yes"
enable_cinder: "yes"
ceph_enable_cache: "yes"
enable_ceph_rgw: "yes"
ceph_osd_filesystem: "btrfs"
ceph_osd_mount_options: "defaults,compress=lzo,noatime"
ceph_cinder_pool_name: "cinder"
ceph_cinder_backup_pool_name: "cinder-backup"
ceph_glance_pool_name: "glance"
ceph_nova_pool_name: "nova"

And finally, the /etc/kolla/passwords.yml file. This contains, you guessed it, passwords. At the time of this writing it has very bad defaults of “password” as the password. By the time of the Mitaka release this patch will have merged and you be able to run kolla-genpwd to populate this file for you with the random passwords and uuids.

Once all of that was completed I run the pull playbooks to fetch all of the proper images to the proper hosts with the following command:

# time kolla-ansible -i /etc/kolla/inventory pull
Pulling Docker images : ansible-playbook -i /etc/kolla/inventory -e @/etc/kolla/globals.yml -e @/etc/kolla/passwords.yml -e action=pull /root/kolla/ansible/site.yml 

PLAY [ceph-mon;ceph-osd;ceph-rgw] ********************************************* 

GATHERING FACTS *************************************************************** 
ok: [ubuntu1]
ok: [ubuntu2]
ok: [ubuntu3]

TASK: [common | Pulling kolla-toolbox image] ********************************** 
changed: [ubuntu1]
changed: [ubuntu3]
changed: [ubuntu2]
...snip...

PLAY RECAP ******************************************************************** 
ubuntu1 : ok=55 changed=36 unreachable=0 failed=0 
ubuntu2 : ok=55 changed=36 unreachable=0 failed=0 
ubuntu3 : ok=55 changed=36 unreachable=0 failed=0 

real 5m2.662s
user 0m8.068s
sys 0m2.780s

After the images were pulled I ran the actual OpenStack deployment where the magic happens. After this point it was all automated (including all the ceph cache tier and galera clustering) and I didn’t have to touch a thing!

# time ~/kolla/tools/kolla-ansible -i /etc/kolla/inventory deploy
Deploying Playbooks : ansible-playbook -i /etc/kolla/inventory -e @/etc/kolla/globals.yml -e @/etc/kolla/passwords.yml -e action=deploy /root/kolla/ansible/site.yml 

PLAY [ceph-mon;ceph-osd;ceph-rgw] ********************************************* 

GATHERING FACTS *************************************************************** 
ok: [ubuntu1]
ok: [ubuntu3]
ok: [ubuntu2]

TASK: [common | Ensuring config directories exist] **************************** 
changed: [ubuntu1] => (item=heka)
changed: [ubuntu2] => (item=heka)
changed: [ubuntu3] => (item=heka)
changed: [ubuntu1] => (item=cron)
changed: [ubuntu2] => (item=cron)
changed: [ubuntu3] => (item=cron)
changed: [ubuntu1] => (item=cron/logrotate)
changed: [ubuntu2] => (item=cron/logrotate)
changed: [ubuntu3] => (item=cron/logrotate)
...snip...

PLAY RECAP ******************************************************************** 
ubuntu1 : ok=344 changed=146 unreachable=0 failed=0 
ubuntu2 : ok=341 changed=144 unreachable=0 failed=0 
ubuntu3 : ok=341 changed=143 unreachable=0 failed=0 

real 7m32.476s
user 0m48.436s
sys 0m9.584s

And thats it! OpenStack is deployed and good to go. In my case, I could access horizon at the openstack.example.com with full ssl setup thanks to the haproxy.pem I supplied. With a 7 and a half minute run-time, it is hard to beat the speed of this deployment tool.

Bonus: Docker running containers on ubuntu1 host

# docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
b3a0400b2502 registry.example.com:8182/kollaglue/ubuntu-source-horizon:2.0.0 "kolla_start" 8 minutes ago Up 8 minutes horizon
bded510db134 registry.example.com:8182/kollaglue/ubuntu-source-heat-engine:2.0.0 "kolla_start" 8 minutes ago Up 8 minutes heat_engine
388f4c6c1cd3 registry.example.com:8182/kollaglue/ubuntu-source-heat-api-cfn:2.0.0 "kolla_start" 8 minutes ago Up 8 minutes heat_api_cfn
6d73a6aba1e5 registry.example.com:8182/kollaglue/ubuntu-source-heat-api:2.0.0 "kolla_start" 8 minutes ago Up 8 minutes heat_api
8648565cdc50 registry.example.com:8182/kollaglue/ubuntu-source-cinder-backup:2.0.0 "kolla_start" 8 minutes ago Up 8 minutes cinder_backup
73cb05710d46 registry.example.com:8182/kollaglue/ubuntu-source-cinder-volume:2.0.0 "kolla_start" 8 minutes ago Up 8 minutes cinder_volume
92c4b7890bb7 registry.example.com:8182/kollaglue/ubuntu-source-cinder-scheduler:2.0.0 "kolla_start" 8 minutes ago Up 8 minutes cinder_scheduler
fec67e07a216 registry.example.com:8182/kollaglue/ubuntu-source-cinder-api:2.0.0 "kolla_start" 8 minutes ago Up 8 minutes cinder_api
d22abb2f75fb registry.example.com:8182/kollaglue/ubuntu-source-neutron-metadata-agent:2.0.0 "kolla_start" 9 minutes ago Up 9 minutes neutron_metadata_agent
12cd372d0804 registry.example.com:8182/kollaglue/ubuntu-source-neutron-l3-agent:2.0.0 "kolla_start" 9 minutes ago Up 9 minutes neutron_l3_agent
6ada0dd5eff6 registry.example.com:8182/kollaglue/ubuntu-source-neutron-dhcp-agent:2.0.0 "kolla_start" 9 minutes ago Up 9 minutes neutron_dhcp_agent
cd89ac90384a registry.example.com:8182/kollaglue/ubuntu-source-neutron-openvswitch-agent:2.0.0 "kolla_start" 9 minutes ago Up 9 minutes neutron_openvswitch_agent
4eac98222be5 registry.example.com:8182/kollaglue/ubuntu-source-neutron-server:2.0.0 "kolla_start" 9 minutes ago Up 9 minutes neutron_server
1f44c676f39d registry.example.com:8182/kollaglue/ubuntu-source-openvswitch-vswitchd:2.0.0 "kolla_start" 9 minutes ago Up 9 minutes openvswitch_vswitchd
609adb430b0f registry.example.com:8182/kollaglue/ubuntu-source-openvswitch-db-server:2.0.0 "kolla_start" 9 minutes ago Up 9 minutes openvswitch_db
96881dbecf8a registry.example.com:8182/kollaglue/ubuntu-source-nova-compute:2.0.0 "kolla_start" 9 minutes ago Up 9 minutes nova_compute
9c3d58d59f3d registry.example.com:8182/kollaglue/ubuntu-source-nova-libvirt:2.0.0 "kolla_start" 9 minutes ago Up 9 minutes nova_libvirt
ab09c12c0d4d registry.example.com:8182/kollaglue/ubuntu-source-nova-conductor:2.0.0 "kolla_start" 10 minutes ago Up 10 minutes nova_conductor
0d381b7f3757 registry.example.com:8182/kollaglue/ubuntu-source-nova-scheduler:2.0.0 "kolla_start" 10 minutes ago Up 10 minutes nova_scheduler
58bc728e30ef registry.example.com:8182/kollaglue/ubuntu-source-nova-novncproxy:2.0.0 "kolla_start" 10 minutes ago Up 10 minutes nova_novncproxy
c49c7703bbf0 registry.example.com:8182/kollaglue/ubuntu-source-nova-consoleauth:2.0.0 "kolla_start" 10 minutes ago Up 10 minutes nova_consoleauth
799b7da9fac3 registry.example.com:8182/kollaglue/ubuntu-source-nova-api:2.0.0 "kolla_start" 10 minutes ago Up 10 minutes nova_api
fd367be42634 registry.example.com:8182/kollaglue/ubuntu-source-glance-api:2.0.0 "kolla_start" 10 minutes ago Up 10 minutes glance_api
34c69911d5bc registry.example.com:8182/kollaglue/ubuntu-source-glance-registry:2.0.0 "kolla_start" 10 minutes ago Up 10 minutes glance_registry
6adc4580aab3 registry.example.com:8182/kollaglue/ubuntu-source-keystone:2.0.0 "kolla_start" 11 minutes ago Up 11 minutes keystone
38e57a6b8405 registry.example.com:8182/kollaglue/ubuntu-source-rabbitmq:2.0.0 "kolla_start" 11 minutes ago Up 11 minutes rabbitmq
4e5662f74414 registry.example.com:8182/kollaglue/ubuntu-source-mariadb:2.0.0 "kolla_start" 12 minutes ago Up 12 minutes mariadb
52d766774cab registry.example.com:8182/kollaglue/ubuntu-source-memcached:2.0.0 "kolla_start" 13 minutes ago Up 13 minutes memcached
02c793ecff9f registry.example.com:8182/kollaglue/ubuntu-source-keepalived:2.0.0 "kolla_start" 13 minutes ago Up 13 minutes keepalived
feaeb72eaca5 registry.example.com:8182/kollaglue/ubuntu-source-haproxy:2.0.0 "kolla_start" 13 minutes ago Up 13 minutes haproxy
806c4d9f9db8 registry.example.com:8182/kollaglue/ubuntu-source-ceph-rgw:2.0.0 "kolla_start" 13 minutes ago Up 13 minutes ceph_rgw
fe1ddb781fef registry.example.com:8182/kollaglue/ubuntu-source-ceph-osd:2.0.0 "kolla_start" 13 minutes ago Up 13 minutes ceph_osd_9
02d64b83b197 registry.example.com:8182/kollaglue/ubuntu-source-ceph-osd:2.0.0 "kolla_start" 13 minutes ago Up 13 minutes ceph_osd_7
82d705e92421 registry.example.com:8182/kollaglue/ubuntu-source-ceph-osd:2.0.0 "kolla_start" 13 minutes ago Up 13 minutes ceph_osd_5
dea36b30c249 registry.example.com:8182/kollaglue/ubuntu-source-ceph-osd:2.0.0 "kolla_start" 13 minutes ago Up 13 minutes ceph_osd_1
c7c65ad2f377 registry.example.com:8182/kollaglue/ubuntu-source-ceph-mon:2.0.0 "kolla_start" 15 minutes ago Up 15 minutes ceph_mon
407bcb0a393f registry.example.com:8182/kollaglue/ubuntu-source-cron:2.0.0 "kolla_start" 15 minutes ago Up 15 minutes cron
b696b905ac23 registry.example.com:8182/kollaglue/ubuntu-source-kolla-toolbox:2.0.0 "/bin/sleep infinity" 15 minutes ago Up 15 minutes kolla_toolbox
ceca142fb3be registry.example.com:8182/kollaglue/ubuntu-source-heka:2.0.0 "kolla_start" 15 minutes ago Up 15 minutes heka