Glance, Ceph, and raw images

If you are a user of Glance and Ceph right now, you know that using raw images is a requirement for Copy-on-Write cloning (CoW) to the nova or cinder pools. Without that there is a thick conversion that must occur which wastes time, bandwidth, and IO. The problem is raw images can take up quite a bit of space even if its _all_ zeros because raw images cannot be sparse images. This becomes even more visible when doing a non-CoW clone from Cinder or Nova to Glance whose volumes can be quite large.

Enter fstrim. For the uninitiated, OpenStack and fstrim.

Basically the idea is we want to tell that image to drop all of the unused space and thus reclaim that space on the Ceph cluster itself. Please note, this only works on images that do not have any CoW snapshots on them. So if you do have anything booted/cloned from that image you’ll need to flatten or remove those instances/volumes.

We will start with a “small” 80GiB raw image and upload that to Glance.

# ls -l 
total 82416868
-rw-r--r-- 1 root root 85899345920 Mar 21 15:34 ubuntu-trusty-80GB.raw
# openstack image create --disk-format raw --container-format bare --file ubuntu-trusty-80GB.raw ubuntu-trusty-80GB
+------------------+----------------------------------------------------------------------------------------------------------+
| Field            | Value                                                                                                    |
+------------------+----------------------------------------------------------------------------------------------------------+
| checksum         | 9ca30159fcb4bb48fbdf876493d11677                                                                         |
| container_format | bare                                                                                                     |
| created_at       | 2016-03-21T15:38:50Z                                                                                     |
| disk_format      | raw                                                                                                      |
| file             | /v2/images/bb371f84-2a61-47a0-ab22-f4dbf8467070/file                                                     |
| id               | bb371f84-2a61-47a0-ab22-f4dbf8467070                                                                     |
| min_disk         | 0                                                                                                        |
| min_ram          | 0                                                                                                        |
| name             | ubuntu-trusty-80GB                                                                                       |
| owner            | 762565b94f314ec6b370d978db902a78                                                                         |
| properties       | direct_url='rbd://20d283ba-5a51-49b0-9be7-9220bcc9afd0/glance/bb371f84-2a61-47a0-ab22-f4dbf8467070/snap' |
| protected        | False                                                                                                    |
| schema           | /v2/schemas/image                                                                                        |
| size             | 85899345920                                                                                              |
| status           | active                                                                                                   |
| tags             |                                                                                                          |
| updated_at       | 2016-03-21T15:53:54Z                                                                                     |
| virtual_size     | None                                                                                                     |
| visibility       | private                                                                                                  |
+------------------+----------------------------------------------------------------------------------------------------------+

NOTE: If you are running with a cache tier you will need to evict the cache to get the proper output from the command below. It is not a requirement to evict the cache to make this work, just if you want valid output from the command below.

rados -p glance-cache cache-flush-evict-all

Now if we look at Ceph we should see it consuming 80GiB of space as well.

# rbd -p glance info bb371f84-2a61-47a0-ab22-f4dbf8467070
rbd image 'bb371f84-2a61-47a0-ab22-f4dbf8467070':
 size 81920 MB in 10240 objects
 order 23 (8192 kB objects)
 block_name_prefix: rbd_data.10661586915
 format: 2
 features: layering, striping
 flags: 
 stripe unit: 8192 kB
 stripe count: 1
# rados -p glance ls | grep rbd_data.10661586915 | wc -l
10240

Sure enough, 10240 * 8MiB per object equals 80GiB. So Ceph is clearly using 80GiB of data on this one image, but it doesn’t have to be!

First step is to remove the snapshot associated with the glance image. This is the reason you cannot have any CoW clones based on this image.

# rbd -p glance snap unprotect bb371f84-2a61-47a0-ab22-f4dbf8467070@snap                                                                                                                                                                                                                    
# rbd -p glance snap rm bb371f84-2a61-47a0-ab22-f4dbf8467070@snap

The process to reduce this usage is to map the rbd to a linux host and mount the filesystem. At that point we can run fstrim on the filesystem(s) and tell Ceph it is ok to free up some space. Finally, we need to re-snapshot it so that moving forward Glance is CoW cloning the sparse rbd. Those steps are as follows:

# rbd -p glance snap unprotect bb371f84-2a61-47a0-ab22-f4dbf8467070@snap                                                                                                                                                                                                                    
# rbd -p glance snap rm bb371f84-2a61-47a0-ab22-f4dbf8467070@snap                                                                                                                                                                                                                           
# rbd -p glance map bb371f84-2a61-47a0-ab22-f4dbf8467070
/dev/rbd0
# mount /dev/rbd0p1 /mnt/
# df -h /mnt/
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd0p1      79G  780M   74G   2% /mnt
# fstrim /mnt/
# umount /mnt/
# /usr/bin/rbd unmap /dev/rbd0

If you are running a cache tier now would be the time to evict your cache again.

# rbd -p glance info bb371f84-2a61-47a0-ab22-f4dbf8467070                                                                                                                                                                                                                 
rbd image 'bb371f84-2a61-47a0-ab22-f4dbf8467070':
        size 81920 MB in 10240 objects
        order 23 (8192 kB objects)
        block_name_prefix: rbd_data.10661586915
        format: 2
        features: layering, striping
        flags: 
        stripe unit: 8192 kB
        stripe count: 1
# rados -p glance ls | grep rbd_data.10661586915 | wc -l
895

And there it is! Now we are only using 895 * 8MiB objects (~7GiB). That is a 90% reduction in usage. Now why are we using 7GiB and not 780M as the df output shows? fstrim is a quick tool, its not a perfect tool. If you want even more efficiency you can use zerofree or write out a large file full of zeros on the filesystem and delete it before running fstrim. That will further reduce the size of this RBD.

Finally, to make this image usable to glance again you need to recreate and protect the snapshot we removed previously.

# rbd -p glance snap create bb371f84-2a61-47a0-ab22-f4dbf8467070@snap
# rbd -p glance snap protect bb371f84-2a61-47a0-ab22-f4dbf8467070@snap

All in all, this can be done fairly quickly and reduces usage a great deal in some cases. If you have glance and ceph and large images, this may be something to consider doing.

Leave a Reply

Your email address will not be published. Required fields are marked *