...making Linux just a little more fun!
Linux 2.4.x had the Logical Volume Manager (LVM) and other multi-disk/multi-partition block device constructs. These have been enhanced by the Device Mapper in Linux 2.6.x. Here is a one line summary:
You can choose any sequence of blocks on a sequence of block devices and create a new block device some of whose blocks are identified with the blocks you chose earlier.That'll take a while to chew on. Meanwhile here are some ways you can use the device mapper:
In Unix, everything is a file. Even a block device like
/dev/hda2 which is meant to be read in
“chunks” called blocks, can be read
byte-by-byte like a file. The loop device allows us to reverse this
asymmetry and treat any file like a block device. Activate loop
devices for your Linux with modprobe loop (as root) if
necessary.
To demonstrate this without risking serious damage to useful files, we will only use empty files. First of all, create an empty file like so:
dd if=/dev/zero of=/tmp/store1 bs=1024 seek=2047 count=1
This creates a file full of nothing and 2 Megabytes in size. Now we
make it into a block device:
losetup /dev/loop1 /tmp/store1
We then operate with this block device just as we would with any
other block device:
blockdev --getsize /dev/loop1
mke2fs /dev/loop1
mount /dev/loop1 /mnt
/mnt just as you would any
other file-system - the changes will be written to
/tmp/store1. When you get tired of playing with the
loop blocks, you put them away with commands like losetup -d
/dev/loop1.
We will use loop devices like /dev/loop1,
/dev/loop2 and so on as the building block devices in
what follows.
...said the device mapper to the block device. If it is not
already activated, load the device mapper for your Linux with
modprobe dm-mod (as root.) The device mapper can take
any block device under its wing with a command like
echo 0 $(blockdev --getsize /dev/loop1) linear /dev/loop1 0 | \
dmsetup create new
This creates a “new” block device
/dev/mapper/new; but this is not really new data.
Reading from this block device returns exactly the same
result as reading directly from /dev/loop1; similarly
with writing to this block device. Looks a lot like the same old
blah in a new block device! So you could get rid of this block
device by dmsetup remove new.
Of course, you can do things differently. For example, you can
take only half of /dev/loop1 as your block device:
SIZE1=$(blockdev --getsize /dev/loop1)
echo 0 $(($SIZE1 / 2)) linear /dev/loop1 0 | \
dmsetup create half
The remaining half (which could be the bigger “half” if
/dev/loop1 is odd-sized!) is then also available for
use. You could use it in combination with /dev/loop2
to create another block device:
REST1=$((SIZE1 - $SIZE1 / 2))
echo 0 $REST1 linear /dev/loop1 $((SIZE1 / 2)) > /tmp/table1
echo $REST1 $(blockdev --getsize /dev/loop2) \
linear /dev/loop2 0 >> /tmp/table1
dmsetup create onenahalf /tmp/table1
Let us try to understand this example and what each of the three
numbers on each line of /tmp/table mean. The first
number is the starting sector of the map described, the second
number is the number of sectors in the map. The word
linear is followed by the name of the original device
that the map refers to; this is followed by the sector number of
the first sector (of this original device) which is assigned by
this map. Read that again!
So you can slice and splice your disks as you like - but there is a small cost, of course. All operations to these new block devices go through the device mapper rather than directly to the underlying hardware. With efficient table management in the kernel, this overhead should not slow down things perceptibly.
Notice how I slipped in (clever me!) the use of
“tables” that contain the mapped device descriptions.
If you are planning to use mapped devices a lot and don't want to
forget your settings, such tables are the way to go. Don't worry -
you can always get the table of any device like
/dev/mapper/new by
dmsetup table new
In the output, the original block device will appear as
major:minor, so you will have to figure out what the
device is actually called if you need the table in human readable
form. (Hint: Try
ls -l /dev | grep "$major, *$minor"
or something very like it.) Don't forget to run
dmsetup remove half
dmsetup remove onenahalf
when you are through.
Perhaps you are one of those people who own multiple disks configured so that reading n bytes from one of them is slower than reading n/2 bytes from two of them; this may happen because your disk controller is capable of multi-disk operations in parallel or because you have multiple disk controllers. The device mapper can help you to speed up your operations.
SIZE=$(( $(blockdev --getsize /dev/loop1) + \
$(blockdev --getsize /dev/loop2) ))
echo 0 $SIZE striped 2 16 /dev/loop1 0 /dev/loop2 0 | \
dmsetup create tiger
Now reads/writes from /dev/mapper/tiger will alternate
(in 16 sector chunks) between the two devices; you will also have
combined the disks into one as in the linear case.
There may be a number of reasons why you may want to stop all writes to your block device but not want the system to come to a grinding halt.
modprobe dm-snapshot if necessary.
Let us start then with a device which is managed by the device mapper. For example it could be created by
SIZE1=$(blockdev --getsize /dev/loop1)
SIZE2=$(blockdev --getsize /dev/loop2)
cat > /tmp/table2 <<EOF
0 $SIZE1 linear /dev/loop1 0
$SIZE1 $SIZE2 linear /dev/loop2 0
EOF
dmsetup create base /tmp/table2
Now assume that you have put a file system on this device with a
command like mke2fs /dev/mapper/base; and suppose you
have begun using this file system at /mnt with the
command mount /dev/mapper/base /mnt.
We will now take a “snapshot” of this file-system - in slow motion! The following steps have to be run quite quickly (say with a script) on a running system where this file-system is being changed actively.
First of all you create a duplicate of this device. This is not
just for safety - we will be changing the meaning of
/dev/mapper/base without telling the file-system!
dmsetup table base | dmsetup create basedup
Next we prepare our COW (copy-on-write) block device by making sure
the first 8 (or whatever you decide is your chunk size) sectors are
zeroed.
CHUNK=8
dd if=/dev/zero of=/dev/loop3 bs=512 count=$CHUNK
Now we suspend all I/O (reads/writes) to the base
device. This is the critical step for a running system. The kernel
will have to put to sleep all processes that attempt to read from
or write to this device; so we want to be sure we can resume soon.
dmsetup suspend base && TIME=$(date)
The next step is to use the COW to clone the device:
echo 0 $(blockdev --getsize /dev/mapper/basedup) \
snapshot /dev/mapper/basedup /dev/loop3 p 8 | \
dmsetup create top
What this says is that from now on reading from
/dev/mapper/top will return the data from
/dev/mapper/basedup unless you write
“on top” of the original data. Writes to
top will actually be written on
/dev/loop3 in chunks of size 8 sectors. If you have
used multiple transparent plastic sheets one on top of the other
(or “Layers” in GIMP) the effect is similar - what is
written on top obscures what is below but wherever nothing is
written on top you see clearly what is written on the lower layer.
In particular, we can now make sure that all changes to the underlying block devices are “volatile.” If we execute the following commands (we'll bookmark this as 'Point A' for later use) -
dmsetup table top | dmsetup load base
dmsetup resume base
we will have replaced the file-system under /mnt with
another one where all changes actually go to
/dev/loop3. When we dismantle this setup,
/dev/loop1 and /dev/loop2 will be in
exactly the state that they were in at time
$TIME.
If /dev/loop1 and /dev/loop2 are on
non-writable physical media (such as a CDROM), whereas
/dev/loop3 is on a writable one (such as RAM or hard
disk), then we have created a writable file-system out of a
read-only one!
This solves the last problem in our list above - but what about
the first two? To tackle the second problem we must have some way
of comparing the new file-system with the older one. If you try to
mount /dev/mapper/basedup somewhere in order to this,
you will find that Linux (the kernel!) refuses to let you do this.
Instead we can create yet another device:
echo 0 $(blockdev --getsize /dev/mapper/basedup) \
snapshot-origin /dev/mapper/basedup | \
dmsetup create origin
You can now mount /dev/mapper/origin somewhere (say
/tmp/orig) and compare the original file system with
the current one with a command like
diff -qur /tmp/orig /mnt
What happens if you write to /tmp/orig? Check it out
and you'll be mystified for a moment.
The analogy of plastic sheets breaks down here! All writes to
/tmp/orig go directly to the underlying device
basedup but are negated on
/dev/loop3 so as to become invisible to reads from
/mnt. Similarly, reads from /tmp/orig
ignore whatever changes were made by writing to /mnt.
In other words the original file system has been forked
(and orthogonally at that!) and /dev/loop3 actually
stores both negative and positive data in order to achieve this. No
plastic sheet can be made to work like this!
To see why this is useful, let us see how it solves the problem of backups. What we want is to get a “snapshot” view of the file-system but we want to continue using the original system. So in this case we should not run the commands at point A above. Instead we run the commands here, at point B:
dmsetup table origin | dmsetup load base
dmsetup resume base
Now all writes to /mnt will go onto the original
device, but these changes are negated on
/dev/mapper/top. So if we mount the latter device at
(say) /tmp/snap, then we can read a snapshot of the
files at time $TIME from this directory. A command
like
cd /tmp/snap
find . -xdev | cpio -o -H new > "backup-at-$TIME"
will provide a snapshot backup of the file-system at time
$TIME.
We could also have taken such a snapshot at Point A with the commands
cd /tmp/orig
find . -xdev | cpio -o -H new > "backup-at-$TIME"
The main difference is that the changes to
/dev/mapper/top are volatile! There is no way to
easily dismantle the structure created under (A) without losing all
the changes made. In the backup context you want to retain
the changes; at Point B you run
dmsetup suspend base
dmsetup remove top
dmsetup remove origin
dmsetup table basedup | dmsetup load base
dmsetup resume base
and you are back to business as usual. If you were to run this at
Point A the results would be quite
unpredictable! What would be the status of all those open files on
/dev/mapper/top? A number of hung processes would be
the most likely outcome - even some kernel threads could hang - and
then perhaps break!
Say you have a laptop or CD which carries some valuable data - valuable not just to you but to anyone who has it. (When, Oh! When will I ever get my hands on such data). In this case backups are no good. What you want is to protect this data from theft. Assuming you believe in the strength of current encryption techniques you could protect it by encrypting the relevant file. This approach has some serious problems:
modprobe dm-crypt if necessary. Also activate some
encryption and hashing mechanism by commands like modprobe
md5 and modprobe aes if necessary.
First of all you need to generate and store your secret key. If you use AES as indicated above then you can use a key of length up to 32 bytes which can be generated by a command like
dd if=/dev/random bs=16 count=1 | \
od --width=16 -t x2 | head -1 | \
cut -f2- -d' ' | tr -d ' ' > /tmp/my_secret_key
Of course, you should probably not output your secret key to such a
file - there are safer ways of storing it:
gpg or openssl and
then store it on a the USB stick or a device that never leaves
you.You can now setup the encrypted device
echo 0 $(blockdev --getsize /dev/loop1) \
crypt aes-plain $(cat /tmp/my_secret_key) 0 /dev/loop1 0 | \
dmsetup create mydata
You can then make a file-system mke2fs
/dev/mapper/mydata on this block device and store data on it
after mounting it somewhere with mount /dev/mapper/mydata
/mnt. All the data written to /mnt will then be
transparently encrypted before storing it in
/dev/loop1. When you are through you unmount the
device and dismantle it as before:
umount /mnt
dmsetup remove mydata
The next time you want to use the device you can set it up with the
same command as above (providing you supply the secret key in
/tmp/my_secret_key). Of course, you shouldn't rune
mke2fs on the device a second time unless you want to
erase all that valuable data!
All the steps given above can be carried out on any block device(s) in place of the loop devices that were used. However, when the block device is the root device then life gets a little more complex. (Roots generally are complex).
First of all we need to put the root device under the control of
the device mapper; this is best done with an initial RAM disk (or
initrd). Even after this is done, we need to be
careful if we are trying to run some of the above commands for the
root file system on a “live” system. In particular, it
is not advisable to suspend I/O on the root file system without
deep introspection! After all this means that all processes that
make a read/write call to the root file system will be put to
sleep.
Here is one way around the problem. Create a temporary file system
mount -t tmpfs tmpfs /mnt
To this file system copy all the files that are necessary in order
to perform the changes - in particular, you need
/sbin/dmsetup, /bin/sh, the
/dev files and all shared libraries that these
programs depend on. Then you run chroot /mnt. After
this you can run a script or (if you type quickly and
without errors!) a sequence of commands that will suspend the root
device map and make relevant changes to it - for example, to take a
snapshot. Don't forget to resume the root device before exiting the
chroot.
Given the complexity of the various operations, it is probably best to produce a shell script or even a C program that carries out the tasks. Luckily, the latter has already been implemented - the Linux Logical Volume Manager version 2 does carry out most of the tasks described above quite “automagically.” Setup and use of encryption is greatly simplified by the cryptsetup program. Why then did I write this article?
I originally came upon dmsetup while
trying to create a read-only root file system for a
“live” CDROM. Unfortunately, the LVM2 tools are not
useful as they only look at the use of snapshots for backups -
clearly they don't care for COWs! The only resource that I found
for this was the RedHat
Mailing list archives. There are now tools which come with live
CD's that make use of dmsetup; for example I came
across this link
which explains how UBuntu does it.
Of course, using dmsetup allowed me to get as
“close to the metal” as is possible without writing
real programs...
This document was translated from LATEX by H EVEA.
Kapil Hari Paranjape has been a ``hack''-er since his punch-card days.
Specifically, this means that he has never written a ``real'' program.
He has merely tinkered with programs written by others. After playing
with Minix in 1990-91 he thought of writing his first program---a
``genuine'' *nix kernel for the x86 class of machines. Luckily for him a
certain L. Torvalds got there first---thereby saving him the trouble
(once again) of actually writing code. In eternal gratitude he has spent
a lot of time tinkering with and promoting Linux and GNU since those
days---much to the dismay of many around him who think he should
concentrate on mathematical research---which is his paying job. The
interplay between actual running programs, what can be computed in
principle and what can be shown to exist continues to fascinate him.