December 15, 2010

Day 15 - Down the 'ls' Rabbit Hole

This article was written by Adam Fletcher (@adamfblahblah)

From ls(1) to the kernel and back again.

Too often sysadmins are afraid to dive into the source code of our core utilities to see how they really work. We're happy to edit our scripts but we don't do the same with our command line utilities, libraries, and kernel. Today we're going to do some source diving in those core components. We'll answer the age-old interview question, "What happens when you type ls at the command line and press enter?" The answer to this question has infinite depth, so I'll leave out some detail, but I'll capture the essence of what is going, and I'll show the source in each component as we go. The pedants in the crowd may find much to gripe about but hopefully they'll do so by posting further detail in the comments.

Requirements

It'll be helpful if you install the source on your machine for the software we'll be looking at. Below are the commands I used to get the source for the needed packages on Ubuntu 9.10, and similar packages are available for your Linux distribution.

apt-get install linux-source 
apt-get source coreutils 
apt-get source bash
apt-get source libc6
apt-get install manpages-dev

I'm using linux-source version 2.6.31.22.35, coreutils (for the code to ls) version 7.4-2ubuntu1, bash version 3.5.21, and libc6 version 2.10.1-0ubuntu18, and finally manpages-dev to get the programmer's man pages.

Starting Out - strace & bash

One of the most useful tools in the sysadmin's arsenal is strace, a command that will show you most of the standard library and system calls a program makes while it executes. We'll use this tool extensively to figure out what code we are looking for in each component.

Let's start by strace'ing bash when it runs ls. To do so, we'll start a new instance of bash under strace. Note that I'll be cutting the output of strace down a lot in the post for readability.

adamf@kid-charlemagne:~/foo$ strace bash
execve("/bin/bash", ["bash"], [/* 30 vars */]) = 0

[... wow that's a lot of output ...]

write(2, "adamf@kid-charlemagne:~/foo$ ", 29adamf@kid-charlemagne:~/foo$ ) = 29
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
read(0,

... and that's where the output stops. If you're new to strace the key to reading it is to make liberal use of man pages to figure out what each library call does. Be aware that the relevant pages you want are in section 2 of the man pages, so you'll need to do man 2 read to find the page on read; this is because many of the system functions have the same name as regular commands that are found in chapter 1 of the man pages.

The read call is waiting for input on file descriptor 0, which is standard input. So we type ls and hit enter (you'll see more read & write calls as you type).

There's a lot of output, but we know we want to see ls related output, so let's do the simple thing and look at the lines that have ls in them:

stat("/usr/local/sbin/ls", 0x7fff03f1fd60) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/ls", 0x7fff03f1fd60) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/ls", 0x7fff03f1fd60)    = -1 ENOENT (No such file or directory)
stat("/usr/bin/ls", 0x7fff03f1fd60)     = -1 ENOENT (No such file or directory)
stat("/sbin/ls", 0x7fff03f1fd60)        = -1 ENOENT (No such file or directory)
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=114032, ...}) = 0
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=114032, ...}) = 0

If we man 2 stat we see that stat returns information about a file if it can find it, and an error if it can't (much more on stat later). In this case what bash is doing is searching my $PATH environment variable in hopes of finding an executable file with the name ls. Bash will stat every directory in my $PATH, and if it can't find the file it returns command not found. In this case, Bash found ls in /bin, and then that's the last we see of the string ls in our output.

We don't see ls in our output anymore because once Bash knows it can execute the program it spawns a child process to execute that program, and we haven't told strace to follow children of the command it is tracing. It's the next few lines of strace that give this spawning away:

pipe([3, 4])                            = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2c853217c0) = 30125

If we man 2 pipe and man 2 clone we see that bash is creating a pipe (two file descriptors that can be read and written to; this way a shell can link commands input and output together when you give the shell a | character) and clone'ing itself so that there are two copies of bash running. Remember, every UNIX process is a child of another process, and a brand new process starts out as a copy of its parent. So when does ls actually happen? Let's strace ls and find out!

adamf@kid-charlemagne:~/foo$ strace ls
execve("/bin/ls", ["ls"], [/* 30 vars */]) = 0

That first line is the key. execve is the library call to load and run a new executable. Once execve runs we're actually ls (well, the loader runs first, but that's another article). Interestingly, the call to execve is in the bash source code, not the ls source code. Let's find it in the bash code:

adamf@kid-charlemagne:/usr/src/bash-4.0/bash-4.0$ find . | xargs grep -n "execve ("
./builtins/exec.def:201:  shell_execve (command, args, env);
./execute_cmd.c:4323:   5) execve ()
./execute_cmd.c:4466:      exit (shell_execve (command, args, export_env));
./execute_cmd.c:4577:  return (shell_execve (execname, args, env));
./execute_cmd.c:4653:/* Call execve (), handling interpreting shell scripts, and handling
./execute_cmd.c:4656:shell_execve (command, args, env)
./execute_cmd.c:4665:  execve (command, args, env);

If we look at line 4323 in execute_cmd.c we see this helpful comment:

/* Execute a simple command that is hopefully defined in a disk file
somewhere.

1) fork ()
2) connect pipes
3) look up the command
4) do redirections
5) execve ()
6) If the execve failed, see if the file has executable mode set.
If so, and it isn't a directory, then execute its contents as
a shell script.
[...]
*/

And looking at line 4665 we do see the call to execve. Take a look at the code around execve - it's a bunch of error handling but nothing too hard to understand. What's interesting is what is not there; the code exists only to handle errors and nothing to handle success. That is because execve will only return if there's a failure, which makes sense - a successful call to execve means we're running something completely different!

Look around execute_cmd.c at the code around calls to shell_execve and you'll see that that code is fairly straightforward.

Inside ls(1)

Let's look at what ls is doing by creating a single file in our directory and ls'ing that file under strace.

adamf@kid-charlemagne:~/foo$ touch bar
adamf@kid-charlemagne:~/foo$ strace ls bar
execve("/bin/ls", ["ls", "bar"], [/* 30 vars */]) = 0

Interesting! We can see that bar is now being passed to our execve call. Let's keep looking at the strace output to find bar:

stat("bar", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
lstat("bar", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f467abbe000
write(1, "bar\n", 4bar
)                    = 4

Right at the end of the strace output we see bar a few times. It looks like bar gets passed to stat, lstat, and write. Working backwards, we can man 2 write to figure out that write sends data to a file descriptor, in this case standard out, which is our screen. So the call to write is just ls printing out bar. The next two library calls, stat and lstat, share a man page, with the difference between the commands being that lstat will get information on a symbolic link while stat will only get information on a file. Let's look in in the ls source code for these calls to see why ls does both lstat and stat:

adamf@kid-charlemagne:/usr/src/coreutils-7.4/src$ grep -n "stat (" ls.c
967:      assert (0 <= stat (Name, &sb));       \ 
2437:      ? fstat (fd, &dir_stat)
2438:      : stat (name, &dir_stat)) < 0)
2721:     err = stat (absolute_name, &f->stat);
2730:         err = stat (absolute_name, &f->stat);
2749:     err = lstat (absolute_name, &f->stat);
2837:         && stat (linkname, &linkstats) == 0)

That call to lstat stands out amongst the other calls, and so it is a pretty good guess that lstat happens for some exceptional reason that programmer would notate with a comment. Looking at line 2749 in ls.c we see an interesting comment a few lines above:

         /* stat failed because of ENOENT, maybe indicating a dangling
             symlink.  Or stat succeeded, ABSOLUTE_NAME does not refer to a
             directory, and --dereference-command-line-symlink-to-dir is
             in effect.  Fall through so that we call lstat instead.  */
        }

    default: /* DEREF_NEVER */
      err = lstat (absolute_name, &f->stat);
      do_deref = false;
      break;
    }

That comment means that if we're not talking about a directory and stat has already succeeded, we need to see if we are looking at a symlink. We can see that this is true by ls'ing a directory under strace:

adamf@kid-charlemagne:~/foo$ strace ls /home/adamf/foo/
[...]
stat("/home/adamf/foo/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/home/adamf/foo/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
getdents(3, /* 3 entries */, 32768)     = 72
getdents(3, /* 0 entries */, 32768)     = 0
close(3)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f873dda4000
write(1, "bar\n", 4bar

Note that there was no call to lstat this time.

Where We Are Going There Is No strace

It is time to bid our friend strace a fond farewell as he doesn't have what it takes to show us what stat is doing. For that we need to look into the standard library, or as it is commonly known, libc.

The libc code provides a common API for UNIX programs, and a portion of that API is the system calls. These are functions that provide a way for a programmer to ask the kernel for a resource that is managed by the kernel, including the resource we're interested in: the filesystem. The code we'd like to look at is for the system call stat. However, because kernels are very dependent on the hardware architecture they run on, and libc needs to talk to the kernel, much of the code you'll find in the libc source organized by architecture. This makes finding the code for stat tricky; if we look in io/stat.c we see basically a single line of code that calls a function called __xstat. If we find . -name xstat.c we'll see that we want ./sysdeps/unix/sysv/linux/i386/xstat.c, which is the implementation of stat for Linux on i386.

The code in xstat.c that isn't a reference to a C #include looks like:

return INLINE_SYSCALL (stat, 2, CHECK_STRING (name), CHECK_1 ((struct kernel_stat *) buf));

And:

result = INLINE_SYSCALL (stat64, 2, CHECK_STRING (name), __ptrvalue (&buf64));

Reading the comments in the code we can see that stat64is for 64-bit platforms. We'll stick to 32-bits for now, but either way we need to figure out what INLINE_SYSCALL is. A convention in C programming is that FUNCTIONS IN ALL CAPS are pre-processor macros, which means you can typically find out what those macros are by grep'ing for define <macroname>:

adamf@kid-charlemagne:/usr/src/eglibc-2.10.1/sysdeps/unix/sysv/linux/i386$ grep -n "define INLINE_SYSCALL" *
sysdep.h:348:#define INLINE_SYSCALL(name, nr, args...) \

At first, the code we find at line 348 in sysdep.h looks confusing:

#define INLINE_SYSCALL(name, nr, args...) \ 
({                                                                          \ 
    unsigned int resultvar = INTERNAL_SYSCALL (name, , nr, args);             \ 
    if (__builtin_expect (INTERNAL_SYSCALL_ERROR_P (resultvar, ), 0))         \ 
    {                                                                       \ 
        __set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, ));                   \ 
        resultvar = 0xffffffff;                                               \ 
    }                                                                       \ 
    (int) resultvar; })

Looking at the code the call to INTERNAL_SYSCALL stands out - it appears that all INLINE_SYSCALL is doing is calling INTERNAL_SYSCALL. Conveniently we can scroll down in sysdep.h to find the definition of INTERNAL_SYSCALL:

/* Define a macro which expands inline into the wrapper code for a system
call.  This use is for internal calls that do not need to handle errors
normally.  It will never touch errno.  This returns just what the kernel
gave back.

The _NCS variant allows non-constant syscall numbers but it is not
possible to use more than four parameters.  */
#undef INTERNAL_SYSCALL
#ifdef I386_USE_SYSENTER
# ifdef SHARED
#  define INTERNAL_SYSCALL(name, err, nr, args...) \

... but it appears to define INTERNAL_SYSCALL a few times, and I'm not sure which one is actually used.

A good practice in a situation like this is to stop looking at the code and instead take some time to understand the concept the code is trying to implement. Googling for something like i386 system calls linux gets us a to (Implementing A System Call On i386 Linux)[http://tldp.org/HOWTO/html_single/Implement-Sys-Call-Linux-2.6-i386/] which says:

A system call executes in the kernel mode. Every system call has a number
associated with it. This number is passed to the kernel and that's how the
kernel knows which system call was made. When a user program issues a system
call, it is actually calling a library routine. The library routine issues a
trap to the Linux operating system by executing INT 0x80 assembly instruction.
It also passes the system call number to the kernel using the EAX register. The
arguments of the system call are also passed to the kernel using other
registers (EBX, ECX, etc.). The kernel executes the system call and returns the
result to the user program using a register. If the system call needs to supply
the user program with large amounts of data, it will use another mechanism
(e.g., copy_to_user call).

Okay, so I think the implementation of INTERNAL_SYSCALL we'll want will have 0x80 in it and some assembly code that puts stuff in the eax register (newer x86 machines can use sysenter instead of int 0x80 to make syscalls). Line 419 in sysdep.h does the trick:

# define INTERNAL_SYSCALL(name, err, nr, args...) \ 
({                                                                          \ 
    register unsigned int resultvar;                                          \ 
    EXTRAVAR_##nr                                                             \ 
    asm volatile (                                                            \ 
    LOADARGS_##nr                                                             \ 
    "movl %1, %%eax\n\t"                                                      \ 
    "int $0x80\n\t"                                                           \ 
    RESTOREARGS_##nr                                                          \ 
    : "=a" (resultvar)                                                        \ 
    : "i" (__NR_##name) ASMFMT_##nr(args) : "memory", "cc");                  \ 
    (int) resultvar; })

If we go back to xstat.c we see that the name we pass to INTERNAL_SYSCALL is stat, and in the code above the name argument will expand from __NR_##name to __NR_stat. The web page we found describing syscalls says that syscalls are represented by a number, so there has to be some piece of code that turns __NR_stat into a number. However, when I grep through all of the libc6 source I can't find any definition of __NR_stat for i386.

It turns out that the code that translates __NR_stat into a number is inside the Linux kernel:

adamf@kid-charlemagne:/usr/src/linux-source-2.6.31$ find . | grep x86 | xargs grep -n "define __NR_stat"
./arch/x86/include/asm/unistd_64.h:23:#define __NR_stat             4
./arch/x86/include/asm/unistd_64.h:309:#define __NR_statfs              137
./arch/x86/include/asm/unistd_32.h:107:#define __NR_statfs       99
./arch/x86/include/asm/unistd_32.h:114:#define __NR_stat        106
./arch/x86/include/asm/unistd_32.h:203:#define __NR_stat64      195
./arch/x86/include/asm/unistd_32.h:276:#define __NR_statfs64        268

The Amulet Of Yendor: Inside The Kernel

The syscall number definitions being inside the kernel makes sense, as the kernel is the owner of the syscall API and as such will have the final say on what numbers get assigned to each syscall. As we're running on 32-bit Linux, it appears the syscall number that libc is going to put in eax is 106.

The table in unistd_32.h is great (look at all those syscalls!) but it doesn't tell us where the code for handling a call to stat actually lives in the kernel. find is our friend again:

adamf@kid-charlemagne:/usr/src/linux-source-2.6.31$ find . -name stat.c
./fs/stat.c
./fs/proc/stat.c

Well that was easy. Opening up fs/stat.c we find what we're looking for:

SYSCALL_DEFINE2(stat, char __user *, filename, struct __old_kernel_stat __user *, statbuf)
{
        struct kstat stat;
        int error;

        error = vfs_stat(filename, &stat);
        if (error)
                return error;

        return cp_old_stat(&stat, statbuf);
}

Looks like this just a wrapper around vfs_stat, which is also in stat.c and is a wrapper around vfs_statat, which again is in stat.c and is wrapper around two functions, user_path_at() and vfs_getattr(). We'll ignore user_path_at() for now (it figures out if the file exists) and instead follow vfs_getattr():

int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
{
        struct inode *inode = dentry->d_inode;
        int retval;

        retval = security_inode_getattr(mnt, dentry);
        if (retval)
                return retval;

        if (inode->i_op->getattr)
                return inode->i_op->getattr(mnt, dentry, stat);

        generic_fillattr(inode, stat);
        return 0;
}

One thing that is helpful to do in a case like is to look back at any documentation I have on the function whose implementation I'm tracking down, which in this case is the library call to stat. Back to man 2 stat we see:

   All of these system calls return a stat structure, which contains the following fields:

       struct stat {
           dev_t     st_dev;     /* ID of device containing file */
           ino_t     st_ino;     /* inode number */
           mode_t    st_mode;    /* protection */
           nlink_t   st_nlink;   /* number of hard links */
           uid_t     st_uid;     /* user ID of owner */
           gid_t     st_gid;     /* group ID of owner */
           dev_t     st_rdev;    /* device ID (if special file) */
           off_t     st_size;    /* total size, in bytes */
           blksize_t st_blksize; /* blocksize for file system I/O */
           blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
           time_t    st_atime;   /* time of last access */
           time_t    st_mtime;   /* time of last modification */
           time_t    st_ctime;   /* time of last status change */
       };

So our vfs_getattr function is trying to fill out these fields, which must be in the struct kstat *stat argument to vfs_getattr. vfs_getattr tries to fill out the stat struct in two ways:

    if (inode->i_op->getattr)
            return inode->i_op->getattr(mnt, dentry, stat);

    generic_fillattr(inode, stat);

In the first attempt to fill stat, vfs_getattr checks to see if this inode struct has a special function defined to fill the stat structure. Each inode has an i_op struct which can have a getattr function, if needed. This getattr function is not defined in fs.h but rather is defined by the specific file system the inode is on. This makes good sense as it allows the application programmer to call libc's stat without caring if the underlying file system is ext2, ext3, NTFS, NFS, etc. This abstraction layer is called the 'Virtual File System' and is why the syscall above is prefixed with 'vfs'.

Some filesystems, like NFS, implement a specific getattr handler, but the filesystem I'm running (ext3) does not. In the case where there is no special getattr function defined vfs_getattr will call generic_fillattr (helpfully defined in stat.c) which simply copies the relevant data from the inode struct to the stat struct:

void generic_fillattr(struct inode *inode, struct kstat *stat)
{
        stat->dev = inode->i_sb->s_dev;
        stat->ino = inode->i_ino;
        stat->mode = inode->i_mode;
        stat->nlink = inode->i_nlink;
        stat->uid = inode->i_uid;
        stat->gid = inode->i_gid;
        stat->rdev = inode->i_rdev;
        stat->atime = inode->i_atime;
        stat->mtime = inode->i_mtime;
        stat->ctime = inode->i_ctime;
        stat->size = i_size_read(inode);
        stat->blocks = inode->i_blocks;
        stat->blksize = (1 << inode->i_blkbits);
}

If you squint a little bit at this struct you'll see all the fields you can get out of a single ls command! Our adventure into the kernel has yielded fruit.

Just One More Turn...

If you'd like to keep going, the next thing to figure out is how the inode struct gets populated (hint: ext3_iget) and updated, and from there figure out how the kernel reads that data from the block device (and then how the block device talks to the disk controller, and how the disk controller finds the data on the disk, and so on).

I hope this has been instructive. Digging through the actual source code to a program isn't as easy as reading a summary of how something works, but it is more rewarding and you'll know how the program actually works. Don't be intimidated by an unknown language or concept! We found our way through the internals of the kernel with strace, find and grep, tools a sysadmin uses every day.

Other Resources

6 comments:

JB said...

Fun article. Thanks.

Nilesh said...

Great article! Opened the door for a topic which I always feared. Kudos!

adamfblahblah said...

User Rhombold on reddit pointed out that moden Linux no longer use the int 0x80 method to make syscalls, they instead use the method described in section 4.6 of this page: http://www.win.tue.nl/~aeb/linux/lk/lk-4.html

The way I show above is for older architectures. I'll pretend I left finding the new method of making syscalls out as an exercise for the reader, rather than just not finding that part of the code.

Ricardo said...

On the BSD land you can mess around with truss, ktrace and kdump.

vishay said...

As a student who recently took an Os class this article was great. thanks

Unix commands Tutorial said...

Great article. Helped me a lot.

Sponsored by Puppet Labs