[OpenAFS-devel] mount-point inode-number inconsistencies with openafs-1.4.1

Thu, 01 Jun 2006 11:34:19 -0400

``

On Thursday, June 01, 2006 04:36:21 PM +0200 Alexander Bergolth 
<leo@strike.wu-wien.ac.at> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 05/30/2006 07:42 PM, chas williams - CONTRACTOR wrote:
>> In message <447C75A0.60003@strike.wu-wien.ac.at>,Alexander Bergolth
>> writes:
>>
>>> When determining the inode-numbers of the mount-points and using
>>> relative path names, different inode numbers are shown when called for
>>> the first time. On subsequent calls the same inode numbers are shown (!)
>>> until I do a "pwd", then the behavior is reset and the next call prints
>>> different inode numbers again:
>>
>> linux doesnt handle having a directory inode mounted twice in the
>> same filesystem very well.  the linux vfs operates on pathnames
>> more than inodes, so there needs to be only one dcache entry per
>> inode directory.  since you have two paths to the same inode,
>> we need to pick which dcache entry to keep current.  in 1.4.1
>> this is now the latest dcache entry (it cured a different bug
>> in the vfs filesystem) instead of the "first found" dcache entry.
>>
>> when a new dcache entry is chosen, the inode number is updated.
>> i believe the inode number is based on the mount point so this is
>> going to lead to different inode numbers.
>>
>>> -------------------- snip! --------------------
>>> $ ls -id1 . backup backup/backup
>>> 278020792 .
>>> 193265714 backup
>>> 193265714 backup/backup
>>
>> here you switch to from the original path to a new path so
>> the inode number changed.
>
> Hmm - I didn't get it...
> Working directory is /afs/wu-wien.ac.at/home/edvz/skamrada and I'm
> referencing . backup and backup/backup, so it's 3 paths (and 3 mount
> points), isn't it?
>
> /afs/wu-wien.ac.at/home/edvz/skamrada/logs
> /afs/wu-wien.ac.at/home/edvz/skamrada/logs/backup
> /afs/wu-wien.ac.at/home/edvz/skamrada/logs/backup/backup

Not entirely.  The inode numbers assigned to files and directories in AFS 
are derived from the FID (volume, vnode, uniqifier) of each file or 
directory.  But volume roots are handled specially - in order for readdir 
to work correctly on a directory containing mount points, the inode number 
assigned to a volume root directory is based on the FID of the mount point 
used to reach that volume, not that of the directory itself.  Each time you 
traverse a mount point, the inode number for the resulting volume root 
directory changes to reflect the FID of the mount point you just used to 
reach it.

In your example, there are actually only two mount points, not three, 
because logs/backup and logs/backup/backup are the same mount point.  They 
are both the entry with the name 'backup' in the root directory of the 
volume user.skamrada.log, and thus have the same FID.  So, the inode number 
is recomputed when you traverse logs/backup/backup, but it doesn't appear 
to change because the computed value is the same.

>> referencing . still stays on the current path.

Yes, because that doesn't involve traversing a mount point.

>>> $ pwd
>>> /afs/wu-wien.ac.at/home/edvz/skamrada/logs
>>> $ ls -id1 . backup backup/backup
>>> 278020792 .
>>
>> ah, you found a "new" (different) path to the same volume.
>> we switch back.  but again, you reference the other path
>> and we switch the inode back.
>
> Why does pwd cause this?

Because of the way it works, which is something like this:

(0) start with an empty path
(1) stat "." to find out its inode number
(2) readdir ".." looking for an entry with a matching number
    - if no entry is found, give up
    - if the entry is named ".", we are done
    - otherwise, add the name we found to the front of the path
(3) chdir to ".."
(4) repeat
(5) when done, chdir back to the original directory

One side-effect of this is to traverse the "real" mount point, which gets 
you the real inode number again.  It also results in the volume moving 
around in the dentry tree, which can confuse the pwd algorithm into 
failing, if someone else is using that volume in a way that prevents the 
dentry from being moved at the right time.

Now, you didn't see this on previous Linux releases because FC5 introduced 
a new version of coreutils (which contains the 'pwd' program) which 
contains a bug.  The bug is that coreutils uses its own getcwd() routine 
which does something like what is described above, instead of using the one 
provided by the system library.  It's unclear to me why they did this; 
perhaps the upstream maintainers thought their approach was more efficient, 
or perhaps it was just an oversight that occurred in the process of copying 
the getcwd() code from glibc (which IMHO was pretty dumb, since it 
essentially means the two pieces of code will now be maintained 
independently).

The reason this is important is that on Linux, the getcwd() provided by 
glibc is implemented by making a system call, which walks upward along the 
dentry tree collecting the names of each entry.  Not only is this method 
considerably more efficient than the algorithm described above, it also 
always produces correct results.

> The problem is that we are not able to control what users are doing. Of
> course this mount-point loops are not desirable but the problem is that
> one user may add such a mount-point and render other user's applications
> that traverse the filesystem (like find) unusable.

It is generally not a safe idea to use 'find' to traverse parts of the 
filesystem where other people might do dangerous or malicious things.  For 
example, using find to manipulate ACL's on space managed by someone else is 
dangerous, because they might insert a mount point which causes you to 
traverse into a volume you didn't intend to change.

-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
   Sr. Research Systems Programmer
   School of Computer Science - Research Computing Facility
   Carnegie Mellon University - Pittsburgh, PA