From: Spencer Shepler (spencer.shepler@sun.com)
Date: 02/22/05-02:39:37 AM Z
Date: Tue, 22 Feb 2005 00:39:37 -0800 From: Spencer Shepler <spencer.shepler@sun.com> Message-ID: <20050222083937.GF130156@jurassic.eng.sun.com> Subject: [nfsv4] Re: stable storage for server restart I wont' go through point by point for Solaris but the approach is similar to what Andy describes for Linux. The Solaris server has a single directory with a file entryfor each clientid. The filenames are ip-address/clientid with the contents being the full client supplied clientid. This offers a "quick" way to determine which clients are active with a server. The same type of "move state directory on server recovery" methodology is used and the methodology will survive the cascading failure issue. Spencer On Mon, William A.(Andy) Adamson wrote: > rick@snowhite.cis.uoguelph.ca said: > > (I cc'd nfsv4@ietf.org, just in case anyone not on the linux list is > > interested. Apologies in advance for cluttering up your email. Also, I'd be > > interested in hearing any thoughts others have on the design.) > > i'm also interested in thoughts on our design for the linux server which is > similar. > > instead of appending a recovery file we use the 'file system as a data base' > approach, populating a (configurable) recovery directory with one directory > per clientid. the clientid directory name is the md5 hash of the client > supplied client id which can be of length 1024. the md5 hash is calculated at > SETCLIENTID, and we return CLIENTID_IN_USE for md5 cache hits, which should be > negligible. no upcalls are involed, all operations are done in-kernel. > > > - at server startup (before any Compounds are performed), the log is > > read and an in-memory structure is created, indicating what clients > > can reclaim state > > we read the recovery directory, populating in-memory structures indicating > which clients can reclaim state. > > > - during reclaim, the in-memory structure is used to check for Grace and > > is marked for successful reclaims, per client > > we do the same. > > > * - at the end of the grace period, the file is truncated to 0 length and > > a new append log is written from the in-memory structure, with one record > > for each client that successfully reclaimed some state > > at the end of the grace period, we remove the clientid directories from the > recovery directory for those clients that did not reclaim state. > > > - then normal, non-grace operation starts... > > - records are appended to the log when a client acquires the first state > > (first Open) after a SetClientID and when state is revoked for a > > client (I do not support revocation of only some state for a client) > > (nb: The first Open records are only done for clients that didn't > > successfully reclaim during grace.) > > normal, non-grace operation starts: > - we add a clientid directory to the recovery directory when a client makes > their first successful open confirm. we remove a clientid directory from the > list whenever either its lease expires or admin action removes the client > state. > > > - I had to lock the other nfsd threads out when updating stable storage. The > > reason was: > > for us this is all auto-magically handled by exitsing directory operations > (mkdir,rmdir), and our nfs state lock. > > > - I needed to record the revocation before issuing conflicting lock state, > > and I wanted to avoid races between multiple clients trying to acquire > > conflicting locks while the write(s) to disk were in progress. Since > > revocation is a rare event, I didn't see this as a serious performance > > hit. > > - For the case of first Open, the record indicates successful lock > > state acquisition. If another client acquires a conflicting lock > > while the disk write(s) for the log are in progress, there would be > > a record indicating that the client had successfully acquired state > > although the lock failed, due to a conflict. Is this actually a > > problem? I'm not convinced it is, but my code "plays it safe" for now. > > I could see this being a significant performance hit, if lots of new > > clients did SetClientIDs followed by Opens at the same time. (Ones > > that haven't already reclaimed locks at server restart.) > > - The other two cases (when server first starts up and at end-of-grace) > > only occur once per server reboot and only add a little time to the > > grace period. > > > > I think the weakest part of this design is that, if the server crashes again > > while at "*", the append only log is not complete (possibly empty). This will > > result in clients not being allowed to reclaim, that otherwise should be able > > to (ie. no entry->no reclaim->NFS4ERR_NOGRACE for all reclaim requests). > > when we crash at *, our recovery directory has all the clients that have just > successfully reclaimed state, plus potentially some that could have reclaimed > state, but didn't during the last grace period. so, i feel ok about crashing > at * and recovering with the data in the recovery directory. > > > The append log will grow, but I only see a problem if clients go hogwild with > > SetClientIDs. It does get truncated when the server restarts, so a sysadmin > > can just reboot when it gets too big:-) > > our recovery directory only holds active clients; this is strength of our > design. > > > It also doesn't support the notion of only some state for a client being > > revoked. (I've looked at that one a bit and it seems to get quite > > challenging. Maybe someday I'll come up with a simple scheme I'm convinced > > works for that case.) > > another strength of this design is that the clientid direcories can easily be > populated with files containg additional info. for example, we plan on adding > a file to hold SETCLIENTID principal info. > > -->Andy > > > > _______________________________________________ > NFSv4 mailing list > NFSv4@linux-nfs.org > http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 _______________________________________________ nfsv4 mailing list nfsv4@ietf.org https://www1.ietf.org/mailman/listinfo/nfsv4
This archive was generated by hypermail 2.1.2 : 03/04/05-02:13:55 AM Z CST