Posted on

File based archive

FileArchive

Last time I provided some background for a tool I call an ‘Archive’. I presented the source for the interface and I said I would follow up with a concrete implementation in the next post. That’s what I’ll talk about this time.

 

Background

When I first envisioned an Archive I was really thinking about storing BLOBs to disk files more than anything else. This was back in the early 1990’s and it was a time when storing BLOBs inside a database wasn’t the norm.  The usual route, back then, was to write BLOB files to a disk folder and then store paths to those files in a database table.

That arrangement works fairly well when the number of files stored stays below 20,000 or so, per folder. When the number of files moves higher though, it tends to slow the system down. Eventually, the sheer number of files causes reliability issues with the disk drive. I’ve seen it happen on more than one occasion.

Why does that happen? Well, for one thing, most file systems store hidden index information for each folder. Those indexes keep track of files as well as child folders. When the number of files in a single folder goes too high, the index information becomes incredibly large. I’ve seen indexes that were gigabytes in size. In fact, I’ve seen them become so large that the drive itself can’t be de-fragmented. That’s when problems start to rise and I/O times start to drag.

Implementation

So, the best way around that whole issue is to spread the files out over multiple folders. The trick is, to create enough folders to make the process work smoothly without creating anymore than is strictly necessary. That’s something my FileArchive class handles nicely. It manages a tree of sub-folders and ensure that no single folder contains more files than it should. It uses an enumeration to decide what strategy to employ. The code for the enumeration is shown here:

This approach allows a developer to decide whether they want more files in fewer folders, or fewer files spread out over more folders. As an example, the TINY strategy is a great choice for systems that store fewer files, such as a program to manage recipe files locally. On the other hand, the HUGE strategy is more appropriate for creating an online content management system, where the number of files could total in the hundreds of thousands, or even millions.

The code that creates the folder paths (and file names), is an extension class. It has a method for generating folder name(s). It has another method for generating file names. Both methods are driven by the enumeration above. Here is the code for the extension:

 

There are two methods on the extension class: ToFolderPath takes a GUID and converts it into a relative path of folders. ToFileName converts a GUID into a file name, but it also accepts a parameter for a version number. This allows us to store multiple versions of files in our archive.

 

Let’s begin the discussion of the FileArchive class by showing the source here:

The first thing to note is that we create an instance of TxnJournal and keep a reference to that object for the life of the archive. I blogged about that class not long ago. It’s the class I use to provide transaction support. I use it here, inside FileArchive, for the same purpose.

The properties are:

  • RootFolder, which is a location on the disk where the archive can create folders and files.
  • ManagedTreeDepth, which is the enumeration I discussed earlier in this post. It specifies what strategy to use when managing the folder tree.

There are two constructors, one without parameters and a second that allows the two property values to be set at creation time.

The IArchive interface is implemented with 4 methods, 2 ‘Put’ overrides, a ‘Get’ method and a ‘Delete’ method. Let’s go through those now.

The first Put method is for creating new archive entries. It accepts a Stream parameter and returns a GUID. Inside the method, there is a check for an active transaction. If one is not found, the stream contents are simply copied the a destination stream and the identifier is returned. The destination stream is created using the Stream method (I’ll cover that shortly). If an active transaction is found, then the stream copy takes place inside an action, on behalf of the TxnJournal object. If the transaction is rolled back, the stream copy is undone using another action. That rollback operation is handled by the TxnJournal class. We don’t have to do anything more than provide the two actions.

The second Put method accepts a GUID and a Stream. This variant is for creating a new version of an existing archive entry. The method is almost identical to the first Put method, except that we have to find out how many versions of this archive entry already exist. We do that with the call To VersionCount (I’ll cover that shortly). Once the version number is discovered, it is passed into the Stream method to create the proper destination stream. The only other difference here is that we return the version index, rather than the GUID.

The Get method accepts a GUID and a version number. This method is for reading archive entries from the disk. The method is very short, since it relies on the Stream method for most of the actual work. No bits are copied here. Instead, this method simply returns a reference to the destination stream.

The Delete method accepts a GUID. This method removes archive entries from the disk. It does so by iterating through the collection of versions and removing each file in turn. Notice that the action has a rollback action that undoes all the deletes, in case the transaction is rolled back.

 

There are 4 private methods that do most of the work for this class. They are: Stream, Versions, VersionCount and FilePath. Let’s start with the Stream method, which is used to get a reference to a destination stream. If the underlying file doesn’t exist it is created. If it does exist, a read-only stream is opened for that file and returned. The path to the stream is created using the FilePath method (I’ll cover that shortly).

The Versions method is used to gather a list of all the versions for a specific archive entry. Typically, most archive entries contain a single entry but some will contain more. All versions are stored in the same folder as the original, a fact which causes more files to exist, per folder. This is a design trade off that allows us to get around the need to manage meta-data for the various versions. It means folders might contain more files than they would without versions, but, as I say, most archive entries won’t contain more than a single version, so it’s not really a big deal. The method uses the FilePath method to look for all files whose name starts with the GUID. That set represents all the versions. The set is gathered that way and returned.

The VersionCount method is much like the Versions method, except it only returns a count, rather than a collection of file paths.

The FilePath method is where the folder tree is created, using the strategy specified in the ManagedTreeDepth property. The path is built up using the RootFolder property along with the folder path and file name from gathered the GUID extension class. That path is eventually created when the destination stream is created.

 

Using the file archive couldn’t be simpler. Here is a quick demo:

Using the archive in a transaction is almost as simple:

So that’s my original Archive class. I’ve since made other versions that work with various databases. It’s nice to be able to switch from disk-based to database storage strategies just by implementing a difference Archive object. In fact, using IOC and a factory, is even possible to make decisions about what sort of archive to create at runtime, based off of things like how large the BLOB being stored is, or things like the MIME type of the data. That makes it possible to store small files on one disk and large files on another, or even to store movies on a separate volume from images, or pdfs, or whatever.

I hope everyone enjoys the code.

Martin

Posted on

The Archive abstraction

Once, while watching valet parking attendants at work, I got an idea for a data storage abstraction …

 

(Yes, I’m aware that I’m a huge nerd)

 

Anyway, I envisioned the idea working like ‘valet parking for binary data’. In other words, you’d give it some data and it would give you back an identifier. Later, you’d give it an identifier and it would give you back your original data.

The more I thought about it, the more I liked the whole idea. Using an Archive, the code performing the actual data storage/retrieval is isolated from the code that owns the data. That kind of isolation is a good thing. The approach is also easier to unit test than similar code would be, if it used files or database code directly, to perform similar functions.

I decided to call my idea an ‘Archive’ … mostly because all the good names were already taken, LOL. I tinkered with the implementation until I finally hit upon something I liked. After using the code in several projects I decided it was worth tossing into my bag of tricks. When .NET came along this was one of the first abstractions I ported over from C++.

Here is the interface that defines all Archives in my library:

As the code above shows, an Archive is clean and simple, having only 4 methods to it. In addition to the Get and Put methods there is also a Delete method for getting rid of data once it’s no longer needed.  Notice there is no support for queries, since an Archive doesn’t maintain or use meta-data. There is no support for iterating over the collection of archived items either. Such an iteration smells of a database and there are already a bazillion different abstractions out there for databases. Archives don’t require databases, but they are frequently used in conjunction with a database – where the database is used to maintain meta-data and the Archive is used to maintain the BLOBs, or data. Each object works together but each one also focuses on a single responsibility. This makes the code smaller, simpler, easier to use and easier to change, down the road.

So how do I use this interface? In my next post I’ll layout my first Archive implementation. It’s file based and capable of safely storing crazy numbers of files on a file-system. I’ve used this file-based Archive in many, many projects and it’s never let me down.

Until next time then.

Posted on

Drop in C# Transaction Support

I ran across a cool blog entry dealing with the IEnlistmentNotification interface, which is Microsoft’s method of allowing code to wire itself into an ongoing transaction.

Blog entry is here: http://www.chinhdo.com/20080825/transactional-file-manager/

Chin’s code can be downloaded here: http://transactionalfilemgr.codeplex.com/

 

The IEnlistmentNotification interface is actually pretty slick but as usual, Microsoft seems to have skipped out on providing any real examples of how to use it. Chin Do accepted that challenge and came up with what I think is a pretty darned good solution. He wrote about it at length and even included some example code.

All in all, I thought Chin’s approach was good but it didn’t quite meet my needs. His approach spread the logic for the transaction out into multiple “operation” classes. I wanted my solution to be a little less intrusive than that. His approach also made it difficult to return values from a transaction. In fact, his example has methods that all return void.

 

My first thought was to get rid of Chin’s ‘operation’ classes and replace them with an action. One for the transaction action itself and another for the rollback action. I figured I could hold a list of those actions internally and obviate the need for much of the code in Chin’s library.

I started with a class that implements the IEnlistmentNotification interface. So far, so good, lol. Here is my listing:

My transaction class includes a single embedded class, _TxnOp, that stands in place of all the ‘operation’ classes in Chin’s approach. The class contains two action delegates, one for the transaction operation and another for the rollback. My thinking was, by using delegates, I could keep the logic for a transaction operation very close to the logic for the rollback. I’ll demonstrate that in a bit, when I show my usage demonstration.

My transaction class contains a list of these _TxnOp objects. There is a method named AddOperation, that will add new instance of _TxnOp to that list at runtime.

The list is really the key to the whole class, because it gives us the ability to undo any work we’ve performed in a transaction, should a rollback be required. A rollback would be performed by the current Transaction object and would include any other objects that indicated a desire to enlist in that transaction. We indicate such a desire in the constructor, by calling the Transaction.Current.EnlistVolatile method. That tells .NET to include our object in the current transaction.

Granted, it’s not durable transaction support. I’m still looking into how to add that little gem without causing too many ripples in my existing code.

 

So, when .NET wants to rollback the current transaction, which we have indicated we wish to participate in, it calls the IEnlistmentNotification.Rollback method. When that method is called, we walk the list backwards and call the Rollback action for each item in the list – effectively undoing everything we’ve done in the transaction.

 

How to use this class? Well, I’ve whipped up a really quick example here:

Notice that my sample class doesn’t have to implement IEnlistmentNotification directly – that’s what TxnJournal is for. Simply create an instance of the TxnJournal and almost everything else is automatic. When implementing your method, simply put the logic into an action, and the rollback logic into a second action. My class, and IEnlistmentNotification, will take care of the rest. So, use my sample class like this:

I’m still performing some testing but so far this approach looks solid.  When I get around to blogging about my archive objects I’ll bring the TxnJournal class up again, since that’s how I’m now adding transaction support for that code. Hopefully I’ll have the time to post about that soon.

In the meantime, this is an alternative approach for working with IEnlistmentNotification. I hope you enjoy it.

Thank you Chin Do, for taking the time to blog about this.

~ Martin

Posted on

Hello again …

ali-gator

Hello there!

I’ve recently moved CODEGATOR.COM to a new hosting service and resurrected my on-again, off-again technical blog. Back in 2004, when I first created this site, I envisioned it as a way for me to keep track of anything .NET related that I didn’t want to lose track of. Today, in 2015, my designs for CODEGATOR.COM are still the same. So expect occasional technical posts, scattered bits of C# code and long periods of silence when my personal life (and family) pull me away.

 

Unless otherwise noted, all the code posted on CODEGATOR.COM is published under the GNU GPL V3 license, a copy of which may be found here.

 

Please feel free to use my code in your projects. If you come up with anything awesome drop me a line and let me know. I’m always interested in how my code is used by others.

 

~ Martin Cook.