NodeJS events and recursion (readdir)

September 16, 2010 in How-To

Notice: This is confusing. The code is confusing and the concepts are confusing. But, the results are very easy to use. This is also a nice pattern to follow for when you need to do something similar.

The reason I built this function is as follows: I was working in node.js. I was processing an html request. I needed to get two lists of files from different folders, and return both lists for the same request. One of the lists of files needed to be recursive (children folders, and children of children …).

The good ol’ way of doing this (like in C or Java) would be to write a single function that would return a collection of files for a given folder. That function would call itself when it found a folder. Finally, when everything finished (the nested recursive calls all returned), the initial call to the function would return ALL the results. Remember, that this will not work in NodeJS using the awesome async calls.

So, what is the concept? How do we design this in NodeJS/JavaScript using async calls? The good ol’ function signature would look like the following:

var data = readDirectory(path)

But, that will not work here. We know that we will have to call “readdir”, which returns immediately. We do get to pass a function to readdir that will get called/ran when readdir is finally finished. We want our new function to act just like readdir. The signature for our new function should end up being something like the following (Note: I’ve left out error handling to make this easier to read):

readDirectory(path, fnc)
fnc = a function that takes a single parameter, which will be the entire data returned.
readDirectory(".", function(data) { ...work here... });

When the entire directory (path) has been read in, we’ll want the “data” to be returned something like the following:

data = []; // an array of objects. one object per file or folder in the path
data[N].name = file/folder name
data[N].stat = the "stat" for the file/folder
data[N].children = if exists, another "data" array of objects containing the children
Note: data[n].stat.isDirectory() will return true is data[N].name is a folder

Now that we know what the signature looks like, and what data we are getting back, we can design how this new super awesome function is going to actually get it’s work done. Here is the outline: (Note: to reduce indentation, most nested function stays at the same indentation level)

  1. call “readdir”, and return
  2. the function that readdir calls back continues…
  3. we now have an array of filenames
  4. iterate over the filenames
  5. for each filename, run the file system “stat” function. Which returns immediately
  6. the function that stat calls back continues…
  7. count the number of stat call backs, so we know the last file that calls back
  8. when the last file has called our call back function, we continue…
  9. at this point, we have the filename and the stats for the file
  10. create an object, and push it onto the data array that will get returned
  11. IF this filename is a folder then do the following:
    1. call ourself (readDirectory) with the path + filename
    2. when readDirectory calls back, we continue…
    3. set the .children attribute of the object to the data returned by readDirectory
    4. count the number of children folders processed
    5. when the last child folder has been processed then continue…
    6. run the original callback function with the original returned data
  12. if we have processed all files AND we didn’t find any folders then continue…
  13. run the callback function with the data we’ve gathered

Whew. Did that make any sense at all? The actual code is below, if you want to compare the design to the code.
One of the major pain points for me during the creation of this function was: Accessing a specific instance of an object from within a callback. The problem is that the closure saves a reference to the variable (not the variable contents). So, if you are creating a new object (“obj”) in a look, and wanting to use that object within a callback function (like I’ve done here) — things don’t work very nicely. The closure allows you to access the “obj” object, but you get the last value that “obj” contained — not the value it contained when the callback was created. SO, a quick-n-easy way to save the instance of the variable is to create and run a function. In the code below, see the line that starts “(function(obj2) { …})(obj);” What this does is to create a function, which expects a single argument that it will call “obj2″. Then by wrapping the function in parenthesis, and adding parenthesis at the end, we run the function right now, and pass in the *value* of obj to that function. That function then calls readDirectory with a callback (which creates a closure, which gives access to “obj2″). That sounds complicated and messy — it isn’t too bad.
The other minor pain is knowing when all the callbacks have finished for something. Something like running “stat” for every file in a folder. The easy way is to remember the number of times you called “stat” (in the code below it is in the variable “count”). And then, simply decrement the count each time the callback function runs. When we reach zero, we know we are at the last callback. Since all these “stat” functions run async, they may callback in a random order. So, you can’t just check when the *last* file has called back.

I hope this helps someone deal with this rather confusing issue.

The “less complex version” follows. The “more complex version” is at the end of this post.

/**
 * read a directory (recursively deep)
 * data[] = an object for each element in the directory
 *		.name = item's name (file or folder name)
 *		.stat = item's stat (.stat.isDirectory() == true IF a folder)
 *		.children = another data[] for the children
 */
readDirectory = function(path, next) {
  // queue up a "readdir" file system call (and return)
  fs.readdir(path, function(err, files) {
    var count = files.length;  // count of files in the current folder
    var countFolders = 0;
    var data = [];
    // iterate over each file in the dir
    files.forEach(function (name) {
      // queue up a "stat" file system call for every file (and return)
      fs.stat(path + "/" + name, function(err, stat) {
        var obj = {};
        obj.name = name;
        obj.stat = stat;
        data.push(obj);
        if (stat.isDirectory()) {
          countFolders += 1;
          // perform "readDirectory" on each child folder (which queues up a
          //    "readdir" and returns)
          (function(obj2) {
            // obj2 = the "obj" object
            readDirectory(path + "/" + name, function(data2) {
              // entire child folder info is in "data2" (1 fewer child
              //    folders to wait to be processed)
              countFolders -= 1;
              obj2.children = data2;
              if (countFolders <= 0) {
                // sub-folders found. This was the last sub-folder
                //    to processes.
                next(data);
              } else {
                // more children folders to be processed. do nothing here.
              }
            });
          })(obj);
        }
        // 1 more file has been processed.
        count -= 1;
        if (count <= 0) {
          // all files have been processed.
            if (countFolders <= 0) {
              // no sub-folders were found. DONE. call "next"
              next(data);		// no sub-folders found
            } else {
              // children folders were found. do nothing here.
            }
          }
        });
    });
  });
};

Note: please see here for how to save variables for the next event-loop (The creation of a function, and running it).


/**
 * read a directory (recursively deep)
 * data[] = an object for each element in the directory
 *		.name = item's name (file or folder name)
 *		.stat = item's stat (.stat.isDirectory() == true IF a folder)
 *		.children = another data[] for the children
 * filter = an object with various filter settings:
 *		.depth		= max directory recursion depth to travel
 *						(0 or missing means: infinite)
 *						(1 means: only the folder passed in)
 *		.hidden		= true means: process hidden files and folders (defaults to false)
 *		.callback	= callback function: callback(name, path, filter) -- returns truthy to keep the file
 *
 *
 * @param path		= path to directory to read (".", ".\apps")
 * @param callback	= function to callback to: callback(err, data)
 * @param [filter]	= (optional) filter object
 */
exports.readDirectory = function(path, callback, filter) {
	if (filter) {
		// process filter. are we too deep yet?
		if (!filter.depthAt) filter.depthAt = 1;				// initialize what depth we are at
		if (filter.depth && filter.depth < filter.depthAt) {
			callback(undefined, []);							// we are too deep. return "nothing found"
			return;
		}
	}
	// queue up a "readdir" file system call (and return)
	fs.readdir(path, function(err, files) {
		if (err) {
			callback(err);
			return;
		}
		var doHidden = false;				// true means: process hidden files and folders
		if (filter && filter.hidden) {
			doHidden = true;				// filter requests to process hidden files and folders
		}
		var count = 0;						// count the number of "stat" calls queued up
		var countFolders = 0;				// count the number of "folders" calls queued up
		var data = [];						// the data to return

		// iterate over each file in the dir
		files.forEach(function (name) {
			// ignore files that start with a "." UNLESS requested to process hidden files and folders
			if (doHidden || name.indexOf(".") !== 0) {
				// queue up a "stat" file system call for every file (and return)
				count += 1;
				fs.stat(path + "/" + name, function(err, stat) {
					if (err) {
						callback(err);
						return;
					}
					var processFile = true;
					if (filter && filter.callback) {
						processFile = filter.callback(name, stat, filter);
					}
					if (processFile) {
						var obj = {};
						obj.name = name;
						obj.stat = stat;
						data.push(obj);
						if (stat.isDirectory()) {
							countFolders += 1;
							// perform "readDirectory" on each child folder (which queues up a readdir and returns)
							(function(obj2) {
								// obj2 = the "obj" object
								exports.readDirectory(path + "/" + name, function(err, data2) {
									if (err) {
										callback(err);
										return;
									}
									// entire child folder info is in "data2" (1 fewer child folders to wait to be processed)
									countFolders -= 1;
									obj2.children = data2;
									if (countFolders <= 0) {
										// sub-folders found. This was the last sub-folder to processes.
										callback(undefined, data);		// callback w/ data
									} else {
										// more children folders to be processed. do nothing here.
									}
								});
							})(obj);
						}
					}
					// 1 more file has been processed (or skipped)
					count -= 1;
					if (count <= 0) {
						// all files have been processed.
						if (countFolders <= 0) {
							// no sub-folders were found. DONE. no sub-folders found
							callback(undefined, data);		// callback w/ data
						} else {
							// children folders were found. do nothing here (we are waiting for the children to callback)
						}
					}
				});
			}
		});
		if (count <= 0) {	// if no "stat" calls started, then this was an empty folder
			callback(undefined, []);		// callback w/ empty
		}
	});
};

1 response to NodeJS events and recursion (readdir)

  1. Dude, your code is cut off. Otherwise it is very helpful. Thanks!