Introduction to programming for the Apache API

by Sameer Parekh

The Apache Group designed the Apache web server with modularity in mind. When the Apache Group rewrote the server core for the 0.8.x release of Apache they built into the core an extensible module API in order to provide a consistent interface for functionality. They seperated out the bulk of the server's operations into a set of modules, so that the server core would be a minimal set of operations.

The group designed the module structure with a number of motivations. First, the Apache Group is seriously concerned with server performance. By abstracting out most of the server's operations into seperate modules, the Apache Group made it possible for server administrators to easily remove modules which contain functionality they don't need, improving their application's specific performance. Second, third-party developers can easily develop for Apaache using the extensible module API, adding to its general functionality. Apache grew from a series of patches to the then-popular NCSA HTTPd server. With a module API, functionality can now be added to the server without an ugly set of patches.

Finally, in addition to providing incredible flexibilty, an API allows web engineers to develop applications under this API which previously required the slow CGI system. Netscape Communications Corporation has done some benchmark tests and found that using a server API over the CGI interface provides a X% performance improvement.

In this article we will provide you with an introduction to programming for the Apache Server API. We will dissect the existing config_log_module, which provides web server adminstrators a configurable alternative to the standard HTTPD Common Log Format. The config_log_module provides server administrators with the ability to create custom log lines, using a "printf" style configuration directive. The "LogFormat" directive is used to specify the exact format of the log line. For example:

LogFormat "%h %l %u %t \"%r\" %s %b"

Is the LogFormat directive that would be used to emulate the standard common log format. The initial comments in the mod_log_config.c source file describe all the LogFormat directives.

The core data structure in a module is the 'module' structure. When building a module, the application developer defines this structure and fills it with the appropriate functions calls which should be run in order to invoke the operations for the module. The module structure for mod_log_config.c, is as follows:

module config_log_module = {
   STANDARD_MODULE_STUFF,
   init_config_log,		/* initializer */
   NULL,			/* create per-dir config */
   NULL,			/* merge per-dir config */
   make_config_log_state,	/* server config */
   NULL,			/* merge server config */
   config_log_cmds,		/* command table */
   NULL,			/* handlers */
   NULL,			/* filename translation */
   NULL,			/* check_user_id */
   NULL,			/* check auth */
   NULL,			/* check access */
   NULL,			/* type_checker */
   NULL,			/* fixups */
   config_log_transaction	/* logger */
};
The NULL entries in this table refer to portions of the server API which the config_log_module do not use. We will not describe those functions in this article.

The LogFormat directive is defined in the "command table", "config_log_cmds", which is as follows:

command_rec config_log_cmds[] = {
{ "TransferLog", set_config_log, NULL, RSRC_CONF, TAKE1,
    "the filename of the access log" },
{ "LogFormat", log_format, NULL, RSRC_CONF, TAKE1,
      "a log format string (see docs)" },
{ NULL }
};
This structure is known as the "command_rec". The command_rec consists of a null-terminated list of substructures, one for each configuration directive. Each substructure has the following fields: the name of the config directive, the function which processes the config directive, a pointer to extra data, the AllowOverrides bit for this configuration option (we will not describe AllowOverrides here), a description of the configuration format, and a description string, for use in the case of syntax errors.

The config_log_cmds structure contains two directives. The TransferLog, which describes the file to which the log gets stored, and the LogFormat, which is the actual format of the configuration file. By specifying TAKE1 as the format of the configuration option, the Apache configuration core is directed to look for one and only one option following the configuration directive. Other possible settings for the configuration format include TAKE2 and FLAG, which mean to look for two options or to accept the directive as an on/off switch, respectively. We will only use the TAKE1 format in this article.

Now that we have seen how the Apache configuration core understands the module-specific configuration, we will look at how the core processes and stores the configuration data internally so that the module may access this data when necessary.

The config_log_module stores its module-specific configuration options in a structure. Modules can define for themselves how they store their configuration options. Some modules, which only need one option, may just use a simple null-terminated string rather than a C structure. The config_log_module structure, known as the config_log_state, is typedef'ed as follows:

typedef struct {
    char *fname;
    array_header *format;
    int log_fd;
} config_log_state;
The Apache API requires one function in order to properly allocate the memory for the configuration structure. The comments for the module structure define this function as the "server config" function. (There also exists a "per-dir config" function, which is not used by config_log_module.)

The config_log_module uses the make_config_log_state() function to allocate memory for the data structure:

void *make_config_log_state (pool *p, server_rec *s)
{
    config_log_state *cls =
      (config_log_state *)palloc (p, sizeof (config_log_state));

    cls->fname = NULL;
    cls->format = NULL;
    cls->log_fd = -1;

    return (void *)cls;
}
The make_config_log_state function takes as arguments a pointer to the "Apache memory pool" and a pointer to the server-wide configuration structure. Apache uses an internal memory allocation system to prevent memory leaks, which we will not describe in detail here.

make_config_log_state, very simply just allocates enough memory for the module configuration data structure, initializes it to NULL values, and returns a pointer to the newly allocated memory. Note that the memory allocation uses "palloc", which is Apache's internal memory allocation function. A module should never use "malloc" to allocate memory. All memory allocations should be made using Apache's set of "pool" memory allocation functions. (Apache internally takes care of deallocating such memory, which is why there is no "pfree".)

Once the memory for all the module's configuration structures are allocated, the server parses the configuration files and calls the functions as described by the command_rec structure for that module. Setting the LogFormat, for example, is done with the log_format function:

char *log_format (cmd_parms *cmd, void *dummy, char *arg)
{
    char *err_string = NULL;
    config_log_state *cls = get_module_config (cmd->server->module_config,
					       &config_log_module);
  
    cls->format = parse_log_string (cmd->pool, arg, &err_string);
    return err_string;
}
As the LogFormat directive is a "TAKE1" configuration directive, the second argument to the function isn't used. Therefore we call it "dummy" in the function definition/prototype.

The function first uses the standard "get_module_config" function to extract from the server core the data structure which was initialized and allocated for this module with the make_config_log_state function. Once the get_module_config function retrieves the configuration structure, the configuration option that was passed into LogFormat and provided to the function in "arg" is assigned to the proper location within the data structure and the "parse_log_string" function is called to parse the directive into its component parts and return an error message if the directive is badly formatted. The function then returns a NULL pointer on success, or, if an error had occurred, a pointer to a character string containing an error message which then then printed to stderr.

Finally, once both the TransferLog and LogFormat directives have been processed, the server is ready to initialize itself. The config_log_module's initializition requires that it open a file on disk (or, if the "| ..." format was passed to TransferLog, open a pipe to a child process) for logging purposes:

void init_config_log (server_rec *s, pool *p)
{
    /* First, do "physical" server, which gets default log fd and format
     * for the virtual servers, if they don't override...
     */
    
    config_log_state *default_conf = open_config_log (s, p, NULL);
    
    /* Then, virtual servers */
    
    for (s = s->next; s; s = s->next) open_config_log (s, p, default_conf);
}
init_config_log very simply calls open_config_log() for every server (the main server and all virtual hosts) being run with Apache. It just scrolls through the linked list of servers from the data in the server_rec structure, calling open_config_log for each one:

config_log_state *open_config_log (server_rec *s, pool *p,
				   config_log_state *defaults)
{
    config_log_state *cls = get_module_config (s->module_config,
					       &config_log_module);
  
    if (cls->log_fd > 0) return cls; /* virtual config shared w/main server */
    
    if (cls->format == NULL) {
	char *dummy;
	
	if (defaults) cls->format = defaults->format;
	else cls->format = parse_log_string (p, DEFAULT_LOG_FORMAT, &dummy);
    }

    if (cls->fname == NULL) {
	if (defaults) {
	    cls->log_fd = defaults->log_fd;
	    return cls;
	}
	else cls->fname = DEFAULT_XFERLOG;
    }
    
    if (*cls->fname == '|') {
	FILE *dummy;
	
	spawn_child(p, config_log_child, (void *)(cls->fname+1),
		    kill_after_timeout, &dummy, NULL);

	if (dummy == NULL) {
	    fprintf (stderr, "Couldn't fork child for TransferLog process\n");
	    exit (1);
	}

	cls->log_fd = fileno (dummy);
    }
    else {
	char *fname = server_root_relative (p, cls->fname);
	if((cls->log_fd = popenf(p, fname, xfer_flags, xfer_mode)) < 0) {
	    fprintf (stderr,
		     "httpd: could not open transfer log file %s.\n", fname);
	    perror("open");
	    exit(1);
	}
    }

    return cls;
}
The open config log function does the necessary file open/process spawning that is necessary for the storage of the logs which are logged according to the LogFormat defined with the LogFormat directive. (The TransferLog directive allows a pipe in the form of "| ... " to be executed, to which the logs lines are sent.)

Once everything is setup, the server can finally begin to accept requests. As this module is only a logging module, it doesn't use any of the API functionality other than the "logger" function.

The logger function, as the rest of the functions used in the process of handling a request, takes as an argument a pointer to the "request_rec" data structure. The "request_rec" stores all the data pertaining to a particular request made on the server. The logger function uses the data stored within this structure to find the information it needs to log to the file.

The request_rec structure is defined as follows:

struct request_rec {

  pool *pool;
  conn_rec *connection;
  server_rec *server;

  request_rec *next;		/* If we wind up getting redirected,
				 * pointer to the request we redirected to.
				 */
  request_rec *prev;		/* If this is an internal redirect,
				 * pointer to where we redirected *from*.
				 */
  
  request_rec *main;		/* If this is a sub_request (see request.h) 
				 * pointer back to the main request.
				 */

  /* Info about the request itself... we begin with stuff that only
   * protocol.c should ever touch...
   */
  
  char *the_request;		/* First line of request, so we can log it */
  int assbackwards;		/* HTTP/0.9, "simple" request */
  int proxyreq;                 /* A proxy request */
  int header_only;		/* HEAD request, as opposed to GET */
  char *protocol;		/* Protocol, as given to us, or HTTP/0.9 */
  
  char *status_line;		/* Status line, if set by script */
  int status;			/* In any case */
  
  /* Request method, two ways; also, protocol, etc..  Outside of protocol.c,
   * look, but don't touch.
   */
  
  char *method;			/* GET, HEAD, POST, etc. */
  int method_number;		/* M_GET, M_POST, etc. */

  int sent_bodyct;		/* byte count in stream is for body */
  
  /* MIME header environments, in and out.  Also, an array containing
   * environment variables to be passed to subprocesses, so people can
   * write modules to add to that environment.
   *
   * The difference between headers_out and err_headers_out is that the
   * latter are printed even on error, and persist across internal redirects
   * (so the headers printed for ErrorDocument handlers will have them).
   *
   * The 'notes' table is for notes from one module to another, with no
   * other set purpose in mind...
   */
  
  table *headers_in;
  table *headers_out;
  table *err_headers_out;
  table *subprocess_env;
  table *notes;

  char *content_type;		/* Break these out --- we dispatch on 'em */
  char *handler;		/* What we *really* dispatch on           */

  char *content_encoding;
  char *content_language;
  
  int no_cache;
  
  /* What object is being requested (either directly, or via include
   * or content-negotiation mapping).
   */

  char *uri;                    /* complete URI for a proxy req, or
                                   URL path for a non-proxy req */
  char *filename;
  char *path_info;
  char *args;			/* QUERY_ARGS, if any */
  struct stat finfo;		/* ST_MODE set to zero if no such file */
  
  /* Various other config info which may change with .htaccess files
   * These are config vectors, with one void* pointer for each module
   * (the thing pointed to being the module's business).
   */
  
  void *per_dir_config;		/* Options set in config files, etc. */
  void *request_config;		/* Notes on *this* request */

/*
 * a linked list of the configuration directives in the .htaccess files
 * accessed by this request.
 * N.B. always add to the head of the list, _never_ to the end.
 * that way, a sub request's list can (temporarily) point to a parent's list
 */
  const struct htaccess_result *htaccess;
};
This article will not go into the details of how the config_log_module actually does its logging and parses the LogFormat format string, as that is just standard C.

The config_log_transaction function prototype, however, is as follows:

int config_log_transaction(request_rec *r);
The function depends on the request_rec structure, some elements of which we describe here.

pool - A pointer to the pool of memory from which allocations should be made while processing this one HTTP request. After processing this request, all allocations made from this pool are freed.

connection - A pointer to the conn_rec structure, which describes details of the connection, such as the local socket address, remote socket address, etc. We will not discuss the conn_rec in detail in this article.

server - The 'server' is a pointer to the server_rec, which points to all the configuration information specific to the server (i.e. either the main server or one of the virtualhost servers) under which this request was made. Most important within the server_rec structure is the "module_config" pointer, which is used by the get_module_config function to return module-specific configuration directives.

main/next/prev - Some requests may result in an internal redirect, resulting in a seperate logical request, even though it goes over a single HTTP request. The main/next/prev pointers point to the chain of request_rec structures which were processed through internal redirects for the current single HTTP request.

the_request - string which just contains the first line of the request. (e.g. "GET /index.html HTTP/1.0")

assbackwards - boolean flag to see whether or not we're processing an old-style HTTP/0.9 "simple" request.

status - the HTTP status return code pertaining to the request. (I.e. 200 for "Document Follows" or 404 for "Not Found") httpd.h contains a list of all the available status codes that the server currently supports.

headers_in - A pointer to a "table" structure (the table structure is not described in this article) which lists all of the incoming HTTP structures which the client sent to the server.

headers_out - A "table" pointer to the headers which the server sends back to the client.

subprocess_env - Another "table" pointer with all of the environment variables that are set for CGI, SSI, etc.

content_type - The MIME Content-type for use with dispatching the actual request handlers. This Content-type may be an actual MIME type or it may actually be an internal type in order to dispatch to a specific module's handler based on various criteria. (e.g. CGI_MAGIC_TYPE)

uri - The URL path for a given request. (a request "GET /index.html HTTP/1.0" would associate "/index.html" to the "uri" variable within request_rec)

filename - If the request has translated to an actual file in the filesystem, this is the full path to that file. In some instances (proxy module, for example) the "filename" is not a representation of a file in the filesystem, but perhaps a proxy URL.

finfo - a "stat" structure with information about the file, if it exists in the filesystem. If it doesn't, the server sets finfo.st_mode equal to zero.