Building a high capacity e-mail system

Friday Jul 7th 2000 by Simon Horman
Share:

E-mail server running out of room? With open source programs like Sendmail and Perdition, you can build a server farm to handle large amounts of messages.

E-mail continues to be one of the most popular services used on the Internet. According to research from Internet service provider UUNET, "E-mail comprises the bulk of network traffic and has increased an average of 20% a year."

This tremendous growth means that service providers frequently find that the server where users' mail is stored can no longer keep up with the traffic. When upgrading the server is no longer practical, they face the choice of buying a larger machine or deploying multiple mail servers. But can you provide end users with a homogeneous view of their mail when more than one server is handling mail for the domain?

You can--and it's not hard to do. This article will explain how to use existing technologies and protocols to distribute incoming e-mail messages across multiple servers. This approach is designed to handle users who are in a single domain, and not to handle distributing messages to users on a mailing list, which is an entirely different problem.

Multiplexing SMTP

Simple mail transfer protocol (SMTP) traffic can be split across multiple machines by assigning multiple IP addresses to the primary mail exchanger (MX) for a domain. Incoming messages can also be multiplexed using layer 4 switch technology, such as that provided by the Linux Virtual Server Project.

Once e-mail arrives on one of the lowest preference mail exchangers for a domain, it may need to be relayed to another server. There are various ways of dividing up the e-mail, but regardless, the mechanism for determining the end destination for an e-mail is the same.

Rule-based multiplexing

One way to split e-mail between back-end servers is to make an arbitrary division in the address space for users and route mail based on this. With two servers, for instance, you could decide that addresses beginning with the letters A-K have mail delivered to one server and mail for all other addresses is delivered to the other server. Under Sendmail1 a simple rule in the Sendmail configuration file would divide the mail, like this:

Kalf regex -s1 -a<@alf.bigisp.com.> (^[a-k][^@]*)<@bigisp.com.>
Kbarney regex -s1 -a<@barney.bigisp.com.> (^[^a-k][^@]*)<@bigisp.com.>

These maps define how an address will be translated. The alf map looks for addresses that begin with A-K and are addressed to @bigisp.com. If this is found then an address @alf.bigisp.com is returned. The barney map looks for addresses that do not begin with A-K and are addressed to @bigisp.com and returns an address @barney.bigisp.com. To use regex maps you need either Sendmail 8.8 patched with map-regex or Ssendmail 8.9. In either case you need to compile with DBMDEF= -DMAP_REGEX set in the Makefile.

These maps should be placed under the definition of Dn in sendmail.cf or under LOCAL_CONFIG in sendmail.mc if the m4 preprocessor is used to generate sendmail.cf. Now place the following under ruleset 98 in sendmail.cf or LOCAL_RULE_0 in sendmail.mc.

R$*                  $: $(alf $1 $)
R$*                  $: $(barney $1 $)
RERROR $*            $#error $: $1

The first line will apply the alf map which translates users beginning with the letters A-K. The second line translates all other users as defined by the barney map. The RERROR line causes the rules to abort if an error has been encountered.

Per-user_map multiplexing

The rule-based solution is simple to implement, but lacks flexibility. Making more complicated rules soon defeats the elegance of this solution as maintaining the rules would become a major chore.

A more flexible approach is to define a map that can redirect each user's mail to the server that their mail is hosted on. If a user's mailbox needs to be moved to another server, then it is a simple matter of changing the entry for that user. The map should be in the form of a hash, to enable fast retrieval of data for an individual user. Conveniently, many message transfer agents (MTAs) including Sendmail and qmail using the fastforward add-on provide such a map in the form of an alias file.

Hybrid multiplexing

The alias mechanism provides great flexibility in assigning users to a mail server. However, it places a burden on administrators to ensure that all users have an alias. Users who receive mail and do not have a valid alias will -- depending on the mail server setup -- have their mail delivered locally, on whichever mail server it arrives on, or the mail will be rejected. This onus is particularly great when a new user is added to the system, in which case it is preferable for users' mail to work with minimal configuration.

A hybrid system that combines simple rules with aliases that can override the rules provides both simplicity and flexibility. New users to the system are covered by the general rules and users whose mailbox has been migrated for one reason or another are covered by their own alias. This can be achieved with the following;

Kalf regex -s1 -a<@alf.bigisp.com.> (^[a-k][^@]*)<@bigisp.com.>
Kbarney regex -s1 a<@barney.bigisp.com.> (^[^a-k][^@]*)<@bigisp.com.>
Kuser_map hash /etc/mail/user_map

R$+ < @ $+ > $*      $: $(user_map $1 $: $1 < @ $2 > $3 $)
R$*                  $: $(alf $1 $)
R$*                  $: $(barney $1 $)
RERROR $*            $#error $: $1

The rules are the same for rule-based multiplexing, with the addition of a map to allow per-user server assignment. This is analogous to the aliases file but is a separate map for greater flexibility. The flat file user_map is built into a hash using the command:

makemap -v hash per_user.db < user_map

The general rules will only take effect in the absence of a valid entry in the user_map. Per-user server assignments are processed first, and if the address to which the user_map entry points is not @bigisp.com then the more general mapping will not take place.

Sample topology

One approach to multiplexing incoming mail is to have two layers of servers, a front line of relays that accept connections from foreign mail servers and a back line of mail servers that house users' mail. The front line servers or relay hosts should be set up so that their IP addresses are "A" records for mail.bigisp.com. In addition, an A record should assign a unique host name to each relay host so that specific servers can be accessed for administrative purposes. The back-end servers that hold users' mail should only be accessed by the relay hosts and can be given any host name as long as it is unique. In this example, barney and alf are relay servers, while ingrid and fritz are the back-end servers where mail is stored.

Sample topology for multiplexing e-mail
Sample topology for multiplexing e-mail

Every message must be relayed as it must first be accepted from the sending host by one of the relay hosts and then relayed to one of the back-end servers. This effectively doubles network traffic. This may, however, be offset by putting an additional NIC into the relay hosts to handle traffic to the back-end servers.

It is quite possible to place the back-end servers on an internal network. The back-end servers could then be placed behind packet filtering protection and even placed on private address space networks, as defined in RFC 1918. Only the relay hosts need to be exposed to traffic from foreign hosts, so the servers that are most vulnerable to attack contain no user content, adding extra protection for end users.

Since the relay hosts do not hold any user data, if one fails or is taken down for maintenance, its load can be switched to a backup server or one of the other relay hosts, using a technique such as IP address takeover.

Multiplexing POP3 and IMAP4

Multiplexing of mail retrieval, accessed using either Post-Office Protocol 3 (POP3) or Internet Messaging Access Protocol 4 (IMAP4), is done using Perdition, a mail retrieval proxy written with this purpose in mind. Perdition allows users to connect to a content-free POP3 or IMAP4 server, that will proxy a connection to their real POP3 or IMAP3 server respectively. This enables mail retrieval for a domain to be split across multiple real servers on a per-user basis. Perdition is freely available, from http://vergenet.net/linux/perdition/ and is distributed under the GNU General Public License.

Perdition should be run on each server that users access to read their E-mail via POP3 or IMAP4. Typically, this would be the same servers that foreign hosts connect to when sending mail via SMTP. When a connection is made to Perdition in POP3 mode, it reads the USER and PASS commands and then refers to its popmap to find where the user's connection should be forwarded to. A connection is then made to the foreign pop server and Perdition enters the USER and PASS commands to the foreign server using the username and password read from the user. If authentication is successful then perdition pipes data between the client and the foreign server. If authentication fails then the foreign server connection is closed and the client connection is reset to the state it was in on initial connection. That is new USER and PASS commands are expected. Similarly in IMAP4 mode, Perdition accepts the LOGIN command and passes the username and password onto the back-end IMAP4 server specified in the popmap for authentication.

Pop map

The pop map is analogous to the aliases file and user_map used to multiplex incoming mail on a per-user basis. The pop map determines the server to which each user will be directed once they have connected to Perdition. The format is:

:[:]

For example:

john:fritz.bigisp.com
alison:fritz.bigisp.com

The program makegdbm, which is provided as part of Perdition, can be used to create a binary of the pop map. To rebuild the pop map run:

makegdbm popmap.db < popmap

Support for regular expression and MySQL-based maps is available and access to PostgreSQL is currently in testing. Details of how to administer these maps is included in the documentation for Perdition.

Multiplexing other protocols

Multiplexing of incoming mail is covered by multiplexing SMTP, as this is the only protocol commonly used to distribute e-mail on the Internet. If other protocols were to be used for mail delivery, these could be passed through an SMTP gateway in any case.

Multiplexing of mail retrieval is more complex. In my experience, POP3 and IMAP4 are overwhelmingly the most popular method of mail retrieval. Multiplexing these protocols is handled by Perdition. Users who wish to access their mail using shell access can be assigned to a single server and given shell access to the spool directory, possibly via a network file system accordingly. Another approach to offering shell access is to develop a method of transparently transferring mail for shell users into a local mailbox.

Conclusion

The methods of distributing mail between multiple servers presented here represents a scalable solution for applications when hosting mail for a domain on a single server becomes impractical. The cost of this method is that typically mail delivery and retrieval must travel though one extra hop, increasing latency and effectively doubling network utilization for mail. The former is a problem inherent in any multistep mechanism and the latter can be alleviated by running multiple physical interfaces on a machine and spreading traffic between the interfaces.

Some work remains to be done to this system of distributing e-mail. In particular, pulling together the pop map, aliases, user_map, and general rules into a single resource would simplify management greatly. It may also be necessary to migrate users between back-end servers from time to time.

The architecture developed relies on each user's mail being located only on a single server. This alleviates contention between servers for locks, and avoids having to query each server for knowledge of a mailbox, since all servers have access to mailbox location information. Intelligent construction of pop maps, aliases, or user_map entries could be designed to migrate users' mailboxes to one of the servers closest to the user. When used in conjunction with routing techniques to force traffic for a particular port onto a specific host, the user can be forced to access one of the closest servers. Hence, a mail system distributed among separate physical locations can easily be created, enhancing the quality of service to end users though faster, more reliable access.

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved