SAGE ;login: - On Reliability - Networks and Services

On Reliability—Networks and Services

John Sellens has recently joined the Network Engineering group at UUNET Canada in Toronto, after 11 years as a system administrator and project leader at the University of Waterloo.

This issue's "reliability" article is concerned with network and network service planning. This includes such things as routers, switches, cabling, leased lines, and servers for such things as mail, DNS, and file service.

Once again, it is important to understand what you want to accomplish before you set out to accomplish it: recall the ideas of service levels, risk evaluation, costs of failures, etc. (and if you don't recall them, you may wish to see the articles in the June and August issues). In this article, I'll point out a few places that tend to be "single points of failure" and try to suggest ways to deal with them. And remember, if reliability is to increase, so are cost and complexity — work to find the balance that is best for your organization.

Network Topology and Components

Let's talk a little bit about network topology—the physical layout of the cables, routers, hubs, etc. that make up the "skeleton" of your network (if the main part of a network is the "backbone," I figure that it's fair to call the whole shebang the "skeleton").

If you're a small organization and everyone fits on one floor of a typical office building, your layout is likely going to be pretty straightforward. All the routing and repeating hardware will end up in your single wiring closet (which, for this size of network, may actually be a closet), with a single connection to your internet service provider (ISP). While you don't have a lot of reliability choices in this instance, there are a number of things that you can do to help you recover quickly when a failure happens, and many of these tactics can be applied to local hubs in much larger networks.

Office Wiring Drops

I have three suggestions for your office drops: use quality components (i.e., "Cat 5" wiring, terminations, punch downs, etc.), install them carefully and within specifications (i.e., length limits, etc.), and install spares, because some wires and connectors will eventually fail, and it's much nicer when you can just switch people to the next port to get them working again. You should also be careful where cables are run in the offices or cubicles. This is not just a safety thing; you don't want people walking on or tripping over your nice, new cables either. You should probably consider hiring an outside contractor to do your wiring instead of trying to do it yourself. It will probably get done faster, and better, and you'll have a written guarantee and someone else to point the finger at when things don't work.

Local Hubs and Repeaters

Whether you choose fancy-shmancy smart hubs with all sorts of whiz-bang remote management features, or the dumbest, plainest hubs you can find to minimize the number of things that can go wrong, try to:

standardize on one model or brand
maintain a good relationship with your supplier
keep a spare or two on hand

Make sure you have spare ports available in case a single port or group of ports goes bad. If your main file/mail/Web/doom server is the only fast Ethernet device you have, make sure you have a spare fast Ethernet port to use if the one in use goes bad. (If your organization is like most, you'll need extra repeater ports eventually anyway. Just remember to buy more repeaters when you start running out of ports!) Cascading hubs, where a number of slave modules are daisy chained off a single master, can be attractive if you need managed hubs, but make sure you're not left completely dead in the water when the master module fails.

Local Routers

If you're a small office, you may have only a single network, and your only router may be for your (single) connection to the outside world via your ISP. Routers tend to be relatively expensive, so it may not be financially practical to keep a spare on hand. If you have only one or a few routers, see if you can strike a deal with your ISP or router vendor (and you do have only one router vendor, don't you?) for quick replacement in case of failure, or, at the very least, purchase a maintenance contract that guarantees you overnight (or faster) delivery of a replacement. If not, be prepared to be cut off from the rest of the world at some point.

That pretty much covers local network design, other than to mention that you should consider the security and safety of your wiring closet. Ideally, you would like one that is 100% secure from prying eyes and fingers, is air-conditioned and rodent-free, has uninterruptible power, and is not located in a hurricane- or tornado-prone area, on a floodplain, or underneath a kitchen or washroom. Unless you're very lucky, you'll probably have to settle for something less than ideal.

The "upstream" or "backbone" portion of your network is where you will probably be more concerned about higher levels of reliability. It's one thing for a workgroup LAN to go down and isolate 15 or 20 people, but it's a different matter entirely when your global corporate backbone melts down and thousands of employees are left idle. This is where you should consider redundant paths and cables, uninterruptible power and good environmental controls, high reliability hardware, and very careful configuration and monitoring.

One of the most obvious reliability approaches in network topology is the use of redundant communication paths to guard against natural or backhoe-related cable failures. This is often more important when your organization is spread across multiple buildings, cities, or continents than when you are within a single building. Cable or fiber failures within a building are usually easier to find and deal with than a failure on a leased fiber somewhere between Chicago and New York. By building some sort of "looping" into your network (e.g., three buildings and each has a direct connection to each of the others), you can live with the failure of any one wide-area link, albeit at a reduced total bandwidth. When possible, you should consider the use of multiple access points to your building and the use of different wide-area communication carriers to reduce the risk of a single incident (fire, backhoe, disgruntled telco employee) taking out both of your redundant links. For example, a power outage at a communication carrier can be a big problem for your network if your only connection goes through that switching office.

One alternative you might want to consider instead of private leased lines is the use of the Internet for wide-area communication links, using virtual private networks (VPNs can be implemented using software or hardware encryption to provide secure communication over public paths). The reliability advantage is that you can use the multiple redundant links of your ISP (and the rest of the Internet) to provide a reliable communication path. This, of course, can also be a disadvantage, because you're depending on someone else to provide appropriately reliable service for your network.

The routing and switching hardware you use on your WAN is also a very important part of reliability. The unfortunate thing is that, as your network grows larger, it often makes the most sense (from a bandwidth and management point of view) to use larger routers, rather than collections of smaller routers. This means that the cost of a router goes up, as does the cost of keeping a spare available. But fortunately, the larger routers tend to be more reliable, with such things as redundant power supplies (make sure they're on different building power circuits!) and multiple, independent interfaces. In practice, you're more likely to have a cabling or WAN circuit problem than a router problem (or at least a router problem that can't be fixed with a more current software release or a reboot).

And, as always, label, document, and be ready for problems. Make sure that each cable is labelled (with something that won't fall off!), make maps of floor plans and network drops, map your network links, and keep your vendor support numbers handy, with a list of part numbers, options, and wide-area communication circuit numbers. Make sure you have more than one copy of your documentation, including a paper copy, in different locations so you won't be unable to reach the copy you need to recover from a fire or natural (or network) disaster. (This, of course, includes the configurations for your routers. Don't keep your only copy in the router's memory!) Remember to plan ahead to limit problems in the first place and to make it easy to recover when you've had a failure.

Network Servers

Once you have your physical network in place, you may actually want to put it to use by connecting some machines. I'll claim that, other than the networking infrastructure, you can pretty much split network-connected machines into clients and servers (though it is never actually that clear-cut in practice). I'll define a "client" as a machine that no other machines or services rely on. A client machine is one that can disappear off the network and the only people inconvenienced will be those who want to sign on to that machine directly (e.g., personal workstations). I'm going to ignore client systems in this article, see the August issue for my comments on computing hardware reliability.

I think it's worthwhile to differentiate between "servers" and "services." The server is the hardware; the service is the protocol or information that the server provides or records. Increased reliability is easier if your service can be decoupled from your server (i.e., if the service in question does not require special purpose hardware and replicates easily, you're in luck). Examples of services that decouple and replicate well are DNS name service, DHCP and BOOTP service, and (most) Web servers; services that don't fare as well are file servers, database servers, and mail servers (because you usually want to have one "authoritative" server for these purposes).

For those services that don't decouple or replicate well, there are two basic approaches to reliability: make the server machine itself as reliable as possible (see the August article) and/or make it as easy as possible to move the service to a different machine in the event of a failure. The latter approach is more or less practical, depending on the service and the size of the data and client populations. But planning, record keeping, and preparing will make service movement much easier.

One technique that is applicable when you have multiple networks is the use of multiple network interfaces on your servers. For example, if you have a file server with a separate network interface for each network that it serves, router failures won't interrupt your file service. This technique can, of course, be applied to just about any service and will usually provide better performance as well as better reliability.

For those services that can be replicated, the obvious approach to reliability (and often performance) is to replicate. For example, DNS service is usually best provided by a primary server and multiple secondaries, geographically dispersed throughout your company. That way, a failure (power, network, or human) that isolates the part of your network containing your primary nameserver won't disrupt your other networks (unless the failure continues long enough that the DNS data start to time out). Topological (physical and logical) dispersion is a very important technique on nontrivial networks. Other services that benefit from replication and dispersion are security servers (e.g., kerberos or SecurID), relatively static file servers (e.g., software servers for your workstations), etc.

For other services, such as mail and USENET news, replication usually isn't appropriate (or just plain doesn't work). However, depending on the size of your organization, you should consider having multiple servers, with people assigned to servers based on location or some arbitrary differentiation such as user id. This is the "don't put all your eggs in one basket" reliability technique. It doesn't help those people assigned to the failed server, but at least the people on other servers can keep on working.

One alternative that can be especially appropriate for Web servers is to co-locate your Web server with your ISP or other service provider. This means that your Web server is no longer limited by the bandwidth or reliability of your network link to your ISP, and you can benefit from the UPS, air conditioning, or 7x24 services from your ISP without having to do it all yourself.

Software and Configuration

In addition to the network and system hardware and planning, proper configuration and software support are essential for a reliable network and services. The following are some useful techniques.

Automate Your Configurations

This makes it easier to maintain and replicate your servers and services and makes it much less likely that a finger slip will cause your servers to stop working. One of the most obvious places for automation is when configuring your routers, especially security-related routers (see [1] for an example).

Use a Dynamic or Load-balancing Nameserver

Use a dynamic or load-balancing nameserver (such as "lbnamed" [2]) so that DNS lookups will ignore redundant servers that are down or otherwise unavailable. There are also hardware devices, primarily sold as gateways to multiple WWW servers that serve the same URLs, that act as gateways to the private server network and that choose the fastest or currently available server.

Configure Your Client Machines (When Possible) Using Protocols like BOOTP or DHCP

Properly configured, a simple config file change can cause all the client machines in your organization to choose different servers (e.g., DNS, time, gateways, etc.) the next time they boot, which makes it much easier to reconfigure things in case of a failure.

Use DNS MX Records for Your Mail Machines

Use DNS MX records for your mail machines to cause incoming mail to collect on another one of your servers when a mail machine is unavailable. This is much more convenient than having your mail pile up on the sending system because it allows you to adjust the expiry time or manually redirect the mail elsewhere to accommodate the failure and provide alternative recovery methods.

Use DNS CNAME (Alias) Records

Use DNS CNAME (alias) records to name your server machines so that it's easy to move the service to another machine when necessary without forcing all your users to change their habits or reconfiguring all your client machines.

Finally

Make sure you have monitoring software and systems in place so you can detect failures as soon as (or before!) they happen. I'll cover more about monitoring in a later article.

Next time I plan to discuss system administration for reliability and how you can make your job easier and a little more predictable.

References

[1] Christopher J. Calabrese, "A Tool for Building Firewall-Router Configurations," Computing Systems vol. 9, no. 3, Summer 1996, pp. 236-253.

[2] Roland J. Schemers, III, "lbnamed: A Load Balancing Name Server in Perl," Ninth Systems Administration Conference (LISA '95), September 17-22, 1995, pp. 1-11.

Need help? Use our Contacts page.

21st November 1997 efc
Last changed: 16 Dec. 1999 jr

Issue index

;login: index

SAGE home