[Microsoft Certified Professional] Accelerated MCSE Study Guide, Windows NT Workstation 4.0,
Chapter 26

Copyright 1999, Dave Kinnaman, all rights reserved

Chapter 26
Troubleshooting, Fault Tolerance, and Disaster Recovery

MCSE candidates are expected to have advanced troubleshooting, fault tolerance, and disaster recovery skills. This chapter provides you with information needed to handle the Windows NT 4.0 Workstation exam questions in these areas, beginning with troubleshooting, followed by fault tolerance and disaster recovery.


Troubleshooting

The most successful troubleshooter is one with experience and knowledge: experience with the software and the hardware. Knowledge includes knowing where to go for more information when faced with a new problem. This chapter will cover the basics of troubleshooting and available resources.

Get Help

Possible sources of additional information include

In addition, Internet mailing lists and newsgroups, and printed magazines and newsletters are available that cover almost every aspect of Windows NT.

Figure It Out

The first step in troubleshooting is to have a systematic approach to diagnosing and solving the problem. Steps that you might take include

Preventative Measures

Regardless of problems a networked computer may experience and the solutions to those problems, there is no replacement for prevention on the part of the network administrator. Careful planning and prevention reduce the number of problems that arise and the downtime to resolve problems that do. An effective network administrator will always perform these general preventative tasks:

Planning

One of the best ways to avoid problems is to carefully plan network administration, implementation, and growth. Plan things that will happen soon in great detail and lots of lists and tables and charts. Plan things that will happen later in less detail, building a foundation for detailed planning later.

Document the current configuration thoroughly as part of your plan for the future. Plan for whatever levels of security each part of the network requires. Plan a storage and backup system that deals with the whole network. Plan for fault tolerance and disaster recovery (see the next two sections).

Standardization

To avoid hardware and software conflicts, network standardization helps reduce potentially difficult and expensive problems. Open standards allow many manufacturers to compete for position in the marketplace, and often lead to innovations and growth. Proprietary standards can lock the network into a single vendor, or only a few vendors, limiting your options and your future possibilities.

Documentation

Aside from planning, documentation is the single most important aspect of solving problems when they occur. Wiring diagrams, notes, and implementation documentation can save network administrators a tremendous amount of time to correct network problems when they occur. Additionally, careful notes about issues that could become problems helps solve those problems before they occur.

Network Monitoring

Windows NT server enables the administrator to monitor the network through various tools included. These tools help the administrator identify problem areas and establish a baseline of acceptable performance.

One of the most important aspects of troubleshooting is documentation. Be sure and document the details of the problems and attempted solutions. It is important to record any solution attempted whether it works or not.

Troubleshooting Tools

There are many tools supplied with Windows NT that are helpful in troubleshooting problems. These tools have already been covered and will not be reviewed here. They include Event Viewer, Performance Monitor, Network Monitor, User Manager, etc.

One of the most common troubleshooting problems is access to resources. Chapter 13, Shares and NTFS Permissions, and Chapter 14, Printers and Print Devices discuss these issues. Additional tools and methods will be covered in the remainder of this chapter.

Installation Problems
When troubleshooting installation problems, here are some areas that should be examined:

Hardware Compatibility
The most frequent problems with installation result from incompatible hardware. Check the Hardware Compatibility List, a copy of which is included on the Windows NT CD-ROM. It is best to consult the most recent version of the HCL which is available for download from Microsoft's web site.

In addition, Microsoft provides the Windows NT Hardware Qualifier (NTHQ) utility on the CD that is used to identify installed hardware. The NTHQ detects what hardware is installed on the computer. Use this list to validate that the hardware is on the HCL.

Windows NT Hardware Qualifier (NTHQ)
Create an NTHQ disk by running makedisk.bat which is found on the Windows NT CD-ROM in the \support\hqtools directory. After creating the disk, reboot your computer with the disk in the A: drive.

 

CD-ROM
Often the installed CD-ROM on a computer may not be compatible with Windows NT but can be accessed under an alternative operating system, such as Windows 95 or MS-DOS. If this is the case, boot that operating system and copy the installation files to the hard drive. You can then install Windows NT from the copy on the hard drive.

As an alternative solution, if a network connection is available, the installation can be accomplished by using a network share.

Hardware Configuration
When installing Windows NT, make sure that no conflicts exist between hardware devices. This is especially true when installing on a computer that has been running Windows 95.

Remember, Windows NT is NOT plug'n'play compatible. Many hardware devices are designed to be plug'n'play and work well under Windows 95. However, these same devices may not be recognized by Windows NT.

Other devices are either plug'n'play or manually configured. Configurable settings include I/O addresses and Interrupts. Check to see if manual configuration has been properly done.

You can use Windows NT Diagnostics to determine what resources are being used by what hardware devices as shown in Figure 26-1. Other versions of WINMSD can be used when installing Windows NT on a computer running an alternative operating system. A version of WINMSD can be run under Windows 95. A nongraphical version also comes with the most recent versions of MS-DOS.

[Accelerated MCSE Study
Guides - Workstation, Chapter 26, illustration #1.]

Figure 26-1. Windows NT Diagnostics is one of the most powerful troubleshooting tools.

 

WINMSD provides a wealth of information about your system.

Logging
Create a log file of the installation process by using the
/l switch during installation. This forces the creation of a log file called $winnt.log that will contain any errors encountered.

Viruses
Many types of viruses exist some of which can effect the system partition. Make sure that no viruses are present. Viruses can corrupt an installation of Windows NT or prevent installation entirely.

Boot Problems

Next to installation, the most common problem with Windows NT has to do with being unable to boot. There are three main types of problems that can occur during boot.

It is helpful to understand the boot process, when troubleshooting boot problems. By knowing where in the boot process the error occurs, the cause of the problem may be more easily identified.

A successful boot does not occur until a logon has been completed. Assuming that Windows NT is the selected operating system, the process consists of four steps which are

    1. POST (Power On Self Test). This portion is where the computer performs a self-test. The boot portion of the hard disk is located and NTLDR is initialized.
    2. Boot Loader Phase. It is during this phase that NTLDR consults the BOOT.INI file for the operating system that is to be started and the location of the system files. Next, NTDETECT.COM is used to determine what hardware is installed and a list of the appropriate device drivers is created.
    3. Kernel Phase. NTLDR turns over control to NTOSKRNL.EXE . It is during this phase that device drivers are loaded, services started, and the pagefile is initialized.
    4. Logon Phase. The last phase of a successful boot is completion of the logon sequence. See Chapter 6 for more information on logging onto a Windows NT computer.
The approach to boot problems are varied depending on the cause and when they occur during the boot process. The following section covers the most common approaches to solving boot problems.

Last Known Good Configuration

 

Troubleshooting During the boot process, an option is presented to boot using the Last Known Good Configuration. If after installing a new device driver, you notice that Windows NT does not boot correctly, select the Last Known Good configuration.

When a user successfully logs onto a Windows NT computer, the configuration that is in effect at that time is copied to be used whenever Last Known Good is selected. For this reason, if a new piece of hardware or an updated device driver is installed, do not log back on after rebooting if you suspect that the system is not functioning properly.

Repair Process and the Emergency Repair Disk

If you find that using the Last Known Good does not allow you to successfully boot the machine, then the next option is to attempt a repair of your installation. To do a repair

Creating an Emergency Repair Disk (ERD)

Troubleshooting During installation, you are prompted to create an ERD. After installation you can use the rdisk.exe utility to create an ERD. Run this utility from either the RUN line or from a command prompt. You will then be presented with the Repair Disk Utility as shown in Figure 26-2.

[Accelerated MCSE Study
Guides - Workstation, Chapter 26, illustration #2.]

Figure 26-2. Make a new ERD every time a system changes.

 

The Repair Disk Utility provides the ability to create a new ERD or update repair information.

The Repair Disk Utility has two options, to either update the repair information or to create a new repair disk. If you chose to update the computer's configuration information, you will be prompted to create an ERD after the update is complete.

When creating an ERD, the information contained in
%SYSTEMROOT%\Repair directory is used. This information is written during installation and is not updated automatically.

Restoring the System Configuration
Troubleshooting If you use the information saved to the Repair directory during installation to create an ERD, when you use that ERD the system will be returned to its original configuration. What that means is that any users or groups that have been created or changed will be lost. Also, any system configuration changes will also be lost.

 

The repair process presents you with four options. Select the appropriate options and then continue with the repair process. The available options are

Reinstall
If repairing the installation does not work, then your only choice is to reinstall the operating system. After successfully reinstalling Windows NT, the system can be returned to its original configuration by using the ERD or restoring the registry from a recent backup.

'Blue Screen of Death'

The most infamous problem associated with troubleshooting Windows NT is the Stop Screens affectionately known as the Blue Screen of Death or BSOD. A Stop Screen is displayed whenever Windows NT encounters a fatal error.

These Stop Screens are a blue screen, hence the nickname, that contains debugging information. This information is helpful in identifying what caused the problem and how to correct it. They are not as daunting as they appear. There are five sections to this information.

Troubleshooting Windows NT may be configured to create a memory dump whenever a Stop error occurs. This is configured using the System applet in the Control Panel on the Startup/Shutdown tab. Two utilities are available for analysis of a crash dump. They are

Both of these utilities are on the Windows NT Server CD and are specific to the platform of the involved computer.

Dr. Watson

Troubleshooting Dr. Watson is an application debugger that is part of Windows NT. Examine the results of Dr. Watson's work in the Application Log of Event Viewer.

Troubleshooting Windows NT can be less of an obstacle if you know the software and the hardware the operating system is running on. By taking a systematic approach, any troubleshooting question becomes less of a roadblock. The more experience you have with troubleshooting problems the easier it becomes.


Fault Tolerance

Because Windows NT 4.0 Workstation does not offer fault tolerant disk configurations (unlike Windows NT 4.0 Server), this section is brief.

All computer resources will eventually fail. Some computer parts fail more often than others nowadays power supplies and hard drives fail far too often. Network Administrators take steps every day to maintain the integrity of the network's most vital data. By reducing the chances of system failures, you protect critical data and your job.This chapter reviews practical steps network administrators may take to provide fault tolerance for the network.

Uninterruptable Power Supplies (UPSs)

UPSs are fancy switched batteries. An UPS unit is plugged into the wall alternating current (AC) outlet, and the computer, monitor, and necessary peripherals are then plugged into the UPS. The primary purpose of the UPS is to provide temporary battery power for a graceful shut-down of the computer in the event of a prolonged power failure. An UPS can also prevent the need for shutting down the computer by providing power to survive very brief power failures, until regular power is restored by a gasoline-powered generator or through the regular electrical power distribution grid.

A serial connection may connect from the computer and the UPS, to allow the UPS and computer to communicate during a power failure. UPS software is especially designed to control the computer-UPS combination and to enable the graceful shut down of both devices before the UPS battery is exhausted.

Providing an UPS for a computer also tends to lengthen the computer's useful life by protecting the computer from harmful variations in electrical current. This is because most UPS units also provide line conditioning to smooth over temporary aberrations and spikes in the electrical power supplied to the computer. Although built-in computer power supplies originally provided some protection in this regard, today power supplies manufactured overseas at rock bottom prices often simply omit these originally designed, but costly protective features. The resource list at the end of this section has a URL for power supplies and computer cases manufactured to higher standards than those available on the popular market.

Network servers should always be protected with an UPS. Vital workstations are also possible candidates for UPS protection.

Disk Management

Fault tolerant data systems protect data by duplicating it and by placing it in different physical locations.

Three types of RAID (Redundant Arrays of Inexpensive Disks) are supported by Windows NT Server. Only "RAID 0" is available from the operating system on Windows NT Workstation.

Raid 0
Disk striping divides data into 64k blocks and spreads it equally among all disks in the array. Disk striping is not fault tolerant. The only recovery option is to restore from back-up.

Only RAID 1 and RAID 5 provide fault tolerance.

"RAID 0" is, of course, a misnomer, because there is no redundancy to the data in a "RAID 0" striped disk system. For the exam you should know that "RAID 0" means not fault tolerant.

Hardware RAID
The RAID options on Windows NT Server are provided in the Windows NT operating system software by Microsoft. RAID hardware cards, particularly those by leading manufacturers such as Adaptec, can provide higher reliability, more fault tolerant protection, and even greater speed than Microsoft's software solutions. The Internet address URL for Adaptec is given in the Resources section.

Backup Systems

Although Windows NT 4.0 Workstation does not natively provide fault tolerant disk configurations, data on networked workstations can be backed-up over the network with several available software and hardware systems.

Resources


Disaster Recovery

Planning for network disaster recovery is like buying disability income insurance you fervently hope you never need to use your recovery plan or your disability income insurance. However, if you either is ever needed, you'll be much better off if you've planned and prepared for that possible day.

Written Disaster Recovery Plan

External disasters, equipment disasters, and human-mediated disasters all should be in your plan.

External disasters

Equipment and Software disasters

Human-mediated disasters

Risk Assessment

Assess all the risks your network is vulnerable to. The items above are only a beginning list to get you started. Then, from the risks you've identified for your network, generate clearly stated precise, prioritized goals for dealing with each risk. These goals and priorities are what will guide you in designing the disaster recovery plan for your network.

Bringing the network back to "business as usual" is often a goal in a disaster recovery plan. That services should come up first? How long are you willing to wait for network services to be back at 100%? Should customers be the first to experience "business as usual?" How long before staff will experience "business as usual?" Should physical network-server security be re-established within 24 hours, or can it wait 3 days?

Contingent Resources

These are tough decisions that can involve large amounts of money. Therefore, another part of the disaster recovery plan is to make a through hardware and software (licensed and unlicensed) inventory of all network assets component by component. Also inventory all business insurance covering these assets.

Next, calculate the current network investment. With the current investment figures, you can begin to estimate the resources needed to implement disaster plans under various salvage scenarios. Eventually, management must face the situation and plan to make appropriate contingent resources available in the event of disaster.

Notification Procedures

Your plan should include exact procedures for notification of everyone at the appropriate times. Maintain a comprehensive roster of emergency officials and contacts for all relevant city, county, state or provincial agencies. Murphy's Law says that the one agency or official you leave out will be the one you need the most. Keep this list up-to-date.

Also, you'll possibly need the home addresses and telephone numbers for:

Alternate Sites

Many disaster recovery plans require storage of important plan documents and replacement equipment at alternate sites, in the event of a total loss of the primary facility. At the least, duplicate plan documents and backup media must be stored off site. Keeping off-site copies up to date must also be a tested part of the plan.

Redundancy At All Points of Failure

Clear responsibilities for every action necessary to implement the plan. It is not enough to designate who is responsible for an action or goal. Be clear about exactly who will perform each activity.

Each responsibility should entail clearly written instructions, checklists, procedures, training requirements, and practice drills to confirm that the person responsible actually can do the task required.

Regular tests of every component of the disaster recovery plan must be built into implementation of the plan. Refine your disaster recovery plan based on the results of these tests, and demand improvements in performance for each testing cycle.

The principle of redundancy also requires that the chain of command be clear, because backup people may be required, as well as backup equipment. Who will take over if Worker A is unavailable? Fail-over responsibilities must be clear, too. Who is responsible to tell Worker B to replace Worker A if Worker A is unavailable?

Backup All Important Data Regularly

Create, implement, enforce, and verify continued performance of a complete, formal backup system, as if your job depended on it. Your job success as an MCSE may well depend on a mundane, boring backup system. When disaster strikes, your ordinary, everyday backups are going to save the company's bacon.

Prioritize the company's data resources. Mission-critical data should receive priority protection. Without data backups to restore everyday data operations, your network's operation, whether it is based on restoration or survival, is irrelevant.

Vital data must be protected with a robust backup system. Ask yourself, can the company afford to start from scratch to replace this data?

A well designed backup system includes consideration of backup hardware, backup software, backup scheduling, training, tests and verification. Appropriate replication polices must also be integrated into the backup system.


For Review


From Here

It may be time to review installation in Chapter 6, Installation Overview, Chapter 7, upgrade Installations, and Chapter 8, Installation Methods. Or perhaps you'd like to look at Chapter 9, Installing and Configuring Hardware for a hardware refresher?


JOIN the Saluki Discussion Group -- It's free!

Back to the Accelerated MCSE Study Guides page

Back to the Workstation Study Guide page

Version 1.6
This page was last modified
This page resides at http://www.Emissary.Net/mcse/NTWKSC26.html

Copyright 1998-1999, Dave Kinnaman, all rights reserved.