Accelerated MCSE Study Guide, Windows NT Workstation 4.0,
all rights reserved
MCSE candidates are
expected to have advanced troubleshooting,
fault tolerance, and
disaster recovery skills.
This chapter provides you with information needed to
handle the Windows NT 4.0 Workstation exam questions in these areas,
beginning with troubleshooting, followed by
fault tolerance and
Troubleshooting, Fault Tolerance, and Disaster Recovery
The most successful troubleshooter is
one with experience and knowledge: experience with the
software and the hardware. Knowledge includes knowing where to go for
more information when faced with a new problem. This chapter will cover
the basics of troubleshooting and available resources.
Possible sources of
additional information include
In addition, Internet mailing lists and newsgroups,
and printed magazines and newsletters are available that cover almost
every aspect of Windows NT.
- Microsoft TechNet
- Web sites
- Classes and training materials
- Microsoft resource kits
Figure It Out
The first step in
troubleshooting is to have a systematic approach to diagnosing and
solving the problem. Steps that you might take include
- Did it ever work?
- If it did work, what has changed since the last time it worked?
- Gather specifics about the problem.
- Consider possible approaches to solving it.
- Try one possible solution at a time. If it works, great. If it does
not work, then try another approach.
Regardless of problems a networked
computer may experience and the solutions to those problems, there is
no replacement for prevention on the part of the network administrator.
Careful planning and prevention reduce the number of problems that arise
and the downtime to resolve problems that do. An effective network
administrator will always perform these general preventative tasks:
One of the best ways to avoid problems
is to carefully plan network administration, implementation, and growth.
Plan things that will happen soon in great detail and lots of lists and
tables and charts. Plan things that will happen later in less detail,
building a foundation for detailed planning later.
Document the current configuration thoroughly as part of your plan for
the future. Plan for whatever levels of security each part of the
network requires. Plan a storage and backup system that deals with the
whole network. Plan for fault tolerance and disaster recovery (see the
next two sections).
To avoid hardware and software
conflicts, network standardization helps reduce potentially difficult
and expensive problems. Open standards allow many manufacturers to
compete for position in the marketplace, and often lead to innovations
and growth. Proprietary standards can lock the network into a single
vendor, or only a few vendors, limiting your options and your future
Aside from planning, documentation is
the single most important aspect of solving problems when they occur.
Wiring diagrams, notes, and implementation documentation can save
network administrators a tremendous amount of time to correct network
problems when they occur. Additionally, careful notes about issues that
could become problems helps solve those problems before they
Windows NT server enables the
administrator to monitor the network through various tools included.
These tools help the administrator identify problem areas and establish
a baseline of acceptable performance.
One of the most important aspects of troubleshooting is documentation.
Be sure and document the details of the problems and attempted solutions.
It is important to record any solution attempted whether it works or not.
There are many tools supplied with
Windows NT that are helpful in troubleshooting problems. These tools
have already been covered and will not be reviewed here. They include
Event Viewer, Performance Monitor, Network Monitor, User Manager, etc.
One of the most common troubleshooting problems is access to resources.
Chapter 13, Shares and NTFS Permissions, and Chapter 14, Printers and
Print Devices discuss these issues. Additional tools and methods will be
covered in the remainder of this chapter.
When troubleshooting installation problems, here are some areas that
should be examined:
- Hardware Compatibility
- Hardware Configuration
- Reviewing Logs
The most frequent problems with installation result from incompatible
hardware. Check the Hardware Compatibility List, a copy of which is
included on the Windows NT CD-ROM. It is best to consult the most
recent version of the HCL which is available for download from
Microsoft's web site.
In addition, Microsoft provides the Windows NT Hardware Qualifier
(NTHQ) utility on the CD that is used to identify installed hardware.
The NTHQ detects what hardware is installed on the computer. Use this
list to validate that the hardware is on the HCL.
Windows NT Hardware Qualifier (NTHQ)
Create an NTHQ disk by running
which is found on the Windows NT CD-ROM in the
directory. After creating the disk, reboot your computer
with the disk in the A: drive.
Often the installed CD-ROM on a computer may not be compatible with
Windows NT but can be accessed under an alternative operating system,
such as Windows 95 or MS-DOS. If this is the case, boot that
operating system and copy the installation files to the hard drive.
You can then install Windows NT from the copy on the hard drive.
As an alternative solution, if a network connection is available, the
installation can be accomplished by using a network share.
When installing Windows NT, make sure that no conflicts exist between
hardware devices. This is especially true when installing on a
computer that has been running Windows 95.
Remember, Windows NT is NOT plug'n'play compatible. Many hardware
devices are designed to be plug'n'play and work well under Windows 95.
However, these same devices may not be recognized by Windows NT.
Other devices are either plug'n'play or manually configured.
Configurable settings include I/O addresses and Interrupts. Check to
see if manual configuration has been properly done.
You can use Windows NT Diagnostics to determine what resources are
being used by what hardware devices as shown in Figure 26-1. Other
versions of WINMSD can be used when installing Windows NT on a
computer running an alternative operating system. A version of WINMSD
can be run under Windows 95. A nongraphical version also comes with
the most recent versions of MS-DOS.
Figure 26-1. Windows NT
Diagnostics is one of the most powerful troubleshooting tools.
WINMSD provides a wealth of
information about your system.
Create a log file of the installation process by using the
during installation. This forces the creation of a log file called
that will contain any errors encountered.
Many types of viruses exist some of which can effect the system
partition. Make sure that no viruses are present. Viruses can corrupt
an installation of Windows NT or prevent installation entirely.
Next to installation, the most common
problem with Windows NT has to do with being unable to boot. There are
three main types of problems that can occur during boot.
It is helpful to understand the boot process, when troubleshooting boot
problems. By knowing where in the boot process the error occurs, the
cause of the problem may be more easily identified.
A successful boot does not occur until a logon has been completed.
Assuming that Windows NT is the selected operating system, the process
consists of four steps which are
- Corrupted or missing boot file
- Incorrect or corrupted device driver
- Memory incompatibilities resulting from applications
Boot Loader Phase. It is during this phase that
file for the operating system that is to be started and
the location of the system files. Next,
is used to
determine what hardware is installed and a list of the appropriate
device drivers is created.
turns over control to
. It is
during this phase that device drivers are loaded, services started,
and the pagefile is initialized.
Logon Phase. The last phase of a successful boot is completion
of the logon sequence. See Chapter 6 for more information on logging
onto a Windows NT computer.
The approach to boot problems are varied depending on the cause and
when they occur during the boot process. The following section covers
the most common approaches to solving boot problems.
- POST (Power On Self Test). This portion is where the computer
performs a self-test. The boot portion of the hard disk is located
Last Known Good
During the boot process, an option
is presented to boot using the Last Known Good Configuration. If
after installing a new device driver, you notice that Windows NT
does not boot correctly, select the Last Known Good configuration.
When a user successfully logs onto a Windows NT computer, the
configuration that is in effect at that time is copied to be used
whenever Last Known Good is selected. For this reason, if a new piece
of hardware or an updated device driver is installed, do not log back
on after rebooting if you suspect that the system is not functioning
Repair Process and the
Emergency Repair Disk
If you find that using the Last Known
Good does not allow you to successfully boot the machine, then the next
option is to attempt a repair of your installation. To do a repair
- Boot your computer with the Windows NT setup disk, #1.
- After inserting Disk #2, you will be given the opportunity to repair
- Insert the Emergency Repair Disk when prompted.
Creating an Emergency Repair Disk
During installation, you are prompted
to create an ERD. After installation you can use the rdisk.exe utility
to create an ERD. Run this utility from either the RUN line or from a
command prompt. You will then be presented with the Repair Disk Utility
as shown in Figure 26-2.
Figure 26-2. Make a new ERD
every time a system changes.
The Repair Disk Utility provides the
ability to create a new ERD or update repair information.
The Repair Disk Utility has two options, to either update the repair
information or to create a new repair disk. If you chose to update the
computer's configuration information, you will be prompted to create an
ERD after the update is complete.
When creating an ERD, the information contained in
directory is used. This information is written during installation and
is not updated automatically.
Restoring the System Configuration
If you use the information saved to the
Repair directory during installation to create an ERD, when you use that
ERD the system will be returned to its original configuration. What that
means is that any users or groups that have been created or changed will
be lost. Also, any system configuration changes will also be lost.
The repair process presents you with
four options. Select the appropriate options and then continue with the
repair process. The available options are
- Inspect Registry Files. You will be prompted for verification of
replacement of each registry file.
- Inspect Startup Environment. Inspects the boot.ini file and
verifies that Windows NT is a valid option.
- Verify Windows NT System Files. Compares existing system files with
the version on the CD. You will be prompted to replace any differing
- Inspect Boot Sector. Checks to make sure that Ntldr is referenced by
the primary boot sector. This is helpful if the sys.com utility has been
used on the partition, such as if Windows 95 or MS-DOS were installed
after Windows NT.
If repairing the installation does not work, then your only choice is to
reinstall the operating system. After successfully reinstalling Windows
NT, the system can be returned to its original configuration by using
the ERD or restoring the registry from a recent backup.
'Blue Screen of Death'
The most infamous problem associated
with troubleshooting Windows NT is the Stop Screens affectionately known
as the Blue Screen of Death or BSOD. A Stop Screen is displayed whenever
Windows NT encounters a fatal error.
These Stop Screens are a blue screen, hence the nickname, that contains
debugging information. This information is helpful in identifying what
caused the problem and how to correct it. They are not as daunting as
they appear. There are five sections to this information.
- Debug Port Status Indicator. This information appears in the upper
right hand corner of the screen if a modem or serial cable is connected
and the debug option has been turned on.
- BugCheck Information. This section begins with the word STOP and
contains the most helpful information. The most common Stop Codes are
hardware related. Record this and consult TechNet or Microsoft Technical
Support for further information.
- Driver Information. The next section contains three columns
containing the base address, time stamp and the name of each loaded
driver. This section often provides information as to the address of
the instruction that caused the error.
- Kernel Build Number and Stack Dump. Contains the build number. Also
lists the range of addresses in the stack. This might indicate the
component that caused the crash.
- Debug Port Information. Confirms if a dump file was created.
Windows NT may be configured to
create a memory dump whenever a Stop error occurs. This is configured
using the System applet in the Control Panel on the Startup/Shutdown
tab. Two utilities are available for analysis of a crash dump. They are
utility. Verifies that all memory addresses in the memory
dump file are valid.
utility. Creates a text file called
information contained in the memory dump file.
Both of these utilities are on the Windows NT Server CD and are
specific to the platform of the involved computer.
Dr. Watson is an application debugger
that is part of Windows NT. Examine the results of Dr. Watson's work in
the Application Log of Event Viewer.
Troubleshooting Windows NT can be less of an obstacle if you know the
software and the hardware the operating system is running on. By taking
a systematic approach, any troubleshooting question becomes less of a
roadblock. The more experience you have with troubleshooting problems
the easier it becomes.
Because Windows NT 4.0 Workstation
does not offer fault tolerant disk configurations (unlike
Windows NT 4.0 Server), this section is brief.
All computer resources will eventually fail. Some computer parts
fail more often than others — nowadays power supplies and hard
drives fail far too often. Network Administrators take steps every
day to maintain the integrity of the network's most vital data. By
reducing the chances of system failures, you protect critical data
and your job.This chapter reviews practical steps network
administrators may take to provide fault tolerance for the network.
UPSs are fancy switched batteries.
An UPS unit is plugged into the wall alternating current (AC)
outlet, and the computer, monitor, and necessary peripherals are
then plugged into the UPS. The primary purpose of the UPS is to
provide temporary battery power for a graceful shut-down of the
computer in the event of a prolonged power failure. An UPS can also
prevent the need for shutting down the computer by providing power to
survive very brief power failures, until regular power is restored by
a gasoline-powered generator or through the regular electrical power
A serial connection may connect from the computer and the UPS, to
allow the UPS and computer to communicate during a power failure.
UPS software is especially designed to control the computer-UPS
combination and to enable the graceful shut down of both devices
before the UPS battery is exhausted.
Providing an UPS for a computer also tends to lengthen the
computer's useful life by protecting the computer from harmful
variations in electrical current. This is because most UPS units
also provide line conditioning to smooth over temporary aberrations
and spikes in the electrical power supplied to the computer. Although
built-in computer power supplies originally provided some protection in
this regard, today power supplies manufactured overseas at rock bottom
prices often simply omit these originally designed, but costly
protective features. The resource list at the end of this section has a
URL for power supplies and computer cases manufactured to higher
standards than those available on the popular market.
Network servers should always be protected with an UPS. Vital
workstations are also possible candidates for UPS protection.
Fault tolerant data systems protect
data by duplicating it and by placing it in different physical
Three types of RAID (Redundant Arrays of Inexpensive Disks) are
supported by Windows NT Server. Only "RAID 0" is available from
the operating system on Windows NT Workstation.
- "RAID 0" – Striping
- RAID 1 – Mirroring or Duplexing
- RAID 5 – Striping with Parity
Only RAID 1 and RAID 5 provide fault tolerance.
"RAID 0" is, of course, a misnomer, because there is no
redundancy to the data in a "RAID 0" striped disk system. For the
exam you should know that "RAID 0" means not
Disk striping divides data
into 64k blocks and spreads it equally among all disks in the array.
Disk striping is not fault tolerant. The only recovery option is to
restore from back-up.
The RAID options on Windows NT
Server are provided in the Windows NT operating system software by
Microsoft. RAID hardware cards, particularly those by leading
manufacturers such as Adaptec, can provide higher reliability, more
fault tolerant protection, and even greater speed than Microsoft's
software solutions. The Internet address URL for
Adaptec is given in
the Resources section.
Although Windows NT 4.0 Workstation
does not natively provide fault tolerant disk configurations, data on
networked workstations can be backed-up over the network with several
available software and hardware systems.
Planning for network disaster recovery
is like buying disability income insurance — you fervently hope you
never need to use your recovery plan or your disability income
insurance. However, if you either is ever needed, you'll be much better
off if you've planned and prepared for that possible day.
Written Disaster Recovery Plan
External disasters, equipment disasters,
and human-mediated disasters all should be in your plan.
- Other Natural Disasters
- Power failure
- Transportation failure
- Communication failure
Equipment and Software disasters
- Server Component Failure
- Network Component Failure
- Workstation Component Failure
- "Upgrade" software that incapacitates the network
- Influenza epidemic
- Virus infection
Assess all the risks your network is
vulnerable to. The items above are only a beginning list to get you
started. Then, from the risks you've identified for your network,
generate clearly stated precise, prioritized goals for dealing with
each risk. These goals and priorities are what will guide you in
designing the disaster recovery plan for your network.
Bringing the network back to "business as usual" is often a goal in a
disaster recovery plan. That services should come up first? How long
are you willing to wait for network services to be back at 100%?
Should customers be the first to experience "business as usual?" How
long before staff will experience "business as usual?" Should physical
network-server security be re-established within 24 hours, or can it
wait 3 days?
These are tough decisions that can
involve large amounts of money. Therefore, another part of the disaster
recovery plan is to make a through hardware and software (licensed and
unlicensed) inventory of all network assets component by component.
Also inventory all business insurance covering these assets.
Next, calculate the current network investment. With the current
investment figures, you can begin to estimate the resources needed to
implement disaster plans under various salvage scenarios. Eventually,
management must face the situation and plan to make appropriate
contingent resources available in the event of disaster.
Your plan should include exact
procedures for notification of everyone at the appropriate times.
Maintain a comprehensive roster of emergency officials and contacts
for all relevant city, county, state or provincial agencies. Murphy's
Law says that the one agency or official you leave out will be the one
you need the most. Keep this list up-to-date.
Also, you'll possibly need the home addresses and telephone numbers
- Employees crucial to recovery
- Employees affected by the disaster
- Other employees and contractors
- Equipment vendors
Many disaster recovery plans require storage of important plan
documents and replacement equipment at alternate sites, in the
event of a total loss of the primary facility. At the least,
duplicate plan documents and backup media must be stored off site.
Keeping off-site copies up to date must also be a tested part of the
Redundancy At All Points of
Clear responsibilities for every
action necessary to implement the plan. It is not enough to
designate who is responsible for an action or goal. Be clear about
exactly who will perform each activity.
Each responsibility should entail clearly written instructions,
checklists, procedures, training requirements, and practice drills
to confirm that the person responsible actually can do the task
Regular tests of every component of the disaster recovery plan must
be built into implementation of the plan. Refine your disaster
recovery plan based on the results of these tests, and demand
improvements in performance for each testing cycle.
The principle of redundancy also requires that the chain of command
be clear, because backup people may be required, as well as
backup equipment. Who will take over if Worker A is unavailable?
Fail-over responsibilities must be clear, too. Who is responsible to
tell Worker B to replace Worker A if Worker A is unavailable?
Backup All Important Data
Create, implement, enforce, and
verify continued performance of a complete, formal backup system, as
if your job depended on it. Your job success as an MCSE may well
depend on a mundane, boring backup system. When disaster strikes, your
ordinary, everyday backups are going to save the company's bacon.
Prioritize the company's data resources. Mission-critical data should
receive priority protection. Without data backups to restore everyday
data operations, your network's operation, whether it is based on
restoration or survival, is irrelevant.
Vital data must be protected with a robust backup system. Ask
yourself, can the company afford to start from scratch to replace
A well designed backup system includes consideration of backup
hardware, backup software, backup scheduling, training, tests and
verification. Appropriate replication polices must also be
integrated into the backup system.
Record what worked and what did not work for all problems. This
will help you solve future occurrences of this problem faster.
Use the tools that come with Windows NT for troubleshooting.
They include Event Viewer, Performance Monitor, and Network Monitor.
Verify hardware configuration, looking especially for resource
Determine where in the boot process a problem exists to help in
Once you log on, the Last Known Good Configuration is overwritten
by the present configuration.
Keep an up-to-date ERD available at all times.
If you must reinstall Windows NT, use the ERD or restore a backup
to return the computer to its prior configuration.
Dr. Watson records application errors in the Application Log.
UPS allow for the graceful shutdown of network servers when the power
Your disaster recovery plan should include a:
Without a robust backup system for mission-critical data, a disaster
recovery plan may be useless.
- Risk assessment
- Contingent resources
- Asset inventory
- Notification procedures
- Alternate sites
- Redundancy at all points of failure
It may be time to
review installation in Chapter 6, Installation Overview, Chapter 7, upgrade
Installations, and Chapter 8, Installation Methods. Or perhaps you'd like
to look at Chapter 9, Installing and Configuring Hardware for a hardware
Back to the Accelerated MCSE Study Guides page
Back to the Workstation Study Guide page
This page was last modified
This page resides at http://www.Emissary.Net/mcse/NTWKSC26.html
Copyright © 1998-1999, Dave Kinnaman,
all rights reserved.