Title: Windows system state , active directory DR backup and recovery
Description: Diagnosing Windows System State Problems - including active directory word version attached to solution.
This document aims to cover:
• A rough procedure for diagnosing System State faults.
• Information on common System State problems and questions.
For a background to System State and how NetVault supports it, please see:
• The appendix, for a very brief overview.
• The end user filesystem documentation, which has been recently been re-written to accurately reflect our System State support.
• The engineering release note attached to version 6.17 (or later) of the filesystem plugin.
Diagnosing Backup Problems:
1. What version of the plugin is being used? If the plugin is earlier than 6.17, the user needs to upgrade to 6.17 and start again.
2. Is the user's problem covered by any of the following common questions or situations?
a) Is OFM/VSS required for System State backups? In the end the company did not make VSS/OFM a hard requirement for System State backups. However, please see the item "User is expecting successful backup and restore of more than the System State" in the restore section of the document and also read the following notes...
We tested with and without VSS/OFM and saw things work correctly.
Microsoft does not have native snapshotting technology on Win 2000 so ntbackup supports AD backup/restore without a snapshot. On Win 2003, ntbackup will always use VSS, so it's less clear that Microsoft think it's OK without a snapshot.
If the user is doing a disaster recovery backup/restore (i.e. system drive and System State) we would strongly recommend a snapshot. There is no protection for files falling out of synchronisation with eachother if one is not used. If only the System State is being backed up then the snapshot is less important, because many of the System State elements effectively have their own snapshotting techniques.
However, this is not the case for all System State elements (e.g. SYSVOL contents) so a snapshot may still be preferable.
b) When browsing for backup, System State items are shown that are not present on the machine. The plugin's detection of System State items at browse time is currently limited, and some items are either always presented or presented even though the feature is not in use on the machine. This may be improved in the future.
As long as the user does not put an explicit tick on the item that is not present/functional there should not be a problem at backup time.
c) Why is a Network Load Balancing Cluster not backed up with the System State? Network Load Balancing clusters were introduced in Windows 2003 and are a different Windows technology to the quorum-disk-based clustering that has been present since Windows 2000. Microsoft does not define a Network Load Balancing cluster as part of the System State.
d) Must the whole of the System State always be backed up in one job? See the user manual for more detail. Although the plugin allows individual items to be selected, this is considered an advanced and unsupported operation. The supported procedures involve backing up or restoring the whole of the System State as outlined in the procedures later in this document.
e) Windows XP Jobs failing due to bad Microsoft software update. A bad Microsoft update for Windows XP can cause all backup and restore jobs to fail.
See MS KB 883357: Your backup program may fail or incorrectly exclude some files from your backup in Windows XP. The user should apply the update discussed in the article.
f) Jobs that are aborted or fail during System State backup may leave System State components in a hung state. This can definitely happen for Active Directory and also possibly for other System State components. In the case of Active Directory, future backups will fail and the Active Directory may not function correctly. The machine is likely to need re-booting if this occurs. Covered by open fault 21600.
3. What type of backup is being performed?
- A backup of just the System State.
- A backup of the whole operating system / disaster recovery backup.
Things to find out: Exactly what has been ticked/crossed in the selection tree for the job and is OFM or VSS being used?
4. Is the user following our documented backup procedure? The supported procedures are duplicated in Appendix II or are contained in the end user documentation.
5. What are the details of the machine being backed up i.e. Windows version and service packetc...? Is it up to date with Microsoft hot fixes and patches?
Is it a domain controller? Is it a cluster node? What drive is Windows installed on? Which other drives have operating system files on them e.g. are the Active Directory files stored on a different drive?
6. What error is the user seeing? Need to gather: A binary dump of the NetVault logs, a dump of all the operating system logs (as seen in the Windows event viewer), LIBVERBOSE trace from the filesystem plugin.
Diagnosing Restore Problems:
1. What version of the filesystem plugin is doing the restore? Which version did the backup? Version 6.17 and later include re-worked System State support. These versions will only restore backups made with 6.17 or later. If the backup was made by an earlier version of the plugin it cannot be restored with 6.x. The user may be able to restore their backup using an alternative procedure but this is a separate topic to the System State support in 6.17+.
2. Do any of the following common problem situations explain the users problem?
a) Machine won't boot due to dissimilar hardware. The disaster recovery procedure does support restore to dissimilar hardware. However, even if NetVault has functioned correctly and the user has followed the correct procedures, the restored system may not boot or may have trouble booting. In general, the more dis-similar the hardware, the more likely this is to occur. Differences in fundamental areas, such as processor architecture, are most problematic of all. Please see Microsoft knowledge base articles 263532 and 237556 for more information on the potential issues and procedures for correcting problems that occur.
b) Restoring an Active Directory backup that is too old. You should not attempt to restore Active Directory data that is older than the "tombstone lifetime" setting for the enterprise. If the whole domain is lost and only a backup that is older than the tombstone lifetime exists, it may be possible to rebuild the domain but special steps are required. For issues relating to tombstone lifetime see MS KB 216993
https://support.microsoft.com/en-us/kb/216993
c) Attempting to transfer System State between machines. Just to emphasise, System State information can only ever be restored to the computer it was backed up from or a re-built version of the computer that complies with the disaster recovery procedure outlined below. Other than this it is totally non-transferable between machines.
d) Windows XP: Jobs failing due to bad Microsoft software update. A bad Microsoft update for Windows XP can cause all backup and restore jobs to fail. See MS KB 883357: Your backup program may fail or incorrectly exclude some files from your backup in Windows XP. The user should apply the update discussed in the article.
e) System State and Virtual Clients: Virtual clients were added in 7.3 to support clustered environments. System State backups may be performed using a Virtual Client, however the user must be careful because Virtual Clients, by their nature, switch from running on one machine to running on another machine. System State backups may only be restored to the exact physical machine they were made on. So, suppose a System State backup was made using virtual client (VCA). At the time of the backup, VCA was running on physical machine (PMA). At restore time, VCA may be running on a different physical machine (PMB). Additionally, the backups using VCA may reflect some System State backups from PMA and some from PMB.
At restore time, the user must be very careful that the restore runs on the exact machine that the backup was made on. In general, staying away from Virtual Clients when using System State is a good idea. The user should generally be looking to schedule System State backups using physical clients rather than Virtual Clients.
f) User is expecting successful backup and restore of more than the System State. The System State backup procedure is only designed to backup the operating system and its state. It does not constitute a guaranteed backup of applications and application data, despite the fact that some or all of an applications files may have been included in the backup and restore. Some applications will be successfully backed up and restored using this procedure. This is especially true if the application is in a quiescent state before the backup e.g. stopped from running, taking a database offline, or if the application is guaranteed to be satisfactorily backed up by the presence of an OFM or VSS snapshot. Likewise, application data will often be consistently and successfully backed up if applications that use the data are quiescent. However, the Microsoft DR procedure we are implementing is only designed to guarantee the backup and restore of the base operating system and its state. For other applications and application data the user should either: Use the appropriate NetVault APM where available or satisfy himself that the backup of an application or its data files is adequately covered by this procedure. As mentioned, quiescent applications and data are likely to be OK but the emphasis is on the user to be sure.
g) Machine dependencies when performing restores / domain servers not functioning correctly after restore. It is important for the administrator restoring a machine to understand the dependencies the machine has on other computers in the forest. In a number of cases a machine will not be fully functional after a restore until it makes contact with other machines or is told not to contact those machines. Especially, please note the following: A domain controller can not become a domain controller for a domain until the file replication service allows it to do so. If the replica sets on a machine have been restored non-authoritatively then the file replication service must make contact with another domain controller with an authoritative version of the replica set before it will allow the machine to become a domain controller. If a restored domain controller is assigned certain FMSO Active Directory roles and believes it is part of a domain with more than one domain controller, it will not start performing these roles until it has replicated with another domain controller. For example, if a machine has the RID Master role it will not start performing the role until it has made contact with another domain controller with which it is configured to replicate. This will have symptoms such as not being able to create any new Active Directory objects and the Security Accounts Manager on the machine not being able to start operating because it cannot contact the RID master. Note that even once another domain controller is brought online, it can take some time (we have seen up to an hour) for the machines to become happy and for all FMSO roles to start to be performed. Please see Microsoft Knowledge Base articles 839879, 305476, 822053 and 257338.
The user must ensure that all the machines necessary for the domain/forest to function are backed up and understand how to re-create the domain from these backups including understanding the order in which the machines must be restored. Such procedures are outlined in Microsofts documentation and knowledge base.
h) Mix of old and new domain controllers after primary restore. As mentioned in the notes in the appendix, once a domain controller has been restored as a primary it should never come into contact (i.e. attempt to replicate with) any domain controller from before the domain was lost. Check that there is no possibility of domain controller restored with the primary option being able to see other older domain controllers on the network.
i) Disaster recovery restore has a long pause at the end (approx 5 minutes). This pause occurs when attempting to restore the IIS metabase. It is the expected behaviour.
j) Must the whole of the System State always be restored in one job? See the user manual for more detail. Although the plugin allows individual items to be selected, this is considered an advanced and unsupported operation. The supported procedures involve backing or restoring the whole of the System State in one operation.
k) Issues performing a Disaster Recovery restore of the cluster database. Ensure the user is really following the disaster recovery procedure, including the step at the end to restore the cluster database on its own as a separate stage. Beyond this, the user may still run into problems: If the shared cluster disks have been changed between backup and restore (e.g. because of disk loss or failure) the new disks will have different disk signatures to the old disks. The disk signatures are stored on the disk itself and are how the Windows cluster service identifies each disk used in the cluster. To get the cluster to work again, the signatures on the new disks need to be changed to match the signatures on the old disks. This is covered in Microsoft KB article 280425. Note that using the Force Cluster Database Restore option will re-write the signature on the quorum disk. However, any other disks in use in the cluster will not have their disk signature reset by using this option.
l) Issues with clustered applications e.g. Exchange Beyond the procedures we document, there may be additional work needed to restore the clustered applications running in a cluster. This is true, for example, with Exchange. There are reams of MS documentation on this, for example the online book Disaster Recovery for Exchange 2000 Server.
To date, as far as the filesystem plugin is concerned, we have taken the approach that the details of how to perform these procedures are beyond the scope of what it is reasonable for us to document and support. Pointing people at the Microsoft documentation is probably the best approach in this sort of case.
Only if we suspect a problem with NetVault's ability to restore the cluster database or System State generally should we consider there to be a supportable filesystem problem. Note that our database APM's for Exchange etc may need to work in specific ways to support restoring a clustered application. Presumably, the same logic should again apply that we will only offer support if we think the APM is at fault rather than the user not being aware of Microsoft's procedures for fully recovering a cluster.
3. What type of restore is the user performing? It should be one of: A restore of just the System State (System State only restore), a disaster recovery restore after the loss of a machine ( Disaster Recovery restore).
4. What error is the user seeing? Need to gather:
A binary dump of the NetVault logs
A dump of all the operating system logs (as seen in event viewer).
If available, trace.
Exact text of any Windows error message the user is seeing.
An exact description of when the error occurs e.g. When rebooting after the NetVault restore, just after Windows has displayed the message, rebuilding Active Directory Indexes.
5. Is the user following the documented restore procedure? The supported procedures are duplicated in Appendix II or are contained in the end user documentation. Please work through the procedure in the appendix and what the user is doing and check if there are any differences. Particular things to check: "Confirm the user is attempting to restore the whole of the System State in one restore job". Confirm if any previous System State restores (successful or otherwise) have ever been made to the machine. What is the grand plan the user is carrying out? Where does the current restore fit into the picture? Have any other restores completed successfully? For example: "Current restore is first machine when recovering a complete domain". Current restore is a replacement of one server in an otherwise healthy domain.
Things to check for Disaster Recovery restores: "Has the user definitely installed the exact same Microsoft service pack, hotfixes and patches as were on the machine that was backed up? "Is the machine's name exactly the same"?
6. What are the details of the machine being restored? "Windows version and service pack".
Is it a domain controller? "Is it a cluster node?".
What drive is Windows installed on? "Which other drives have operating system files on them" e.g. are the Active Directory files stored on a different drive?
Appendix I System State Background.
System State is Microsofts term for a collection of operating system databases and files. These items must be backed up and restored together. Collectively, these items represent the state of the operating system on a particular machine. The System State components are:
"Registry" COM+ Class Registration Database.
System Files (including the Boot Files and Windows File Protection Files) Certificate Services Database.
"Active Directory Database" SYSVOL Directory.
"Cluster Service Information" IIS Metabase.
"Removable Storage Manager" Disk Quota Database.
Some components are only present on certain machines e.g. Active Directory only exists on a domain controller.
Supported procedures System State only.
This is a backup and restore of just the System State.
To restore the System State on its own, the restore must be made to the exact machine that the backup was made on.
Disaster Recovery.
This is a backup of the System State and all operating system files. A disaster recovery restore must be made to a re-built version of the original machine that conforms to a number of Microsoft restrictions. Either a Disaster Recovery or System State only restore can be made from a Disaster Recovery backup.
Appendix II
Supported System State procedures.
This information is contained in the end user documentation. The key elements are reproduced here.
Backup procedure:
"Tick System State"
Disaster recovery scenario only:
Tick all fixed drives holding operating system information, on many systems this is just the C: drive.
This should backup everything required for recovering the base operating system and state in a disaster.
Recommended: Select OFM or VSS in the backup options, Run the backup System state only restore procedure.
If the machine is a domain controller, boot into Directory Services Restore mode (F8 on boot, select Directory Services Restore mode).
If the machine is running Certificate Server, either reboot into Directory Services Restore Mode or manually stop the Certificate Server service.
If the cluster database is being restored, stop the cluster service on all nodes except the node on which the restore will be performed. (Alternatively, select the Force cluster database restore option on the restore options screen and not recommended unless the user is sure this is what he wants).
Tick System State.
Please see section below on authoritative and primary restores. Choose the SYSVOL authority level on the restore options screen. Run the restore.
If the machine is a domain controller and Active Directory is being authoritatively restored, before rebooting use the ntdsutil tool to mark the Active Directory database as authoritative (MicroSoft KB article 241594).
Reboot Disaster recovery procedure.
Install the base operating system in the usual way.
Please note the following restrictions:
- Exactly the same operating system and service pack must be installed.
- The same MS hotfixes and updates that were installed at the time of the backup must be installed.
- The machine must have the same name.
- The same drive letter mappings must be in existence.
- Each drive must be the same size or bigger than when the backup was made.
- Each drive must be formatted with the same file system (and version of the filesystem) as when the backup was made.
- Significant hardware differences between the two systems may cause issues with Windows ability to process the restore. Do not make the machine a domain controller, setup Active Directory or do anything beyond OS, service pack and hotfix/update installation. Install NetVault to the same location as it was on the backed up system. Install version 6.14 or later of the NetVault filesystem plugin.
If the machine is a NetVault server:
Configure NetVault so it has access to the devices and tapes on which the backup resides. Scan in the tapes. If the machine is a domain controller, boot into Directory Services Restore mode. Tick all drives in the backup and System State.
Please see section below on authoritative and primary restores. Choose the SYSVOL authority level on the restore options screen. Run the restore.
If the machine is a domain controller and Active Directory is being authoritatively restored:
- Reboot the system and ensure that Active Directory Restore mode is again selected from the boot menu.
- Use the ntdsutil tool to mark the Active Directory database as authoritative (MS KB article 241594).
Reboot. If the cluster database is being restored.
Reboot the machine again (and, if the machine is a domain controller, boot into Directory Services Restore Mode).
Restore just the cluster database element of the System State. Reboot.
If the machine is a NetVault server:
- Restore the NetVault database from backup media. Install/re-install applications, including performing application data restores as required. See comments earlier on applications and application data Primary, authoritative and non-authoritative restores.
There are two concepts of authoritativeness in a restore.
1. Active Directory. This may be restored authoritatively or non-authoritatively. We only have the power to perform a non-authoritative restore. Ntdsutil must be made a restore authoritative as described in the procedures above.
2. SYSVOL (or File Replication Service). This may be restored authoritatively, non-authoritatively or as primary. We do control this via options on the restore options screen. The user may combine Active Directory and SYSVOL authority options as he sees fit. However, Microsoft state that:
a. A primary restore should only be used when all domain controllers have been lost and the domain is being completely re-created.
b. Because of interdependencies between Active Directory and SYSVOL, they should be restored with matching authority. Given this, the normal authority choices are as shown below. These apply regardless of whether the user is performing a System State only or disaster recovery restore.
Restore operation: Re-creating first or only domain controller SYSVOL=Primary, Active Directory=Authoritative Restoring data non-authoritatively (DR or SystemState only scenario) SYSVOL=Non-authoritative, Active Directory=Non-authoritative Restoring data authoritatively (DR or SystemState only scenario) SYSVOL=Authoritative , Active Directory=Authoritative By default, Active Directory information will be restored to a machine non-authoritatively. This means that more up to date Active Directory information from other domain controllers will take precedence over the restored information. For many scenarios, this is the desired behaviour. If you need to mark some or all of the Active Directory as authoritative, so that it will take precedence over information held on other domain controllers, use the Microsoft ntdsutil tool. All File Replication Service replica sets stored on a server will be restored non-authoritatively by default. This is usually the required type of restore. However, if all domain controllers in a domain have been lost and the domain is being re-created, the first domain controller that is restored should have its replica sets marked as primary. You should never restore primary replica sets in a scenario where other domain controllers are in existence and holding other copies of the replica sets. It is worth emphasising that once a domain controller has been restored as primary it should never come into contact (i.e. attempt to replicate with) any domain controller from before the domain was lost. There is more detailed information in the Microsoft KB.