Failover Cluster – An error was encountered while loading the network topology

Sharing and friendly reminder for myself as I will for sure run into this problem again in the future.

When creating a Windows Failover Cluster, you get this error message (btw, did you know you can Ctrl+C copy the text from a dialog box and paste like this);

 

Solution: In this case I had to remove the User Account from AD’s “Protected Users” group because Failover Cluster is still using NTLM and CredSSP. Relogon and it’s now possible to create the cluster, and then just add the user back into the Protected Users Group.

How to solve EVENT ID 1202 SceCli 0x57 Parameter is incorrect

Customer is repeatedly getting this Event ID on all Servers and Clients, especially on the Domain Controllers being logged every 5 minute.

Searching for that Error gives thousands of results, most less helpful.The way I solved this problem was like this.

  1. On one of the servers having the problem, run RSOP.MSC
  2. Resultant Set of Policies showed a Warning on the Computer policies. Selecting properties there showed the same error as in our Event Log.
  3. Browsing the Tree showed that there were a problem in the Password Policy section, from the Default Domain Policy.
  4. Which were also visible in GPMC (Group Policy Management Console)
  5. By modifying the Default Domain Policy and fixing the bad entries (no clue how they got there). The Error message (and problem) is now gone.

 

 

 

 

 

Use OMS (Log Analytic) to monitor and send alert for BlueScreen of Death

At times there is a driver or two that’s misbehaving and causing bluescreens. As the server automatically reboots after dumping memory to the memory.dmp file you might not get a report from your users that there has been a problem. And depending on your monitoring tool you might not get an alter there either. Operations Manager can easily alert you for things like that, but far from all customers use OpsMgr due to it’s complexity. Luckily, it’s just a 1 minute job to get alert in OMS if you have got a bluescreen! And as OMS can be run in Free mode, you may be able to monitor your servers for free (all depending on the amount of data you collect) and else, it’s really cheap so no big deal if you need to use a standard subscription. Anyway, lets get to the technical stuff!

First of all, enable OMS to collect Eventlog System and all Error messages.

omserrordata

Then create an Alert like this,

oms_bsod

The Alert text to be used is:

That will only alert for Crashes. You can also enable an alert for Event ID 6008 which will alert you for an unexpected shutdown. The difference is that my alert will only send an alert if there was a BSOD while an unexpected alert could also alert if someone pulled the power. Or even combine both into one alert with an OR statement. In my case, I just want to get alerted about the BSOD’s so thats the only thing I look for right now.

Tell how often is should check. There is usually no need to check more than once or twice an hour. And finally define if it should send an email alert or use one of the other alert methods.

Easy as that! Next time you get a bluescreen on a server, you will get an alert by mail so you can debug the dump and find out what’s causing it.

It will look like this,

bsodmail

 

Disable ASUS Mini Bar (AsPowerBar.exe)

asusminibarNot 100% work related but computer related. I reinstalled my home PC last week and also installed the ASUS AI Suite 3 tools to make it easier with overclocking and handling the fans and pump for my custom liquid cooling system.

One annoying thing is the ASUS Mini Bar (also called ASPowerBar.exe if you check in Task Manager) that automatically starts. When you logon to Windows. Easy to remove I thought and just downloaded one of the best (and free!) tools ever, called Sysinternals Autoruns which makes it super easy to see and disable all programs that automatically starts for various reasons including the ones launched form Task Manager or as Shell Extensions.

But there was no reference at all to be found related to the Asus Mini Bar. Ehh?  Well turned out that it was a lot easier than that.

asus

Just right click on the Asus AI Suite icon in the systray and remove the checkbox for ASUS Mini Bar! It’s the AI Suite tool that launches the Mini Bar…

I hope it can help someone else who’s like me digging through the Registry and Autorun folders and what ever else.

 

Enable driver verifier for all none-microsoft drivers with powershell

I’ve been doing some debugging for a customer, who has multiple industrial Client PC’s who are rebooting regularly. And to get more information in the memory dumps I had a need to configure the system to do a complete memory dump but also to enable extra verification of all drivers in the system to find the cause of the bluescreens.

Window has a built in tool called “Verifier” where you can enable extra checks on calls done by specific drivers. You generally don’t want to enable it on all drivers as that will slow down the system notable. And truthfully, the number of times it’s a Microsoft device driver who’s causing the issue is so small, because they check and stress test their drivers so much better than all the other vendors. Thus, it’s always better to enable the extra checks for all drivers, except the ones from Microsoft to start with.

As I didn’t want to run around to all the Client PC’s and configure verifier, I’ve made a small powershell script that reads the name of all none-microsoft drivers from the system and enabled verification for just those drivers. Which can then be execute in any number of ways.

It’s using both the Get-VMIObject and Get-WindowsDrivers to get a complete list of thirdparty drivers in the system. And it will also configure the system for a Complete Memory Dump.

Just to be safe, I’ve added /bootmode resetonbootfail so it will reset the verifier settings in case the system is bluescreening during boot due to verifier notificing a bad driver in the boot process.

Reboot the PC, get a big cold Coke and wait for the bluescreen to happen.

Live (VSM) migration fails with mirror operation failed and access is denied error

When doing a Live Migration from SCVMM (System Center Virtual Machine Manager) with VSM, moving a Virtual Machine from one Cluster to another Cluster and at the same time also to a new Storage Location, you are getting an error message similar to this:

The strange thing is that there is a destination folder in the new location, it’s just does not copy content to that folder and aborts with the Access Denied error. But If you shutdown the VM first, so it’ s just a migration over the Network, it works!

The solution is to give the SOURCE Cluster Write Access on the DESTINATION Storage. When you do a VSM Migration, the destination Hyper-V host, creates the Directory on the SOFS Node, but it’s the Hyper-V Host that owns the VM that copies the VHD’s files to the destination storage. And as the current owner, by default does not have access to write there, it will fail. One could think that VMM should grant permissions to a host when VMM knows that the host needs to write in the location?

Maybe it’s fixed in the next version, but until then, there are two ways to do this.
Solution 1) In VMM add the Destination SOFS Shares as Storage on the Source VM Hosts like this. That will make VMM add the VM Hosts with Modify Permissions in the SOFS Shares so it can write there.

sofs2

This works quite fine, if the Hyper-V Clusters and all Storage is located in roughly the same location. But if you have one compute cluster with storage in one location, and another compute cluster with storage in another location. There is then a risk that you may be running VM’s cross the WAN link.

Solution 2) This is the one we used. By not using VMM to grant permissions to the shares, but rather do it manually we achieve the same solution as above but with the added benefit that a new VM will always be provisioned on the local storage and there is no (or a lot less) risk of running a VM cross the WAN link. Yes, it’s still technically possible to do it, but no one will by accident provision a VM that uses storage in the other datacenter.

You can either add each node manually, so we have created a “Domain Servers Hyper-V Hosts” security Group in AD where we add ALL Hyper-V hosts to during deployment. And then added that group to the Share and NTFS Permissions. All Hyper-V hosts will then automatically have write access to all locations they may need.

I wrote these two short scripts to query the VMM Database for the available SOFS Nodes and use powershell to grant permissions to the share, and to NTFS.

As all our SOFS Shares were called vDiskXX or CSVXX (where XX is a number) I just used a vDisk* and CSV* to do the change on all those shares. You might have to modify it a little to suit your name standard.

Updated Script (2016-02-04):
I got a report that the script was getting an error on some servers, which I managed to reproduce. Here is an alternative version where it will connect to the server and execute the ACL change locally via invoke-command. It’s also only changing permissions on Continuously Available (SOFS) shares.

 

 

Microsoft Fabric (datacenter and private cloud) related Hotfixes

Here is the list of Hotfixes I’m deploying in our production environment and that I deploy regularly at customers. Those production environments are a Fabric (Private Cloud) running Hyper-V, Storage Spaces, SOFS, ADFS, Domain Controllers, Azure Pack, System Center, SQL Servers, and more, yes everything you need in a Fabric. Though not Exchange, Lync or Sharepoint etc. So this list might not be complete for your system.
And as always, use your own judgement which hotfixes you would like to deploy in your environment or not. Hotfixes are not tested as much as ServicePacks used to be, and Update Rollups are, so it’s possible there are problems with them.

My philosophy is that I like to have everything updated and reduce the risk of having a problem. The number of times I have had issues with a hotfix are, as far as I can remember one (1), including the several years I worked at Microsoft Premier support and were assisting customers with problems and now and then provided a hotfix for an issue. So I rather install hotfixes I know of and are relevant to reduce the risk of hitting a real problem than wait for that issue to actually happen and then find a hotfix or open a case with Microsoft.

A hotfix included all previous fixes for that module too, so when troubleshooting a problem, it’s common that Microsoft Support asks you to install hotfix X, Y and Z to get the components involved in the problem to the latest revision. Thus, it might look like some of the KB Articles and hotfixes below does not apply to you, or you don’t have that problem in your environment. But if it’s related to Cluster, Hyper-V or any other component that you do use, it might be wise to install it anyway as it could fix 10 other problems that you are not aware of.

I’ll always import the updates directly into WSUS and deploy them, so I can use approval rules and see reporting of which updates has been installed where. Here is a good guide for how to do it; http://www.thirdtier.net/2013/03/how-to-manually-add-a-hotfix-to-wsus/

There is as far as I know (and I’ve also asked Premier Support) no way to script the import of updates into WSUS directly from Windows Catalog. You will have to manually use a Web Browser to import them. Click, Click, Click, wait, Click, Click….

The list is ordered by release date so the latest hotfixes are at the top. And looking at a fresh Fabric deployment, it looks like most hotfixes older than 10/14/2014 has been superseded, except for KB2965733 which was still needed by a couple of servers in this new fresh environment. But things might be different for you. wsus1

I’ve also written a Powershell Script for SCVMM to create Baselines and import all Updates and Hotfixes there. So it’s easy to use compliance scans and use remediation from SCVMM to keep the Fabric updated.
You can find it here; https://gallery.technet.microsoft.com/scriptcenter/SCVMM-Automatic-Baseline-8779597b

It’s not that easy to find new hotfixes or to know which ones are mandatory. Luckily, there is a blogpost to help you out. I’ve collected all sources from Microsoft product teams, where they list the hotfixes they recommend.
You can find the lists here: http://www.isolation.se/list-of-resources-to-find-hotfixes-and-updates-for-windows-server-2012-r2/ 

 

Anyway, here is the long list of fixes for possible problems in your environments. Updated: 7/22/2015

Hyper-V cluster unnecessarily recovers the virtual machine resources in Windows Server 2012 R2

http://support.microsoft.com/kb/3072380   Released: 7/14/2015

Virtual machines that host on Windows Server 2012 R2 may crash or restart unexpectedly
http://support.microsoft.com/kb/3068445   Released: 7/14/2015

Added 07/22/2015    “0xc0000017” error when you restart a UEFI-based computer in Windows
https://support.microsoft.com/kb/3072381   Released: 7/13/2015

Interrupts to the Intelligent Platform Management Interface driver are missed in Windows Server 2012 R2
http://support.microsoft.com/kb/3061460   Released: 6/9/2015

Unexpected ASP.Net application shutdown after many App_Data file changes occur on a server that is running Windows Server 2012 R2
http://support.microsoft.com/kb/3052480   Released: 6/9/2015

Update adds support for compound ID claims in AD FS tokens in Windows Server 2012 R2
http://support.microsoft.com/kb/3052122   Released: 6/9/2015

Update to improve the backup of Hyper-V Integrated components in Hyper-V Server 2012 R2
http://support.microsoft.com/kb/3063283   Released: 6/9/2015

Stop error code 0xD1, 0x139, or 0x3B and random crashes in Windows Server 2012 R2
http://support.microsoft.com/kb/3055343   Released: 5/12/2015

Backup application that calls the VSS service becomes unresponsive when the DFSR service is running in Windows
http://support.microsoft.com/kb/3054249   Released: 5/12/2015

Resolution of external DNS records on a Windows Server 2012 R2 Hyper-V guest cluster fails through a Hyper-V Network Virtualization Gateway
http://support.microsoft.com/kb/3049448   Released: 5/12/2015

Shared Hyper-V virtual disk is inaccessible when it’s located in Storage Spaces on a Windows Server 2012 R2-based computer
http://support.microsoft.com/kb/3025091   Released: 5/12/2015

“The URL cannot be resolved” error in DirectAccess and routing failure on HNV gateway cluster in Windows Server 2012 R2
http://support.microsoft.com/kb/3047280   Released: 5/12/2015

Hyper-V host crashes and has errors when you perform a VM live migration in Windows 8.1 and Windows Server 2012 R2
http://support.microsoft.com/kb/3031598   Released: 4/14/2015

Hotfix enables AD FS token replay protection for Web Application Proxy authentication tokens in Windows Server 2012 R2
http://support.microsoft.com/kb/3042121   Released: 4/14/2015

“HTTP 400 – Bad Request” error when you open a shared mailbox through WAP in Windows Server 2012 R2
http://support.microsoft.com/kb/3042127   Released: 4/14/2015

Files cannot be copied when drive redirection is enabled in Windows 8.1 or Windows Server 2012 R2
http://support.microsoft.com/kb/3042841   Released: 4/14/2015

“STATUS_PURGE_FAILED” error when you perform VM replications by using SCVMM in Windows Server 2012 R2
http://support.microsoft.com/kb/3044457   Released: 4/14/2015

You cannot upgrade Hyper-V integration components or back up Windows virtual machines
http://support.microsoft.com/kb/3046826   Released: 4/14/2015

RDP session becomes unresponsive when you connect to a Windows Server 2012 R2-based computer
http://support.microsoft.com/kb/3047296   Released: 4/14/2015

“Your computer can’t connect to the remote computer” error because RD Gateway service freezes in Windows Server 2012 R2
http://support.microsoft.com/kb/3042843   Released: 4/14/2015

A SQL Server that is running in a Hyper-V virtual machine takes a long time to restore a database to a dynamic VHD
http://support.microsoft.com/kb/2970653   Released: 3/10/2015

DNS server does not try the second forwarder and other DNS improvements in Windows Server 2012 R2
http://support.microsoft.com/kb/3038024   Released: 3/10/2015

“0x000000D1” Stop error when you fail over a cluster group in Windows Server 2012 or Windows Server 2012 R2
http://support.microsoft.com/kb/3036614   Released: 3/10/2015

Hotfix for update password feature so that users are not required to use registered device in Windows Server 2012 R2
http://support.microsoft.com/kb/3035025   Released: 3/10/2015

AD FS cannot process SAML response in Windows Server 2012 R2
http://support.microsoft.com/kb/3033917   Released: 3/10/2015

Added 7/18/2015    “0x0000003B” or “0x0000007E” Stop error on a Windows-based computer that has 4K sector disks
https://support.microsoft.com/kb/3027108  Released: 2/10/2015

Custom values for various MPIO timers in Windows Server 2012 R2 may not be honored
http://support.microsoft.com/kb/3027115   Released: 2/10/2015

System may freeze if a reserved disk is mounted accidentally in Windows 8.1 or Windows Server 2012 R2
http://support.microsoft.com/kb/3027110   Released: 2/10/2015

RemoteApp window is too large or too small when you use RDP to run a RemoteApp application in Windows Server 2012 R2
http://support.microsoft.com/kb/3026738   Released: 2/10/2015

Operation fails when you try to save an Office file through Web Application Proxy in Windows Server 2012 R2
http://support.microsoft.com/kb/3025080   Released: 2/10/2015

You are not prompted for username again when you use an incorrect username to log on to Windows Server 2012 R2
http://support.microsoft.com/kb/3025078   Released: 2/10/2015

Hotfix to avoid a deadlock situation on a CSV file system volume on Windows Server 2012 R2
http://support.microsoft.com/kb/3022333   Released: 2/10/2015

You are prompted for authentication when you run a web application in Windows Server 2012 R2 AD FS
http://support.microsoft.com/kb/3020813   Released: 2/10/2015

Time-out failures after initial deployment of Device Registration service in Windows Server 2012 R2
http://support.microsoft.com/kb/3020773   Released: 2/10/2015

You are prompted for a username and password two times when you access Windows Server 2012 R2 AD FS server from intranet
http://support.microsoft.com/kb/3018886   Released: 2/10/2015

Cluster fixes for deadlock and resource time-out issues in Windows Server 2012 R2 Update 1
http://support.microsoft.com/kb/3023894   Released: 2/10/2015

RDS License Manager shows no issued free or temporary client access licenses in Windows Server 2012 R2
http://support.microsoft.com/kb/3013108   Released: 12/9/2014

iSCSI SAN server that’s running Windows Server 2012 R2 restarts unexpectedly on a high-speed network
http://support.microsoft.com/kb/3000123   Released: 11/11/2014

TRIM and UNMAP activities for thin provisioning on one volume block all activities on other volumes
http://support.microsoft.com/kb/2996802   Released: 11/11/2014

SMBv1 named pipe requests do not time out when the remote server hangs in Windows 7, Windows Server 2008, Windows 8.1, and Windows Server 2012 R2
http://support.microsoft.com/kb/2995054   Released: 10/14/2014

SMB 3.0 Transparent Failover feature does not work after you disconnect a drive cable in Windows
http://support.microsoft.com/kb/2991247   Released: 10/14/2014

WTSQuerySessionInformation API function always returns zero bytes for WTSIncomingBytes and WTSOutgoingBytes
http://support.microsoft.com/kb/2981330   Released: 10/14/2014

A network printer is deleted unexpectedly in Windows
http://support.microsoft.com/kb/2967077   Released: 8/12/2014

“0x00000018” Stop error when volumes are mounted in Windows Server 2012 R2 or Windows Server 2012
http://support.microsoft.com/kb/2973052   Released: 8/12/2014

Updates to improve the compatibility of Azure RemoteApp in Windows 8.1 or Windows Server 2012 R2
http://support.microsoft.com/kb/2977219   Released: 8/12/2014

Error 58 when an application calls BackupRead function to back up files that are shared by using SMB in Windows
http://support.microsoft.com/kb/2973055   Released: 7/8/2014

2965733 The guest cluster is not available to service users after failover in a Hyper-V Network Virtualization environment
https://support.microsoft.com/kb/2965733   Released: 6/10/2014

NFS version 4.1 and version 3 work unexpectedly in Windows Server 2012 R2 or Windows Server 2012
http://support.microsoft.com/kb/2934249   Released: 4/8/2014

CSV snapshot file is corrupted when you create some files on the live volume in Windows
http://support.microsoft.com/kb/2929869   Released: 4/8/2014

On-demand virus scan freezes in Windows
http://support.microsoft.com/kb/2904100   Released: 3/11/2014

Windows Server 2012 R2 or Windows 8.1 crashes when virtual volumes are exposed to hyper-v virtual machines
http://support.microsoft.com/kb/2925766   Released: 2/11/2014

iSCSI Target stops responding to requests in Windows Server 2012 R2
http://support.microsoft.com/kb/2919740   Released: 2/11/2014

Memory and deadlock issues for the RD Virtualization Host and RD Connection Broker role services in Windows 8.1
http://support.microsoft.com/kb/2908810   Released: 2/11/2014

Hotfix improves storage enclosure management for Storage Spaces in Windows 8.1 and Windows Server 2012 R2
http://support.microsoft.com/kb/2913766   Released: 1/14/2014

OffloadWrite is doing PrepareForCriticalIo for the whole VHD in a Windows Server 2012 or Windows Server 2012 R2 Hyper-V host
http://support.microsoft.com/kb/2913695   Released: 1/14/2014

 

The Interactive Services Detection service terminated with the following error: Incorrect function.

This morning I noticed that one of the Hyper-V Hosts at a customer were logging this error regularly in the system Eventlog;

The full detailed entry:

It looks like the events are happening every  30 minutes, and at the same time as Windows is for some (so far) unknown reason doing a reinstall of a lot of MSI packages, and the above Interactive Service is triggered at the same time as it’s reinstalling the DHCPExt.msi

I can so far unfortunately not find anything that’s logging why Windows is reconfiguring all MSI Packages on the server every 30 minutes.

It does look like it’s the DHCP Server extension that’s causing the Interactive Service errors, as they always happen at the same time. Though, the DHCP Server extension shouldn’t be reconfiguring in the first place.

We always enable the Reliability History on all servers whireliabilitych can be handy at times to see when a problem begun happening.
Check this Out!

It looks like the problem started on April 28 at 8:42 PM.

As the Reliability History tool is disabled by default, I’ll make another blogpost showing how you can enable this feature for all your servers.

Weventloghen I wanted to see what had happened around April 28th. I noticed that was the oldest entries in the Application log. When the log has become full, it has removed the oldest entries according to the settings.

So I don’t think I’ll get any more details that way, and it does look like this problem has gone on for quite some time.

I’ll just reinstall the Hyper-V Host as it’s done in a few minutes compared to spending hours trying to fix the problem.
AND… I’ll create a Group Policy that will increase the Eventlog Size to x10 the default. So the next time something like this happens, I’ll have information to dig deeper.

Updated 2015-05-19 09:08:

After doing some more digging, it seems according to this KB Article (KB974524 : Event log message indicates that the Windows Installer reconfigured all installed applications) that this problem can happen if one of the following is true:

  • You have a group policy with a WMIFilter that queries Win32_Product class.
  • You have an application installed on the machine that queries Win32_Product class.

As the problem is not happening every 90-120 minute which would be true if it was GPO Triggered, I would say it’s an application that uses the Win32_Product class. And after doing some digging, it turns out it’s a known problem with VMM which will be fixed in UR7. Or hopefully earlier with a hotfix.

Updated 2015-05-19 10:12:

Wow, I got a hotfix for the issue within 15 minutes after contacting the VMM Team.
I’ve just installed it in our test environment and will later install it in the customers production environment.

Unfortunately I don’t have a KB or Hotfix ID for this, but if you contact Premier Support I think you can mention that you need a hotfix for Engine.Adhc.Operations.dll which gives support for RegKey: UpdateDHCPExtension
That info should make them able to find the correct hotfix.

Unable to Connect to VMM in AzurePack after UR install

After upgrading to Update Release 6 (UR6) we got the same issue as seen in earlier UR’s. It’s not possible to connect to VMM in AzurePack so you can’t see your Virtual Machines, Clouds or Networks.

It turned out that when UR6 gets applied to SPF, the bindings are once again messed up. To fix this, just logon to the server hosting SPF and in IIS check the bindings as seen here;

SPF1

The SPF Website is not running and you can see two Bindings.
In my case, one has a certificate and the other doesn’t. So I just remove the binding without a certificate. Then start the Website and everything is working as expected again.

In earlier UR’s I’ve also seen how there is no bindings at all listed here. So you may have to create some binding then.

 

The request size exceeded the configured MaxEnvelopeSize quota

Today when I was updating our AzurePack WebSites Servers, I got an error which prevented the upgrade of most of the WebSite Roles like these;
Management Servers, Publishing Servers, Front End Servers and all the Web Workers. Yes, everyone except the Web Sites Controller.
Resulting in some unexpected downtime.  Luckily, all that was affected was this blogsite.

The error message I got was;
The WinRM client sent a request to the remote WS-Management service and was notified that the request size exceeded the configured MaxEnvelopeSize quota.
And I could also see that the files being copied to c:\windows\temp (WebFarmAgent.msi) were broken.

I also had an error “Failed to copy role artifacts to agent” in the logfile seen on Windows Azure Pack Websites Controller.

First of all, I ran this command in an Elevated Command prompt on the server hosting the Controller Role;
C:\Windows\system32>winrm g winrm/config

winrm1

And then the same command on one of the failing servers;
C:\Windows\system32>winrm g winrm/config

winrm2

Notice the difference in MaxEnvelopeSizekb between the servers. One of the other servers had MaxEnvelopeSizeKB set to 700.

I don’t know why it’s different between the servers or what has suddenly changed it, my guess it’s some Windows Update patch. Though it’s the same patches being installed on all the servers, and I’ve seen three different values. Wicked.
So by using the same value on all the servers I got the setup to work. And as you can see, this blog site is now also running. YAY!

I chose to set the value to the same as on the Controller Server which is the one trying to run the commands and copy the files to the other servers.
winrm set winrm/config @{MaxEnvelopeSizekb=”8192″}
It will now take 5-60 minutes for all update and repair jobs to complete.

I couldn’t find any Group Policy object to use to set that value as a default value on all AzurePack WebSites servers. So I’ve got to come up with another longterm solution. Maybe doing it with Desire State Configuration (DSC) or via Configuration Manager?