Home page Home page Home page Home page
Pixel Header R1 C1 Pixel
Pixel Header R2 C1 Pixel
Pixel Header R3 C1 Pixel
By APK | Tuesday 26 April 2016 15:28 | 0 Comments
One of our largest customers decided to use the Good Friday holiday to handle some server maintenance.  While normally a 24/7 shop, this is one of the few days of the year that not only are they on a skeleton crew, it's a skeleton of a skeleton crew and they generally receive no calls on this day.

After performing and verifying the backups, upgrading the OS with all the relevant patches, the server was rebooted.  A quick check, and the LH service was not running, which seemed a bit odd mostly because it's never failed to start before.  Starting the service immediately displayed an error stating that the service did not start in a timely manner.  While none of us was completely sure what's considered "a timely manner", we were all in agreement that "a timely manner" was not "immediately".

At this point, we thought it's time to examine the Window's event log for some clues.  However, this proved to be a problem because the event viewer wouldn't load.  This did not seem like a good sign, so we rebooted the server, but there were no changes.  Looking deeper into the system, we found that the event viewer service wasn't starting either.  Now we think there's something horribly wrong with the server and something went wrong with the patch installation.  While some people were checking on what was installed and what side-effects there might have been, others looked into why the event viewer wouldn't work.

Eventually, we worked out that the event viewer wouldn't start because the subdirectory storing the log files was read-only.  Once we set those files to read-write, the event viewer service started.  Once the event viewer started, the LH Service also started.

As near as we can work out, the LH Service couldn't update the event log and immediately failed.  This started an interesting discussion with the client.  How would we have coded for this particular error.  Revelation is correct in that the LH Service should halt when an OS feature so basic as event logging is unavailable.  This points to a fundamental error on the server which should be handled immediately.  There isn't a very good way to log the error.  Normally the error would go into the event log files, but the error was that it couldn't write to the event logs.  The LH service managed to avoid the perpetual loop of writing to the event log, generating an error, which writes to the event log, generating another error.

The most interesting thing in all this was that Windows never bothered to inform us of the error, and all (most?) Windows services loaded.  We really have no idea what started and what failed.  Had the LH service not acted correctly, a major production server could have run for weeks in a potentially inoperable state.


Post a Comment

Subscribe to Post Comments [Atom]

<< Home

Pixel Footer R1 C1 Pixel