Why Servers Fail

Many small businesses have one to three servers sitting in a dark closet. This closet is rarely looked at unless something fails or someone needs an extra power cord.

The servers housed in the closets continue to run for months and sometimes years until they fail. When they do crash, there’s an office-wide panic. Key information personnel (owners, accountant, operations managers, or sales person with a knack for gadgets) scramble to recover the server and the information on it. If attempts to recover fail, the 3rd party, contract vendor is called and while awaiting their arrival and a fix, everyone nervously focuses on neglected, non-digital tasks while “the fix” process begins. During this time there is no focus on the business objects diminishing productivity and increasing overhead cost.

The questions which immediately come to the ‘owner’ of the crash are: 

  • What happens now?

  • How long will we be down?

  • How will I get my information back?

  • What will it cost to fix?

  • What are my employees going to do?

  • What is the business impact (cost and productivity)?

  • Did the last back up run?

  • Can I find my last backup tape?

  • Can I use my last backup tape?

It is not a pleasant place to be. Many companies feel they have a good backup and recovery plan, but haven’t tested whether the plan even works. So information is being backed up, but there is no confirmation to whether the data is even on the tape. We will save recovery for another date.

The big question is “So what causes a server to fail?” There are many reasons. We’ve written a very concise piece which details not only why they fail but more importantly what you can do to prevent it from happening.

Whether you have one or 100 servers, over time they will fail. Customers experience a component failure in approximately 6% of their x86 servers, spread out fairly evenly during the first three years. This does not mean that 6% of the servers will be unavailable at anytime during the first three years. If, for example, the failed component was a fan and there were redundant fans in the server, then the server will continue to operate, and the administrator can replace the failed fan soon thereafter. In the fourth year, 50% of the x86 servers will likely have experienced a component failure, and by the end of the fifth year, all x86 servers will have experienced a component failure. Again, these component failures may not result in server failures, but they will require attention eventually.

Most x86 servers come with a standard three-year warranty. Support in years four and five may be expensive, but customers can consider self-support or third-party maintenance to reduce those costs. After five years, replacement parts may become almost impossible to obtain. Based on these expected x86 server hardware failure rates and maintenance and part replacement, enterprises should aim to replace their x86 servers approximately 48 to 60 months after installation. A firm plan should be in place to replace x86 services no later than the end of the fifth year.

Why servers fail?

  • The computer room/closet is not cool enough thereby causing the server(s) to overheat. I call this a slow death because heat slowly damages the internal circuitry. It is important to have controlled cooled airflow around your computing environment.

  • Security and critical patches are not up to date. There are patches released daily and many are difficult to install and manage. In many cases patches can have an ill effect on applications running on the server.

  • Improperly installed software. Missing drivers, dll or any other file can use a server to crash if an application is not installed correctly.

  • Mismanagement. Without regular monitoring of the server, log files grow. Eventually you run out of space causing the server to fail. Have you ever noticed how slow your PC becomes as it ages? The same thing happens with servers. You need to perform regular maintenance on your servers.

  • Limited capacity: Storage, memory, CPU’s. Applications installed on the server without any awareness or recognition of other applications installed. This causes applications to starve for resources, slowing down performance and in many cases crashing the server.

  • Outdated operating system. Windows XP will only be supported through 2014. Windows 2000 is no longer support by Microsoft. Without regular patch updates when the server eventually fails, there will be no way to recover it.

How can this be prevented?

  • Hardware redundancy. Multiple power supplies, redundant RAM, and a RAID (Redundant Array of Inexpensive Drives) help ensure availability.

  • By creating a defined approach for managing and monitoring the server environment. It starts with creating standard operating procedures for managing your server(s). These operating procedures include installation instructions, backup and recovery plan, and installed software. This creates your baseline.

  • Regular patch updates. This improves the performance of the server and limits security vulnerabilities. Please note that it is important to review patch documentation before installing to prevent it from impacting the server’s performance.

  • Routine preventative hardware and software maintenance (PM) on server. Regular maintenance like removing dust and preventing the air flow through the server form being obstructed goes a long way to improving the servers longevity. With the operating system emptying the recycle bin, deleting files in the temp folders, and de-fragmenting the hard drive all improve the server’s performance.

  • Regular review of event logs and the overall performance of the server daily to understand performance trends and recognize potential problems before they happen.

  • A well maintained server closet or datacenter that includes clean power (UPC, Generator, power conditioning units) and controlled cooled airflow that prevents the server from crashing.

Effectively running an IT environment is expensive and in many cases the full costs and risks are not completely understood. This is where Synergy Technical provides an opportunity. We take away the routine operations needed to manage an organizations IT environment, including servers. We take on the heavy lifting and remove the risks, delivering you a well managed computing environment at an affordable cost.

 

Comments