24/7/365 support [closed]
So the organization I work for as a sysadmin is considering extending our hours to 24/7/365 - can anyone who works for or has worked in an extended hours shop maybe provide some basic info on your organization size, support hours, & call/incident volume so I can do a little comparison?
A lot of my coworkers seem to think that extended hours coverage is unnecessary due to our relatively small size. We do have 20+ international offices, but they are mostly quiet in terms of requesting technical assistance. Here are some more stats on my employer:
2000 employees
250 production servers worldwide
~20 incidents daily (sysadmin)
7-8 systems/network admins
We are also unsure if covering a 24 hour schedule will be difficult with only about 8 people (considering vacation and sick time) - so if you care to share how many people are in your group, that would be helpful as well.
Solution 1:
Whether or not this will work will depend a lot on whether you're dealing with automated alerts, a staffed operations center, or waiting for people to call for support. Our shop is a 24/7/365 shop, but I work in production systems, not in desktop support/client support. We don't get calls directly from users, so that helps a lot. We also support our own home-grown apps, and the development teams are invested in keeping the systems error free (from a code standpoint, the servers are up to us). Depending on what system a certain team supports, after hours calls can range from 2-5 a month to 2-5 a week. This can also vary depending on if a new code release caused issues.
As for compensation, our system was in place when I started working here, so my compensation was set with that in mind. If you are going to be expected to offer support after hours, you should be entitled to additional compensation. At the very least, if someone is on call, and has to deal with an after hours issue, they should be able to come in late the next day, or work from home if possible.
Solution 2:
One of the place I worked at had a 24/7 monitoring center. It took basically 10 staffers to fill all the slots, to have "spares" and to have a manager in charge. The biggest problem was the graveyard shift (2400-0800) since there was basically one guy coming off shift (in the entire building) and one guy coming on shift. Any delays meant automatic overtime for the guy on shift and if the midnight guy was a no show, a nice 16 hours shift.
We also had terrible turn-over rates on those jobs and much difficulty recruiting for the night/evening shifts... Of course the non-peak shifts were only staffed with "operator" level employees with technical back-up on call. Basically management was unwilling to pay well enough to get highly technical resources to do night shifts, especially if they had families.
Solution 3:
One thing you're not asking - when you have an incident, how long is acceptable until someone actually looks at it? And will this impact your regular maintenance windows?
I saw a shop go from exclusively 8/5 to 50% 12/5 and 50% 24/6 over a gradual process (a couple years) without adding staff. The new issues created by this were somewhat mitigated by getting more redundant hardware, better monitoring (especially alert delivery).
The real problem is that everyone had come to expect a large maintenance window during convenient hours of the day. This was handled by having more people work from 8am-8pm, more people scheduled to work Saturdays and Sundays, and three-day weekends being considered an opportunity do to maintenance that had previously been done after-hours on normal days.
Another problem is that everyone had come to expect that only unusual problems, extremely severe problems, or problems with a certain small handful of systems, needed to be handled during the middle of the night or the weekend. This was handled by never defining these requirements, so that a severely sleep-deprived individual could choose to ignore a particular issue if he was unable at the time, without fear of repercussions. (Although it could cost the company a great deal).
People became increasingly burnt out and disgruntled. But IT management was able to hide these problems from their management, looking like heroes able to do more work with less staff. Upper management never understood that the decreased maintenance windows for hardware maintenance were a serious problem, because IT management would not admit that this caused an issue.
The environment became increasingly pathological and I ended up leaving due to IT management's habit of lying about requirements and issues all the way up the chain. The last I heard, the remaining staff is feeling more burned-out, more ill-treated. People are losing their families becuase three-day weekends are considered a time to do extra work, and because of the expectation of working such long hours that their family is asleep when they return home. (only to do more work anyway).
However, one group started offering secret "comp days", i.e. if Saturday is scheduled for 18 hours of work, then Monday could be scheduled as a weekend day. This had to be done in secret for fear of management finding out that extra work requires extra people.
The company's profits are increasing, and the salary of the IT management is increasing.
Vacations were surprisingly not a problem due to careful scheduling and a diverse staff that did not share many religious holidays.
It seems like an overall success to me.