Disaster Recovery 101

A Disaster Recovery Plan is essential to manage your business critical systems should there be a seriously disruptive event, says Richard Booth

By Richard Booth, Infrastructure Service Manager, ULCC

Over the past few months we have had various queries regarding disaster recovery (DR) services and how they operate. In particular, there seem to be a growing number of Learning Technologists who are asking how they should protect their VLE resources in case of a disaster.

An area where there appears to be some confusion are the roles played by data backups and service replication within disaster recovery. The distinction between the two is briefly explored below and should hopefully help people understand their relation to the DR process.

Planning and KPIs

What is a disaster? Typically it’s defined as a sudden, unplanned event that results in an organisation failing to provide critical business functions in line with contracted service levels. The most likely scenarios for the UK typically include: power failures, network connectivity issues, infrastructure component failure, security breaches, work place access, fire and flooding.

In the context of IT service provision, a Disaster Recovery Plan (DRP) is essential to manage your business critical systems should there be a seriously disruptive event. The question the DRP answers is; “if we lost our IT services how would we recover them?” The DRP is used to document a comprehensive and consistent set of IT procedures to be taken before, during and after a disaster. The document should cover all aspects of the DR life-cycle, including: likely events and scenarios, invoking DR, governance, budgeting, recovery strategies, implementation, testing, incident records and process documentation.

Your DRP should also be owned by a manager who has sufficient responsibility to make organisational or business level decisions. The IT director is usually the obvious choice for this role, as they should sit on SMT, hold the DR budget and have a good view of the various operational functions.

Recovery Time and Recovery Point Objective

If a disaster were to strike, your DRP should enable you to recover in the quickest amount of time (Recovery Time Objective or RTO) and with the least amount of data loss (Recovery Point Objective or RPO). As these KPIs are set by the organisation, the DRP should support these defined objectives. A DRP that doesn’t support the RPO and RTO is likely to lead to reputational and financial repercussions. However, a successful DRP will be your emergency manual if the worst happens, enabling you to return your organisation’s IT functions in good time.

Backups and DR

The relationship between your backups and your DRP may not be as straight forward as first thought. Typically, they are performed as daily dumps and also fall into a broader weekly, monthly and yearly schedules. Their main aim is usually for compliance and the granular recovery of data during normal workplace operation; for example, the recovery of single files from various point in time increments or to recover from a single system failure.

A DR strategy based on offsite backups doesn’t usually turn out to be the most cost effective solution, when you look at the whole picture. There can be a whole raft of planning, external hosting and testing considerations that get overlooked and add to budgetary pressures. Disaster Recovery requires that the right people, software and supporting platforms are all present and available before your data can be restored in a meaningful way.

Using backups as your only DR solution is likely to substantially impact both your RPO and RTO aspirations. Your RPO will be dependent on when your last data dump was taken and this could be up to 24 hours plus, depending on your scheduling. Your RTO will also be entirely reliant on how quickly you can restore your data on to a re-provisioned platform, which could be several days.

Benefits of Replication

Data replication technologies have significantly helped the protection of services hosted on virtualised platforms. The main focus with replication is on business continuity and to ensure mission critical systems are highly available, even when a disaster happens.

At ULCC, we have been using VMware’s Site Recovery Manager (SRM) to replicate some customer services from our datacentre in central London to our remote facility in Maidstone Kent. This is an additional bolt-on service, which works seamlessly in the background and allows full, non-disruptive and auditable testing.

As an aside, one trend we are starting to see is that customers are becoming more interested on lowering their RPO at the expense of their RTO. Understandably, data loss seems to be becoming the main influencing KPI.

The original blog post first appeared on ULCC Infrastructure Services Blog on 17/10/2014. To read the complete article visit: https://bit.ly/1HyWxON 

Send an Invite...

Would you like to share this event with your friends and colleagues?

Interactive Roundtable

The Role of Testing within Digital Transformations

Wednesday, January 26, 11AM (GMT)