|

Do You Need IP-Based Disaster Recovery?

A new category of disaster recovery solutions is emerging. Virtually synchronous replication or asynchronous mirroring was previously unavailable due to the difficulty in maintaining the characteristics of a localized disk mirror over distance. Now available, here's how and when to use it.

Due to increased concerns and legislation around data recovery and retention, organizations are evaluating their remote data recovery options. Unfortunately, weeding through the myriad options and claims can be quite daunting. Companies must take a holistic approach, carefully considering many issues including the target environment, scope of the project along with chief objectives, and business and technical risks associated with deploying a solution.

When evaluating disaster recovery needs, companies must consider the following:

  • Legislative and compliance requirements
  • Internal and external infrastructure requirements
  • Implementation budget and recurring costs
  • Management and personnel costs and availability
  • Acceptable risk levels for disaster recovery

Determining infrastructure requirements is the most involved of these considerations. First, companies must identify and define the data they'd like to preserve. There are several questions IT professionals can ask to ascertain this information. What is the most critical time-sensitive data? How much data is there? How quickly does it change (i.e., what's the data change rate)? Where and how is it stored? On a single server? On a single data storage device? On multiple servers and storage arrays? Are the servers homogenous, or are multiple operating systems and hardware vendors involved? Does the organization have a storage area network (SAN) or network attached storage (NAS) in place? Are there branch offices with critical data?

As companies begin to understand the scope of the data recovery project, they may gravitate to one solution or another. To avoid significant pitfalls, they should pay particular attention to how potential solutions measure up with regard to scalability, compatibility, and performance.

Scalability is an important issue, both in terms of the current network's size, and any expected growth. A suitable disaster recovery solution must be able to grow with the organization's network infrastructure and data.

Compatibility of the solution with the existing servers and network is critical. Companies must overcome the common notion that when it comes to disaster recovery, the cure is worse than the original problem. Software compatibility issues cause network and server outages, drain IT resources, and impact customer confidence. Before implementing a solution, companies must have a thorough understanding of all compatibility issues related to the product.

Performance is another concern. Software agents and drivers required by many solutions can have an adverse effect on server or network performance. It's important to confirm that the solution doesn't degrade performance and operations remain within acceptable levels. Also, the available network should support the solution selected. Solutions designed for 24-by-7 environments can experience issues when deployed in an environment that operates around a defined schedule.

Once the target environment is understood, companies should determine budgetary requirements. An organization should budget for the initial cost of the solution and for such recurring costs as support contracts, software updates and hardware expansion. Another often-overlooked expense is the price of personnel required to maintain the disaster recovery solution. More complex solutions require greater expertise. Companies should consider whether they have budget to bring in consultants or train employees if additional expertise is required.

Finally, companies should determine their risk tolerance, which has two components: the timeliness of the data and the time required to recover data. Businesses, overwhelmed by the arduous task of deciphering the many aspects of today's disaster recovery solutions, often settle for risk scenarios that are "good enough." For example, some businesses consider traditional backup solutions to be good enough since in the event of a disaster offsite tapes can be retrieved, new servers can be brought online and data as timely as the current or previous month can be restored from the tapes.

A well-planned backup and recovery strategy using traditional disaster recovery solutions such as tape backups has a company up and running with relatively recent data within a few days. If that scenario is sufficient, an IP-based disaster recovery strategy may not be necessary. On the other hand, many businesses are finding that an IP-based solution presents the same cost as tape backups, yet when disaster strikes, they gain immediate access to current data.

IP-based disaster recovery

The two-tiered goal of an IP-based disaster recovery solution is to ensure that data is copied to a remote site in real-time and that recovery can occur immediately. Businesses will find that in most disaster recovery solutions, there are tradeoffs between cost, ease-of-use, infrastructure requirements, and reliability of the data.

Disaster recovery technologies face many challenges in today's dynamic business world. Many of the solutions available today were designed when data change happened within defined schedules (i.e. 9:00 am--5:00 pm) and did not happen on a 24x7 basis. As such, these solutions tend to fail when used outside a defined schedule environment. Another challenge ironically stems from improvements made to data storage methods. As operating systems and applications became increasingly sophisticated, they employed such techniques as journaling and logging to ensure that data is recoverable in the event of a power failure. The result is that operating systems and applications recover well locally. However, recovery risks exist in remote data protection scenarios when solutions operate outside the guidelines of how current operating systems and applications write data. Data protection solutions that are susceptible to recovery risks are designed to offset these issues with additional data recovery agents or drivers. These solutions tend to address many of the core recovery issues while missing many of the edge conditions that can negatively impact recovery. Additionally, as discussed earlier, agents and drivers can also have an adverse impact on server or network performance.

In a heterogeneous network, employing sophisticated solutions to handle every server type in a network can require a complex combination of disaster recovery products and software agents. This approach not only adds complexity and expense to a solution, but also requires specialized expertise for managing storage and recovery. IT professionals responsible for maintaining these systems must learn a whole new set of recovery techniques for each product or application used. When it comes time to recovery of data, the simplest solution is often the best.

What is replication?

The term replication is now a buzzword in the disaster recovery market, and is often used incorrectly to refer to any technology that creates a backup of primary data on a secondary device. This creates a confusing situation for consumers, since many believe that all solutions perform the same task and provide the same level of protection. In actuality, there are many different technologies to choose from, and each has its own set of pros and cons.

At a base level, replication is defined as a technique that uses software to copy and modify data so that it can be transferred effectively to a remote location. Traditionally, replication acquires its data through such external means as an application, driver, or software agent. The first technique employed was originally called file shadowing, and is now commonly referred to as file level replication. This software works by scanning files to determine whether or not they have changed. If a file has changed, the file level replication application waits for the document to close and makes a copy of this file to send to the remote site. While this works well in an environment where a small number of changes are made to files that are quickly closed, file level replication becomes inefficient when there are a lot of file changes. File level replication can also use larger amounts of bandwidth than necessary as it copies whole files to the remote system. Moreover, file level replication outright fails in a scenario (common in database applications) where the file does not close, or closes only at the end of the business day, after which the replicator must "catch up" during off-hours (See Figure 1).

A newer form of replication is the snapshot. Snapshot technology resembles traditional backups in that it uses an agent to take a picture of your data at a single point in time and transfer it to a remote location. The term point-in-time copy is often used synonymously with snapshot and is essentially the same technique. This solution, while effective in many cases, requires increasingly large amounts of bandwidth as volume sizes and file changes increase.

To offset this problem, some snapshot solutions employ incremental updates. On a periodic basis, after taking a snapshot (point-in-time copy), replication software records data changes made since the last snapshot or the last incremental update. The changes are recorded at the remote location until another snapshot takes place. While this decreases the amount of data that needs to be transmitted, it creates data consistency issues and can increase recovery times. As with incremental tape backups, recovery involves restoring the snapshot and then restoring each of the incremental updates (See Figure 2).

Block-level replication is becoming a common technique. The method scans the hard drive or volume to determine which blocks (sections of contiguous data on the drive) have changed since the last scan. These changes are then transmitted to the desired remote location. Block-level replication uses one of two techniques: synchronous replication and asynchronous replication.

Synchronous replication employs a process that creates a "modified" byte-by-byte copy of the primary data at a secondary location. Synchronous replication ensures that data is committed to the remote data volume before informing the client that the data has been written. Another write is not allowed to commit until the local client receives confirmation of the remote write. A substantial limitation of this technology is that it requires a high-speed and high-cost connection to the remote location. For example, suppliers of synchronous remote replication solutions commonly recommend customers locate a recovery site within 20 km of the primary site. As such, synchronous replication does not meet the requirements of an ideal remote data protection solution with respect to distance limitations. In this scenario, companies must balance the amount of data to protect against the cost of the distance to the remote site. Additionally, a significant external infrastructure is required to ensure the performance of this solution, including dedicated high bandwidth leased lines. Synchronous replication solutions are sensitive to delays caused by router hops or multiple switches. As a result, distance rapidly increases costs while degrading performance and reliability.

Asynchronous replication is similar to synchronous replication with one key difference. Asynchronous replication ensures that data is committed to the local volume before informing the client that the data has been written. It does not ensure that the data is written to the remote location. Instead, it trusts the software and the bandwidth to correctly transfer information to the remote location. Skipping verification of the data transfer solves many cost, distance and bandwidth limitations of synchronous replication, while failing to ensure the integrity of remote data.

While block-level replication offers efficiency improvements over file-level replication, attempts to further optimize the transmission of data introduce new issues related to data integrity. Many block level replication solutions attempt to reduce the amount of data that must be sent by compressing writes, sending writes in parallel, or employing other techniques that change the order in which the data is written. As long as all of the data reaches the remote site, it will be in sync with data at the remote site. However, with data changes occurring around the clock, it is more likely that some data will be buffered locally or in transit when a disaster occurs. Previously mentioned advances in file system technology, which use journaling and logging to ensure data can be recovered in the event of a local disruption will work only if the data is written in the expected order. Block-level replication solutions that modify the write order of the data while making more efficient use of the network, can thwart attempts to recover the data at the remote site.

Asynchronous Mirroring Emerges

A new category of disaster recovery solutions is emerging. Often referred to as virtually synchronous replication or asynchronous mirroring, the technology was previously unavailable due to the difficulty of maintaining the characteristics of a localized disk mirror over distance. Local mirroring maintains an exact local order of writes and associated data so that there is immediate recovery if the primary volume fails. As with synchronous replication, a local mirror cannot continue to the next write until the first write is committed. Asynchronous mirroring creates this recovery environment remotely by maintaining a synchronous state-consistent mirror locally and a synchronous state-consistent write order between the local and remote sites.

The technique is performed by presenting a device locally that emulates a local mirrored volume on either a server or SAN. This volume receives ordered writes locally and acts as a redundant mirrored volume within the local mirror set. As data is written to this device, each write is copied to a buffer and then transmitted asynchronously over a standard TCP/IP network to a similar device at the recovery site. The local device transmits the next set of writes only after receiving confirmation that the current set was received by and committed to the remote device, so that there is no loss of data integrity. In this way, the remote recovery device maintains state and write order consistency, enabling continuous recovery under journaled operating systems and applications. This allows the remote recovery device to provide full recovery in the event of a disaster, even if packets are in transit when the event occurs.

Because the local device is a volume in a mirrored set, the solution has the added benefit of providing a second local copy of the data in case the primary volume fails, enhancing local fault tolerance. Additionally, because the local mirror is created using standard hardware or operating system mirroring technology, there is no need for additional software agents to be installed for use in writing the data or recovery. Installation of the devices can be performed by any IT personnel who can set up a hard drive, reducing the costs of consultants and employee training.

One potential drawback of asynchronous mirroring over block-level replication is that, since it does not optimize data prior to sending it, more data is transferred over the network. However, this is balanced by the fact that it transfers only the writes that were performed on the local volume, which may not cover an entire block of data. Typically, for an asynchronous mirroring solution, network bandwidth must be at least equal to the data change rate, but the solution does not have the high-speed or proximity requirements of most other disaster recovery solutions. While a dedicated network is recommended, many businesses find that the asynchronous mirror can share an existing Internet connection without significantly impacting network performance (See Figure 3).

Many different IP-based disaster recovery solutions are on the market and it is difficult to compare the attributes and risks of each. By understanding the target environment, scope of the project and key objectives, and business and technical risks, companies are well positioned to evaluate possible solutions. With good planning, companies can find a solution that meets the organization's budgetary, technological, and recovery objectives. Remember: a lot of IP-based solutions will get your data offsite, companies must be sure to choose one that will also get it back.

About the Author

Ron McCabe is founder, president, and CEO of MiraLink, and has more than ten years of experience in the rapidly emerging data storage and recovery industries. During his twenty-one year tenure Ron has introduced innovative hardware, software, and chip designs that have disrupted the market while generating strong sales and profits. He is world-renowned for developing technologies that control the speed and use of memory bandwidth. At MiraLink, Ron has been instrumental in leading the development of cost-effective business continuity and data recovery storage appliances. Ron can be reached at: rmccabe@miralink.com

Republished with the permission of Miralink


Do You Need IP-Based Disaster Recovery?