For the past few years, we have heard a lot about the benefits of Hadoop, the dominant big data technology. But one less spoken of use case is backing databases to Hadoop.
Data Warehouse Backup to Hadoop
Data Warehouses, especially large ones are expensive and doing backups by replication to another DW is out of the question for many enterprises. Consequently, the method of choice to backup a data warehouse is quite often tape. Tape backup is neither cheap nor fast. Further, a restore from a backup can cause significant disruption to business depending on the time taken. Yet, there has been no other cost-effective solution until now.
By using commodity hardware and cheap disks that are replicated, Hadoop has proven to be a safe and fast backup solution. The backup solution is easy to setup and the attractive costs and recovery time make it an ideal choice for this important function. One of our customers, a major bank, took this approach and saved a considerable amount of money while avoiding a large Data Warehouse upgrade.
The other big advantage is that the backup system is live and the data in it can be analyzed. Use Hive, Impala or Lingual and users will never know whether they are accessing the active data warehouse or a backup!
Traditional backups are always squirreled away into hiding, never to be seen by engineers or analysts. In contrast, the Hadoop solution is not just active, fast and cheap but can be used for analytics.
Online Database Backup to Hadoop
It is not just large data warehouses that can benefit from backing to Hadoop, but online, relational databases as well.
As online databases get larger, many DBAs prefer online backups with the ability to do point in time recovery. This allows for fast restores – a very important requirement for online databases. To ensure reliable, fast backup and restores, expensive SAN/NAS or specialized disk drives are used. Every backup makes a physical copy of the database and depending on the frequency of backups and the number of backups you want to retain, the storage costs quickly add up.
In contrast, consider a Hadoop backup solution. Hadoop uses commodity servers with vanilla disks, achieving its scale and reliability because of its redundant, distributed architecture. You can even cobble together a Hadoop cluster using older equipment, perhaps beefing up the disks depending on the amount of storage required.
Backup your database as usual; then copy it over to the Hadoop cluster. Multiple backups can all be safely stored in the same cluster.
Shanti Subramanyam is Founder & CEO at Orzota, Inc. a big data services company whose mission is to make Big Data easy for consumption. Shanti is a Silicon Valley veteran and a performance and benchmarking expert. She is a technical leader with many years of experience in distributed systems and their performance acquired at companies like Sun Microsystems and Yahoo! Contact Shanti via twitter @shantiS, on LinkedIn, or http://orzota.com/contact
The opinions expressed in this article are hers alone.