In my previous post I described the storage requirements for your enterprise content repository or Data Lake:
- Cost Efficient
- Multi-protocol access
In my experience data maintained in your Data Lake will have these six storage criteria:
The data criteria will likely change based on the analysis being performed at any given time and not necessarily the age of the data so your storage architecture will most likely consist of a federation of arrays, storage media, and access protocol requirements.
Today I commonly see four common storage components used to support most Data Lake’s solutions.
- Integrated HDFS with Hadoop distribution
- HDFS Storage Array Interface
- HDFS by Storage Virtualization Software
- HDFS Analytics Appliances
These components can be federated to leverage your existing storage investments, and meet your data criteria.
INTEGRATED HDFS WITH HADOOP DISTRIBUTION
Many customers start with deploying the Hadoop File System software that comes with their Hadoop distribution on direct attach server storage hardware or enterprise storage arrays. This has many advantages including:
- Tightly coupled with Hadoop software, integrated software support
- Low cost
- Storage hardware choice
I have seen many customers hit the wall with this solution at about 30TB of data. The HDFS standard requires three full copies of the data for redundancy so the raw infrastructure capacity is around 100TB. The operational overhead of replacing failed hardware, backup, and the environmental requirements for this much infrastructure is often not sustainable. Most customers begin to look at enterprise infrastructure solutions as the next logical step.
HDFS STORAGE ARRAY INTERFACE
Many customers have large amounts of existing file data they want to be able to analyze. If this data is stored on an enterprise storage array like EMC Isilon the customer can enable HDFS access to this data for analytics. More information on this capability can be found here.
This option has many advantages including:
- Fast time to analytics processing
- NameNode Fault Tolerance
- Eliminate 3x mirroring
- Simultaneous support for Multi-Hadoop distribution
In addition, enterprise storage arrays provide all the data efficiency, and data security, and operational benefits for Big Data they have been providing for other applications for years including:
- Smart-Dedupe for Hadoop
- SEC 17a-4 Compliance
- Kerberos Authentication
- Application Multi-tenancy
As I discussed previously we have build a deployment guide to enable customers to reliably, and rapidly deploy this solution. The guides are available here.
HDFS BY STORAGE VIRTUALIZATION SOFTWARE
While utilizing HDFS natively on a storage array has many benefits, many customers want the flexibility to use any enterprise storage array. EMC ViPR is one example of a new storage virtualization technology that can be used to provide HDFS data services support for almost any major enterprise storage array or direct attach server storage. This has many advantages for Big Data analytics users:
- Multi-protocol access - Object, HDFS, Block (iSCSI), more coming
- Write file, read object & vice versa
- NameNode Fault Tolerance
- Heterogeneous storage
In addition, a storage virtualization software layer still allows the enterprise storage array to provide all it’s traditional benefits including:
- Data efficiency - Eliminate 3x mirroring, de-dup, and compression
- Security - SEC 17a-4 Compliance, Kerberos Authentication
I believe storage virtualization is the most interesting Data Lake storage opportunity. It combines the benefits of enterprise storage with the flexibility of hardware choice. As I discussed previously we have build a deployment guide to enable customers to reliably, and rapidly deploy this solution. The guides are available here.
- Rapid deployment
- Predictable performance & scale
- Optimum resource utilization
- Integrated, simplified management
- Simplified support & maintenance
- Highest Reliability, Availability, and Stability
Appliances solution can be limiting for some customers. Appliance solutions by definition are highly predictable, fully integrated stacks, which tend to lag in the introduction of new features and functionality. The Big Data analytics space is evolving quickly with new HDFS features and functions being introduced every few months. Appliance upgrades can delay access to these new functions so you will want to consider how quickly you want access to the latest and greatest HDFS features.
In conclusion, each of these storage options has strengths and weaknesses. Most customers use a combination of these components in addition to an in memory data grid service to build a fully functional Data Lake solution. Each of these storage options can be federated together from the beginning or as needed to meet your needs.