When a new technology is introduced to the IT community we often approach it’s use like a small child with a new toy. Think about the first time you used an Apple iPAD. You probably looked at from a few different angles, tried to integrate it with technology you already knew like VDI, tried a lot of different use cases to see where it worked well and maybe not so well. Once it passed the exploration tests, the excitement phase took over and you started to use it in the best ways you discovered. Ultimately it became a part of your technology tool belt.
As part of my role at EMC we started to see a lot of our customers experimenting with Hadoop. Many of them were having trouble getting past the exploration phase. Some of the common challenges identified were:
- IT takes to long to set up. Shadow IT or public cloud were the preferred deployment partners
- Several Hadoop distributions and customers needed the ability to use more than one
- Hadoop processing was slowed by data gravity. Large amounts of data had to be copied to the Hadoop cluster before processing could start
- Hadoop needed a flexible infrastructure that could expand compute and data capacity & performance independently
- Hadoop was inefficient with storage capacity. Hadoop would create three full copies of the data sets for redundancy
EMC recognized a need to provide an easy way for customers to deploy their favorite Hadoop distribution easily, and consistently on an efficient, reliable, and flexible IT infrastructure that could easily scale. Working with James Ruddy we started to look at the common Hadoop architectures. We found that many customer’s deployed Hadoop using:
- bare metal servers using Hadoop clustering functionality
- direct attach storage with commodity hard drive technology
- little automation was commonly used across distributions
When we started to map these findings with the challenges customers were experiencing with Hadoop deployments we saw an opportunity for EMC to help. We knew bare metal server architectures were inefficient, more time consuming to deploy, and could not be dynamically re-configured on demand. We thought server virtualization would help minimize these challenges and VMware Big Data Extension (BDE) functionality was about to be released and incorporated into the Hadoop standard which would make it virtualization compatible.
Based on our findings we knew many customers wanted to analyze data already stored on their EMC Isilon storage cluster. Isilon recently announced native support for the HDFS protocol at no additional charge for existing or new customers. We decided to leverage the EMC Isilon storage platform because Isilon storage is highly reliable, scalable, and cost effective. We could eliminate the need to create three copies of the large data sets for redundancy using the built in Isilon resiliency features.
The automation challenge was more difficult. We looked at several options and decided to use the VMware Project Sarengeti pre-GA solution. Ultimately we were impressed with the functionality and tight integration with VMware vSphere. We met with Project Serengeti team and agreed to collaborate on a joint solution.
We reviewed our proposed solution with our EMC Office of the CTO colleague, and Hadoop expert, Dan Baskette. If you are serious about Hadoop make sure to talk with Dan. After incorporating Dan’s feedback we worked with our EMC Isilon team, led by Ryan Peterson, and the Project Sarengeti team we were able build our environment and document the cookbook to automate the deployment of an Apache Hadoop environment.
Our solution allows the Hadoop compute and data services to be scaled easily and independently. This is an important feature since it allowed IT user to modify processing, data capacity, and performance capabilities based on user needs. This provided the flexibility good Hadoop infrastructures provide. The initial solution, EMC’s Hadoop Starter Kit was released at VMworld 2013 on 8/25 here.
This project is a great example of the value of EMC’s Open Innovations Lab. We were able to take a problem many of our customers and the EMC were struggling to address. By using our engineering expertise, EMC product portfolio, and emerging products in a new combination we created an innovative solution. This is allowing EMC to engage with our customers and partners to quickly prototype solutions in emerging applications like Hadoop. We are working on both enhancing this solution (VMworld Barcelona announcement coming) and creating a converged infrastructure solution with our product teams. I would love to hear your feedback on the EMC Community Network –Everything Big Data at EMC site.
Comments
You can follow this conversation by subscribing to the comment feed for this post.