This past week I attended CiscoLive! 2015, as EMC’s Big Data “expert”. It was validated during this conference that every business, in every industry is collecting data from new sources, and leveraging next generation analytics to improve their customer’s experience, deliver new products and services, and deliver those much more efficiently. Improving technology is enabling the accelerating use of Big Data solutions. We are able to deploy and embedded more cost effective data collection sensors in even the most traditional commodity devices such as light bulbs. Cisco demonstrated the availability of impressive higher capacity, more reliable wireless networks this week. I previously discussed the new EMC storage technology that enable the ingestion and analysis of data much faster and cost effectively here.
Although the technology needed for Big Data is getting better it needs to get significantly easier to use and deploy to keep pace with businesses demands. Two main industry challenges were very evident this week:
- Analytics’ tools are difficult to use
- Data fabrics are hard to implement
The lack of resources that could build the data models and algorithms to make the data available actionable was referenced in many of the conversations I had with both product teams and users. Many are focused on increasing the training and education capacity needed to develop analysts to be able to use the new analytics tools available today. I think we need to also greatly improve the tools to be easier to use. Compare the complexity of the tools analysts are using for Big Data projects to traditional query languages like SQL, and analytics tools such as:
Big Data analysts need to be experienced with programming languages such as Java, Python, and R. Data access is a polyglot combination of often disparate low level API’s and formats requiring transformation of the data before it can be processed. Some of the complexity of the new tools is a result of needing new capabilities that will mature and simplify over time. Next generation analytics tools such as Splunk and Tableau are promising but I think we need to be less accepting of the poor usability of many Big Data analytics tools. Analysts and Data Scientists need to be able to focus more on designing data models and algorithms, and less on building unique solutions requiring a lot of application programming.
The second challenge for the industry is to greatly simplify the Big Data infrastructure deployment. Today it can take several months for an organization to install and configure the IT infrastructure, data fabric, and analytics tools. Look above at all the Big Data tools and products that have to deployed to create a Big Data Lake. Today there is still not a consensus IT infrastructure model. For example, some fundamental attributes are still being debated including:
- Storage - external array vs. commodity direct attached server storage
- Compute – bare metal vs. hypervisor
I think a standard architecture will emerge soon. It will follow the same path as other new technology paradigms of the past such server virtualization. In the mid 2000’s timeframe it was challenging, and time consuming to deploy VMware virtualization due to some of the same challenges. Eventually the industry decided on a common architecture that all the storage, server, and network optimized around. As that happened adaption accelerated rapidly.
At the data fabric layer Hadoop has become widely accepted and it is based on an Apache open source standard there are multiple commercial distributions (Hortonworks, Cloudera, MapR, PivotalHD) it is not easy to deploy the same workload across all of these distributions. Foundations such as the Open Data Platform (ODP) have been formed by the industry to facilitate collaboration to address this issue. The growth in the number of participants in the ODP initiative demonstrates the industry recognizes this problem.
The early results to date using Big Data to improve customer experience, create new products and services, and delivering them more efficiently has been promising. Our ability for us to address the complexity of the tools and infrastructure will determine how fast the benefits of Big Data solutions will be realized.