This year, the data architecture Apache Hadoop celebrates its 10-year anniversary. When it first launched in 2006, Apache Hadoop was a small project deployed on 20 machines at Yahoo.
By 2010, it was running on 45,000 machines and had become the backbone of Yahoo’s data infrastructure. Now, the Apache Hadoop market is forecasted to surpass $16 billion by 2020.
While the open source platform may be proven and popular among seasoned developers who require a technology that can power large, complex applications, the fragmented nature of its ecosystem is a leading cause for businesses’ difficulty extracting value from Hadoop investments. Apache Hadoop and Big Data proponents acknowledge that this technology has not yet achieved its game-changing business potential just yet.
One glaring barrier to adoption is the high-speed innovation happening with Apache Hadoop components and distributions. The rapid and fragmented growth can slow big data ecosystem development and stunt adoption.
Currently, Apache Hadoop lacks consistent distributions, which would allow application developers and enterprises to more easily build data-driven applications. Through compatibility across distribution and application offerings for management and integration, widespread industry interoperability would provide an official baseline of technological expectations for those encountering an Apache Hadoop distribution.
The container shipping industry (and I don’t mean Docker) was able to grow significantly once universal guidelines were implemented to ensure the safe and efficient transport of containers. When a formal shipping container standard was implemented by the International Standards Organization (ISO), its significant impact increased trade more than 790 percent over 20 years – an incredible case for the unification and optimization of an entire ecosystem to ensure its longevity.
Taking full advantage of today’s growing enterprise buyer opportunity, ripe with enterprises looking to harness the estimated 4ZB of data the world generates, the ecosystem will need to support greater stability across Apache Hadoop and other Apache project technologies to ensure confidence from adopters in their investment – regardless of the industry they serve.
From distributions, to ISVs and SIs, known standards in which to operate will not only help sustain this piece of the Big Data ecosystem pie, but define how these pieces interoperate and integrate more simply for the benefit of the ever-important end users.
I leave you with a quote from Todd Moore, vice president of open technology at IBM:
Within the IBM ecosystem, cloud plays a tremendous role as a connecting point and springboard for companies to begin taking advantage of Big Data. As we are beginning to dive deeper into the cognitive world, there has been an explosion of structured and unstructured data feeding into it, and the cloud gives developers the tools they need to tap into this goldmine. Hadoop, Big Data, machine learning–they all play into how this cognitive world will evolve. So, having a cloud platform to count on to access these services–starting with a small, specific, and consistent packaging model that lives within the ecosystem–is priceless. As people take advantage of this and deploy into their own cloud infrastructure, they will be able to test once and run everywhere. In this way, standardization is an enabler and is a huge deal as we move into the cognitive era.
John Mertic is Director of Program Management for ODPi and Open Mainframe Project at The Linux Foundation. Mertic comes from a PHP and Open Source background, being a developer, evangelist, and partnership leader at Bitnami and SugarCRM, board member at OW2, president of OpenSocial, and frequent conference speaker around the world.