Reasons people are migrating to Spark. Image: Databricks
Hadoop and MapReduce, the parallel programming paradigm and API originally behind Hadoop, used to be synonymous. Nowadays when we talk about Hadoop, we mostly talk about an ecosystem of tools built around the common file system layer of HDFS, and programmed via Spark.
Spark is the new Hadoop. One of the defining trends of this time, confirmed by both practitioners in the field and surveys, is the en masse move to Spark for Hadoop users. Spark is itself an ecosystem of sorts, offering options for SQL-based access to data, streaming, and machine learning.
People are migrating to Spark for a number of reasons, including easier programming paradigm. Easier than MapReduce does not necessarily mean easy though, and there are a number of gotchas when programming and deploying Spark applications.
So why are people migrating to Spark? The top reason seems to be performance: 91 percent of 1615 people from over 900 organizations participating in the Databricks Apache Spark Survey 2016 cited this as their reason for using Spark. But there's more. Advanced analytics and ease of programming are almost equally important, cited by 82 percent and 76 percent of respondents.
All industry sources we have spoken to over the last months point to the same direction: programming against Spark's API is easier than using MapReduce, so MapReduce is seen as a legacy API at this point. Vendors will continue to offer support for it as long as there are clients using it, but practically all new development is Spark-based.
Not everyone using Spark has the same responsibilities or skills. Image: Databricks
As Ash Munshi, Pepperdata CEO puts it: "Spark offers a unified framework and SQL access, which means you can do advanced analytics, and that's where the big bucks are. Plus it's easier to program: gives you a nice abstraction layer, so you don't need to worry about all the details you have to manage when working with MapReduce. Programming at a higher level means it's easier for people to understand the down and dirty details and to deploy their apps."
Great. What's the problem then? Munshi points out that the flip side of Spark abstraction, especially when running in Hadoop's YARN environment which does not make it too easy to extract metadata, is that a lot of the execution details are hidden. This means it's hard to pinpoint which lines of code cause something to happen in this complex distributed system, and it's also hard to tune performance.
Having a complex distributed system in which programs are run also means you have be aware of not just your own application's execution and performance, but also of the broader execution environment. Pepperdata calls this the cluster weather problem: the need to know the context in which an application is running. A common issue in cluster deployment for example is inconsistency in run times because of transient workloads.
Pepperdata is not the only one that has taken note. A few months back Alpine Data also pinpointed the same issue, albeit with a slightly different framing. Alpine Data pointed to the fact that Spark is extremely sensitive to how jobs are configured and resourced, requiring data scientists to have a deep understanding of both Spark and the configuration and utilization of the Hadoop cluster being used.
Failure to correctly resource Spark jobs will frequently lead to failures due to out of memory errors, leading to inefficient and time-consuming, trial-and-error resourcing experiments. This requirement significantly limits the utility of Spark, and impacts its utilization beyond deeply skilled data scientists, according to Alpine Data.
This is based on hard-earned experience, as Alpine Data co-founder & CPO Steven Hillion explained. At some point one of Alpine Data's clients was using Alpine Data Science platform (ADSP) to do some very large scale processing on consumer data: billions of rows and thousands of variables. ADSP uses Spark under the hood for data crunching jobs, but the problem was that these jobs would either take forever or break.
The reason was that the tuning of Spark parameters in the cluster was not right. People using ADSP in that case were data scientists, not data engineers. They were proficient in finding the right models to process data and extracting insights out of them, but not necessarily in deploying them at scale.
The result was that data scientists would get on the phone with ADSP engineers to help them diagnose the issues and propose configurations. As this would obviously not scale, Alpine Data came up with the idea of building the logic their engineers applied in this process into ADSP. Alpine Data says it worked, enabling clients to build workflows within days and deploy them within hours without any manual intervention.
So the next step was to bundle this as part of ADSP and start shipping it, which Alpine Labs did in Fall 2016. This was presented in Spark Summit East 2017, and Hillion says the response has been "almost overwhelming. In Boston we had a long line of people coming to ask about this".
Hillion emphasized that their approach is procedural, not based on ML. This may sound strange, considering their ML expertise. Alpine Labs however says this is not a static configuration, but works by determining the correct resourcing and configuration for the Spark job at run-time is based on the size and dimensionality of the input data, the complexity of the Spark job, and the availability of resources on the Hadoop cluster.
"You can think of it as a sort of equation if you will, in a simplistic way, one that expresses how we tune parameters" says Hillion. "Tuning these parameters comes through experience, so in a way we are training the model using our own data. I would not call it machine learning, but then again we are learning something from machines."
Pepperdata now also offers a solution for Spark automation with last week's release of Pepperdata Code Analyzer for Apache Spark (PCAAS), but addressing a different audience with a different strategy. Data scientists make for 23 percent of all Spark users, but data engineers and architects combined make for a total of 63 percent of all Spark users. This is the audience Pepperdata aims at with PCAAS.
Architects are the people who design (big data) systems, and data engineers are the ones who work with data scientists to take their analyses to production. Munshi says PCAAS aims to give them the ability to take running Spark applications, analyze them to see what is going on and then tie that back to specific lines of code.
The thinking there is that by being able to understand more about CPU utilization, garbage collection or I/O related to their applications, engineers and architects should be able to optimize applications. PCAAS boasts the ability to do part of the debugging, by isolating suspicious blocks of code and prompting engineers to look into them.
PCAAS aims to help decipher cluster weather as well, making it possible to understand whether run time inconsistencies should be attributed to a specific application or to the workload at the time of execution. Munshi also points out the fact that YARN heavily uses static scheduling, while using more dynamic approaches could result in better hardware utilization.
Better hardware utilization is clearly a top concern in terms of ROI, but in order to understand how this relates to PCAAS and why Pepperdata claims to be able to overcome YARN's limitations we need to see where PCAAS sits in Pepperdata's product suite. PCAAS is Pepperdata's latest addition to a line of products including the Application Profiler, the Cluster Analyzer, the Capacity Optimizer, and the Policy Enforcer.
The latter three are about collecting telemetry data, while the former two are about intervening in real-time, says Munshi. Pepperdata's overarching ambition is to bridge the gap between Dev and Ops, and Munshi believes that PCAAS is a step in that direction: a tool Ops can give to Devs to self-diagnose issues, resulting in better interaction and more rapid iteration cycles.
Interestingly, Hillion also agrees that there is a clear division between proprietary algorithms for tuning ML jobs and the information that a Spark cluster can provide to inform these algorithms. There are differences as well as similarities in Alpine Labs and Pepperdata offerings though.
To begin with, both offerings are not stand-alone. Spark auto-tuning is part of ADSP, while PCAAS relies on telemetry data provided by other Pepperdata solutions. So if you are only interested in automating parts of your Spark cluster tuning or application profiling, tough luck.
When discussing with Hillion, we pointed out the fact that not everyone interested in Spark auto tuning will necessarily want to subscribe to ADSP in its entirety, so perhaps making this capability available as a stand-alone product would make sense. Hillion alluded that the part of their solution that is about getting Spark cluster metadata from YARN may be open sourced, while the auto-tuning capabilities may be sold separately at some point.
Alpine Labs is worried about giving away too much of their IP, however this concern may be holding them back from commercial success. When facing a similar situation, not every organization reacts in the same way. Case in point: Metamarkets built Druid and then open sourced it. Why? "We built it because we needed it, and we open sourced it because if we had not, something else would have replaced it."
The AI lock-in loop: great investment begets greater results begetting greater investment. Image: Azeem Azhar / Schibsted
In all fairness though, for Metamarkets Druid is just infrastructure, not core business, while for Alpine Labs ADSP is their bread and butter. As for Pepperdata, they are toying with the idea of giving free access to PCAAS for non-production clusters to get a foothold in organizations. The reasoning is tested and true: get engineers to know and love a tool, and the tool will eventually spread and find its way in IT budgets.
Either way, if you are among those who would benefit from having such automation capabilities for your Spark deployment, for the time being you don't have much of a choice. You will have to either pay a premium and commit to a platform, or wait until such capabilities eventually trickle down.
The bigger picture however is clear: automation is finding an increasingly central role in big data. Big data platforms can be the substrate on which automation applications are developed, but it can also work the other way round: automation can help alleviate big data pain points.
Remember the AI lock in the loop? First mover advantage may prove significant here, as sitting on top of million telemetry data points can do wonders for your product. This is exactly the position Pepperdata is in, and it intends to leverage it to apply Deep Learning to add predictive maintenance capabilities as well as monetize it in other ways.
Whether Pepperdata manages to execute on that strategy and how others will respond is another issue, but at this point it looks like a strategy that has more chances of addressing the needs for big data automation services.
Read more:
Spark gets automation: Analyzing code and tuning clusters in production - ZDNet
- Automation Personnel Services - Temporary Staffing ... [Last Updated On: March 25th, 2016] [Originally Added On: March 25th, 2016]
- Automation | Define Automation at Dictionary.com [Last Updated On: March 25th, 2016] [Originally Added On: March 25th, 2016]
- Automation | Definition of automation by Merriam-Webster [Last Updated On: March 25th, 2016] [Originally Added On: March 25th, 2016]
- Automation | The Car Company Tycoon Game [Last Updated On: March 25th, 2016] [Originally Added On: March 25th, 2016]
- Automation - Wikipedia, the free encyclopedia [Last Updated On: March 25th, 2016] [Originally Added On: March 25th, 2016]
- Automation - Cloud process & workflow automation | Microsoft ... [Last Updated On: June 29th, 2016] [Originally Added On: June 29th, 2016]
- Riverside Automation - Machine Controls [Last Updated On: July 3rd, 2016] [Originally Added On: July 3rd, 2016]
- Automation: The Car Company Tycoon Game Windows - Mod DB [Last Updated On: July 3rd, 2016] [Originally Added On: July 3rd, 2016]
- System Integration | Industrial Automation [Last Updated On: July 3rd, 2016] [Originally Added On: July 3rd, 2016]
- WinAutomation - Smart Macro Recorder, Web Automation ... [Last Updated On: July 3rd, 2016] [Originally Added On: July 3rd, 2016]
- Automation Solutions - Home [Last Updated On: July 3rd, 2016] [Originally Added On: July 3rd, 2016]
- The Automation Conference [Last Updated On: July 3rd, 2016] [Originally Added On: July 3rd, 2016]
- Rohtek Automation [Last Updated On: July 3rd, 2016] [Originally Added On: July 3rd, 2016]
- JL Automation, LLC | Home Automation, A/V Automation [Last Updated On: July 3rd, 2016] [Originally Added On: July 3rd, 2016]
- Four fundamentals of workplace automation | McKinsey & Company [Last Updated On: August 27th, 2016] [Originally Added On: August 27th, 2016]
- Leviton Security & Home Automation [Last Updated On: August 27th, 2016] [Originally Added On: August 27th, 2016]
- EVA Automation [Last Updated On: September 6th, 2016] [Originally Added On: September 6th, 2016]
- News | Automation | The Car Company Tycoon Game [Last Updated On: September 6th, 2016] [Originally Added On: September 6th, 2016]
- Automation - The Car Company Tycoon Game on Steam [Last Updated On: September 6th, 2016] [Originally Added On: September 6th, 2016]
- Test automation - Wikipedia, the free encyclopedia [Last Updated On: September 6th, 2016] [Originally Added On: September 6th, 2016]
- Job Seekers - Automation Personnel Services [Last Updated On: October 8th, 2016] [Originally Added On: October 8th, 2016]
- Custom Automation & Machine Design | Automation GT [Last Updated On: October 31st, 2016] [Originally Added On: October 31st, 2016]
- iAutomation [Last Updated On: October 31st, 2016] [Originally Added On: October 31st, 2016]
- Test automation - Wikipedia [Last Updated On: November 16th, 2016] [Originally Added On: November 16th, 2016]
- Automation - Official Site [Last Updated On: November 19th, 2016] [Originally Added On: November 19th, 2016]
- Beckhoff Automation - Wikipedia [Last Updated On: November 21st, 2016] [Originally Added On: November 21st, 2016]
- Automation - Security Hyperstore [Last Updated On: November 21st, 2016] [Originally Added On: November 21st, 2016]
- IT Automation - BMC [Last Updated On: November 29th, 2016] [Originally Added On: November 29th, 2016]
- ID Automation [Last Updated On: November 29th, 2016] [Originally Added On: November 29th, 2016]
- The Best Home Automation Systems of 2016 | Top Ten Reviews [Last Updated On: December 24th, 2016] [Originally Added On: December 24th, 2016]
- What is Home Automation? | Home Automation Systems [Last Updated On: December 24th, 2016] [Originally Added On: December 24th, 2016]
- Beyond Automation - hbr.org [Last Updated On: December 25th, 2016] [Originally Added On: December 25th, 2016]
- Build automation - Wikipedia [Last Updated On: December 26th, 2016] [Originally Added On: December 26th, 2016]
- Home automation - Wikipedia [Last Updated On: January 10th, 2017] [Originally Added On: January 10th, 2017]
- Automation | Food Engineering [Last Updated On: January 13th, 2017] [Originally Added On: January 13th, 2017]
- Home Automation - Enerwave Home Automation [Last Updated On: January 14th, 2017] [Originally Added On: January 14th, 2017]
- Automation - DESHAZO [Last Updated On: January 14th, 2017] [Originally Added On: January 14th, 2017]
- Robots, Automation, EOAT, Grippers, Conveyors, Guarding [Last Updated On: January 26th, 2017] [Originally Added On: January 26th, 2017]
- Werner Electric | Automation [Last Updated On: January 28th, 2017] [Originally Added On: January 28th, 2017]
- Automationtechies | Automation Engineering Recruiting [Last Updated On: January 28th, 2017] [Originally Added On: January 28th, 2017]
- Automation - Mazak Corporation [Last Updated On: January 28th, 2017] [Originally Added On: January 28th, 2017]
- Automation | Technologies | Systems | Integrator ... [Last Updated On: January 28th, 2017] [Originally Added On: January 28th, 2017]
- Test Automation Services for Development of Regression ... [Last Updated On: January 28th, 2017] [Originally Added On: January 28th, 2017]
- Carlo Gavazzi Automation Components [Last Updated On: January 30th, 2017] [Originally Added On: January 30th, 2017]
- UI Automation Overview - msdn.microsoft.com [Last Updated On: February 5th, 2017] [Originally Added On: February 5th, 2017]
- New telecom transformation goals require service automation - TechTarget [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- Global Hazardous Waste Handling Automation Market: By Products ... - Business Wire (press release) [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- 2M Automation wins IoT support from Schneider - Electronics EETimes (registration) [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- Futures Shaped by Automation and Catastrophe: Peter Frase on Capitalism's Endgame - Truth-Out [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- Automation expected to displace insurance underwriters, real estate brokers - CIO Dive [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- Automation, robots could replace 250000 public sector workers in the next 15 years - Computer Business Review [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- Design Automation Conference - Business Wire (press release) [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- The Perks Of Automation And The Risks: Why To Think Twice About Getting Into That Driverless Uber - Forbes [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- Lib Dems Should Embrace Automation of the Workforce - Liberal Democrat Voice [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- Voices Reinventing enterprise finance by overhauling AP automation - Accounting Today [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- How Accountants Can Use Automation Their Advantage - Accountingweb.com (blog) [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- DFLabs Launches the First Security Automation and Orchestration Platform based Upon Supervised Active Intelligence - Business Wire (press release) [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- QAD Automation Solutions is Honda Approved - Yahoo Finance [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- VIDEO: Going Big on Automation in a Small Footprint Facility - ENGINEERING.com [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- Building a better model of human-automation interaction - Phys.Org [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- AlixPartners examines automation in manufacturing and logistics management - Logistics Management [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- Report: Test automation is increasing - SD Times - SDTimes.com [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- Automation is the unavoidable future of the economy - The Daily Cougar [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- GM's Cruise Automation Is Testing An App to Order Self-Driving ... - Fortune [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- Speeders beware: Legislation would allow automation crackdown ... - SFGate [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- Orbita Ingenieria: New Age Terminal Automation - Port Technology International [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- A Sharper Focus on the Edge - Automation World [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- Rockwell Automation Surged 10% in January as Growth Picked Up Steam - Motley Fool [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- Most people are optimistic about workplace automation, social data suggests - ZDNet [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- Improving Behavior Through Automation of Vehicle Systems - School Transportation News (blog) [Last Updated On: February 11th, 2017] [Originally Added On: February 11th, 2017]
- 'We employ insane levels of automation' Kris Canekeratne - Times of India [Last Updated On: February 11th, 2017] [Originally Added On: February 11th, 2017]
- Why Don't We See More Automation in Federal Networks? - Nextgov [Last Updated On: February 11th, 2017] [Originally Added On: February 11th, 2017]
- Technobabble: Automation and the modern worker - CIO Dive [Last Updated On: February 11th, 2017] [Originally Added On: February 11th, 2017]
- Readers Write (Feb. 12): The moose population; jobs, start-ups and automation; diversity in the funny pages - Minneapolis Star Tribune [Last Updated On: February 12th, 2017] [Originally Added On: February 12th, 2017]
- Automation Nightmare: Philosopher Warns We Are Creating a World Without Consciousness - Big Think [Last Updated On: February 12th, 2017] [Originally Added On: February 12th, 2017]
- Automation can replace bureaucrats and save taxpayers money - Hot Air [Last Updated On: February 12th, 2017] [Originally Added On: February 12th, 2017]
- Automation can revitalize the US workforce - Fox News [Last Updated On: February 12th, 2017] [Originally Added On: February 12th, 2017]
- TigerStop hopes to ride automation to new heights - The Columbian [Last Updated On: February 13th, 2017] [Originally Added On: February 13th, 2017]
- Hexadite Unveils Custom Playbooks Following One Millionth Automated Cybersecurity Investigation - Yahoo Finance [Last Updated On: February 13th, 2017] [Originally Added On: February 13th, 2017]
- NEC updates postal automation system for Hongkong Post - ETCIO.com [Last Updated On: February 13th, 2017] [Originally Added On: February 13th, 2017]