How To Fix Spark Performance Issues Without Thinking Too Hard

How To Fix Spark Performance Issues Without Thinking Too Hard

At Pepperdata we have been analyzing many thousands of Spark jobs on many different clusters, on-prem and cloud production clusters running Spark across a variety of industries and applications, and even workload types.

Learn why Enterprise clients use Pepperdata products and Services: https://www.pepperdata.com/

#sparkperformance #bigdataperformance #pepperdata

In this presentation, Alex and I are going to cover a very brief intro to Spark, and we'll discuss some of the common issues we have seen and the symptoms of those issues and also how you can address and overcome some issues like that without having to think too hard.

Spark is an analytics engine that can let you run jobs in, Python, lots of other languages incredibly fast. Spark can integrate with machine learning, other types of libraries. And, it can even run in many different types of environments. I actually was on a presentation where Spark was referred to as a swiss army knife of analytics platforms, based on all of these capabilities. But that said, Spark definitely has some challenges.

So, one of the challenges that we have seen from looking across our production cluster customer clusters is that Spark jobs tend to actually fail more than other jobs. And we've seen a failure rate of over four percent. And of course, because of Sparks capabilities and power, all the reasons I mentioned in the previous slide, Spark, in particular, tends to be a very popular option for developers.

We actually looked at 25 random clusters on our platform and found 24 of them running Spark. So, that just says to me if you can address problems with Spark you can probably address a lot of the problems on your clusters in general.

So, because Spark jobs tend to be so common, we find that where people are seeing waste in their cloud and on-prem environments is typically concentrated around Spark. And this is a chart showing a grouping of 42 different clusters that we looked at.

In almost all of these, there's very little waste with the majority of the jobs which is great. In fact, with 95% of the jobs, you can see that there's not very much waste. But, the flip of that is five percent of the jobs are responsible for almost all the waste that we see on the cluster.

So, if your cluster is typical you might be able to eliminate a lot of your wasted resources by just focusing on those jobs, the five percent that are the big wasters. And looking a little bit more closely at waste the other thing that we found across those same 42 clusters is that within a typical week the median rate of maximum memory utilization is around 42%.

But, over-provisioning in the cloud definitely has a different cost profile than if you happen to do that in an on-prem environment. And over-provisioning the cloud can lead to a really large monetary waste in addition to just a waste of computing resources. So, this is definitely a challenge, a delicate challenge, and a delicate balance, and something to keep in mind when utilizing or provisioning a memory for Spark applications.

So, we're gonna go through those three sections that she mentioned: 1. The memory-related issues, the overallocation - out of memory 2. The data skew 3. and the configuration issues. Let's start with the memory-related issues. So, from an overallocation perspective, Spark, there are two areas where this can happen.

The first one's gonna be in the executives which are going to be the parts of your Spark application that are actually processing the work. And it's fairly common, especially if you've ever seen the other side happen, the under allocation where you've either ran into out of memory issues or garbage collection issues. To just say, “Okay I'm just gonna double or triple or choose an arbitrarily large amount of memory so that I don't have that existing problem again.”

And depending on where this is, you may not have the same type of visibility into it. So, if you're running Spark in client mode the driver is going to be outside of your cluster, and at that point, you don't know that you're using this extra memory, especially if you're on a shared gateway node into a cluster. If it is on the cluster you might have some visibility into it, but you still need to understand that you're just asking for too much. And depending on how much you're asking too much for, that can also lead to slower performance because now your job is having to manage more heap.

Check out our blog: https://www.pepperdata.com/blog/

/////////////////////////////////////////////////////////////////////////////////////////
Connect with us:
Visit Pepperdata Website: https://www.pepperdata.com/
Follow Pepperdata on LinkedIn: https://www.linkedin.com/company/pepperdata
Follow Pepperdata on Twitter: https://twitter.com/pepperdata
Like Pepperdata on Facebook: https://www.facebook.com/pepperdata/