Spark Performance Management Optimization Best Practices | Pepperdata

Spark Performance Management Optimization Best Practices | Pepperdata

Gain the knowledge of Spark veteran Alex Pierce on how to manage the challenges of maintaining the performance and usability of your Spark jobs.

Learn why Enterprise clients use Pepperdata products and Services:

#sparkperformance #bigdatamanagement #pepperdata

Apache Spark provides sophisticated ways for enterprises to leverage big data compared to Hadoop. However, the increasing amount of data being analyzed and processed through the framework is massive and continues to push the boundaries of the engine.

This webinar draws on experiences across dozens of production deployments and explores the best practices for managing Apache Spark performance. Learn how to avoid common mistakes, improve the usability, supportability, and performance of Spark.

Topics include:

– Serialization
– Partition sizes
– Executor resource sizing
– DAG management

More on the episode:
Just as a quick overview Spark is just a sophisticated execution engine for parsing your big data environments. And we're definitely seeing a very strong uptick in Spark utilization over the last even six months, especially last year.

So, we would like to make sure that everybody is using Spark safely in a performance manner. Just as a heads up, we're going to talk a little bit about standard Spark challenges. What we visualize is some of the best practices for Spark performance management, a little bit about what we feel Spark success looks like.

And, of course, in the end, I'll just really quickly get those points again and answer any questions that you guys have submitted. So, your Spark performance challenges. Some of these you might be familiar with, some maybe not.

I'm sure anybody who's worked with Spark is familiar with the memory and garbage collection issues, trying to determine how much memory to assign to your executors so you're not clogging up cues in your working environment, while at the same time not spending 10 to 15 percent of your execution time.

Managing garbage collection is a pretty important problem with Spark. Another common one actually is data skew. I'll spend a little bit of time on that today. We see a lot of cases, especially when we start dealing with Spark SQL or understanding the keyspace in your data sets, or other things. With Spark, a data skew is a very big problem we see consistently across the board in our customer base.

That feeds a little bit into parallelism and partitions. Sometimes a skew is due to too small of a partition space versus your key size. Sometimes it's due to a data scheme, so understanding when, which is true, but also understanding how to choose partition sizing and parallelism based on the computing environment you're working in.

And we'll just touch on a couple of common misconfigurations. So, the first part of Spark, starts with observability. What's going on? You need to be able to actually see why is my application running slow? Why is my application failing? What is happening to the system when my application fails? What is my application doing when it misses an SLA? So, one of the first most important parts of any performance management tool is going to be visibility.

So, understanding what's actually happening inside your Spark environment. And just now, we'll touch on a couple things. We'll hit a first one: Common misconfiguration serialization. And basically, as you're transferring data between jobs and stages in your applications, you're often going to need to serialize objects to transfer the data across.

It's a very simple, easy thing to tune for the most part. Just make sure you're using the Kryo serializers not the Java answers. There might be times when you need to increase or decrease the maximum buffer memory for the serializer. But, for the most part, just choosing a serializer that's better performance, especially if you're spending a lot of time serializing data to transfer across the wire between executors.

When you're doing that, there are a couple things you need to keep in mind. One, make sure you're not using anonymous classes so you can only have to realize the set of classes you care about, and not any sort of, not any outer wrapping classes. And excuse me. And static variables because multiple tasks can run inside the JVM...

Learn why Enterprise clients use Pepperdata products and Services:

Check out our blog:


Connect with us:
Visit Pepperdata Website:
Follow Pepperdata on LinkedIn:
Follow Pepperdata on Twitter:
Like Pepperdata on Facebook: