How to Optimize Spark Enterprise Application Performance | Pepperdata

How to Optimize Spark Enterprise Application Performance | Pepperdata

Does your big data analytics platform provide you with the Spark recommendations you need to optimize your application performance and improve your own skillset? Explore how you can use Spark recommendations to untangle the complexity of your Spark applications, reduce waste and cost, and enhance your own knowledge of Spark best practices.

Learn why Enterprise clients use Pepperdata products and Services:

#sparkoptimization #applicationperformance #pepperdata

Topics include:

  • Avoiding contention by ensuring your Spark applications are requesting
  • the appropriate amount of resources,
  • Preventing memory errors,
  • Configuring Spark applications for optimal performance,
  • Real-world examples of impactful recommendations,
  • and More!

Join Product Manager Heidi Carson and Field Engineer Alex Pierce from Pepperdata to gain real-world experience with a variety of Spark recommendations, and participate in the Q and A that follows.

More on the episode:
My name is Heidi Carson. I work in product management, and today we're going to be talking to you about Spark recommendations and how they can help you improve your Spark application performance.

So, just to get started, as I advance the slide, I'm sure most of you are already familiar with the incredible power of Apache Spark. But, I'm just going to summarize a few highlights developers usually really appreciate.

Spark speed, and ease of use, and generality, and flexibility. Of course, it's an analytics engine that allows you to run jobs written in Java and Python, and other languages incredibly fast.

It can integrate with machine learning and other types of libraries, and Spark can run lots of, in lots of different environments connecting to many different types of databases.

In fact, there was a previous Pepperdata webinar where Spark was referred to as the Swiss Army knife of analytics platforms because of all of these incredible capabilities.

That said, Spark definitely has some challenges and if you're a Spark developer I'm sure you're familiar with these sorts of things that you might be thinking about every day: How to efficiently source potentially huge volumes of data.

How to perform, extract, transform and load operations efficiently. How do you validate data sets at a huge scale? And how do you work with other applications and play nicely with them on a multi-tenant system? And, of course, how to do all of this while minimizing cost and maximizing efficiency, and getting your jobs done as quickly and as efficiently as possible.

In fact, we've seen at Pepperdata, in our recent big data performance report that with all of these many complexities and challenges of Spark, Spark shops, in fact, tend to fail more often than other types of jobs with a failure rate of over 4 percent.

And so, today we're going to talk to you about some of these common issues and causes of failure. Now, we've seen with the, you know, hundreds and thousands of Spark jobs running across Pepperdata, some common themes, and those themes tend to be these three that I've listed here.

The first one is memory-related issues. Even if a job didn't run, it ran successfully and didn't fail, sometimes you find issues with overallocation of memory wasting resources, essentially in making sure that you've allocated enough to it. Or, sometimes you see out-of-memory errors due to under allocation.

Another common topic that we see is data skew-based issues. Now, in an ideal scenario, if you had a pool of executors, you ideally would want each of them sharing the load equally and processing the same amount of work.

But, that doesn't always happen in reality. And so, you end up sometimes with unbalanced work across the executors creating this kind of skew. And finally, sometimes we see just basic configuration issues, maybe related to your choice of serializer, for example.

So, that actually takes us to our very first poll question and I'll hand it back to our moderator, Dave. Now to take us through that great. So, thank you, Heidi. And that poll question should show up below your console right now.

And that question is: Which of these is the biggest issue that you encounter in your Spark environment. Is it overallocation of memory? Is it out of memory issues? Is it data skew? Or is it configuration issues?...

Learn why Enterprise clients use Pepperdata products and Services:

Check out our blog:


Connect with us:
Visit Pepperdata Website:
Follow Pepperdata on LinkedIn:
Follow Pepperdata on Twitter:
Like Pepperdata on Facebook: