Benchmarking Llama 4 with GitHub Multiple Choice Benchmarks

May 9, 2025

How accurately can LLMs predict how bugs were fixed? To start exploring this field, we put Llama 4 and other leading models to the test using a GitHub Multiple Choice Benchmark. Each model was given a real bug ticket and had to identify the pull request that resolved it.

The benchmark framework is open-source and available on GitHub: https://github.com/Rootly-AI-Labs/GMCQ-benchmark