reader comments 1
Share this story
LAS VEGAS—On a raised floor in a ballroom at the Paris Hotel, seven competitors stood silently.
These combatants had fought since 9:00am, and nearly $4 million in prize money loomed over all the proceedings. Now some 10 later, their final rounds were being accompanied by all the play-by-play and color commentary you’d expect from an episode of American Ninja Warrior. Yet, no one in the competition showed signs of nerves.
To observers, this all likely came across as odd—especially because the competitors weren’t hackers, they were identical racks of high-performance computing and network gear.
The finale of the Defense Advanced Research Projects Agency’s Cyber Grand Challenge, a DEFCON game of “Capture the Flag,” is all about the “Cyber Reasoning Systems”(CRSs).
And these collections of artificial intelligence software armed with code and network analysis tools were ready to do battle.
Inside the temporary data center arena, referees unleashed a succession of “challenge” software packages.
The CRSs would vie to find vulnerabilities in the code, use those vulnerabilities to score points against competitors, and deploy patches to fix the vulnerabilities.
Throughout the whole thing, each system had to also keep the services defined by the challenge packages up and running as much as possible. And aside from the team of judges running the game from a command center nestled amongst all the compute hardware, the whole competition was untouched by human hands.
Greetings, Professor Falken
Waiting to rumble—the seven AI supercomputing hacker “bots” and their supporting cast hum quietly in the Paris Hotel ballroom
Some of the 7 supercomputers required just to keep tabs on the participating systems.
The Airgap robot allows scoring data to be passed to DARPA’s visualization team by moving burned Blu-Ray discs from one side to the other.
It’s time to rumble. Well, actually, they started 10 hours ago…it’s just time for the show.
The crowd watches as the show gets underway.
Daniel Tkacik, Ph.D, Carnegie Mellon University
The Cyber Grand Challenge (CGC) was based on the formula behind the successful Grand Challenges that DARPA has funded in areas such as driverless vehicles and robotics.
The intent is to accelerate development of artificial intelligence as a tool to fundamentally change how organizations do information security. Yes, in the wrong hands such systems could be applied to industrial-scale discovery and weaponization of zero-days, giving intelligence and military cyber-operators a way to quickly exploit known systems to gain access or bring them down.
But alternatively, systems that can scan for vulnerabilities in software and fix them automatically could, in the eyes of DARPA director Arati Prabhakar, create a future free from the threat of zero-day software exploits.
In such a dream world, “we can get on with the business of enjoying the fruits of this phenomenal information revolution we’re living through today,” he said.
For now, intelligent systems—call them artificial intelligence, expert systems, or cognitive computing—have already managed to beat humans at increasingly difficult reasoning tasks with a lot of training.
Google’s AlphaGo beat the world’s reigning Go master at his own game.
An AI called ALPHA has beaten US Air Force pilots in simulated air combat.
And, of course, there was that Jeopardy match with IBM’s Watson.
But those sorts of games have nothing on the cutthroat nature of Capture The Flag—at least as the game is operated by the Legitimate Business Syndicate, the group that oversees the long-running DEFCON CTF tournament.
This was the culmination of an effort that began in 2013, when DARPA’s Cyber Grand Challenge program director Mike Walker began laying the groundwork for the competition. Walker was a computer security researcher and penetration tester who had competed widely in CTF tournaments around the world. He earned this project after working on a “red team” that performed security tests on a DARPA prototype communications system.
After a 2012 briefing he gave the leadership of DARPA’s Information Innovation Office (I2O) on vulnerability detection and patching, the I20 leadership had one thought. “Can we do this in an automated fashion?” I20 deputy director Brian Pierce told Ars. “When it comes down to cyber operations, everything operates on machine time—the question was could we think about having the machine assist the human in order to address these challenges.”
The same question was clearly on the minds of Defense Department leaders, particularly at the US Cyber Command with its demand for some way to “fight the network.” In 2009, Air Force Gen. Kevin Chilton, then commander of the Strategic Command, said, “We need to operate at machine-to-machine speeds…we need to operate as near to real time as we can in this domain, be able to push software upgrades automatically, and have our computers scanned remotely.”
Walker saw an opportunity to push forward what was possible by combining the CTF tournament model with DARPA’s “Grand Challenge” experience. He drew heavily from then-deputy I20 director Norm Whitaker’s experience running the DARPA self-driving vehicle grand challenges from 2004 to 2007.
“We learned a lot from that,” said Pierce.
But even with a template to follow, “a lot of things had to be built from scratch.” Those things included the creation of a virtual arena in which the competitors could be fairly judged against each other—one that was a vastly simplified version of the real world of cybersecurity so competitors focused on the fundamentals.
That was the same model DARPA followed in its initial self-driven vehicle challenges, as Walker pointed out at DEFCON this month.
The winning vehicle of the 2005 DARPA Grand Challenge, a modified Volkswagen Touareg SUV named “Stanley,” “was not a self-driving car by today’s standards,” said Walker. “It was filled with computing and sensor and communications gear.
It couldn’t drive on our streets, it couldn’t handle traffic…it couldn’t do a lot of things.
All the same, Stanley earned a place in the Smithsonian by redefining what was possible, and today vehicles derived from Stanley are driving our streets.”
Similarly, the Cyber Grand Challenge devised by Walker and DARPA didn’t look much like today’s world of computer security.
The systems would “work only on very simple research operating systems,” Walker said. “They work on 32-bit native code, and they spent a huge amount of computing power to think about the security problems of small example services. the complex bugs they found are impressive, but they’re not as complex as their real-world analogues, and a huge amount of engineering remains to be done before something like this guards the networks we use.”
A highlight reel from DARPA’s Cyber Grand Challenge finale.
The “research operating system” built by DARPA for the CGC is called DECREE (the DARPA Experimental Cyber Research Evaluation Environment).
It was purpose-built to support playing Capture the Flag with an automated scoring system that changes some of the mechanics of the game as it is usually played by humans.
There are many variations on CTF.
But in this competition, the “flag” to be captured is called a Proof of Vulnerability (POV)—an exploit that successfully proves the flaw on opponents’ servers.
Teams are given “challenge sets,” or pieces of software with a one or sometimes multiple vulnerabilities planted in them, to run on the server they are defending.
The teams race to discover the flaw through analysis of the code, and they can then score points either by patching their own version of the software and submitting that patch to the referee for verification or by using the discovered exploit to hack into opponents’ systems and obtain a POV.
The problem with patching something is that once a team patches, its code can be used to patch by everyone else. Patching generally means bringing down the “challenge set” code briefly to apply the patch.
It’s a risk for competitors: if the patch fails and the software doesn’t work properly, that counts against your score.
Based on 32-bit Linux for the Intel architecture, DECREE only supports programs compiled to run in the Cyber Grand Challenge Executable Format (CGCEF)—a format that supports a much smaller number of possible system calls than used in software on general-purpose operating systems.
CGCEF also comes with tools that allow for quick remote validation that software components are up and running, debugging and binary analysis tools, and an interface to throw POVs at the challenge code either as XML-based descriptions or as C code.
So just as with the human version of CTF, each of the CRSs was also tasked with defending a “server”—in this case, running instances of DECREE.
The AI of each bot controlled the strategy used to analyze the code and the creation of potential POVs and patches.
AIs made decisions based on the strategy they were trained with plus the state of the game, adapting to either submit patches (and share them with everyone else as a result), create a network-based defense to prevent exploits from landing, or go on the attack and prove vulnerabilities on other systems.
While teams could score points based on patches submitted, successfully deflected attacks, and POVs scored against the other teams’ systems, those points were multiplied by the percentage of time their copies of the challenge sets were available. Patching would mean losing availability, as the challenge sets would have to go down to be patched.
That meant possibly giving up the chance to exploit the bug against others to score more points.
Setting up an intrusion detection system to block attacks could also affect availability—especially if it was set wrong and it blocked legitimate traffic coming in.
All of this makes for a complex set of game strategies in CTF, and that really tests the flexibility of the AI controlling the bots.
One of the major differences between human CTF and the DARPA version of the game was its pace.
In a typical human CTF tournament, only about 10 challenge sets would be posted over the course of a two-day tournament.
In the qualifying round of the Cyber Grand Challenge, held last August at DEFCON, there were 131 challenge sets with a total of 590 vulnerabilities.
In total, 28 systems made the first cut over the course of 24 hours—so they handle a challenge about once every 10 minutes.
In the finale of the Cyber Grand Challenge, there would be 100—but they would be posted to systems at the rate of one challenge every five minutes.
Another difference in this event was how the “offensive” part of the CTF worked. Rather than launching their POVs directly at competitors’ servers to try to score, the POVs were submitted to and launched by the referee system.
That way, the success or failure of any POV could be instantly assessed by the scoring system, just as any patch submitted by competitors could be independently evaluated.
All of the action recorded by the referee system could then be played back through a set of visualization tools to show the results of each round.
On top of all this was the hardware itself.
CTF tournaments generally look like rooms full of people huddled at tables around laptops.
The CGC version of the game required the construction of a portable data center, assembled in Las Vegas: an “air-gapped” network of 15 supercomputing systems, each with 128 2.5 GHz Intel Xeon processors (totaling over 1000 processor cores) and 16 terabytes of RAM. Physically disconnected from any outside network, the only way data left the “arena” network was via a robotic arm that passed Blu-ray discs burned with scoring data from one tray to another.
Seven of the identical supercomputing racks ran the AI-powered “bots.” Seven more were dedicated to running the match itself—handling the deployment of the challenge sets, verifying POVs and patches, performing forensic analysis, and tracking the score.
The last system acted as a sparring partner for the competitors in warm-ups.
The whole raised-floor rig was cooled by water piped in from three industrial chillers sitting on the Paris Convention Center’s loading dock; it drew 21 kilowatts of power over cables snaked in from outside.
Listing image by Sean Gallagher
Report to the game grid, program
Members of the seven teams behind the battling hacker AIs relax as the battle begins, redundant for now.
Some of the 7 supercomputers required just to keep tabs on the participating systems.
A DARPA slide explains the visualization system for scoring the Cyber Grand Challenge.
A view of the “card” for a participant in the CGC, showing active “challenge sets” and their level of security.
Each round gets visualized like a game of Missile Command, with incoming “proof of vulnerability” attacks color-coded to the team they’re coming from.
The Cyber Grand Challenge scoreboard after a few rounds favors DeepRed from Raytheon…
…but ForAllSecure’s Mayhem soon pulls ahead for good.
When the Cyber Grand Challenge was first announced in 2014, 104 teams of security researchers and developers registered to take on the challenge of building systems that could compete in a Capture the Flag competition. Of them, 28 teams completed a “dry run”—demonstrating that they could find software flaws in new code and interact with the CTF game system.
Those 28 battled in the first-ever artificial intelligence CTF competition—last year’s first round of the CGC held at DEFCON 2015.
The 131 challenge sets were the most ever used in a single CTF event.
In that first full run, several competitors were able to detect and patch bugs in individual software packages in less than an hour.
All of the 590 bugs introduced into the “challenge” software packages used in the competition were patched by at least one of the competing systems during the match.
The seven finalists were given a budget of $750,000 to prepare their systems for this year’s competition.
They would need it: the final round would not only bring on challenges at twice the speed of the first round, but it would include more difficult challenge sets.
Some were even based on “historic” vulnerabilities such as the Morris Worm, SQL Slammer, and Heartbleed.
The most challenging of these, in the minds of the team behind the DARPA CTF, was a reproduction of the Sendmail bug known as Crackaddr.
This exploit took advantage of a bug that defied the usual types of static analysis of code.
The final seven teams brought a mix of skills to the game:
David Brumley, CEO of the security start-up ForAllSecure and director of Carnegie Mellon University’s CyLab, sent a team led by CMU doctoral student Alexandre Rebert. Most of the team were also members of CMU’s Plaid Parliament of Pwning CTF team, which has participated in DEFCON’s human CTF tournaments for a decade.
TechX featured a team made up of engineers from the software assurance company GrammaTech and researchers from the University of Virginia.
ShellPhish was an academic team from the University of California-Santa Barbara with deep experience in human CTF tournaments.
DeepRed came from Raytheon’s Intelligence, Information and Services division and was led by Mike Stevenson, Tim Bryant, and Brian Knudson.
CodeJitsu was a team of researchers from UC Berkeley, Syracuse University, and the Swiss company Cyberhaven.
CSDS featured a two-person team of researchers from the University of Idaho.
Disekt was led by University of Georgia professor Kang Li, also a CTF veteran.
It turned out ForAllSecure’s human-winning CTF experience paid off. ForAllSecure’s bot was named Mayhem after the symbolic execution analysis system developed by CMU researchers.
It almost ran away with the match early on, eventually finishing at the front of the pack with 270,042 points to win the $2 million prize for first place.
Brumley told Ars right after the victory that part of the reason Mayhem was able to succeed was that the team intentionally avoided using any sort of intrusion detection.
Instead, they focused on attacks and patching. “We did everything as software security, not network security,” he explained. “For intrusion detection, we did zero.
Every scenario we ran through, an IDS slowed us down. We weren’t dedicating all the cores to offense or defense—we dedicated more to deciding what the right thing to do was.”
That additional computing power gave the AI the resources to do more testing before making a decision. “We have a lot of patch strategies, and we choose which one we use depending on where we are in the competition,” Brumley explained. Mayhem made patching decisions in part by using some of its processing cores to run multiple versions of the patch.
“We test all the different patches we generate, Brumley said. “We use the AI to select the best one. We had actually two fundamentally different approaches to patching—one was a hot patch, where if we detected a vulnerability, we would fix it specifically; the other was more agnostic or general patches that would fix a broad range of things. We had a bunch of candidate patches [pre-built].
The executive system would run the two different patches on parallel cores to see how they performed.”
To decide when to patch, Mayhem used an expert system that looked at information about the state of the game, including how Mayhem was doing relative to competitors on the scoreboard.
The AI “executive” was “running modules through an expert system where we had different weights based on where we were,” said Brumley. “If we were behind, we would switch strategies.
If we were getting exploited a lot, we would switch to a more heavy defense.”
To find those vulnerabilities in the first place, Mayhem had two separate analytical components: Sword (for offensive base analysis of code, seeking exploits) and Shield (an analysis tool for creating patches).
To find bugs in the challenge sets, the system used a mix of old school brute force “fuzzing” and a technique called symbolic execution.
Fuzzing is essentially throwing random inputs at software to see what makes it crash, and it’s the most common way vulnerabilities are found. Mayhem’s fuzzing analysis was built on AFL, developed by Michal Zalewski (also known as lcamtuf). “We took that as sort of the base idea,” Brumley explained. “Then we built on a variety of techniques- first to make fuzzing faster.
But the problem with fuzzing is it can often get stuck in one area of code and it can’t get out.
So we paired it with symbolic execution, which is a much different approach.”
Symbolic execution tries to bring some order to fuzzing by defining ranges of inputs, varying them to see what combinations of inputs trigger different paths within the program and which cause errors like endless loops, memory buffer overflows, and crashes.
By using a more controlled approach to applying variables, symbolic execution can sometimes drill deeper into programs to expose bugs than fuzzing can.
But it can also be slower. “One of the keys in our strategy was how do you do this handoff between dumb fuzzing and symbolic execution which is more of a formal method,” Brumley said. “We have a nice system where they flow between each other.”
The AI also engaged in a bit of deception, generating fake network traffic “that was actually chaff traffic that we were generating to shoot at our competitors,” Brumley added.
This traffic might have been detected by other systems as attempted exploits, triggering patching or IDS changes.
Elsewhere in the competition, CodeRed’s Reubus was apparently bit more dependent on IDS. While Reubus scored a number of early POVs, in mid-match the availability of its server started to plummet and seriously impacted its score.
Tim Bryant said he wasn’t sure what caused the drop (since the airgap was still in place immediately following the competition), but his suspicion was that the IDS may have brought the performance of the server down.
By mid-match, almost all of the competitors were relatively close from round to round in their performance, though Mayhem had built up a lead of over 10,000 points.
If it hadn’t been for the failure of a software component, Mayhem’s margin of victory may have been much wider than the 8,000 points it ultimately won by over TechX’s Xandra bot. “As far as we can tell, what happened was that the submitter—the thing that’s supposed to submit our patch and POV candidates—started lagging,” Brumley said. “It started submitting binaries for the wrong part of the competition.
It’s actually the simplest part of the system. We’ll just have to do some analysis to figure out what happened.
It’s kind of cool, though, because we had such a big lead that we were able to cruise in and it started working again in the end.”
TechX’s Xandra bot earned a $1 million second place finish.
ShellPhish’s Mechanical Phish narrowly defeated CodeRed for third, earning the all-student team $750,000. Mechanical Phish also won some other bragging rights—it was the only one of the seven systems to successfully patch the Crackaddr vulnerability with its symbolic execution analysis.
All others choked on the problem.
Shall we play a game?
ForAllSecure’s team celebrates their $2 million win.
The champion: Mayhem.
In victory, however, there was no rest.
As Alex Rebert, the team leader for ForAllSecure, accepted a trophy from DARPA’ s Prabhakar and Wilson, he also accepted a challenge to bring Mayhem (virtually) to play in another CTF tournament at DEFCON.
But this tournament, called by some the World Series of Hacking, features humans.
“Yeah, we accepted the challenge from the CTF organizer,” said Brumley just after the award ceremony, “We’re going to go and have our system compete against the best hackers in the world and see how it does.
I think it’s going to be pretty exciting—we honestly have no idea what’s going to happen.
I think the machine is going to win if there’s a high number of challenges, just through brute force.
It will be best art the parts of the competition that require a quick reaction.
I think we’re going to have an advantage.”
But Brumley also noted that human creativity was important in CTF competitions—and that the AI would have to play more aggressively against human competitors. “In the DARPA event, because we weren’t competing against humans, we were pretty careful about not doing anything that looks like counter-autonomy (attacks on the system the other AIs were running on).
It wasn’t in the scope of the DARPA competition.
In DEFCON, it’s a bit more aggressive, so we’re going to enable a bit more aggressive techniques.”
The DEFCON tournament, however, added another wrinkle for Mayhem. Many members of the team would be playing against the AI as the Plaid Parliament of Pwning, a burgeoning dynasty at this competition. “Half the team is going to segregate off and has been playing at DEFCON for the past 10 years,” Brumley said. “There’ll be an airgap between the two teams, so it won’t just be DEFCON vs. Mayhem—it’ll be Mayhem against the creators of Mayhem.
It’ll be fun.”
It turned out not to be as much fun as expected. When we checked in with Alex Rebert late on the first day of the DEFCON CTF, he was too busy to talk—the interface between Mayhem and the tournament system wasn’t working properly, and he and a teammate were trying to fix it. On Saturday, he and the team minding Mayhem were a bit more relaxed given there was nothing more for them to do but watch.
The late start had hurt them, but he told Ars he was just hoping the system would work well enough not to be a total embarrassment.
At the conclusion of the tournament on Sunday, it was clear that Mayhem was not ready to compete with its masters quite yet.
The Plaid Parliament of Pwning again emerged victorious with 15 points, their third victory in four years. Mayhem found itself on the other end of the scoreboard with a single point to its credit.
Maybe next year, robots.