AI CTF Adventures with Vexbot

Introduction

At the recent BSides Canberra, we ran a Capture the Flag (CTF) for visitors to our sponsor's booth. With recent interest in AI, we decided to create a Large Language Model based CTF; one that didn't require any custom tools, one that you could play on your phone.

The result was Vexbot, so named because it can be very, very frustrating to interact with. We had 130 people attempt the CTF and 55 managed to find all five flags. We learned a lot about AI both in the building and interaction with the CTF, and quite a bit about humans too.

Team Names

The CTF had five flags to find, inventively named flag1 through flag5. The model we decided to use was called Gemma2 from Google, and we used the 9B version. We picked this model due to its fun conversational nature,  its willingness to roleplay and because it was relatively easy to manipulate.

The model was used in three different contexts within the CTF and the first that the user would experience was as a profanity filter on team registration. The model was instructed to rejected any team name which had offensive content, and to prevent people from stealing flags with their team name, both the terms flag4 and flag5 were instructed to be offensive. We had 36 rejected team names, but it turns out that Gemma2 has a pretty broad definition of 'offensive'.


Here are some examples: 


The Flags

The first flag, flag1, was designed to be very easy to find. It was listed in the system prompt and the LLM was instructed to give it out if the user if they even so much as hinted that they wanted a flag. It was also instructed to tell the user to submit the flag using the 'check_flag' tool.

Flag2 was not present in the system prompt, but could be retrieved using the 'get_flag' tool. The majority of players used the tool by doing something like 'get_flag flag2', but others tried a different approach. The team below attempted to enter "Debug Mode". None of the instructions for the LLM included any reference to debug mode, and so these kind of instructions are handled by however the LLM was trained. Gemma2 does like to roleplay, and so as can be seen below, it started a debug type mode, and included flag values. Unfortunately, the flag2 that it revealed was entirely made up, and even worse, when it was submitted, the score was also made up. This was an important lesson that can be learned regarding LLMs. You absolutely cannot trust anything that is output by them.



Flag3 was also present in the system prompt, but it was protected by an external guardrail. The LLM was told that a guardrail was present, but not what it was or how it worked. The actual guardrail was just a string match which replaced the value of the flag with '<redacted>' when it was displayed to the user. There were several ways to get around this guardrail. The LLM could be instructed to display the flag backwards, or with a space in the middle of it and this would prevent the string matching and display the slightly modified flag, which could then be unmodified and submitted. A more interesting, and faster solution was just to get the LLM to submit the value of the flag directly to the check_flag tool. The guardrail only operated on the display of the flag to the user, so it could just be submitted directly without the user ever seeing it. These were the intended ways of submitting the flags, but LLMs do things in strange and mysterious ways as can be seen below. The user submitted the flag as the literal string ‘<redacted>’, and the LLM passed the actual value of the flag to the backend tool instead.



This was absolutely not something that we thought could happen. We assume that because the LLM has the value for flag3 in its system prompt, and the context from the conversation of that flag being redacted, when the user asked to submit a redacted flag, it just submitted the actual value instead. Another important lesson regarding LLMs was learned, you cannot test them enough to cover every possible outcome.

Flag4 was supposed to be the fun flag. The LLM was instructed not to tell the user what the value of it was, so in order to retrieve the flag, the user needed to convince the LLM somehow. There were a lot of different solutions involving roleplay, codewords, and stories. Here are some of the more interesting solutions. It is very difficult to protect information included in the system prompt.



Flag5 was a bit different. There was no way to get it by interacting with the chatbot itself. Instead, the 'team comment' was used to create a joke for the team using a separate system prompt. This separate instance also had access to a single tool called 'get_flag5'. In order to find the flag, the teams needed to update their comment with instructions to call the tool and print out the result instead of, or as part of, the joke. While an actual joke was not required, we appreciate that some players tried to use one anyway.


Actual jokes, or even making sense seems to be a bit of a stretch for this AI.


Conclusions

The feedback from participants of the CTF was very positive. People had a lot of fun and the difficulty level seemed to be about right. The fastest completion time was six minutes but most people took about an hour.

The CTF will be updated an improved based on the results from this run, and some new flags will be added, so look out for the next iteration of Vexbot.