This article has been co-authored by Daax Rynd

Introduction

The cat and mouse game of game-hacking continues to fuel the innovation of exploitation and mitigation. The usage of virtualization technology in game-hacking has exploded ever since copy-pastable hypervisors such as Satoshi Tanda’s DdiMon and Petr Beneš’ hvpp hit the scene. These two projects are being used by most of the paid cheats in the underground hacking scene, due to their low barrier of entry and extensive documentation. These releases have with high certainty sped up the hypervisor arms race that is now beginning to show its face in the gamehacking community. Here’s what the administrator at one of the worlds largest game-hacking communities, wlan, says about the situation:

With the advent of ready-made hypervisor solutions for game hacking it’s become unavoidable for anti-cheats such as BattlEye to focus on generic virtualization detections

The reason hypervisors are so wide-spread now is because of recent developments in kernel anti-cheats leaving very little room for hackers to modify games through traditional means. The popularity of hypervisors could be explained by the simplicity of evasion, since virtualization enables you to more easily hide information from the anti-cheat, through mechanisms such as syscall hooks and MMU virtualization.

BattlEye has recently implemented a detection of generic hypervisors such as the previously mentioned platforms (DdiMon, hvpp) using time-based detection. This detection aims to spot abnormal time values in the instruction CPUID. CPUID is a relatively cheap instruction on real hardware, and will generally only require two hundred cycles, where as in a virtualized environment it may take up to ten times as long due to the overhead incurred by an introspective engine. An introspective engine is not like any real hardware which just performs the operation as is expected, it monitors and conditionally changes the data returned to the guest based on arbitrary criteria.

Fun fact: CPUID is commonly used in these time based detection routines because it is an unconditionally exiting instruction as well as an unprivileged serializing instruction. This means that CPUID acts as a ‘fence‘ and ensures that instructions before or after it are completed and makes the timing independent of typical instructions reordering. One could use instructions like XSETBV which also unconditionally exits, but to ensure independent timing would need to use some sort of FENCE instruction so that no reordering occurs before or after that would affect the timings reliability.

Detection

Below is the detection routine that I reverse-engineered and reconstructed into pseudo-C from the BattlEye module “BEClient2” before posting it on twitter. BattlEye developers, in an unexpected turn of events, changed the obfuscation on BEClient2 a day after my tweet probably hoping that it would prevent me from analyzing the module. The previous obfuscation had not changed in over a year, but changed the day after I tweeted at them, which is an impressive turnaround.

void battleye::take_time()
{
    // SET THREAD PRIORITY TO THE HIGHEST
    const auto old_priority = SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
 
    // CALCULATE CYCLES FOR 1000MS
    const auto timestamp_calibrator = __rdtsc();
    Sleep(1000);
    const auto timestamp_calibration = __rdtsc() - timestamp_calibrator;
 
    // TIME CPUID
    auto total_time = 0;
    for (std::size_t count = 0; count < 0x6694; count++)
    {
        // SAVE PRE CPUID TIME
        const auto timestamp_pre = __rdtsc();
 
        std::uint32_t cpuid_data[4] = {};
        __cpuid(cpuid_data, 0);
 
        // SAVE THE DELTA
        total_time += __rdtsc() - timestamp_pre;
    }
 
    // SAVE THE RESULT IN THE GLOBAL REPORT TABLE
    battleye::report_table[0x1A8] = 10000000 * total_time / timestamp_calibration / 0x65;
 
    // RESTORE THREAD PRIORITY
    SetThreadPriority(GetCurrentThread(), old_priority);
}

As I mentioned earlier, this is the most common detection technique using unconditionally intercepted instructions. However, this technique is vulnerable to time-forging, and we’ll detail that in the next section.

Circumvention

There are a few issues with this detection method. The first being that it’s susceptible to time-forging, which is typically done one of two ways: TSC offsetting in the VMCS, or decreasing the TSC every time CPUID is executed. There are many more ways to beat time based attacks, but the latter is much simpler to implement as you can ensure that instruction execution times are within one or two clock ticks of real hardware execution. Detecting this time-forging technique can be difficult depending on experience. We’ll cover detection of time-forging and an improvement on BattlEye’s implementation in the next section. The second reason this detection method is flawed is that CPUID latency (execution time) varies widely across processors and can be made worse depending on the leaf value given. It takes anywhere from 70-300 cycles to execute. The third issue with this detection routine is the usage of SetThreadPriority. This Windows function is used to specify priority value for the given thread handle, however, the OS doesn’t always listen to the request. This function is simply a suggestion to increase the thread priority and there is no guarantee it will take place thus leaving this method vulnerable to interference by interrupts, or other processes.

Circumvention in this instance is simple, and the time-forging technique described beats this detection method effectively. If BattlEye wanted to improve this method there are some suggestions offered in the next section.

Improvement

There are multiple improvements that could be made to this function. The first being to deliberately disable interrupts, and force thread priority by modifying CR8 to the highest IRQL. It would also be ideal to isolate this test to a single CPU core. Other improvements would be to use different timers, however, many are not as accurate as the TSC but there is one such timer called the APERF timer. The APERF timer, aka Actual Performance Clock. This clock is recommended because it is more difficult to cheat and only accumulates counts when the logical processor is in the C0 power state. It’s a great alternative to using the TSC. The other timers could include the ACPI timer, HPET, PIT, GPU timer, NTP clock, or the PPERF timer which is similar to the APERF but only counts cycles that are perceived as instruction execution. A downside is that is requires HWP to be enabled which can be disabled by an interceding operator thus rendering it useless.

Given below is an improved version of their detection routine, that must be executed in kernel:

void battleye::take_time()
{
    std::uint32_t cpuid_regs[4] = {};
 
    _disable();
    const auto aperf_pre = __readmsr(IA32_APERF_MSR) << 32;
    __cpuid(&cpuid_regs, 1);
    const auto aperf_post = __readmsr(IA32_APERF_MSR) << 32;
     
    const auto aperf_diff = aperf_post - aperf_pre;
     
    // CPUID IET ARRAY STORE
    // BATTLEYE REPORT TABLE STORE
     
    _enable();
}

Note: IET just means Instruction Execution Time.

Still, this could be unreliable in detecting a generic hypervisor as CPUID execution can vary wildly. A better idea would be to compare the IET of two instructions. One having a longer execution latency than CPUID. An example would be FYL2XP1, which is an arithmetic instruction that takes slightly longer than the average IET of CPUID – it also doesn’t cause any trap into our hypervisor and can be reliably timed. Using these two instructions a proper profiling function would create an array to store IET times of CPUID and FYL2XP1. Using the APERF timer they would get starting clock for the arithmetic instruction, execute the instruction and calculate the delta clock count for it. Following that they’d store the result in the IET array for the specific instruction for N amount of profiling loops to get an average, and repeat for CPUID. If the CPUID instruction execution time is longer than the arithmetic instruction it’s a reliable indication that the system is virtualized because under no circumstances should the arithmetic instruction take longer than the CPUID execution to grab vendor, or version information. This detection will also catch those using TSC offsetting/scaling.

Once again, they would need to force affinity for the execution of this test to one core, disable interrupts, and force the IRQL to the maximum value to ensure consistent and reliable data. It would be surprising if the BattlEye developers decided to implement this, as it requires a greater level of effort. There are two other vm detection routines within BattlEye’s kernel driver, but those will be a topic for later 🙂