Personal tools

Article     Discussion     Edit     History     

Cycle Exact Emulation

From the Macscene Wiki

The following was taken from Richard Bannister's FAQ

Traditional emulators tend to run each part of the hardware for a period of time, only switching a couple of times per frame. In this situation, you might run the CPU for a given period of emulated time, then the video hardware, then the sound hardware. If you only have to switch control once every 10ms of emulated time, you can achieve good performance on any modern computer.

However, some systems, particularly those with the 6502 processor or variants, often have extremely tight timing, and over the life of the hardware programmers learnt how to push the systems to the absolute limit. Consider a piece of code that modifies the video registers ten times in a single time slice. If the video hardware is not updated every time a register write occurs, then nine of those registers will have changed by the time the update does happen. A cycle exact emulation prevents this happening. Unfortunately, a 1 MHz 6502 has one million cycles per second, which equates to a lot of jumps between different parts of the code. This is what costs so much performance.

It is true that the majority of titles do not require anything like the accuracy of timing that cycle exact emulation provides. However, current computer hardware is well up to the challenge of accurate emulation, so some authors feel that it is time to do things properly.

The following is specifically in reference to BSNES as opposed to Snes9x but the concepts can be generalized to other emulators.

Most emulators execute ROMs one CPU instruction at a time. Instructions at this level (equivalent to assembly) are things like 'read a byte from memory' or 'add the contents of registers A and B.' e.g.

 sta $00,x
  • store the contents of the x register at address $00

Over the course of 'executing' one instruction, however, real hardware must retrieve the byte(s) encoding the opcode from memory, parse out any operands, possibly make another memory read to get operand values that couldn't fit in the opcode (such as a 16-bit memory address, which would never fit in a 16-bit opcode that needs to also specify in so many bits what to do and where to put the result), link the proper registers as inputs to the arithmetic unit / bus / whatever, and possibly write a value back to a memory location. Each of these steps take a particular number of clock ticks, meaning a given opcode can take ~30(?) clock ticks to execute. Example:

 [1: 6] Read <read opcode from regs.pc  >
  • Step 1 in executing the 'sta' store instruction is to actually retrieve the instruction from the next memory address after the program counter (regs.pc) so you know it's a 'sta' instruction to begin with.

If that weren't bad enough, things like memory reads and writes take time. Values must physically propogate across the circuit board and through silicon gates, and all memories have fixed 'set-up' and 'hold' time requirements to successfully write values or specify the address to read from. For memory elements larger than single-value storage registers, these times can be longer than 1 cycle. Whether or not a value is actually read/written can depend on how long the CPU makes data available. e.g.

  • During the retireval of the 'sta' instruction from read-only data memory, the CPU has to keep the wires specifying the address to read from held at the proper values for 2 cycles so the memory can recognize the value. The memory will then guarantee the data at that address will be valid for 2 cycles so the CPU can use it.

While the CPU (or other processor) is in the middle of executing an instruction, other processors can be doing other things and may change the values they make available to the CPU (or other processors). e.g.

  • While the CPU is busy waiting on the memory in the previous example, the graphics processor left to its own devices has had time to render one pixel, so the (x,y) position of the scan beam is different and any changes made to the graphics buffer will not be reflected at any point above/left of the new position until the next frame. (NES does a lot of this intentionally for parallax and splitscreen effects)

Most emulators ignore all the mucky timing within an instruction, simply doing whatever computation the instruction specifies in effectively no time, using the values from other processors as they were when the instruction was encountered. However, some tightly coded games rely on cramming multiple changes in other processors into a single CPU instruction timeframe and having the CPU use the new values as they become available. e.g.

  • SNES9X would be done with the entire 'sta' by now, but would have done an 8-pixel chunk of graphics work based on old values when the 'sta' could very well have been writing to the graphics memory, changing what the last 2 pixels of that chunk should have been.*

BSNES (among other cycle-exact emulators) accurately emulates the time it takes to do the individual steps which exist at one level of generalization below single instructions, and can occasionally subdivide even those steps when necessary. This allows games which use timing as a method to know/control some part of the program/output to do so accurately. The cost is that evaluating all parts of the system 5-10 times per instruction takes > 5-10 times longer than doing so once per instruction.

* this is a contrived example and probably isn't true of this particular instruction, but it shows what can happen in general


by Byuu in explanation of BSNES - (note that I messed up the last cycle by cutting the write one clock too short, I'll fix the image in a few days)
Enlarge
by Byuu in explanation of BSNES - (note that I messed up the last cycle by cutting the write one clock too short, I'll fix the image in a few days)

Right now, ZSNES, SNES9x et al use the instruction breakdown portrayed in the left-hand side of this image. Now imagine that all those extra horizontal clock lines going down the image were the PPU or APU updating something that would affect the CPU operation. An opcode-based emulator would miss those updates, but a real system would not. This is basically parallelism.
Currently, bsnes uses the cycle-based emulation, with hackery to simulate the subcycle-based example, but without the synchronization to make it correct.
Consider each horizontal line as a break in execution. You have to break out of the CPU core, then synchronize and call the PPU and APU and synchronize them back up, and then re-enter into the CPU core for each line.
On SNES9x, this would only need to occur once. In the example listed (plus the off-by-one error), this would need to occur sixteen times. And each time would be more complex as each individual action is broken down into smaller parts.