Qb001: My understanding of fetching includes a mecanism to "de-fetch" in case an instruction is not to be executed. Think of all the conditional branches for instance that have to be de-fetched.
That is basically correct...
Qb001: And it also explains why the latest Pentium processors have this "branch prediction mecanism", whose role is to try and predict if a branch will be executed or not.
Correct - although this mechanism has been used for almost every "big" CPU by now, since it helps boosting clock frequencies with longer prefetch, decode and execution queues. The 80x86 family suffers from the longest queues by far due to its completely braindamaged base concept which is still being dragged along at all cost (and with progressively diminishing returns).
When the prediction mechanism fails to predict a conditional branch properly, the entire rest of the queue will have to be purged - which in modern CPUs means that some of these instructions may even have begun executing
internally so this is quite "expensive" with a long queue and the quality of the branch prediction becomes crucial to the resulting overall performance.
Back to self-modifying code: In order to make it work, you´d need to make sure the modified instruction hasn´t been fetched already when the modifying instruction is being completed; Explicit synchronisation by specialized instructions is the cleanest way to prevent that from happening (and when you´re dealing with a cache, you´ve got no other choice); Merely "stuffing the pipeline" with as many instructions as necessary to keep the distance is a "hack" that may or may not work, depending on the specific CPU variant you´re dealing with... (execute your code on a new CPU with a longer queue than your code expects and you´re screwed again!)
Caches are especially tricky as most modern CPUs have separate
data and instruction caches; Your modifying write operation will go into the data cache
, but the instructions will be fetched through the instruction cache
; This means you may need to do the following:
a) write your modified code (this will actually change the data cache and may not be visible in main memory immediately)
b) "flush" the data cache so it´s actually written to main memory (that may not happen immediately to save bandwidth)
c) "purge" the instruction cache so it won´t "remember" the old incarnation of the newly modified code
d) purge the prefetch queue (if that isn´t done implicitly by purging the i-cache)
e) jump to the modified code (will force the i-cache to read from main memory, then start fetching the new code from the i-cache)
The actual topology of your CPU will determine which of the above steps are necessary; Some operating systems will also protect code memory from modifying data access in order to keep it interesting...
With all of the above it becomes clearer why in most cases self-modifying code can´t be used to improve execution efficiency: It requires far too "costly" operations to make it worthwhile in almost every case except on "smaller" CPUs without queues or caches.