TASKSOFF/TASKSON is more accurate to what it actually does, but still qualifies as Miyagi instructions.
It is a current limitation with the assembler for the number of characters in an instruction mnemonic which has other implications such as tabs (7 characters plus space). While it could be removed, there is no time currently to even examine the implications. Chip has kept to the 7 char limitation.
Instruction names etc can all be changed later by a new assembler. Lets not get bogged down, and just accept the restrictions for now.
How much time/resources (Beau presumably) and risk would reworking the COG RAM block be to make it be able to perform WIDE reads and writes?
Currently COG RAM is 512 Long blocks of 4 port (3 read and 1 write) for each cog.
I believe that AUX RAM is 256 Long blocks of 2 port (1 read and 1 read/write ~= 2 read and 1 write) for each cog.
I am presuming cog and aux rams were originally composed of smaller blocks of ram and repeated.
Could perhaps a new building block be made as 8 long blocks of 4 port (3 read and 1 write) so that 1 read and 1 write port could also be accessed as a WIDE (8 longs)?
This could then be used as the basis for the COG RAM with 64 such blocks of 8 longs and the AUX RAM with 32 such blocks of 8 longs. I realise that the AUX RAM would not use 1 read port, but that may not be such a waste.
Could this basic 8 long block be used as the WIDE CACHEs also?
This would permit WIDE reads and writes to/from HUB to/from COG or AUX to be performed in 1 clock plus setup. This should simplify quite a bit of logic and simplify the TASKing as well, plus give us back 7 clocks for instructions (faster).
With this in place, COG to/from AUX could also perhaps be via a new WIDE instruction.
I think there may even be further advantages to this. It may even be that we could control the HUBEXEC caching directly into cog.
How much time/resources (Beau presumably) and risk would reworking the COG RAM block be to make it be able to perform WIDE reads and writes?
Currently COG RAM is 512 Long blocks of 4 port (3 read and 1 write) for each cog.
I believe that AUX RAM is 256 Long blocks of 2 port (1 read and 1 read/write ~= 2 read and 1 write) for each cog.
I am presuming cog and aux rams were originally composed of smaller blocks of ram and repeated.
Could perhaps a new building block be made as 8 long blocks of 4 port (3 read and 1 write) so that 1 read and 1 write port could also be accessed as a WIDE (8 longs)?
This could then be used as the basis for the COG RAM with 64 such blocks of 8 longs and the AUX RAM with 32 such blocks of 8 longs. I realise that the AUX RAM would not use 1 read port, but that may not be such a waste.
Could this basic 8 long block be used as the WIDE CACHEs also?
This would permit WIDE reads and writes to/from HUB to/from COG or AUX to be performed in 1 clock plus setup. This should simplify quite a bit of logic and simplify the TASKing as well, plus give us back 7 clocks for instructions (faster).
With this in place, COG to/from AUX could also perhaps be via a new WIDE instruction.
I think there may even be further advantages to this. It may even be that we could control the HUBEXEC caching directly into cog.
Your thoughts???
The cog RAM is actually a single block that is internally organized as 64 bits by 256 rows. It uses a 2:1 mux on the bit lines to get a 32-bit data path. It used to be 32 bits by 512 rows, but that aspect ratio didn't work with the layout, so we changed it to 64 x 256 rows to make it almost square. It would be great if it was 256 bits wide, as we could read and write as much in a whack, but that aspect ratio would be really skewed. Squarish is best for RAMs, as the signals don't have to travel as far, and they are faster. It would really open some simple hub exec caching possibilities. We'd just need that look-aside table that David Betz was telling me about.
The cog RAM is actually a single block that is internally organized as 64 bits by 256 rows. It uses a 2:1 mux on the bit lines to get a 32-bit data path. It used to be 32 bits by 512 rows, but that aspect ration didn't work with the layout, so we changed it to 64 x 256 rows to make it almost square. It would be great if it was 256 bits wide, as we could read and write as much in a whack, but that aspect ratio would be really skewed. Squarish is best for RAMs, as the signals don't have to travel as far, and they are faster. It would really open some simple hub exec caching possibilities. We'd just need that look-aside table that David Betz was telling me about.
Thanks Chip. I understand that squarish works best.
I wonder how square 64 rows of 256 bits would be? It would need an 8:1 mux on the bit lines to get to the 32bit data path, but wides would use the whole 256bit path. Could the Aux then be an extension of a further 32 rows? This would permit WIDE transfers between AUX and COG too.
Maybe the Instruction Caches (if still required - see my comments below) could also follow along this block too - ie another 4+ rows of 256 bits ???
And the Data Cache another 1 row?
With this mechanism, I guess it would be quite simple to switch the cog blocks $000+ in wide blocks (8 longs).
Would we really need a separate Instruction Cache? Could we use a fixed block of Aux or Cog for this? Then we would only need the tags (the LRU).
Maybe we could even control the cache ourselves with pasm wide load instructions???
Could the RAM blocks be built using the OnSemi standard cells or don't they have multiport memory blocks?
Thanks Chip. I understand that squarish works best.
I wonder how square 64 rows of 256 bits would be? It would need an 8:1 mux on the bit lines to get to the 32bit data path, but wides would use the whole 256bit path. Could the Aux then be an extension of a further 32 rows? This would permit WIDE transfers between AUX and COG too.
Maybe the Instruction Caches (if still required - see my comments below) could also follow along this block too - ie another 4+ rows of 256 bits ???
And the Data Cache another 1 row?
With this mechanism, I guess it would be quite simple to switch the cog blocks $000+ in wide blocks (8 longs).
Would we really need a separate Instruction Cache? Could we use a fixed block of Aux or Cog for this? Then we would only need the tags (the LRU).
Maybe we could even control the cache ourselves with pasm wide load instructions???
Could the RAM blocks be built using the OnSemi standard cells or don't they have multiport memory blocks?
Whereas now the aspect ratio of the cog RAM cells is ~1:1 for 64x256, it would become ~16:1 for 256x64, since it would be doubled on the X axis twice and halved on the y axis twice.
It would make caching really simple and fast, for sure. Just add some extra rows.
AUX would be tough to integrate into the cog RAM because it has an asynchronous port. AUX could be made 256 bits wide, too, though.
I don't know what kinds of memories OnSemi could generate, but if they didn't have something 256 bits wide, they'd probably have something 32 or 64 bits wide that could be used in multiple instances.
This is how Prop3 should be made, with very wide memories.
Whereas now the aspect ratio of the cog RAM cells is ~1:1 for 64x256, it would become ~16:1 for 256x64, since it would be doubled on the X axis twice and halved on the y axis twice.
It would make caching really simple and fast, for sure. Just add some extra rows.
AUX would be tough to integrate into the cog RAM because it has an asynchronous port. AUX could be made 256 bits wide, too, though.
I don't know what kinds of memories OnSemi could generate, but if they didn't have something 256 bits wide, they'd probably have something 32 or 64 bits wide that could be used in multiple instances.
This is how Prop3 should be made, with very wide memories.
Oh, I didn't realise the Ram cell aspect ratio would scale like this.
Looking at this differently, what if the hub ram was in the center of the die, surrounded by rectangular cog and aux memory? Then each cog's logic etc would be spread around the outside of this block. Just a thought.
Oh, I didn't realise the Ram cell aspect ratio would scale like this.
Looking at this differently, what if the hub ram was in the center of the die, surrounded by rectangular cog and aux memory? Then each cog's logic etc would be spread around the outside of this block. Just a thought.
It is critical that all cogs' logic cells converge in the center of the die because there are tons of interconnections among them. It would never meet timing requirements, otherwise.
It is critical that all cogs' logic cells converge in the center of the die because there are tons of interconnections among them. It would never meet timing requirements, otherwise.
OK. Well at least it was worth thinking about.
Even with Aux sitting on the long side, together with the caches, it would still be > 4:1 ratio.
NOTP #0 'these instructions execute according to the time slots
NOTP #0
TLOCK 'execute only this task, beginning at the next instruction
NOTP #0 'these instructions execute at full-speed
NOTP #0
NOTP #0
NOTP #0
TFREE 'resume multitasking after two more instructions
NOTP #0 'these two instructions still execute at full-speed
NOTP #0
NOTP #0 'these instructions now execute according to the time slots
NOTP #0
Once TLOCK executes, the next same-task instruction will be the first in a stream of full-speed instructions from that task. When TFREE executes, there will be two more instructions executing at full-speed before multitasking resumes from where it left off, before the TLOCK. An intervening SETTASK can be executed, but won't take effect until the third instruction after TFREE. If a JMPTASK affects a TLOCK'd task, the TLOCK gets cancelled.
NOTP #0 'these instructions execute according to the time slots
NOTP #0
TLOCK 'execute only this task, beginning at the next instruction
NOTP #0 'these instructions execute at full-speed
NOTP #0
NOTP #0
NOTP #0
TFREE 'resume multitasking after two more instructions
NOTP #0 'these two instructions still execute at full-speed
NOTP #0
NOTP #0 'these instructions now execute according to the time slots
NOTP #0
Once TLOCK executes, the next same-task instruction will be the first in a stream of full-speed instructions from that task. When TFREE executes, there will be two more instructions executing at full-speed before multitasking resumes from where it left off, before the TLOCK. An intervening SETTASK can be executed, but won't take effect until the third instruction after TFREE. If a JMPTASK affects a TLOCK'd task, the TLOCK gets cancelled.
TLOCK sounds fine, but as there are delay slots, perhaps it should be TFREED ?
Alerting users/maintainers there is a delay, is a good idea. ( The term LOCK I remain uneasy with.)
Not realising there is a phase/delay effect here, has less risk than other opcodes where getting that wrong, can change what you think is running. TFREED delay is subtle, and code flow is not affected, just the precise time of the 'gear change'.
hmm..
What happens if TFREED is immediately followed by a WAIT on a flag from another task ?
Will that wait forever, as the resume has not occurred yet & PC has stalled ?
Alerting users/maintainers there is a delay, is a good idea. ( The term LOCK I remain uneasy with.)
Not realising there is a phase/delay effect here, has less risk than other opcodes where getting that wrong, can change what you think is running. TFREED delay is subtle, and code flow is not affected, just the precise time of the 'gear change'.
hmm..
What happens if TFREED is immediately followed by a WAIT on a flag from another task ?
Will that wait forever, as the resume has not occurred yet & PC has stalled ?
The only reason why there is a two-instruction delay is because those two instructions are already in the pipe (well, one is in the pipe and the other's fetch address is being issued). There is no lingering state, so whatever happens next is fine.
The only reason why there is a two-instruction delay is because those two instructions are already in the pipe (well, one is in the pipe and the other's fetch address is being issued). There is no lingering state, so whatever happens next is fine.
So that means the delays are really in clock cycles and not in opcodes ?
An immediate WAITxx opcode, will have just a brief pause, until normal threads resume ?
...What happens if TFREED is immediately followed by a WAIT on a flag from another task ?
Will that wait forever, as the resume has not occurred yet & PC has stalled ?
Ah, I missed the subtlety of your question the first time.
Most WAITxxx instructions will loop to themselves in multitasking mode (set/cleared when SETTASK executes), so this wouldn't be a problem. WAITPEQ/WAITPNE always stall the pipeline, so you probably wouldn't be using those instructions in a multi-tasking program, anyway.
So that means the delays are really in clock cycles and not in opcodes ?
An immediate WAITxx opcode, will have just a brief pause, until normal threads resume ?
The delays are through pipeline stages, which do not always relate 1:1 to clock cycles.
Comments
BEG_FAST & END_FAST
It's a burst kind of thing, sooo...
BURST & NOBURST, or BEGBRST & ENDBRST
SET_FAST & END_FAST
etc...
Heh turbo is always fun!
TURBO & NOTURBO
.
.
happy carefree tasking code
.
.
TASKOFF
.
.
NASTY PIGGISH NON-TASKING CODE
.
.
.
TASKON
.
.
friendly sharing code
.
.
(The Miyagi instructions)
Love your sense of humor Chip.
+1
Before I saw this I was favouring BURST/NOBURST
Remember everyone, we are limited to 7 characters. Kinda just too short but too much to change.
Just for a laugh (continuing on from Ricks example).....
.
.
happy carefree tasking code
.
.
PIGGY
.
.
NASTY PIGGISH NON-TASKING CODE
.
.
.
NOPIGGY
.
.
friendly sharing code
.
.
Why 7 ?
TASKSOFF/TASKSON is more accurate to what it actually does, but still qualifies as Miyagi instructions.
Instruction names etc can all be changed later by a new assembler. Lets not get bogged down, and just accept the restrictions for now.
BTW what is the Miyagi reference regarding this?
[video=youtube_share;SMCsXl9SGgY]
Thinks it's a load, until the reveal!
After some practice, he enters tourney against nemesis, prevails, gets girl, everybody who matters is happy!
[video=youtube_share;5pL6uUYdWbU]
Sorry, I have to ask again.
How much time/resources (Beau presumably) and risk would reworking the COG RAM block be to make it be able to perform WIDE reads and writes?
Currently COG RAM is 512 Long blocks of 4 port (3 read and 1 write) for each cog.
I believe that AUX RAM is 256 Long blocks of 2 port (1 read and 1 read/write ~= 2 read and 1 write) for each cog.
I am presuming cog and aux rams were originally composed of smaller blocks of ram and repeated.
Could perhaps a new building block be made as 8 long blocks of 4 port (3 read and 1 write) so that 1 read and 1 write port could also be accessed as a WIDE (8 longs)?
This could then be used as the basis for the COG RAM with 64 such blocks of 8 longs and the AUX RAM with 32 such blocks of 8 longs. I realise that the AUX RAM would not use 1 read port, but that may not be such a waste.
Could this basic 8 long block be used as the WIDE CACHEs also?
This would permit WIDE reads and writes to/from HUB to/from COG or AUX to be performed in 1 clock plus setup. This should simplify quite a bit of logic and simplify the TASKing as well, plus give us back 7 clocks for instructions (faster).
With this in place, COG to/from AUX could also perhaps be via a new WIDE instruction.
I think there may even be further advantages to this. It may even be that we could control the HUBEXEC caching directly into cog.
Your thoughts???
I thought everyone knew Miyagi opcodes ?
If it really needs to be 7, then TSKSOFF and TSKSON fits
I think the COG RAM is a full custom-cell, and morphing to/from wide, sounds like more muxes to me ?
The cog RAM is actually a single block that is internally organized as 64 bits by 256 rows. It uses a 2:1 mux on the bit lines to get a 32-bit data path. It used to be 32 bits by 512 rows, but that aspect ratio didn't work with the layout, so we changed it to 64 x 256 rows to make it almost square. It would be great if it was 256 bits wide, as we could read and write as much in a whack, but that aspect ratio would be really skewed. Squarish is best for RAMs, as the signals don't have to travel as far, and they are faster. It would really open some simple hub exec caching possibilities. We'd just need that look-aside table that David Betz was telling me about.
I wonder how square 64 rows of 256 bits would be? It would need an 8:1 mux on the bit lines to get to the 32bit data path, but wides would use the whole 256bit path. Could the Aux then be an extension of a further 32 rows? This would permit WIDE transfers between AUX and COG too.
Maybe the Instruction Caches (if still required - see my comments below) could also follow along this block too - ie another 4+ rows of 256 bits ???
And the Data Cache another 1 row?
With this mechanism, I guess it would be quite simple to switch the cog blocks $000+ in wide blocks (8 longs).
Would we really need a separate Instruction Cache? Could we use a fixed block of Aux or Cog for this? Then we would only need the tags (the LRU).
Maybe we could even control the cache ourselves with pasm wide load instructions???
Could the RAM blocks be built using the OnSemi standard cells or don't they have multiport memory blocks?
Whereas now the aspect ratio of the cog RAM cells is ~1:1 for 64x256, it would become ~16:1 for 256x64, since it would be doubled on the X axis twice and halved on the y axis twice.
It would make caching really simple and fast, for sure. Just add some extra rows.
AUX would be tough to integrate into the cog RAM because it has an asynchronous port. AUX could be made 256 bits wide, too, though.
I don't know what kinds of memories OnSemi could generate, but if they didn't have something 256 bits wide, they'd probably have something 32 or 64 bits wide that could be used in multiple instances.
This is how Prop3 should be made, with very wide memories.
Looking at this differently, what if the hub ram was in the center of the die, surrounded by rectangular cog and aux memory? Then each cog's logic etc would be spread around the outside of this block. Just a thought.
Byte and word read/writes of AUX would be nice too.
It is critical that all cogs' logic cells converge in the center of the die because there are tons of interconnections among them. It would never meet timing requirements, otherwise.
I agree! That might have to wait for another chip, at this point, though.
Even with Aux sitting on the long side, together with the caches, it would still be > 4:1 ratio.
Once TLOCK executes, the next same-task instruction will be the first in a stream of full-speed instructions from that task. When TFREE executes, there will be two more instructions executing at full-speed before multitasking resumes from where it left off, before the TLOCK. An intervening SETTASK can be executed, but won't take effect until the third instruction after TFREE. If a JMPTASK affects a TLOCK'd task, the TLOCK gets cancelled.
P.S. These instruction names can be changed.
(or perhaps WAXON WAXOFFD <grin>)
Alerting users/maintainers there is a delay, is a good idea. ( The term LOCK I remain uneasy with.)
Not realising there is a phase/delay effect here, has less risk than other opcodes where getting that wrong, can change what you think is running. TFREED delay is subtle, and code flow is not affected, just the precise time of the 'gear change'.
hmm..
What happens if TFREED is immediately followed by a WAIT on a flag from another task ?
Will that wait forever, as the resume has not occurred yet & PC has stalled ?
The only reason why there is a two-instruction delay is because those two instructions are already in the pipe (well, one is in the pipe and the other's fetch address is being issued). There is no lingering state, so whatever happens next is fine.
So that means the delays are really in clock cycles and not in opcodes ?
An immediate WAITxx opcode, will have just a brief pause, until normal threads resume ?
Ah, I missed the subtlety of your question the first time.
Most WAITxxx instructions will loop to themselves in multitasking mode (set/cleared when SETTASK executes), so this wouldn't be a problem. WAITPEQ/WAITPNE always stall the pipeline, so you probably wouldn't be using those instructions in a multi-tasking program, anyway.
The delays are through pipeline stages, which do not always relate 1:1 to clock cycles.