At one point you, or maybe it was Ken, suggested that you might make the RTL for P1 available after P2 shipped. Now it seems that the RTL for P1+ is going to be an extension of the RTL for P1. Do you still plan to release any RTL either before or after you ship the next chip? Did you by any chance archive the RTL for P1 before you started morphing it into P1+ or P2 or whatever the chip being described in this thread will be called?
I made a mistake in thinking about and explaining the indirect issues. We CAN do an extra clock to achieve indirection, without any clock speed penalty. We would let the first clock go through, reading D and S according the D and S fields in the instruction. Then, on the next clock, we could issue the reads for indirect D and S using some implied hidden registers. This would help speed and code density, but would give us only maybe two indirect registers. I kind of like the universal approach using two instructions, but it is wasteful, compared to what could be done with implied indirection via special INDA/INDB register locations.
I made a mistake in thinking about and explaining the indirect issues. We CAN do an extra clock to achieve indirection, without any clock speed penalty. We would let the first clock go through, reading D and S according the D and S fields in the instruction. Then, on the next clock, we could issue the reads for indirect D and S using some implied hidden registers. This would help speed and code density, but would give us only maybe two indirect registers. I kind of like the universal approach using two instructions, but it is wasteful, compared to what could be done with implied indirection via special INDA/INDB register locations.
> but it is wasteful, compared to what could be done with implied indirection via special INDA/INDB register
But does the INDA/INDB have optional post-inc/pre-dec flags?, as any space saved is lost if you have to put a sub before or a add after the INDx instruction
Regarding REPS and the code-size matter, remember that there's the $ for origin:
REPS #count,#:end-$
inst
inst
:end inst
The problem with this simpler, vanilla form, is the code fails if that last inst opcode is a double-size one
Is there still a delay following REPS before the looping block, or has that gone ?
If REPS now starts immediately, then a single label form is ok, a finite delay needs two labels.
REPS #count,EndLoop
inst
inst
inst ' can be one or two sized inst
:EndLoop
yes, that also works well, if there is no lead-in delay on REPS
I think REPS cannot be nested, so this form is fine, and if anyone does REPS..REPS ENDR ENDR it can spit an error
I think everything should be documented in sysclocks, I've been confused a number of times, which suggests to me that those new to the prop will generally go HUH??? at the two clocks.
I think everything should be documented in sysclocks, I've been confused a number of times, which suggests to me that those new to the prop will generally go HUH??? at the two clocks.
Yes, and if there are going to be 3 SysClk opcodes, that gives little choice, as you cannot really spec 1.5 OpCodeClks ?
That would make mnemonics 2/3/4 SysClks in speed.
I agree with this. It is confusing. Sysclocks would be unambigious.
Is there a delay now on REP? I thought I read we don't have one sans the complex pipeline. If that is true, I like the form Bill posted last the best. (unable to quote on this device in any sane way)
I agree with this. It is confusing. Sysclocks would be unambigious.
Is there a delay now on REP? I thought I read we don't have one sans the complex pipeline. If that is true, I like the form Bill posted last the best. (unable to quote on this device in any sane way)
REP has no delay slots in this design, since there's no pipeline.
The way we can get INDA/INDB to have all the different pre/post-inc/dec possibilities is to make 8 registers worth of INDx registers:
That way there is no need to have so many registers...
Is there a spare opcode bit, that can re-map these to give the decode choices, but not to consume (valuable) register address space with fixed-use-locations ?
I think you can safely deep-six the four level stack.
Good point about task switching. So how about
How about SETINDMOD #ddd,#sss
ddd - applies to whichever index register is used as the destination
0xx = use INDd value directly, d=A/B
100 = INDd++
101 = INDd--
110 = ++INDd
110 = --INDd
sss - applies to whichever index register is used as the destination
0xx = use INDs value directly, d=A/B
100 = INDs++
101 = INDs--
110 = ++INDs
110 = --INDs
That way there is no need to have so many registers... and the INDMOD stays in effect until next SETINDMOD
Is there a spare opcode bit, that can re-map these to give the decode choices, but not to consume (valuable) register address space with fixed-use-locations ?
The only way we could get a spare bit, or two, would be to reduce cog RAM to 256 longs. Then, we could get indirect bits for both D and S. And cog size would be cut in half. It might be viable with hub exec, you know. 32 cogs!
With 512 cog locations (minus shadow regs) it is still barely possible to do a 256 entry lookup table for vm's using cog memory, without taking a hub cycle hit.
16 cogs with 512 registers, and tasks, is FAR more useful than 32 cogs with 256 registers.
The only way we could get a spare bit, or two, would be to reduce cog RAM to 256 longs. Then, we could get indirect bits for both D and S. And cog size would be cut in half. It might be viable with hub exec, you know. 32 cogs!
I think you can safely deep-six the four level stack.
Good point about task switching. So how about
How about SETINDMOD #ddd,#sss
ddd - applies to whichever index register is used as the destination
0xx = use INDd value directly, d=A/B
100 = INDd++
101 = INDd--
110 = ++INDd
110 = --INDd
sss - applies to whichever index register is used as the destination
0xx = use INDs value directly, d=A/B
100 = INDs++
101 = INDs--
110 = ++INDs
110 = --INDs
That way there is no need to have so many registers... and the INDMOD stays in effect until next SETINDMOD
Good idea on the mode bits. If we gave two registers per INDx, we could pick between two of the preset modes. That would enable stacks.
The only way we could get a spare bit, or two, would be to reduce cog RAM to 256 longs. Then, we could get indirect bits for both D and S. And cog size would be cut in half. It might be viable with hub exec, you know. 32 cogs!
That sounds too costly, I was thinking more along the lines of 'blurring' the 9b + 9b opcode fields to something like 10b+ 8b in some sparse cases, where that split allows things like dedicated pointers/sfr (special function registers) 'above' usual 512 limit, and now the #Immediate value is smaller, & the source register form is limited to one of the lower 256.
-but it has avoided eating-into general purpose RAM.
But we need to be able to do stuff like JMPSW INDA,++INDA.
....
If this is for task switching, do we then not also need the automatic wrapping inside a table (FIXINDx instructions) ?
I have the feeling this needs a lot of logic gates (comparators, muxes, registers) ?
I think we can always do it with hardcoded task switches instead of a task-list. For example with 3 tasks:
If this is for task switching, do we then not also need the automatic wrapping inside a table (FIXINDx instructions) ?
I have the feeling this needs a lot of logic gates (comparators, muxes, registers) ?
I think we can always do it with hardcoded task switches instead of a task-list. For example with 3 tasks:
I'd like to avoid that wrapping issue, as it is costly. I hope to have all these issues settled tonight. There are many other things that need attention.
That sounds too costly, I was thinking more along the lines of 'blurring' the 9b + 9b opcode fields to something like 10b+ 8b in some sparse cases, where that split allows things like dedicated pointers/sfr (special function registers) 'above' usual 512 limit, and now the #Immediate value is smaller, & the source register form is limited to one of the lower 256.
-but it has avoided eating-into general purpose RAM.
I could see that, but how would you know if you were in a 10b + 8b situation?
I'd like to avoid that wrapping issue, as it is costly. I hope to have all these issues settled tonight. There are many other things that need attention.
One of these other things:
I think we need a word-move instruction for SDRAM drivers. We have Nibble and Byte instructions (ROLNIB, GETNIB, SETNB etc.) but nothing like that with word size.
If we want to access 16bit wide SDRAM with half the sysclock rate we need to do the move and shift in one single cycle instruction.
My proposal is a MOVWORD instruction which moves the lower or higher 16bits from S to the lower or higher 16bits of D. The other 16bits in D must not be affected:
I think we need a word-move instruction for SDRAM drivers. We have Nibble and Byte instructions (ROLNIB, GETNIB, SETNB etc.) but nothing like that with word size.
If we want to access 16bit wide SDRAM with half the sysclock rate we need to do the move and shift in one single cycle instruction.
My proposal is a MOVWORD instruction which moves the lower or higher 16bits from S to the lower or higher 16bits of D. The other 16bits in D must not be affected:
In common with SDRAM which would want 16 bit and a CS# strobe, LCD parallel interfaces are similar.
Some need 24b, which may be best via the video , others need 16b i8080 bus models.
A useful opcode here could be a double-move, that does 2 x16b moves on a 32 bit register.
With 2 SysClks available, it may be possible to get close to 200MHz bursts ?
I think we need a word-move instruction for SDRAM drivers. We have Nibble and Byte instructions (ROLNIB, GETNIB, SETNB etc.) but nothing like that with word size.
If we want to access 16bit wide SDRAM with half the sysclock rate we need to do the move and shift in one single cycle instruction.
My proposal is a MOVWORD instruction which moves the lower or higher 16bits from S to the lower or higher 16bits of D. The other 16bits in D must not be affected:
When used as a stack, you are either PUSHing (CALLing) or POPing (RETing). So you choose PUSH INDA++ which pushes first, then increments, and POP --INDA which decrements first then pops. This only requires 2 register spaces for INDA. For the occasional times you require just take a copy from the stack you have to POP then PUSH.
If we want 2 stacks, the probability is we will go from each end. Therefore make INDB work the opposite - PUSH --INDB and POP INDB++, and 2 more register spaces.
Now, you will notice that INDA & INDB both have pre-decrement and post-increment. They just get reversed for CALL to RET and also for PUSH to POP. Silicon should be simplified here.
This also gives us the freedom to use INDx++ and --INDx in standard op codes too.
Might INDA & INDB be better called STACKA and STACKB ?
BTW I agree, the 4 level LIFO (which is only 17 address bits + flags) can go.
MOVNIB D,S/#, #0..7
MOVBYTE D,S/#,#0..3
MOVWORD D,S/#,#0..1
where the NIB/BYTE/WORD (rightmost bits) of S/# replace the bits in D as indexed by #0..n
and
GETNIB D,S,#0..7
GETBYTE D,S,#0..3
GETWORD D,S,0..1
where D is left-zero filled and #0..n is an index into S.
These 6 instructions could map nicely to one set of opcode sets.
Can you remind me what ROLNIB/BYTE/WORD does?
My preference when performing a MOVe to cog memory is to use "MOVxxx". Similarly I prefer MOVD, MOVS, MOVI or MOVINST, and MOVCOND.
My preference to use "SETxxx" is for buried registers or setting modes.
Thinking about what has been discussed above, I wonder if an alternative could be...
RDQUADC D/#,S/PTRA++/PTRB++ 'reads a quad long into cog ram at a quad boundary (no need for buried DCACHE) and resets the internal OFFSET 4-bit counter.
RDBYTEC D,S/# WC 'reads a byte from the cog where S/# specifies the location D/# used in the RDQUADC, and increments the OFFSET counter +1. "C" set if OFFSET wraps = last byte.
RDWORDC D,S/# WC 'reads a word from the cog where S/# specifies the location D/# used in the RDQUADC, and increments the OFFSET counter +2. "C" set if OFFSET wraps = last word.
Note mixed RDBYTE and RDWORD not supported.
Note if the user gives a different S/# in RDBYTEC or RDWORDC than the D/# used in the RDQUADC then results will be unpredictable (reads from the cog location specified)
Comments
We plan to release Prop1 code, at first.
Good to hear
Have that have be good for my Serial-Com's experiments
I knew you would figure out a way.
Ummm... could we have our cake, and eat it too?
Adding the universal ALT instruction gives us MANY additional pointers, and we can use INDA/INDB for the highest speed uses.
But does the INDA/INDB have optional post-inc/pre-dec flags?, as any space saved is lost if you have to put a sub before or a add after the INDx instruction
The problem with this simpler, vanilla form, is the code fails if that last inst opcode is a double-size one
Is there still a delay following REPS before the looping block, or has that gone ?
If REPS now starts immediately, then a single label form is ok, a finite delay needs two labels.
addit :
yes, that also works well, if there is no lead-in delay on REPS
I think REPS cannot be nested, so this form is fine, and if anyone does REPS..REPS ENDR ENDR it can spit an error
I was wondering the same thing.
I'm unclear if Chip is meaning an extra Opcode clock (4 SysClk Opcode) or an extra SysClk (3 SysClkCycClk opcode) in #645 ?
fyi,
I think everything should be documented in sysclocks, I've been confused a number of times, which suggests to me that those new to the prop will generally go HUH??? at the two clocks.
Yes, and if there are going to be 3 SysClk opcodes, that gives little choice, as you cannot really spec 1.5 OpCodeClks ?
That would make mnemonics 2/3/4 SysClks in speed.
Is there a delay now on REP? I thought I read we don't have one sans the complex pipeline. If that is true, I like the form Bill posted last the best. (unable to quote on this device in any sane way)
REP has no delay slots in this design, since there's no pipeline.
The way we can get INDA/INDB to have all the different pre/post-inc/dec possibilities is to make 8 registers worth of INDx registers:
$1F8 = INDA
$1F9 = INDA++
$1FA = INDA--
$1FB = ++INDA
$1FC = INDB
$1FD = INDB++
$1FE = INDB--
$1FF = ++INDB
MOV INDB++,INDA++ ...same as... MOV $1FC,$1F9
aaa
0xx = use INDA value directly
100 = INDA++
101 = INDA--
110 = ++INDA
110 = --INDA
bbb
0xx = use INDB value directly
100 = INDB++
101 = INDB--
110 = ++INDB
110 = --INDB
That way there is no need to have so many registers...
But we need to be able to do stuff like JMPSW INDA,++INDA.
Hey, I just realized that we had cog RAM stacks in the Prop2, all along:
JMPSW INDA--,ADR = CALL ADR
JMP ++INDA = RET
This takes less transistors than a hardware LIFO.
Is there a spare opcode bit, that can re-map these to give the decode choices, but not to consume (valuable) register address space with fixed-use-locations ?
I think you can safely deep-six the four level stack.
Good point about task switching. So how about
The only way we could get a spare bit, or two, would be to reduce cog RAM to 256 longs. Then, we could get indirect bits for both D and S. And cog size would be cut in half. It might be viable with hub exec, you know. 32 cogs!
With 512 cog locations (minus shadow regs) it is still barely possible to do a 256 entry lookup table for vm's using cog memory, without taking a hub cycle hit.
16 cogs with 512 registers, and tasks, is FAR more useful than 32 cogs with 256 registers.
Good idea on the mode bits. If we gave two registers per INDx, we could pick between two of the preset modes. That would enable stacks.
That sounds too costly, I was thinking more along the lines of 'blurring' the 9b + 9b opcode fields to something like 10b+ 8b in some sparse cases, where that split allows things like dedicated pointers/sfr (special function registers) 'above' usual 512 limit, and now the #Immediate value is smaller, & the source register form is limited to one of the lower 256.
-but it has avoided eating-into general purpose RAM.
If this is for task switching, do we then not also need the automatic wrapping inside a table (FIXINDx instructions) ?
I have the feeling this needs a lot of logic gates (comparators, muxes, registers) ?
I think we can always do it with hardcoded task switches instead of a task-list. For example with 3 tasks:
Andy
I'd like to avoid that wrapping issue, as it is costly. I hope to have all these issues settled tonight. There are many other things that need attention.
I could see that, but how would you know if you were in a 10b + 8b situation?
One of these other things:
I think we need a word-move instruction for SDRAM drivers. We have Nibble and Byte instructions (ROLNIB, GETNIB, SETNB etc.) but nothing like that with word size.
If we want to access 16bit wide SDRAM with half the sysclock rate we need to do the move and shift in one single cycle instruction.
My proposal is a MOVWORD instruction which moves the lower or higher 16bits from S to the lower or higher 16bits of D. The other 16bits in D must not be affected:
Andy
In common with SDRAM which would want 16 bit and a CS# strobe, LCD parallel interfaces are similar.
Some need 24b, which may be best via the video , others need 16b i8080 bus models.
A useful opcode here could be a double-move, that does 2 x16b moves on a 32 bit register.
With 2 SysClks available, it may be possible to get close to 200MHz bursts ?
Thanks for pointing this out. Perhaps GETNIB/BYTE/WORD should perform a ROL function, too.
Why do we need more than one mode?
When used as a stack, you are either PUSHing (CALLing) or POPing (RETing). So you choose PUSH INDA++ which pushes first, then increments, and POP --INDA which decrements first then pops. This only requires 2 register spaces for INDA. For the occasional times you require just take a copy from the stack you have to POP then PUSH.
If we want 2 stacks, the probability is we will go from each end. Therefore make INDB work the opposite - PUSH --INDB and POP INDB++, and 2 more register spaces.
Now, you will notice that INDA & INDB both have pre-decrement and post-increment. They just get reversed for CALL to RET and also for PUSH to POP. Silicon should be simplified here.
This also gives us the freedom to use INDx++ and --INDx in standard op codes too.
Might INDA & INDB be better called STACKA and STACKB ?
BTW I agree, the 4 level LIFO (which is only 17 address bits + flags) can go.
May I suggest...
MOVNIB D,S/#, #0..7
MOVBYTE D,S/#,#0..3
MOVWORD D,S/#,#0..1
where the NIB/BYTE/WORD (rightmost bits) of S/# replace the bits in D as indexed by #0..n
and
GETNIB D,S,#0..7
GETBYTE D,S,#0..3
GETWORD D,S,0..1
where D is left-zero filled and #0..n is an index into S.
These 6 instructions could map nicely to one set of opcode sets.
Can you remind me what ROLNIB/BYTE/WORD does?
My preference when performing a MOVe to cog memory is to use "MOVxxx". Similarly I prefer MOVD, MOVS, MOVI or MOVINST, and MOVCOND.
My preference to use "SETxxx" is for buried registers or setting modes.
Thinking about what has been discussed above, I wonder if an alternative could be...
RDQUADC D/#,S/PTRA++/PTRB++ 'reads a quad long into cog ram at a quad boundary (no need for buried DCACHE) and resets the internal OFFSET 4-bit counter.
RDBYTEC D,S/# WC 'reads a byte from the cog where S/# specifies the location D/# used in the RDQUADC, and increments the OFFSET counter +1. "C" set if OFFSET wraps = last byte.
RDWORDC D,S/# WC 'reads a word from the cog where S/# specifies the location D/# used in the RDQUADC, and increments the OFFSET counter +2. "C" set if OFFSET wraps = last word.
Note mixed RDBYTE and RDWORD not supported.
Note if the user gives a different S/# in RDBYTEC or RDWORDC than the D/# used in the RDQUADC then results will be unpredictable (reads from the cog location specified)