The primary objective here is to settably shrink CogRAM and/or AuxRAM in exchange for a larger Hubexec cache size, right? ... Which implies a four-ported cache that can be treated with the same flexibility and speed as CogRAM.
Nice idea but this would also imply register addressing to 32 bits or at least 18 bits for current hub address range ... but the instruction size limits your register addressing range! Personally, I feel 32 bit instructions are rather large for a micro-controller. 64 bit would be gross wastage and even then you wouldn't be able to achieve a full 32 bits for each of S & D.
Another alternative is drop back to a single direct address and have an accumulator in the ALU.
True. We also need to break MOVFLD into SETFLD and GETFLD, and maybe have a ROLFLD, too, which would be like GETFLD, but would rotate the gotten field into D rather than just zero-extending it. Also, a RORFLD would really round things out. We'd be set then!
These things are so easy to do, but I'm mired deep in changes to support hub exec now. Please remind me later if you don't see this implemented.
I guess what you are saying is that my initial proposal is unlikely to be implemented because of complexity. (die size)
Of course, the dynamic get/set field instructions would take care of all cases where bits are aligned and would be better than nothing.
Instead of mask + offset or offset + bit length to set the field, would not a 32 bit mask be sufficient!? Number of leading zeros is the offset. Delta between least significant one and the most significant one is bit length. Bits in between will be a normal mask.
I guess what you are saying is that my initial proposal is unlikely to be implemented because of complexity. (die size)
Of course, the dynamic get/set field instructions would take care of all cases where bits are aligned and would be better than nothing.
Instead of mask + offset or offset + bit length to set the field, would not a 32 bit mask be sufficient!? Number of leading zeros is the offset. Delta between least significant one and the most significant one is bit length. Bits in between will be a normal mask.
It could work like that, but it would take more circuitry for the needed priority encoders. Ultimately, the circuit needs to know two things: mask size and shift count, each 5 bits.
Your idea of magically laying bits into a random mask, sequentially, is beautiful, but it would take a lot of transistors, perhaps 10x as many as using a mask size and shift count to map a contiguous pattern.
It could work like that, but it would take more circuitry for the needed priority encoders. Ultimately, the circuit needs to know two things: mask size and shift count, each 5 bits.
Your idea of magically laying bits into a random mask, sequentially, is beautiful, but it would take a lot of transistors, perhaps 10x as many as using a mask size and shift count to map a contiguous pattern.
Chip,
In my last reply I was not talking about my initial proposal to spread bits into holes. I meant that a 32 bit mask is sufficient to replace the need for two parameters to set the field.
The following bit mask, %00001101010100000000000000000000, would be equivalent to "offset = 20, length = 8" or if a mask is used instead of length it is equivalent to "offset = 20, mask = 000011010101".
If the latter (mask version) is possible to implement one could do something like...
SETFLDMASK bitMask
SETFLD source, destination
long bitMask %00001101010100000000000000000000
This is equal to the following in P1 pasm...
ANDN destination, bitMask
SHL source, #20 <- Yes, shift BEFORE masking
AND source, bitMask
OR destination, source
long bitMask %00001101010100000000000000000000
BTW, maybe an indexed register scheme could work. Where you could set like 8 32 bit mask registers.
And then do something like: "SETFLD source, destination, #%xxx".
Where xxx is the 3 bit index to select a predefined mask to use. Many applications doesn't need more than a handful masks anyway. And if I am not mistaken there are space for instructions with 3 extra bits if certain flags (like immediate flag, execution flags etc are not used).
Then you could do a lot of user defined single cycle field instructions without the need to first set the mask.
like so..
Chip,
In my last reply I was not talking about my initial proposal to spread bits into holes. I meant that a 32 bit mask is sufficient to replace the need for two parameters to set the field.
The following bit mask, %00001101010100000000000000000000, would be equivalent to "offset = 20, length = 8" or if a mask is used instead of length it is equivalent to "offset = 20, mask = 000011010101".
If the latter (mask version) is possible to implement one could do something like...
SETFLDMASK bitMask
SETFLD source, destination
long bitMask %00001101010100000000000000000000
This is equal to the following in P1 pasm...
ANDN destination, bitMask
SHL source, #20 <- Yes, shift BEFORE masking
AND source, bitMask
OR destination, source
long bitMask %00001101010100000000000000000000
BTW, maybe an indexed register scheme could work. Where you could set like 8 32 bit mask registers.
And then do something like: "SETFLD source, destination, #%xxx".
Where xxx is the 3 bit index to select a predefined mask to use. Many applications doesn't need more than a handful masks anyway. And if I am not mistaken there are space for instructions with 3 extra bits if certain flags (like immediate flag, execution flags etc are not used).
Then you could do a lot of user defined single cycle field instructions without the need to first set the mask.
like so..
I understand now. It would still take a priority encoder to determine the base offset.
Did you know that we now have 'D/#,S/#' instructions? You can convey two 9 bit constants in one instruction. This could be used to not need a separate bitmask.
I see the value of the mask, though. It allows holes to exist. Are there many instances where you'd want holes, as oppose to a contiguous bit field?
Are there many instances where you'd want holes, as oppose to a contiguous bit field?
I can not come to think of any specific situation right off the bat, where it could be used.
In my initial proposal the mask was, of course, the key thing, but then I thought that it might be convenient to have a mask even if the bits are not spread. But if you still need a priority encoder to do that, it may not be worth it.
Did you know that we now have 'D/#,S/#' instructions? You can convey two 9 bit constants in one instruction. This could be used to not need a separate bitmask.
You mean like, "SETFLD #offset, #width"?!
Still it would be nice to be able to define some "user fields" and then be able to reference them by index. So in tight loops when getting and setting different fields, you do not need to have a SETFLD instruction and then GET/SET.
I can not come to think of any specific situation right off the bat, where it could be used.
In my initial proposal the mask was, of course, the key thing, but then I thought that it might be convenient to have a mask even if the bits are not spread. But if you still need a priority encoder to do that, it may not be worth it.
You mean like, "SETFLD #offset, #width"?!
Still it would be nice to be able to define some "user fields" and then be able to reference them by index. So in tight loops when getting and setting different fields, you do not need to have a SETFLD instruction and then GET/SET.
/Johannes
Agreed!
We've got probably a dozen instructions which could be replaced by this method. I agree that having maybe four settings would be good. They could be initialized to something common, like I,X,D,S fields.
On second thought, I think we should have two of these field mask settings, but leave SETI/SETX/SETD/SETS as they are, since they are needed all the time.
This will let us get rid of GETNIB/SETNIB/GETBYTE/SETBYTE/GETWORD/SETWORD, which take up 28 'D,S/#' instruction slots.
Those will be replaced by SETFLDA/SETFLDB/GETFLDA/GETFLDB/RORFLDA/RORFLDB/ROLFLDA/ROLFLDB D,S/#, which take up only 8 slots. Then, we'll add CFGFLDA/CFGFLDB D/#,S/#.
For hub exec mode, I think I've just about got everything in place to add the four 8-long instruction cache lines.
This has been really tedious work. When so much is on the operating table, it gets a little overwhelming, but then I'm kind of giddy about how neat it's going to be, so I have to force myself to focus.
A lot of instructions have moved around to make room for the new branch instructions. I spread out the boundaries of different instruction types and left some opcode ranges open to avoid having to move so much in the future when something else gets added. This makes the instruction decoding simpler, too. I did a test compile and came back at Fmax = 73MHz, which is really good.
After I get the Verilog back into shape, I must modify PNut.exe for the new instruction mapping. Then I can try it out. And when it works, I'll modify the Prop2_Docs.txt file and post an update. It's probably several days away, yet.
Someone had suggested starting a cog in hub mode. This would only take one more bit of conveyance in COGNEW/COGINIT, so it's quite doable. That way, there's no load time - task0 just starts executing from the hub at the address specified in D.
For hub exec mode, I think I've just about got everything in place to add the four 8-long instruction cache lines.
This has been really tedious work. When so much is on the operating table, it gets a little overwhelming, but then I'm kind of giddy about how neat it's going to be, so I have to force myself to focus.
A lot of instructions have moved around to make room for the new branch instructions. I spread out the boundaries of different instruction types and left some opcode ranges open to avoid having to move so much in the future when something else gets added. This makes the instruction decoding simpler, too. I did a test compile and came back at Fmax = 73MHz, which is really good.
After I get the Verilog back into shape, I must modify PNut.exe for the new instruction mapping. Then I can try it out. And when it works, I'll modify the Prop2_Docs.txt file and post an update. It's probably several days away, yet.
Someone had suggested starting a cog in hub mode. This would only take one more bit of conveyance in COGNEW/COGINIT, so it's quite doable. That way, there's no load time - task0 just starts executing from the hub at the address specified in D.
This sounds great! How did you decide to handle the cache? Are you going to implement an LRU cache replacement policy?
This sounds great! How did you decide to handle the cache? Are you going to implement an LRU cache replacement policy?
Yes, LRU is what I'm planning on implementing.
About LR (the link register idea), there are going to be hub stacks using PTRA/PTRB, as well as AUX stacks using PTRX/PTRY (used to be called SPA/SPB) available for both hub and cog tasks. Is there still a need for LR?
For hub exec mode, I think I've just about got everything in place to add the four 8-long instruction cache lines.
This has been really tedious work. When so much is on the operating table, it gets a little overwhelming, but then I'm kind of giddy about how neat it's going to be, so I have to force myself to focus.
A lot of instructions have moved around to make room for the new branch instructions. I spread out the boundaries of different instruction types and left some opcode ranges open to avoid having to move so much in the future when something else gets added. This makes the instruction decoding simpler, too. I did a test compile and came back at Fmax = 73MHz, which is really good.
After I get the Verilog back into shape, I must modify PNut.exe for the new instruction mapping. Then I can try it out. And when it works, I'll modify the Prop2_Docs.txt file and post an update. It's probably several days away, yet.
Someone had suggested starting a cog in hub mode. This would only take one more bit of conveyance in COGNEW/COGINIT, so it's quite doable. That way, there's no load time - task0 just starts executing from the hub at the address specified in D.
About LR (the link register idea), there are going to be hub stacks using PTRA/PTRB, as well as AUX stacks using PTRX/PTRY (used to be called SPA/SPB) available for both hub and cog threads. Is there still a need for LR?
Well, I guess it isn't strictly necessary but leaf functions would execute faster using LR since there would be no hub access for pushing the PC and popping it again on return. I guess non-leaf functions would probably be a bit faster using PTRA/PTRB since the return address would get saved as part of the call instruction. I assume that's what you're planning. I guess we can look at the function prologue and epilogue in more detail once you have the instruction set completely defined to see if there would be a big advantage to having a call instruction that uses an LR register.
About LR (the link register idea), there are going to be hub stacks using PTRA/PTRB, as well as AUX stacks using PTRX/PTRY (used to be called SPA/SPB) available for both hub and cog tasks. Is there still a need for LR?
About LR (the link register idea), there are going to be hub stacks using PTRA/PTRB, as well as AUX stacks using PTRX/PTRY (used to be called SPA/SPB) available for both hub and cog tasks. Is there still a need for LR?
Chip
Everything looks better, every day!
Sure it will worth all your time and effort.
As for HUB stack management, is it planned to have a twofold WIDE cache mechanism per task, enabling them to perform some HUB r/w ops in the background, while continually exposing one WIDE space, to be on par with COG's usage demmands?
So PTRA/PTRB are for hub stacks, and PTRX/PTRY for cog stacks?
Sounds good for me
With hub stack oriented call/ret (that automatically push / pop the PC to/from a hub stack) I see no need for LR.
CALLA/CALLB use hub RAM via PTRA/PTRB. RETA/RETB are their counterparts.
CALLX/CALLY use AUX RAM via PTRX/PTRY. RETX/RETY are their counterparts.
You can use either set from either cog or hub mode. Just match the proper return to the call.
There are delayed versions (-d suffix) of all jmps, calls, and returns.
The immediate JMP and CALL instructions have 16 bits of address and are available as follows (JMP shown, but applies to CALLA/CALLB/CALLX/CALLY, too):
JMP #address 'jump to address
JMP_ #address 'jump to address, toggle hub/cog mode
JMP @address 'jump to relative address
JMP_ @address 'jump to relative address, toggle mode
JMPD #address 'jump to address, delayed
JMPD_ #address 'jump to address, delayed, toggle mode
JMPD @address 'jump to relative address, delayed
JMPD_ @address 'jump to relative address, delayed, toggle mode
JMP D 'jump to address
JMP_ D 'jump to address, toggle mode
JMPD D 'jump to address, delayed
JMPD_ D 'jump to address, delayed, toggle mode
CALLA/CALLB/CALLX/CALLY always save the hub/cog mode in bit 18 of the stack long. Z and C go into bits 17 and 16. The return address goes into bits 15..0.
When a return executes, it always takes bit 18 of the popped long and uses it to restore the caller's hub/cog mode. WZ/WC control restoration of the caller's Z/C flags.
Here is RETA (also applies to RETB, RETX, and RETY):
RETA 'return to hub[--PTRA], restore caller's mode
RETAD 'return to hub[--PTRA], delayed, restore caller's mode
About LR (the link register idea), there are going to be hub stacks using PTRA/PTRB, as well as AUX stacks using PTRX/PTRY (used to be called SPA/SPB) available for both hub and cog tasks.
Nice! Could you please name the call instruction that uses the hub stack "PUSHJ" so I can relive my old PDP-10 days? :-)
Everything looks better, every day!
Sure it will worth all your time and effort.
As for HUB stack management, is it planned to have a twofold WIDE cache mechanism per task, enabling them to perform some HUB r/w ops in the background, while continually exposing one WIDE space, to be on par with COG's usage demmands?
Yanomani
It will be reading new WIDEs in the background, while code executes, in order to keep the cache lines filled.
It will be reading new WIDEs in the background, while code executes, in order to keep the cache lines filled.
How will the hub-based call/return instructions work? Will they be using the equivilent of WRLONGC and RDLONGC to save and restore the return address? Or will they access the hub directly without benefit of the RDLONGC/WRLONGC cache?
How will the hub-based call/return instructions work? Will they be using the equivilent of WRLONGC and RDLONGC to save and restore the return address? Or will they access the hub directly without benefit of the RDLONGC/WRLONGC cache?
They won't use the cache because it would waste it. They'll just use background RDLONG/WRLONG instructions. The WIDEs and read cache will be available for the program to use. The instruction cache lines will be completely separate.
Don't worry about it. I was just kidding. I'm sure there will have to be an "A" or "B" in the opcode to indicate which PTRx register to use. Lots of people here seem to want to relive their CP/M-80 days but I am more attached to my TOPS-10 / PDP-10 days. Maybe I should write a PDP-10 emulator. Might be kind of tough though since the PDP-10 was a 36 bit machine.
About LR (the link register idea), there are going to be hub stacks using PTRA/PTRB, as well as AUX stacks using PTRX/PTRY (used to be called SPA/SPB) available for both hub and cog tasks. Is there still a need for LR?
LR would be nice to have. Actually the ideal would be something very like the current JMPRET D, S, that can save the HUB address in any register. Then you can define calling conventions that avoid hitting the HUB memory for most common cases. That's the way most RISC processors (ARM, MIPS, SPARC, etc.) work. I realize that instruction encodings are probably tight though :-(.
This has been really tedious work. When so much is on the operating table, it gets a little overwhelming, but then I'm kind of giddy about how neat it's going to be, so I have to force myself to focus.
LR would be nice to have. Actually the ideal would be something very like the current JMPRET D, S, that can save the HUB address in any register. Then you can define calling conventions that avoid hitting the HUB memory for most common cases. That's the way most RISC processors (ARM, MIPS, SPARC, etc.) work. I realize that instruction encodings are probably tight though :-(.
Okay. I'll see if I can fit it in. I think making JMPRET/JMPRETD do different things in hub vs cog mode is the key. If it can work like I imagine it could, it'll only be a few mux's.
Okay. I'll see if I can fit it in. I think making JMPRET/JMPRETD do different things in hub vs cog mode is the key. If it can work like I imagine it could, it'll only be a few mux's.
Comments
Nice idea but this would also imply register addressing to 32 bits or at least 18 bits for current hub address range ... but the instruction size limits your register addressing range! Personally, I feel 32 bit instructions are rather large for a micro-controller. 64 bit would be gross wastage and even then you wouldn't be able to achieve a full 32 bits for each of S & D.
Another alternative is drop back to a single direct address and have an accumulator in the ALU.
I guess what you are saying is that my initial proposal is unlikely to be implemented because of complexity. (die size)
Of course, the dynamic get/set field instructions would take care of all cases where bits are aligned and would be better than nothing.
Instead of mask + offset or offset + bit length to set the field, would not a 32 bit mask be sufficient!? Number of leading zeros is the offset. Delta between least significant one and the most significant one is bit length. Bits in between will be a normal mask.
It could work like that, but it would take more circuitry for the needed priority encoders. Ultimately, the circuit needs to know two things: mask size and shift count, each 5 bits.
Your idea of magically laying bits into a random mask, sequentially, is beautiful, but it would take a lot of transistors, perhaps 10x as many as using a mask size and shift count to map a contiguous pattern.
Chip,
In my last reply I was not talking about my initial proposal to spread bits into holes. I meant that a 32 bit mask is sufficient to replace the need for two parameters to set the field.
The following bit mask, %00001101010100000000000000000000, would be equivalent to "offset = 20, length = 8" or if a mask is used instead of length it is equivalent to "offset = 20, mask = 000011010101".
If the latter (mask version) is possible to implement one could do something like...
This is equal to the following in P1 pasm...
BTW, maybe an indexed register scheme could work. Where you could set like 8 32 bit mask registers.
And then do something like: "SETFLD source, destination, #%xxx".
Where xxx is the 3 bit index to select a predefined mask to use. Many applications doesn't need more than a handful masks anyway. And if I am not mistaken there are space for instructions with 3 extra bits if certain flags (like immediate flag, execution flags etc are not used).
Then you could do a lot of user defined single cycle field instructions without the need to first set the mask.
like so..
/Johannes
I understand now. It would still take a priority encoder to determine the base offset.
Did you know that we now have 'D/#,S/#' instructions? You can convey two 9 bit constants in one instruction. This could be used to not need a separate bitmask.
I see the value of the mask, though. It allows holes to exist. Are there many instances where you'd want holes, as oppose to a contiguous bit field?
In my initial proposal the mask was, of course, the key thing, but then I thought that it might be convenient to have a mask even if the bits are not spread. But if you still need a priority encoder to do that, it may not be worth it.
You mean like, "SETFLD #offset, #width"?!
Still it would be nice to be able to define some "user fields" and then be able to reference them by index. So in tight loops when getting and setting different fields, you do not need to have a SETFLD instruction and then GET/SET.
/Johannes
Agreed!
We've got probably a dozen instructions which could be replaced by this method. I agree that having maybe four settings would be good. They could be initialized to something common, like I,X,D,S fields.
This will let us get rid of GETNIB/SETNIB/GETBYTE/SETBYTE/GETWORD/SETWORD, which take up 28 'D,S/#' instruction slots.
Those will be replaced by SETFLDA/SETFLDB/GETFLDA/GETFLDB/RORFLDA/RORFLDB/ROLFLDA/ROLFLDB D,S/#, which take up only 8 slots. Then, we'll add CFGFLDA/CFGFLDB D/#,S/#.
This has been really tedious work. When so much is on the operating table, it gets a little overwhelming, but then I'm kind of giddy about how neat it's going to be, so I have to force myself to focus.
A lot of instructions have moved around to make room for the new branch instructions. I spread out the boundaries of different instruction types and left some opcode ranges open to avoid having to move so much in the future when something else gets added. This makes the instruction decoding simpler, too. I did a test compile and came back at Fmax = 73MHz, which is really good.
After I get the Verilog back into shape, I must modify PNut.exe for the new instruction mapping. Then I can try it out. And when it works, I'll modify the Prop2_Docs.txt file and post an update. It's probably several days away, yet.
Someone had suggested starting a cog in hub mode. This would only take one more bit of conveyance in COGNEW/COGINIT, so it's quite doable. That way, there's no load time - task0 just starts executing from the hub at the address specified in D.
Yes, LRU is what I'm planning on implementing.
About LR (the link register idea), there are going to be hub stacks using PTRA/PTRB, as well as AUX stacks using PTRX/PTRY (used to be called SPA/SPB) available for both hub and cog tasks. Is there still a need for LR?
You also must catch up on your sleep sometime soon...
Eric: Can you comment on this?
Going to bed...
Sounds good for me
With hub stack oriented call/ret (that automatically push / pop the PC to/from a hub stack) I see no need for LR.
Chip
Everything looks better, every day!
Sure it will worth all your time and effort.
As for HUB stack management, is it planned to have a twofold WIDE cache mechanism per task, enabling them to perform some HUB r/w ops in the background, while continually exposing one WIDE space, to be on par with COG's usage demmands?
Yanomani
CALLA/CALLB use hub RAM via PTRA/PTRB. RETA/RETB are their counterparts.
CALLX/CALLY use AUX RAM via PTRX/PTRY. RETX/RETY are their counterparts.
You can use either set from either cog or hub mode. Just match the proper return to the call.
There are delayed versions (-d suffix) of all jmps, calls, and returns.
The immediate JMP and CALL instructions have 16 bits of address and are available as follows (JMP shown, but applies to CALLA/CALLB/CALLX/CALLY, too):
CALLA/CALLB/CALLX/CALLY always save the hub/cog mode in bit 18 of the stack long. Z and C go into bits 17 and 16. The return address goes into bits 15..0.
When a return executes, it always takes bit 18 of the popped long and uses it to restore the caller's hub/cog mode. WZ/WC control restoration of the caller's Z/C flags.
Here is RETA (also applies to RETB, RETX, and RETY):
It will be reading new WIDEs in the background, while code executes, in order to keep the cache lines filled.
Ooops!
It happened while I was typing my post, in some lazy worm fashion way!
Have a nice sleeping session, don't forget Chilly Willy and the singing bear!
Yanomani
We'll need macros for that sort of thing.
They won't use the cache because it would waste it. They'll just use background RDLONG/WRLONG instructions. The WIDEs and read cache will be available for the program to use. The instruction cache lines will be completely separate.
LR would be nice to have. Actually the ideal would be something very like the current JMPRET D, S, that can save the HUB address in any register. Then you can define calling conventions that avoid hitting the HUB memory for most common cases. That's the way most RISC processors (ARM, MIPS, SPARC, etc.) work. I realize that instruction encodings are probably tight though :-(.
Hope you have had some sleep before you read this !!!
Sounds like you have it all under control - cannot wait to see the results in their full glory
I offer that cheeky Pedward for the menial tasks.
I claim full Harvard!
Okay. I'll see if I can fit it in. I think making JMPRET/JMPRETD do different things in hub vs cog mode is the key. If it can work like I imagine it could, it'll only be a few mux's.
CALL_LR D/#16b
That stores the return address in LR (suggest $1F1)
and a RET_LR would simply be a macro for
JMP $1F1
as this would have a nice embedded 16 bit address in a single-long instruction.