If PA/PB/PTRA/PTRB could be your basereg, then maybe we could make an instruction which adds a 20-bit immediate, but rather than modify PA/PB/PTRA/PTRB, the sum appears as S in the next instruction.
All those ALTxx instructions do is modify fields in the next instruction, not final S or D values. So, getting 20-bit alterations requires something more.
This would limit CALLD to just PA and PB, but I don't really see why you'd want to jump to where PTRA/PTRB points.
This LOCOS instruction would sum PA/PB/PTRA/PTRB with a 20-bit immediate offset and put it into the next instruction's S value, which would be used by the subsequent RDxxxx/WRxxxx instruction.
How many concurrent basereg's do you think you need? Would PA and PB be sufficient, or might you need several?
The compiler would generally use arbitrary registers. If we can only use a limited set then we can stick with PTRA and PTRB (rather than PA and PB) but the more special cases there are the more tricky optimization becomes.
There is nothing technically stopping more than one COG from reaching into XIP memory, except that would thrash the bus more, and increase latencies. In some cases, that could be tolerable.
That would need some kind of mutex on the bus though, right? You couldn't do it concurrently in multiple cogs I assume?
Users can, and will, pull out 'driver level like' functions to run on other COGS. Some of those may creep into HUB space, but the general nature is they would be loaded and run without being moved.
XIP allows the 'main code' to place most often used functions in HUB, and then call into further memory, where the usual size-speed trade-offs apply. Code like POST, or Setup menus, are some examples of not-used-often, but still required to be available code.
That would also require compiler, linker, and loader support. This functionality is possible with P1 using PropGCC to compile overlays into eeprom, but almost no one has ever used it. Macca did a demo of it recently, but I think the latest versions of the Prop loader tool removed the support for overlay code, and it's not accessible through SimpleIDE - you need to hand code the command lines to make it go. It'd be nice to see overlays "officially supported", though probably less of an issue with more memory and fast compressed instruction decoding.
How many concurrent basereg's do you think you need? Would PA and PB be sufficient, or might you need several?
The compiler would generally use arbitrary registers. If we can only use a limited set then we can stick with PTRA and PTRB (rather than PA and PB) but the more special cases there are the more tricky optimization becomes.
Would it be a problem if CALLD (the "LINK" instruction) only worked with PA/PB?
What about a AUG instruction that extends the 5 bit N field in PTRA/B hub instructionis.
augn #imm_offset
wr[long data,PTRA[imm_offset]
This way you get the benefits of pre/post indexing too.
I think this is the best idea. It's only a few more adders to add this and, like you said, you have the benefit of pre-/post-usage and update/no-update. This would do it all.
By the way, it could just work off the AUGS instruction. If AUGS is in play during RDxxxx/WRxxxx, then it is used to augment the normally-5-bit offset field. This could still allow for 20-bit immediate addressing without PTRA/PTRB by making the PTRA/PTRB use kick in only when bit 31 of the augmented S is set.
There is nothing technically stopping more than one COG from reaching into XIP memory, except that would thrash the bus more, and increase latencies. In some cases, that could be tolerable.
That would need some kind of mutex on the bus though, right? You couldn't do it concurrently in multiple cogs I assume?
Yes, hence my comment on increase latencies.
Instead of one COG having control, each would need to fetch small packets, then release the BUS for possible other use.
Keeping delays under control gets challenging, but the way P2 shares pin, there is no HW prevention.
That would also require compiler, linker, and loader support. This functionality is possible with P1 using PropGCC to compile overlays into eeprom, but almost no one has ever used it. Macca did a demo of it recently, but I think the latest versions of the Prop loader tool removed the support for overlay code, and it's not accessible through SimpleIDE - you need to hand code the command lines to make it go.
Yes, that did seem a good idea, given the caveats of P1, but I guess things slowing down use of this are the finite EEPROM size and somewhat slow i2c EEPROM speed. (far slower than XIP memory)
It'd be nice to see overlays "officially supported", though probably less of an issue with more memory and fast compressed instruction decoding.
My thinking is users will always run out of code space, eventually.
P2 is larger, but can do far more too. ( I see other MCUs shipping now, offer 2MB Flash and 1MB SRAM)
The question is what happens then, and how much of a hard ceiling is that ?
Could alts/altd etc. modify final values rather than fields? Probably a big change, but it's worth asking...
There's no time for any math, only mux'ing, for the field substitution. The field substitution permits register indexing.
Substituting values is more relaxed on timing, but I need to figure out what would be useful.
You know, having the full 20 bits is overkill -- this is for offsetting into a structure, and so it doesn't really need to span the whole address space. I think a lot of processors allow 12 bit signed offsets. 9 bits would be better than nothing: it would allow us to reduce the 3 instruction sequence above down to 2 (1 + an augment), and would be an improvement over the 5 bit offset available in PTRA/PTRB.
So I think my first choice would be an ALTS like instruction that replaced the source value (rather than source field). But an AUG that expands the range of offsets for PTRA/PTRB would be an OK second choice.
All of this is conditional on it not being too expensive either in gate counts or in your time!
Could alts/altd etc. modify final values rather than fields? Probably a big change, but it's worth asking...
There's no time for any math, only mux'ing, for the field substitution. The field substitution permits register indexing.
Substituting values is more relaxed on timing, but I need to figure out what would be useful.
You know, having the full 20 bits is overkill -- this is for offsetting into a structure, and so it doesn't really need to span the whole address space. I think a lot of processors allow 12 bit signed offsets. 9 bits would be better than nothing: it would allow us to reduce the 3 instruction sequence above down to 2 (1 + an augment), and would be an improvement over the 5 bit offset available in PTRA/PTRB.
So I think my first choice would be an ALTS like instruction that replaced the source value (rather than source field). But an AUG that expands the range of offsets for PTRA/PTRB would be an OK second choice.
All of this is conditional on it not being too expensive either in gate counts or in your time!
Eric
Making AUGS behave differently with RDxxxx/WRxxxx is almost free. There is no new instruction in that case, only a change in behavior. The thing about any 20-bit immediate instruction is that it's going to have, at most, two bits available for register selection. If we can just say PTRA/PTRB is good enough, then it's really simple to make this happen.
Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?
Shouldn't it be? Indexed data or code may not be aligned.
Right, as long we've got 20 bits, I see no need for scaling, as all it might do is make 3/4 of the memory unreachable (RDLONG). If the compiler knows what the scaled value should be, it knows the unscaled value, and it can just use that.
Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?
Direct without scaling would actually be easier for the compiler to handle. The scaling is nice for PASM programmers, but for the machine it's one more thing it has to adjust for. Also, for packed structures it would be nice to be able to access unaligned fields this way
Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?
Direct without scaling would actually be easier for the compiler to handle. The scaling is nice for PASM programmers, but for the machine it's one more thing it has to adjust for. Also, for packed structures it would be nice to be able to access unaligned fields this way
Eric
All right. I'll get this change made, test it, and recompile the FPGA images.
I haven't followed the details of the P2 for about a year, and I've forgotten much of what I knew about it from a year ago. So please bear with me on some possibly silly questions. Most of the time C code will be executing from HUB memory. It will used the CALLD instruction to call functions. This will result in using the long version of CALLD, which only allows using either PA, PB, PTRA or PTRB as the link register. PA/PB are cog RAM locations, and PTRA/PTRB are registers, correct? Are there any limitations on using PTRA/PTRB for the link register? That is, do any instructions access the shadow memory instead, which would not contain the same values as the registers?
It seems like a reasonable thing to do for C is to use PA for the link register and PTRA for the stack pointer. Does that make sense? For C the stack starts at the high end of memory, and a pre-decrement is done before writing to the stack. After a read is done a post-increment is performed. Is this consistent with the way the P2 handles the pointers?
What are CALLPA/CALLPB? They're new since I've last kept up with the P2. It appears that they put the return address on the hardware stack and also PA/PB. What's the purpose for doing that?
1F6 PA (used to be called ADRA)
1F7 PB (used to be called ADRB)
1F8 PTRA
1F9 PTRB
and
CALLPA {#}D,{#}S 'Call to S** by pushing {C, Z, PC[19:0]} onto stack, copy D to PA.
CALLPB {#}D,{#}S 'Call to S** by pushing {C, Z, PC[19:0]} onto stack, copy D to PB.
I understand that PA, PB, PTRA and PTRB are mapped to cog locations, but I was curious about how they were implemented. It appears that PA and PB are physically in cog RAM, but PTRA and PTRB are implemented as registers. That implies that there may be cog RAM also at the same addresses as PTRA and PTRB. When an instruction is fetched from $1F8 does it come from cog RAM or from the register that implements PTRA? There doesn't appear to be any way to write to cog RAM at $1f8, unless something like "mov PTRA, #5" puts a 5 in both the cog RAM and the PTRA register.
Chip, thanks for the clarification on CALLPA/CALLPB. Of course the documentation states exactly what you said. I just didn't read it correctly.
Thanks for confirming this. I actually knew this stuff a year or two ago, but my memory had faded on these details. And things have changed a bit since then.
Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?
Direct without scaling would actually be easier for the compiler to handle. The scaling is nice for PASM programmers, but for the machine it's one more thing it has to adjust for. Also, for packed structures it would be nice to be able to access unaligned fields this way
Eric
All right. I'll get this change made, test it, and recompile the FPGA images.
Comments
If PA/PB/PTRA/PTRB could be your basereg, then maybe we could make an instruction which adds a 20-bit immediate, but rather than modify PA/PB/PTRA/PTRB, the sum appears as S in the next instruction.
How many concurrent basereg's do you think you need? Would PA and PB be sufficient, or might you need several?
Andy
All those ALTxx instructions do is modify fields in the next instruction, not final S or D values. So, getting 20-bit alterations requires something more.
Hey, that could be almost for free, but would preclude immediate absolute addressing using AUGS, as happens now. It's something to consider, though.
I was thinking we could just change the encoding of CALLD to eke out another instruction which has a 20-bit immediate.
Currently:
...could become:
This would limit CALLD to just PA and PB, but I don't really see why you'd want to jump to where PTRA/PTRB points.
This LOCOS instruction would sum PA/PB/PTRA/PTRB with a 20-bit immediate offset and put it into the next instruction's S value, which would be used by the subsequent RDxxxx/WRxxxx instruction.
The compiler would generally use arbitrary registers. If we can only use a limited set then we can stick with PTRA and PTRB (rather than PA and PB) but the more special cases there are the more tricky optimization becomes.
That would need some kind of mutex on the bus though, right? You couldn't do it concurrently in multiple cogs I assume?
That would also require compiler, linker, and loader support. This functionality is possible with P1 using PropGCC to compile overlays into eeprom, but almost no one has ever used it. Macca did a demo of it recently, but I think the latest versions of the Prop loader tool removed the support for overlay code, and it's not accessible through SimpleIDE - you need to hand code the command lines to make it go. It'd be nice to see overlays "officially supported", though probably less of an issue with more memory and fast compressed instruction decoding.
There's no time for any math, only mux'ing, for the field substitution. The field substitution permits register indexing.
Substituting values is more relaxed on timing, but I need to figure out what would be useful.
Would it be a problem if CALLD (the "LINK" instruction) only worked with PA/PB?
I think this is the best idea. It's only a few more adders to add this and, like you said, you have the benefit of pre-/post-usage and update/no-update. This would do it all.
By the way, it could just work off the AUGS instruction. If AUGS is in play during RDxxxx/WRxxxx, then it is used to augment the normally-5-bit offset field. This could still allow for 20-bit immediate addressing without PTRA/PTRB by making the PTRA/PTRB use kick in only when bit 31 of the augmented S is set.
Yes, hence my comment on increase latencies.
Instead of one COG having control, each would need to fetch small packets, then release the BUS for possible other use.
Keeping delays under control gets challenging, but the way P2 shares pin, there is no HW prevention.
Yes, that did seem a good idea, given the caveats of P1, but I guess things slowing down use of this are the finite EEPROM size and somewhat slow i2c EEPROM speed. (far slower than XIP memory)
My thinking is users will always run out of code space, eventually.
P2 is larger, but can do far more too. ( I see other MCUs shipping now, offer 2MB Flash and 1MB SRAM)
The question is what happens then, and how much of a hard ceiling is that ?
You know, having the full 20 bits is overkill -- this is for offsetting into a structure, and so it doesn't really need to span the whole address space. I think a lot of processors allow 12 bit signed offsets. 9 bits would be better than nothing: it would allow us to reduce the 3 instruction sequence above down to 2 (1 + an augment), and would be an improvement over the 5 bit offset available in PTRA/PTRB.
So I think my first choice would be an ALTS like instruction that replaced the source value (rather than source field). But an AUG that expands the range of offsets for PTRA/PTRB would be an OK second choice.
All of this is conditional on it not being too expensive either in gate counts or in your time!
Eric
Making AUGS behave differently with RDxxxx/WRxxxx is almost free. There is no new instruction in that case, only a change in behavior. The thing about any 20-bit immediate instruction is that it's going to have, at most, two bits available for register selection. If we can just say PTRA/PTRB is good enough, then it's really simple to make this happen.
Question:
Imagine that 5-bit index is expanded to 20 bits. Would you still want the value scaled according to byte/word/long, or should it just be direct, without scaling? Are both cases needed? Actually, "no", right? Because this is a compile-time thing with full range. So, it could be direct without scaling?
Right, as long we've got 20 bits, I see no need for scaling, as all it might do is make 3/4 of the memory unreachable (RDLONG). If the compiler knows what the scaled value should be, it knows the unscaled value, and it can just use that.
Direct without scaling would actually be easier for the compiler to handle. The scaling is nice for PASM programmers, but for the machine it's one more thing it has to adjust for. Also, for packed structures it would be nice to be able to access unaligned fields this way
Eric
All right. I'll get this change made, test it, and recompile the FPGA images.
This time is super high value. Lean and mean tools to make a lot of people happy.
It seems like a reasonable thing to do for C is to use PA for the link register and PTRA for the stack pointer. Does that make sense? For C the stack starts at the high end of memory, and a pre-decrement is done before writing to the stack. After a read is done a post-increment is performed. Is this consistent with the way the P2 handles the pointers?
What are CALLPA/CALLPB? They're new since I've last kept up with the P2. It appears that they put the return address on the hardware stack and also PA/PB. What's the purpose for doing that?
They are all cog locations
and
Chip, thanks for the clarification on CALLPA/CALLPB. Of course the documentation states exactly what you said. I just didn't read it correctly.
I believe writes to $1F8/$1F9 do write the RAM. I'll check soon.
Curiosity got the better of me so I tried this Even after indexing the PTRA register the original $1F8 RAM contents persist.