What's the point of non-long-aligned hubexec? From what I understand, it just complicates addressing, and makes even streamer-aligned jumps take a tick longer because the first instruction can't be run until two longs are fetched because it spans two longs. Of course, I could just ignore it by putting "long" at the top of all my PASM...
Otherwise, I'm very excited for the P2 and wish I had the time and money for an FPGA!
I thought all instruction fetches are long aligned, including in hubexec mode.
What's the point of non-long-aligned hubexec? From what I understand, it just complicates addressing, and makes even streamer-aligned jumps take a tick longer because the first instruction can't be run until two longs are fetched because it spans two longs. Of course, I could just ignore it by putting "long" at the top of all my PASM...
Otherwise, I'm very excited for the P2 and wish I had the time and money for an FPGA!
I thought all instruction fetches are long aligned, including in hubexec mode.
Then what's with the byte addressing (e.g. cogram goes from $0 - $7FC) Chip keeps referring to?
The byte addressing is natural and normal. The smallest "atom" of data in any machine now a days is the byte.
That does not imply that instructions can sit at any byte address. There are many machines, especially those with fixed sized 16 or 32 bit instructions where the instructions have to be located at two or four byte boundaries.
If you are going to have byte addressing in HUB it makes sense to be consistent and have byte addressing in COG, even if you can never address a arbitrary byte in COG.
First of all, let me clarify that I'm only talking about instruction fetching. Data should definitely be byte-addressible. I love it that (R|W)F(BYTE|WORD|LONG) can operate on misaligned words and longs.
All instructions are exactly one long long. The hub streamer reads one aligned long at a time. If an instruction isn't long-aligned, then it must be split over a long boundary. That means that the first instruction after a jump can't be run until two aligned longs are fetched by the streamer. An optimizing compiler that tries to align hubexec jumps so that the streamer is ready to load the jump target instruction immediately after the jump would never misalign an instruction because doing that would slow down the first instruction after each jump. Yes, this is a matter of ticks, but they add up for inner loops.
Caveats/Questions:
1. COGSTART/COGINIT will have to clear the special registers in cog - or could it be done at BOOT and COGSTOP ???
2. Will remainder of COG RAM/registers & LUT be cleared?
If possible, suggest NO - we can then in fact hold what was in Cog RAM on a previous start ???
-or do you power down the whole cog and it's RAM & LUT at boot and/or COGSTOP ???
3. HUB RAM - will it be cleared on BOOT ???
Again suggest NO - it's been nice with the P2 FPGA to be able to reboot and examine the hub ram by jumping into the monitor.
How is the HUB ROM going to work? Will it be part of the HUB Address space or will it switch out after bootup ???
1. The reset hardware causes all the cog I/O to reset. However, it is still necessary to clear the RAM registers for OUTA/OUTB/DIRA/DIRB. This will be done by the now-simplified cog startup code that is realized in logic, as it's only a 5-long program:
2. Nope. They will retain whatever they had in them before.
3. Utimately, on the chip, the hub RAM will have to be cleared on failed startup when security is enabled to get rid of all SHA-256 residue and any data left behind. Without security enabled, there's no need to clear the hub RAM, though.
(4.) The hub ROM is read via COGID WC, sequentially. This happens at boot-up and the contents are loaded into the first 16KB of RAM and executed by cog0. 16KB is complete overkill, for now, but it is sufficient room for a complete on-chip development system in the future.
I was thinking about how addresses above $1000 are hub-exec, while addresses $0..$7FF are cog-exec and $800..$FFF are LUT-exec, and what a pain it is that you can't have hub-executable code below $1000. Then, it dawned on me that cog/LUT-exec could be restricted to long-aligned addresses, only, allowing hub exec to occur on non-long-aligned addresses below $1000. Here's the new way:
I was thinking about how addresses above $1000 are hub-exec, while addresses $0..$7FF are cog-exec and $800..$FFF are LUT-exec, and what a pain it is that you can't have hub-executable code below $1000. Then, it dawned on me that cog/LUT-exec could be restricted to long-aligned addresses, only, allowing hub exec to occur on non-long-aligned addresses below $1000. Here's the new way:
This area (4KB) of low hub memory will not be wasted anyway,
even if there is not enough data to fill it:
it can contain cog code, to be loaded into cog RAM or LUT RAM (using RDLONG-repeat).
This way, not even the EEPROM space would be wasted!
(I'm assuming that the boot process loads EEPROM contents at hub address zero.)
In other words, since you appear to be supporting 20-bit addressing (1MB), put cog and lut execution above the hub address range. Then there is absolutely no overlap in the addressing and you still get full coverage. Data addressing is not an issue anyhow, since you use different instructions for each type of memory.
Yes, it means our cog code would have to have "ORG $80000", but I don't see an issue with that. Now that all of the jumps are either long jumps from a register value or relative jumps, it doesn't really matter where the cog (and LUT) address range is actually located.
That low address space will not see normal code use due to the misalignment.
That is a good thing, just like the COG shadow register RAM was. Special dev type code can go there and it's not expected to have code there otherwise.
We will have P2 FPGA projects in addition to the real chip and not breaking the full megabyte will pay off for those.
I would much rather not have executable code there at all before breaking the addressing above 512kb
Ahh. Now, the FPGA argument I agree with. But, only just. It's going to have to be a pretty beefy FPGA to provide 1MB of hub ram plus ram for the cogs and LUT (and internal registers). Seems like the tail wagging the dog.
(note: I suspect the more economical approach will be to use external DDR and map it's page size in via an additional address range above $7FFFF.)
Additional note: I'm not attached to the suggestion, by the way. Just throwing it out there as an alternative suggestion in case Chip is determined to make it work one way or another. The instruction address space from $80000 is not used by the P2, even though it has instructions that support it. How about this instead:
$00000-007FF : Cog instruction and data space
$00800-00FFF : LUT instruction and data space
$01000-7FFFF : unused
$80000-FFFFF : Hub instruction and data space
Now, if future FPGA people want to extend memory down, they can, and be now worse off than they would be with the current design (prior to chip's current suggestion).
Again, I'm not attached. Just putting it out there...
COG addresses stay simple, HUB code starts at $1000, done. Large address constants for COG code is more goofy more of the time than the occasional non aligned code is, and that can be managed by software too, if we want.
Just saw the revised suggestion. Same issue. Why break a nice clean address model?
Think success. If P2 does well, a 1mb variant won't be a big deal. Could happen.
Now that I think about it, the booter, monitor, crypt, etc... can go in the non aligned space, leaving some room for user debug or dev code and or hooks to nicely integrate the same and the dev system.
Call this the system area and it is expected to be used in that fashion.
And if it is really needed, it can be used anyway, unlike the hot ROM.
Now that I think about it, the booter, monitor, crypt, etc... can go in the non aligned space, leaving some room for user debug or dev code and or hooks to nicely integrate the same and the dev system.
Call this the system area and it is expected to be used in that fashion.
And if it is really needed, it can be used anyway, unlike the hot ROM.
It just seems rather ugly to have two address spaces overlaid on top of each other with a slight offset. What is this? Parallel universes? :-)
(4.) The hub ROM is read via COGID WC, sequentially. This happens at boot-up and the contents are loaded into the first 16KB of RAM and executed by cog0. 16KB is complete overkill, for now, but it is sufficient room for a complete on-chip development system in the future.
Does that mean you have a price for a ROM-revision mask ?
I was thinking about how addresses above $1000 are hub-exec, while addresses $0..$7FF are cog-exec and $800..$FFF are LUT-exec, and what a pain it is that you can't have hub-executable code below $1000. Then, it dawned on me that cog/LUT-exec could be restricted to long-aligned addresses, only, allowing hub exec to occur on non-long-aligned addresses below $1000. Here's the new way:
Interesting approach, users can still use LOWEST HUB as Arrays/Data right ?
Does adding that wider decode have much of a speed penalty ?
What about access outside HUB - does/can that generate a trap (interrupt?) like some MCUs do.
Doing that would give a low-level way to manage off-chip access for large data.
Comments
I thought all instruction fetches are long aligned, including in hubexec mode.
Then what's with the byte addressing (e.g. cogram goes from $0 - $7FC) Chip keeps referring to?
The byte addressing is natural and normal. The smallest "atom" of data in any machine now a days is the byte.
That does not imply that instructions can sit at any byte address. There are many machines, especially those with fixed sized 16 or 32 bit instructions where the instructions have to be located at two or four byte boundaries.
If you are going to have byte addressing in HUB it makes sense to be consistent and have byte addressing in COG, even if you can never address a arbitrary byte in COG.
It's 32 bit alignment is not an article of faith. Whatever is easiest to do is the way to go.
First of all, let me clarify that I'm only talking about instruction fetching. Data should definitely be byte-addressible. I love it that (R|W)F(BYTE|WORD|LONG) can operate on misaligned words and longs.
All instructions are exactly one long long. The hub streamer reads one aligned long at a time. If an instruction isn't long-aligned, then it must be split over a long boundary. That means that the first instruction after a jump can't be run until two aligned longs are fetched by the streamer. An optimizing compiler that tries to align hubexec jumps so that the streamer is ready to load the jump target instruction immediately after the jump would never misalign an instruction because doing that would slow down the first instruction after each jump. Yes, this is a matter of ticks, but they add up for inner loops.
Potatohead, In that case it should be implemented. But it should never be used.
1. The reset hardware causes all the cog I/O to reset. However, it is still necessary to clear the RAM registers for OUTA/OUTB/DIRA/DIRB. This will be done by the now-simplified cog startup code that is realized in logic, as it's only a 5-long program:
MOV OUTA,#0
MOV OUTB,#0
MOV DIRA,#0
MOV DIRB,#0
JMP PTRB
2. Nope. They will retain whatever they had in them before.
3. Utimately, on the chip, the hub RAM will have to be cleared on failed startup when security is enabled to get rid of all SHA-256 residue and any data left behind. Without security enabled, there's no need to clear the hub RAM, though.
(4.) The hub ROM is read via COGID WC, sequentially. This happens at boot-up and the contents are loaded into the first 16KB of RAM and executed by cog0. 16KB is complete overkill, for now, but it is sufficient room for a complete on-chip development system in the future.
Thanks for the reply.
3. Makes sense as I totally forgot about security. Would be nice to leave uncleared when security is not enabled.
4. Nice for future ROM use
%000000000xxxxxxxxx00 = cog-exec
%000000001xxxxxxxxx00 = LUT-exec
everything else = hub-exec
My first reaction is that original way was fine...
Although, I guess also have to figure out how code is beyond $1000...
I second this.
Nooo, that is monstrous.
Define your memory spaces. Be done with it.
Third!
Please don't do this!
This area (4KB) of low hub memory will not be wasted anyway,
even if there is not enough data to fill it:
it can contain cog code, to be loaded into cog RAM or LUT RAM (using RDLONG-repeat).
This way, not even the EEPROM space would be wasted!
(I'm assuming that the boot process loads EEPROM contents at hub address zero.)
Having a directive to run code in that space might be great for debuggers and other kinds of things like we did on P1 with shadow RAM.
Instruction address space:
$00000-$7FFFF : Hub
$80000-$807FF : Cog
$80800-$80FFF : LUT
In other words, since you appear to be supporting 20-bit addressing (1MB), put cog and lut execution above the hub address range. Then there is absolutely no overlap in the addressing and you still get full coverage. Data addressing is not an issue anyhow, since you use different instructions for each type of memory.
Yes, it means our cog code would have to have "ORG $80000", but I don't see an issue with that. Now that all of the jumps are either long jumps from a register value or relative jumps, it doesn't really matter where the cog (and LUT) address range is actually located.
... seriously? you are arguing against a suggestion based on some distant future version of the propeller?
That low address space will not see normal code use due to the misalignment.
That is a good thing, just like the COG shadow register RAM was. Special dev type code can go there and it's not expected to have code there otherwise.
We will have P2 FPGA projects in addition to the real chip and not breaking the full megabyte will pay off for those.
I would much rather not have executable code there at all before breaking the addressing above 512kb
(note: I suspect the more economical approach will be to use external DDR and map it's page size in via an additional address range above $7FFFF.)
$00000-007FF : Cog instruction and data space
$00800-00FFF : LUT instruction and data space
$01000-7FFFF : unused
$80000-FFFFF : Hub instruction and data space
Now, if future FPGA people want to extend memory down, they can, and be now worse off than they would be with the current design (prior to chip's current suggestion).
Again, I'm not attached. Just putting it out there...
COG addresses stay simple, HUB code starts at $1000, done. Large address constants for COG code is more goofy more of the time than the occasional non aligned code is, and that can be managed by software too, if we want.
Just saw the revised suggestion. Same issue. Why break a nice clean address model?
Think success. If P2 does well, a 1mb variant won't be a big deal. Could happen.
Call this the system area and it is expected to be used in that fashion.
And if it is really needed, it can be used anyway, unlike the hot ROM.
Does that mean you have a price for a ROM-revision mask ?
Does adding that wider decode have much of a speed penalty ?
What about access outside HUB - does/can that generate a trap (interrupt?) like some MCUs do.
Doing that would give a low-level way to manage off-chip access for large data.