Chip,
That sounds great to me. Even better than what I was suggesting. I recall you suggesting something like this before but saying it might take more work, maybe I am remembering incorrectly.
There's still the thing where hub addresses from $00000 to $00400 are not executable in hubexec mode, but that seems just fine to me.
Chip,
That sounds great to me. Even better than what I was suggesting. I recall you suggesting something like this before but saying it might take more work, maybe I am remembering incorrectly.
There's still the thing where hub addresses from $00000 to $00400 are not executable in hubexec mode, but that seems just fine to me.
I did start going down this path earlier, but when I realized that it would make hub code incompatible with cog/LUT code, I overreacted and abandoned the effort. It really doesn't matter, though. Hub code is almost always going to be different, anyway, for a number of reasons. Having the addresses stretched out in hub code won't make it that much more incompatible with cog/LUT code than it probably already is.
$00000..$001FF = cog exec (register addressing is 1:1, PC steps by 1)
$00200..$003FF = LUT exec (register addressing is 1:1, PC steps by 1)
$00400..$FFFFF = hub exec (PC steps by 4, relative D,@S (9-bit immediate) branches are shifted left twice)
Where there's no perfect solution, there's usually a perfect compromise.
Chip's latest proposal is sounding pretty simple to use to me, definitely a big improvement.
Although not imposing limits on flexibility is good, I'm concerned that a complicated addressing scheme could really put off potential users who are new to the Prop family.
As such, I'll gladly give up running hub code down low and so on for the sanity that such "restrictions" bring. Anyway, try to keep it clean and simple, though different people will define those differently.
Again, if perfection is not possible, go for the perfect compromise.
Thanks for pausing to regroup, Chip, at this critical juncture. You have our support and prayers.
$00000..$001FF = cog exec (register addressing is 1:1, PC steps by 1)
$00200..$003FF = LUT exec (register addressing is 1:1, PC steps by 1)
$00400..$FFFFF = hub exec (PC steps by 4, relative D,@S (9-bit immediate) branches are shifted left twice)
I think this means binary images have to be compiled for their segment, and must run in that segment ?
How are calls from COG/LUT to code that is HUBEXEC managed ?
What about crossing the LUT-HUB bondary ?
$00000..$001FF = cog exec (register addressing is 1:1, PC steps by 1)
$00200..$003FF = LUT exec (register addressing is 1:1, PC steps by 1)
$00400..$FFFFF = hub exec (PC steps by 4, relative D,@S (9-bit immediate) branches are shifted left twice)
I think this means binary images have to be compiled for their segment, and must run in that segment ?
How are calls from COG/LUT to code that is HUBEXEC managed ?
What about crossing the LUT-HUB bondary ?
It's true that code with 20-bit branches to itself would have to run in either cog/lut or hub, depending on which it was assembled for.
Branches between cog/lut and hub are just as before. No special considerations.
I was mistaken when I said earlier that you can execute through LUT, straight into hub. It would not work, because hub exec needs to be initiated by a branch into hub range. This branch is needed to trigger the cog state machine to issue a RDFAST to start the instruction stream. Without that branch, it would start pulling longs from the FIFO without it being loaded.
It's true that code with 20-bit branches to itself would have to run in either cog/lut or hub, depending on which it was assembled for.
Does that mean a subroutine can run either in HUB, or copied into COG ? (with the latter being more deterministic ? ) [assume it uses relative jumps only, and rets]
The Relative jumps are handled ok in both cases ?
What about code like indexed case statements ?
Can the tools warn, if someone tries to call/jump to 20-bit absolute code in the wrong place ?
I have thought long and hard about this. It comes down to a few basics which seem to have been lost along the way.
Instructions
1. Make all instructions/code addresses longs and long-aligned.
- seems no arguments here
- addresses can now be reduced to 18 bits to address 256K longs (1MB of hub).
- when fetching instructions from hub (hubexec), the PC (program counter) simply appends 2 (LSB) bits "00".
2. Use a "Flat Instruction Address" model
- seems most here want that (IMHO I don't think this is necessary but certainly the simplest to explain)
(18-bit long address)
$00000-001FF = COG RAM/Code (2KB=512-Longs) - usable as traditional cog code
$00200-003FF = LUT RAM/Code (2KB=512-Longs) - usable as lutexec
$00400-005FF = LUT RAM/Code (2KB=512-Longs) - Possible expansion if space on P2 die
$00600-007FF = LUT RAM/Code (2KB=512-Longs) - Possible future expansion
*$00000-007FF = HUB RAM (8KB=2K-Longs) - NOT usable for hubexec
$00800-1FFFF = HUB RAM/Code (504KB=126K-Longs) - also usable as hubexec
$20000-3FFFF = HUB RAM/Code (512KB=128KB-Longs) - possible future expansion & hubexec
Memory
3. Use a "Flat Memory Address" model
- $00000-007FF = COG/LUT contiguous address (long addresses only)
- $00000-FFFFF = HUB 1MB contiguous address (byte addresses)
. - word addresses must be word-aligned, long addresses must be long-aligned (20-bits)
RD/WR-LONG/WORD/BYTE
4. HUB Addresses are always byte addresses and 20-bits (max address is 1MB)
5. COG/LUT Addresses are contiguous and are 11-bits (max address is 2K-Long=8KB )
SETQ sets up a count (11-bits = 2,048) of 32-bit longs to read/write from/to Hub to/from COG/LUT.
Therefore, the SETQ & RD/WR-LONG combination can load COG & LUT in one step.
General
Presuming the above, then...
6. JMP/CALLx/RETx Instructions could now be simplified to
- Use a "Flat Instruction Address" model of 18-bits
- Always address 32-bit Longs
It's true that code with 20-bit branches to itself would have to run in either cog/lut or hub, depending on which it was assembled for.
Does that mean a subroutine can run either in HUB, or copied into COG ? (with the latter being more deterministic ? ) [assume it uses relative jumps only, and rets]
The Relative jumps are handled ok in both cases ?
What about code like indexed case statements ?
Can the tools warn, if someone tries to call/jump to 20-bit absolute code in the wrong place ?
If a subroutine has relative D,@S branches within itself and RETurns, it could run in either cog/lut or hub, in raw binary form. 20-bit relative branches would cause incompatibility, though.
The only reason I'll shift those 9-bit relative branches up two bits for hub exec would be to increase their range, at the cost of insisting alignment be common.
Indexed case statements would cause incompatibility, too.
I kind of doubt much binary code would ever be expected to run in both cog/lut and hub modes.
All JMP/CALLx/RETx addresses would be long-addresses, and long-aligned, and long +/- relative
So, @relative would always be +/- Longs.
The PC (program counter) would always be in longs, so always increments by 1.
And @relative always +/- the exact count (because they are both in longs).
Only when accessing hub for fetching hubexec code, does the access add 2 LSB "00"s because hub is byte addressed. This is hidden from the user.
It's true that code with 20-bit branches to itself would have to run in either cog/lut or hub, depending on which it was assembled for.
Does that mean a subroutine can run either in HUB, or copied into COG ? (with the latter being more deterministic ? ) [assume it uses relative jumps only, and rets]
The Relative jumps are handled ok in both cases ?
What about code like indexed case statements ?
Can the tools warn, if someone tries to call/jump to 20-bit absolute code in the wrong place ?
If a subroutine has relative D,@S branches within itself and RETurns, it could run in either cog/lut or hub, in raw binary form. 20-bit relative branches would cause incompatibility, though.
...
Indexed case statements would cause incompatibility, too.
I kind of doubt much binary code would ever be expected to run in both cog/lut and hub modes.
I saw benefits for HLL flows, if the code can 'run anywhere', library management is much easier.
eg HUB calls would be used if you needed spare room in the COG N, whilst COG M may be speed focused, and it can copy to a local version.
If two modes have to be carefully tracked and managed, that really needs flags and a linker to ensure you really do have the right code in the right location. (and still there is room for user slip-ups)
HLL code generators will need two instances of code, and a means to propagate settings for builds. They will also need to create separate symbols, as some designs could use both libraries.
It's true that code with 20-bit branches to itself would have to run in either cog/lut or hub, depending on which it was assembled for.
Does that mean a subroutine can run either in HUB, or copied into COG ? (with the latter being more deterministic ? ) [assume it uses relative jumps only, and rets]
The Relative jumps are handled ok in both cases ?
What about code like indexed case statements ?
Can the tools warn, if someone tries to call/jump to 20-bit absolute code in the wrong place ?
If a subroutine has relative D,@S branches within itself and RETurns, it could run in either cog/lut or hub, in raw binary form. 20-bit relative branches would cause incompatibility, though.
...
Indexed case statements would cause incompatibility, too.
I kind of doubt much binary code would ever be expected to run in both cog/lut and hub modes.
I saw benefits for HLL flows, if the code can 'run anywhere', library management is much easier.
eg HUB calls would be used if you needed spare room in the COG N, whilst COG M may be speed focused, and it can copy to a local version.
If two modes have to be carefully tracked and managed, that really needs flags and a linker to ensure you really do have the right code in the right location. (and still there is room for user slip-ups)
HLL code generators will need two instances of code, and a means to propagate settings for builds. They will also need to create separate symbols, as some designs could use both libraries.
Hmmm... This sounds like it will be a pain. If an entire program is either hub or COG then GCC can pick the right library. However, I'm not sure how to handle a program that consists of both types of code.
It's true that code with 20-bit branches to itself would have to run in either cog/lut or hub, depending on which it was assembled for.
Does that mean a subroutine can run either in HUB, or copied into COG ? (with the latter being more deterministic ? ) [assume it uses relative jumps only, and rets]
The Relative jumps are handled ok in both cases ?
What about code like indexed case statements ?
Can the tools warn, if someone tries to call/jump to 20-bit absolute code in the wrong place ?
If a subroutine has relative D,@S branches within itself and RETurns, it could run in either cog/lut or hub, in raw binary form. 20-bit relative branches would cause incompatibility, though.
...
Indexed case statements would cause incompatibility, too.
I kind of doubt much binary code would ever be expected to run in both cog/lut and hub modes.
I saw benefits for HLL flows, if the code can 'run anywhere', library management is much easier.
eg HUB calls would be used if you needed spare room in the COG N, whilst COG M may be speed focused, and it can copy to a local version.
If two modes have to be carefully tracked and managed, that really needs flags and a linker to ensure you really do have the right code in the right location. (and still there is room for user slip-ups)
HLL code generators will need two instances of code, and a means to propagate settings for builds. They will also need to create separate symbols, as some designs could use both libraries.
If there were binary compatibility between modes, there would still be some things to think about. You'd have to make sure the code would fit, after the variables. You couldn't access any hub data from relative addresses, like you could in a normal hub-exec app. It would basically only work for certain routines that were not very data dependent. In my pondering, I've come to the conclusion that hub code and cog/lut code will do different things, typically, and not seek to run in both modes.
All JMP/CALLx/RETx addresses would be long-addresses, and long-aligned, and long +/- relative
So, @relative would always be +/- Longs.
The PC (program counter) would always be in longs, so always increments by 1.
And @relative always +/- the exact count (because they are both in longs).
Only when accessing hub for fetching hubexec code, does the access add 2 LSB "00"s because hub is byte addressed. This is hidden from the user.
Interesting idea, but it would necessitate some new instructions to handle the two-bit differences for resolving data addresses. I think while this would improve cog/lut and hub code compatibility, it would introduce special considerations that would need ongoing minding. If we believe that cog/lut code and hub code are going to typically be mutually exclusive in purpose and place, there's no need to try to make them compatible, as this will introduce new complexities.
All JMP/CALLx/RETx addresses would be long-addresses, and long-aligned, and long +/- relative
So, @relative would always be +/- Longs.
The PC (program counter) would always be in longs, so always increments by 1.
And @relative always +/- the exact count (because they are both in longs).
Only when accessing hub for fetching hubexec code, does the access add 2 LSB "00"s because hub is byte addressed. This is hidden from the user.
Interesting idea, but it would necessitate some new instructions to handle the two-bit differences for resolving data addresses. I think while this would improve cog/lut and hub code compatibility, it would introduce special considerations that would need ongoing minding. If we believe that cog/lut code and hub code are going to typically be mutually exclusive in purpose and place, there's no need to try to make them compatible, as this will introduce new complexities.
Since we've now decided that COG and hub code don't have to be compatible then I guess that opens up the possiblity that they could be two entirely different instruction sets. Maybe hub could run RISC-V code? ARM code? (just kidding)
All JMP/CALLx/RETx addresses would be long-addresses, and long-aligned, and long +/- relative
So, @relative would always be +/- Longs.
The PC (program counter) would always be in longs, so always increments by 1.
And @relative always +/- the exact count (because they are both in longs).
Only when accessing hub for fetching hubexec code, does the access add 2 LSB "00"s because hub is byte addressed. This is hidden from the user.
Interesting idea, but it would necessitate some new instructions to handle the two-bit differences for resolving data addresses. I think while this would improve cog/lut and hub code compatibility, it would introduce special considerations that would need ongoing minding. If we believe that cog/lut code and hub code are going to typically be mutually exclusive in purpose and place, there's no need to try to make them compatible, as this will introduce new complexities.
Since we've now decided that COG and hub code don't have to be compatible then I guess that opens up the possiblity that they could be two entirely different instruction sets. Maybe hub could run RISC-V code? ARM code? (just kidding)
I'm on it!
Actually, I was thinking that maybe in the future, if the RISC-V thing looks good, we could use it and add all kinds of nice features, like analog I/O, streaming, CORDIC, etc. and make a uniquely useful RISC-V chip. I suppose the availability of lots of silicon IP for all the popular protocols would be even more important than what architecture glues it all together, given you program in C, either way. Oh, it doesn't seem very exciting, overall.
If there were binary compatibility between modes, there would still be some things to think about. You'd have to make sure the code would fit, after the variables. You couldn't access any hub data from relative addresses, like you could in a normal hub-exec app. It would basically only work for certain routines that were not very data dependent. In my pondering, I've come to the conclusion that hub code and cog/lut code will do different things, typically, and not seek to run in both modes.
I'm not following the "You couldn't access any hub data from relative addresses" comment.
COG data is register address, and HUB arrays would have a absolute base, so I'm not sure where 'relative addresses' applies ?
If code cannot run in both modes, then HLL libraries will need to have both variants.
Will that be possible with a single source base, (eg conditional assembly) or does this force a move to two-source development ?
I think I generally agree with you, though I would maybe make a few crazy suggestions.
Instruction Addresses
$00000-$001FF : Cog (longs)
$00200-$003FF : LUT (longs)
$00400-$7FFFF : reserved (I would just map it back on the Cog/LUT instruction space for now)
$80000-$9FFFF : Hub (longs)
$A0000-$FFFFF : reserved (I would just map it back on the Hub instruction space for now)
Data Addresses
$00000000-$000001FF : Cog (512 longs)
$00000200-$000003FF : LUT (512 longs)
$00000400-$0007FFFF : reserved for per-cog expansion
$00080000-$000FFFFF : Hub (512K bytes)
$00100000-$FFFFFFFF : reserved for hub expansion
Additionally, I would change the ORGx directives to be LONG-oriented. All labels are long-oriented, unless qualified in some way (wrapped in {}, prefaced with "&", etc.) to indicate their byte value instead.
I would also keep BYTE, WORD, and LONG aligned to natural boundaries (as on P1), then add a PBYTE, PWORD, and PLONG for packed data usage. Since the packed variants will only be useful in hub memory, their associated labels will always need to be qualified-to-byte-address.
Note: you may ask why I suggest moving the hub stuff to start at $80000 for both instructions and data. A couple reasons:
* First, most access to hub memory (anything over the first 512 bytes or instructions) is going to require greater-than-9-bit addresses, so you may as well take advantage of the address space (20-bit for instructions, 32-bit for data). This gives plenty of room to expand while still keeping the instruction and data address spaces flat.
* Second, now all of hub memory is executable.
* Third, this allows both instruction and data addressing to start at the same value ($80000). Of course, from there, you are dealing with longs (instructions addresses) or bytes (data addresses). To make conversion easy (for those times you need to treat instructions like data and vice versa), you could have two instructions HLTOB (hub long-to-byte) and HBTOL (hub byte-to-long), which basically does DATA_ADDR = ((INST_ADDR << 2) - $180000) and the inverse.
A code example would look like:
dat
orgh $80000 ' any value below $80000 will not compile
init coginit #16, #init ' start the next cog (in hubexec mode)
jmp #init2 ' notice that init2 = $80002, but still works because jmp has 20-bit address field
'jmp @init2 ' relative address is in longs, so 20-bit address would be $00001
init2 mov ptrb, #{cog_code} ' cog_code = $80006 (longs), {cog_code} = $80018 (bytes)
setq #(x-cog_entry) ' x and cog_entry are longs, so simply subtraction for setq
rdlong cog_entry, ptrb
jmp #cog_entry ' cog_entry = $00008
cog_code
org 8 ' any value over $3FF will not compile
cog_entry
blink rep @:end, #0
cogid x
setb dirb, x
notb outb, x
add x, #16
shl x, #18
waitx x
:end
x res 1
In any case, I really think we should just go with only long aligned execution across all types.
I would even be fine with having hub access always be long oriented, and just have instructions for reading the proper bytes or words from a long address. So, for example, RDBYTE D,S,N where N is what byte to read from the hub address at S and place into D. This would be similar to the get/setbyte and get/setword instructions. In fact, we could even do the nibble versions if we wanted.
The more I think about this, the more I think it's the way to go. I hope Chip likes it.
Wouldn't this destroy linear byte and word addressing? We'd have to maintain separate counters to cycle through bytes and longs.
I've been stalled out, trying to come to some resolution about how to reign this mess in, because moving forward on the current path is dismal.
Here is what looks best to me:
$00000..$001FF = cog exec (register addressing is 1:1, PC steps by 1)
$00200..$003FF = LUT exec (register addressing is 1:1, PC steps by 1)
$00400..$FFFFF = hub exec (PC steps by 4, relative D,@S (9-bit immediate) branches are shifted left twice)
This keeps the cogs simple and fun, like they are on Prop1, which is a necessity. It also gets rid of any impetus to make overlapped cog/LUT/hub execution spaces.
There's no way we ought to clutter up the assembly language with all kinds of operators to overcome the current 4:1 hub:cog/LUT addressing ratio. Making it 1:1 keeps us sane, happy, and free.
This takes a few gates to implement and it doesn't slow the chip down.
This would make the Prop2 just like the Prop1, but without any hub alignment requirements for longs and words, and with the pleasant addition of hub exec. It even gets rid of the notion that hub exec instructions ought to be long-aligned for any reason. Hub execution and data access all become dirt simple with no alignment caveats.
All JMP/CALLx/RETx addresses would be long-addresses, and long-aligned, and long +/- relative
So, @relative would always be +/- Longs.
The PC (program counter) would always be in longs, so always increments by 1.
And @relative always +/- the exact count (because they are both in longs).
Only when accessing hub for fetching hubexec code, does the access add 2 LSB "00"s because hub is byte addressed. This is hidden from the user.
Interesting idea, but it would necessitate some new instructions to handle the two-bit differences for resolving data addresses. I think while this would improve cog/lut and hub code compatibility, it would introduce special considerations that would need ongoing minding. If we believe that cog/lut code and hub code are going to typically be mutually exclusive in purpose and place, there's no need to try to make them compatible, as this will introduce new complexities.
Chip,
I am sure you are looking at this way more difficult than it needs to be, and at the same time introducing unnecessary restrictions.
Remember the complexities we had to make LMM work. Hubexec complexities are insignificant compared to LMM.
HUB Address Differences
The only differences in HUB addresses (labels) are where they are used...
1. In an instruction as #abs or @rel when used as a destination address. They should always be long addresses.
2. When used as a Data location address... They should always be byte addresses.
The compiler should be able to determine this, and use the appropriate address (ie discard the 2xLSBs when a destination address).
If we want to mix, load, manipulate, etc, any hub address as an instruction destination (or return address), then we will need to take the hub address and >>2.
ie the programmer will need to take care of this. There is no other solution that will work, and it is IMHO the simplest to cover.
However, if you really want to simplify this, you could use an alternative identifier other than # or ##.
The @ will always be long.
I have thought long and hard about this. It comes down to a few basics which seem to have been lost along the way.
Instructions
1. Make all instructions/code addresses longs and long-aligned.
- seems no arguments here
- addresses can now be reduced to 18 bits to address 256K longs (1MB of hub).
- when fetching instructions from hub (hubexec), the PC (program counter) simply appends 2 (LSB) bits "00".
2. Use a "Flat Instruction Address" model
- seems most here want that (IMHO I don't think this is necessary but certainly the simplest to explain)
(18-bit long address)
$00000-001FF = COG RAM/Code (2KB=512-Longs) - usable as traditional cog code
$00200-003FF = LUT RAM/Code (2KB=512-Longs) - usable as lutexec
$00400-005FF = LUT RAM/Code (2KB=512-Longs) - Possible expansion if space on P2 die
$00600-007FF = LUT RAM/Code (2KB=512-Longs) - Possible future expansion
*$00000-007FF = HUB RAM (8KB=2K-Longs) - NOT usable for hubexec
$00800-1FFFF = HUB RAM/Code (504KB=126K-Longs) - also usable as hubexec
$20000-3FFFF = HUB RAM/Code (512KB=128KB-Longs) - possible future expansion & hubexec
Memory
3. Use a "Flat Memory Address" model
- $00000-007FF = COG/LUT contiguous address (long addresses only)
- $00000-FFFFF = HUB 1MB contiguous address (byte addresses)
. - word addresses must be word-aligned, long addresses must be long-aligned (20-bits)
RD/WR-LONG/WORD/BYTE
4. HUB Addresses are always byte addresses and 20-bits (max address is 1MB)
5. COG/LUT Addresses are contiguous and are 11-bits (max address is 2K-Long=8KB )
SETQ sets up a count (11-bits = 2,048) of 32-bit longs to read/write from/to Hub to/from COG/LUT.
Therefore, the SETQ & RD/WR-LONG combination can load COG & LUT in one step.
General
Presuming the above, then...
6. JMP/CALLx/RETx Instructions could now be simplified to
- Use a "Flat Instruction Address" model of 18-bits
- Always address 32-bit Longs
All JMP/CALLx/RETx addresses would be long-addresses, and long-aligned, and long +/- relative
So, @relative would always be +/- Longs.
The PC (program counter) would always be in longs, so always increments by 1.
And @relative always +/- the exact count (because they are both in longs).
Only when accessing hub for fetching hubexec code, does the access add 2 LSB "00"s because hub is byte addressed. This is hidden from the user.
Interesting idea, but it would necessitate some new instructions to handle the two-bit differences for resolving data addresses. I think while this would improve cog/lut and hub code compatibility, it would introduce special considerations that would need ongoing minding. If we believe that cog/lut code and hub code are going to typically be mutually exclusive in purpose and place, there's no need to try to make them compatible, as this will introduce new complexities.
Chip,
I am sure you are looking at this way more difficult than it needs to be, and at the same time introducing unnecessary restrictions.
Remember the complexities we had to make LMM work. Hubexec complexities are insignificant compared to LMM.
HUB Address Differences
The only differences in HUB addresses (labels) are where they are used...
1. In an instruction as #abs or @rel when used as a destination address. They should always be long addresses.
2. When used as a Data location address... They should always be byte addresses.
The compiler should be able to determine this, and use the appropriate address (ie discard the 2xLSBs when a destination address).
If we want to mix, load, manipulate, etc, any hub address as an instruction destination (or return address), then we will need to take the hub address and >>2.
ie the programmer will need to take care of this. There is no other solution that will work, and it is IMHO the simplest to cover.
However, if you really want to simplify this, you could use an alternative identifier other than # or ##.
The @ will always be long.
I have thought long and hard about this. It comes down to a few basics which seem to have been lost along the way.
Instructions
1. Make all instructions/code addresses longs and long-aligned.
- seems no arguments here
- addresses can now be reduced to 18 bits to address 256K longs (1MB of hub).
- when fetching instructions from hub (hubexec), the PC (program counter) simply appends 2 (LSB) bits "00".
2. Use a "Flat Instruction Address" model
- seems most here want that (IMHO I don't think this is necessary but certainly the simplest to explain)
(18-bit long address)
$00000-001FF = COG RAM/Code (2KB=512-Longs) - usable as traditional cog code
$00200-003FF = LUT RAM/Code (2KB=512-Longs) - usable as lutexec
$00400-005FF = LUT RAM/Code (2KB=512-Longs) - Possible expansion if space on P2 die
$00600-007FF = LUT RAM/Code (2KB=512-Longs) - Possible future expansion
*$00000-007FF = HUB RAM (8KB=2K-Longs) - NOT usable for hubexec
$00800-1FFFF = HUB RAM/Code (504KB=126K-Longs) - also usable as hubexec
$20000-3FFFF = HUB RAM/Code (512KB=128KB-Longs) - possible future expansion & hubexec
Memory
3. Use a "Flat Memory Address" model
- $00000-007FF = COG/LUT contiguous address (long addresses only)
- $00000-FFFFF = HUB 1MB contiguous address (byte addresses)
. - word addresses must be word-aligned, long addresses must be long-aligned (20-bits)
RD/WR-LONG/WORD/BYTE
4. HUB Addresses are always byte addresses and 20-bits (max address is 1MB)
5. COG/LUT Addresses are contiguous and are 11-bits (max address is 2K-Long=8KB )
SETQ sets up a count (11-bits = 2,048) of 32-bit longs to read/write from/to Hub to/from COG/LUT.
Therefore, the SETQ & RD/WR-LONG combination can load COG & LUT in one step.
General
Presuming the above, then...
6. JMP/CALLx/RETx Instructions could now be simplified to
- Use a "Flat Instruction Address" model of 18-bits
- Always address 32-bit Longs
Summary
IMHO, the above seems both logical and simple.
What do you think???
I need to go over what you wrote more carefully, because it didn't click the first time. I will do that now.
DJxx/TJxx D,S/@ and JP/JNP D/#,S/@ Instructions
These are simple and straightforward.
- @rel Restricted to +/- 256 long relative jumps within HUB or within COG/LUT
- S/#) Restricted to direct jump to Cog-Register code(ie COG/LUT $008..1FF)
Again, simple and straightforward (reduce to 18-bit absolute/relative long address).
- @rel Restricted to +/- 128K-Long relative jumps within HUB or
..... within COG/LUT (COG/LUT ignores bits[18:11])
- #abs) Restricted to absolute jump to Hub $00800-$3FFFF
..... COG/LUT $00008..007FF
CCCC 11101ww Rnn nnnnnnnnn nnnnnnnnn LOC reg,#abs/@rel 'loads Register A..D with 20-bit address A..D
We need to determine if we require twosimilar instructions for this, one with a 20-bit result and one with an 18-bit result. Otherwise, we may need to perform an >>2 on the result.
People are talking like hubexec is going to be so significantly slower that you would sometimes want to run the same code from cog space instead of hub space.
However, it's just not the case. For straight line code it's the same speed once it's going. It's only branches that incur a stall, and it's not uber long.
Realistically, I can't think of a case where I'd want the exact same binary code to both run in hub space and cog space. Especially, since it ALREADY has to live in hub space, and doing the copy to cog space to then run it there is enough overhead that you probably end up with a wash or loss doing that instead of just running it in hub space. Tight loops being the exception, and the code would need to function both ways (with the hub stalls on branches) anyway, so why?
Can you give a non-contrived example of it being actually useful?
People are talking like hubexec is going to be so significantly slower that you would sometimes want to run the same code from cog space instead of hub space.
However, it's just not the case. For straight line code it's the same speed once it's going. It's only branches that incur a stall, and it's not uber long.
Those all add up and each stall is a variance, and moves away from the hard determinism the user was promised by having 16 separate cores.
Realistically, I can't think of a case where I'd want the exact same binary code to both run in hub space and cog space. ... Tight loops being the exception,
Can you give a non-contrived example of it being actually useful?
I believe you answered your own question. (I made it bold)
Hard real time, deterministic code still matters.
The new LUT execute has opened a lot more space to include local-libraries, but I get the feeling the extra admin complexity will get that placed in the 'too hard box' ( as your post illustrates), and so libraries will tend to be HUB only, which relegates the P2 in typical use cases, to much like any other MCU running interrupts.
People are talking like hubexec is going to be so significantly slower that you would sometimes want to run the same code from cog space instead of hub space.
However, it's just not the case. For straight line code it's the same speed once it's going. It's only branches that incur a stall, and it's not uber long.
Totally agree with this.
But you have missed an important issue... It will most likely consume more power because you have an additional 16KB block active every clock, and per hubexec executing cog. So we are going to want to restrict hubexec where possible.
Realistically, I can't think of a case where I'd want the exact same binary code to both run in hub space and cog space. Especially, since it ALREADY has to live in hub space, and doing the copy to cog space to then run it there is enough overhead that you probably end up with a wash or loss doing that instead of just running it in hub space. Tight loops being the exception, and the code would need to function both ways (with the hub stalls on branches) anyway, so why?
Can you give a non-contrived example of it being actually useful?
I cannot think of an example. I am sure there are some, but whether they are important in the whole scheme of things, I think not.
But I would like the programming model, if possible, to be as similar as possible for hubexec vs cog/cogexec.
Thus, IMHO long addresses make the most sense. This is what we are used to in the P1. Look at the problems this is causing in the cog currently. We don't want to have 7 bit cog addresses with 2 "00" bits appended. The whole thing is upside down IMHO.
Anyway, with the same model using long addresses, @rel will work identically in both cogexec and hubexec. So it is possible (there will be some restrictions such as no rep instruction) to make the same routines able to be hub and/or cog resident.
So I am with you - keep it as simple as possible, no kludges starting hub at $00001, skip the lower 8KB hub and map the cog/lut here. We now have 504KB of other hubexec space which gives us 126K instruction space.
Comments
That sounds great to me. Even better than what I was suggesting. I recall you suggesting something like this before but saying it might take more work, maybe I am remembering incorrectly.
There's still the thing where hub addresses from $00000 to $00400 are not executable in hubexec mode, but that seems just fine to me.
I did start going down this path earlier, but when I realized that it would make hub code incompatible with cog/LUT code, I overreacted and abandoned the effort. It really doesn't matter, though. Hub code is almost always going to be different, anyway, for a number of reasons. Having the addresses stretched out in hub code won't make it that much more incompatible with cog/LUT code than it probably already is.
Where there's no perfect solution, there's usually a perfect compromise.
Chip's latest proposal is sounding pretty simple to use to me, definitely a big improvement.
Although not imposing limits on flexibility is good, I'm concerned that a complicated addressing scheme could really put off potential users who are new to the Prop family.
As such, I'll gladly give up running hub code down low and so on for the sanity that such "restrictions" bring. Anyway, try to keep it clean and simple, though different people will define those differently.
Again, if perfection is not possible, go for the perfect compromise.
Thanks for pausing to regroup, Chip, at this critical juncture. You have our support and prayers.
How are calls from COG/LUT to code that is HUBEXEC managed ?
What about crossing the LUT-HUB bondary ?
Amen, Brother!
The inspiration doesn't arrive until it does.
It's true that code with 20-bit branches to itself would have to run in either cog/lut or hub, depending on which it was assembled for.
Branches between cog/lut and hub are just as before. No special considerations.
I was mistaken when I said earlier that you can execute through LUT, straight into hub. It would not work, because hub exec needs to be initiated by a branch into hub range. This branch is needed to trigger the cog state machine to issue a RDFAST to start the instruction stream. Without that branch, it would start pulling longs from the FIFO without it being loaded.
The Relative jumps are handled ok in both cases ?
What about code like indexed case statements ?
Can the tools warn, if someone tries to call/jump to 20-bit absolute code in the wrong place ?
Instructions
1. Make all instructions/code addresses longs and long-aligned.
- seems no arguments here
- addresses can now be reduced to 18 bits to address 256K longs (1MB of hub).
- when fetching instructions from hub (hubexec), the PC (program counter) simply appends 2 (LSB) bits "00".
2. Use a "Flat Instruction Address" model
- seems most here want that (IMHO I don't think this is necessary but certainly the simplest to explain)
(18-bit long address)
$00000-001FF = COG RAM/Code (2KB=512-Longs) - usable as traditional cog code
$00200-003FF = LUT RAM/Code (2KB=512-Longs) - usable as lutexec
$00400-005FF = LUT RAM/Code (2KB=512-Longs) - Possible expansion if space on P2 die
$00600-007FF = LUT RAM/Code (2KB=512-Longs) - Possible future expansion
*$00000-007FF = HUB RAM (8KB=2K-Longs) - NOT usable for hubexec
$00800-1FFFF = HUB RAM/Code (504KB=126K-Longs) - also usable as hubexec
$20000-3FFFF = HUB RAM/Code (512KB=128KB-Longs) - possible future expansion & hubexec
Memory
3. Use a "Flat Memory Address" model
- $00000-007FF = COG/LUT contiguous address (long addresses only)
- $00000-FFFFF = HUB 1MB contiguous address (byte addresses)
. - word addresses must be word-aligned, long addresses must be long-aligned (20-bits)
RD/WR-LONG/WORD/BYTE
4. HUB Addresses are always byte addresses and 20-bits (max address is 1MB)
5. COG/LUT Addresses are contiguous and are 11-bits (max address is 2K-Long=8KB )
SETQ sets up a count (11-bits = 2,048) of 32-bit longs to read/write from/to Hub to/from COG/LUT.
Therefore, the SETQ & RD/WR-LONG combination can load COG & LUT in one step.
General
Presuming the above, then...
6. JMP/CALLx/RETx Instructions could now be simplified to
- Use a "Flat Instruction Address" model of 18-bits
- Always address 32-bit Longs
Summary
IMHO, the above seems both logical and simple.
What do you think???
If a subroutine has relative D,@S branches within itself and RETurns, it could run in either cog/lut or hub, in raw binary form. 20-bit relative branches would cause incompatibility, though.
The only reason I'll shift those 9-bit relative branches up two bits for hub exec would be to increase their range, at the cost of insisting alignment be common.
Indexed case statements would cause incompatibility, too.
I kind of doubt much binary code would ever be expected to run in both cog/lut and hub modes.
All JMP/CALLx/RETx addresses would be long-addresses, and long-aligned, and long +/- relative
So, @relative would always be +/- Longs.
The PC (program counter) would always be in longs, so always increments by 1.
And @relative always +/- the exact count (because they are both in longs).
Only when accessing hub for fetching hubexec code, does the access add 2 LSB "00"s because hub is byte addressed. This is hidden from the user.
eg HUB calls would be used if you needed spare room in the COG N, whilst COG M may be speed focused, and it can copy to a local version.
If two modes have to be carefully tracked and managed, that really needs flags and a linker to ensure you really do have the right code in the right location. (and still there is room for user slip-ups)
HLL code generators will need two instances of code, and a means to propagate settings for builds. They will also need to create separate symbols, as some designs could use both libraries.
If there were binary compatibility between modes, there would still be some things to think about. You'd have to make sure the code would fit, after the variables. You couldn't access any hub data from relative addresses, like you could in a normal hub-exec app. It would basically only work for certain routines that were not very data dependent. In my pondering, I've come to the conclusion that hub code and cog/lut code will do different things, typically, and not seek to run in both modes.
Interesting idea, but it would necessitate some new instructions to handle the two-bit differences for resolving data addresses. I think while this would improve cog/lut and hub code compatibility, it would introduce special considerations that would need ongoing minding. If we believe that cog/lut code and hub code are going to typically be mutually exclusive in purpose and place, there's no need to try to make them compatible, as this will introduce new complexities.
I'm on it!
Actually, I was thinking that maybe in the future, if the RISC-V thing looks good, we could use it and add all kinds of nice features, like analog I/O, streaming, CORDIC, etc. and make a uniquely useful RISC-V chip. I suppose the availability of lots of silicon IP for all the popular protocols would be even more important than what architecture glues it all together, given you program in C, either way. Oh, it doesn't seem very exciting, overall.
I'm not following the "You couldn't access any hub data from relative addresses" comment.
COG data is register address, and HUB arrays would have a absolute base, so I'm not sure where 'relative addresses' applies ?
If code cannot run in both modes, then HLL libraries will need to have both variants.
Will that be possible with a single source base, (eg conditional assembly) or does this force a move to two-source development ?
The COG code case is still what we do in P1. They are discrete so they can be reused in the COGS. Stuff that really needs a COG is special.
Hub code is the more typical case for libraries, and or lower performance use cases.
I think I generally agree with you, though I would maybe make a few crazy suggestions.
Instruction Addresses
$00000-$001FF : Cog (longs)
$00200-$003FF : LUT (longs)
$00400-$7FFFF : reserved (I would just map it back on the Cog/LUT instruction space for now)
$80000-$9FFFF : Hub (longs)
$A0000-$FFFFF : reserved (I would just map it back on the Hub instruction space for now)
Data Addresses
$00000000-$000001FF : Cog (512 longs)
$00000200-$000003FF : LUT (512 longs)
$00000400-$0007FFFF : reserved for per-cog expansion
$00080000-$000FFFFF : Hub (512K bytes)
$00100000-$FFFFFFFF : reserved for hub expansion
Additionally, I would change the ORGx directives to be LONG-oriented. All labels are long-oriented, unless qualified in some way (wrapped in {}, prefaced with "&", etc.) to indicate their byte value instead.
I would also keep BYTE, WORD, and LONG aligned to natural boundaries (as on P1), then add a PBYTE, PWORD, and PLONG for packed data usage. Since the packed variants will only be useful in hub memory, their associated labels will always need to be qualified-to-byte-address.
Note: you may ask why I suggest moving the hub stuff to start at $80000 for both instructions and data. A couple reasons:
* First, most access to hub memory (anything over the first 512 bytes or instructions) is going to require greater-than-9-bit addresses, so you may as well take advantage of the address space (20-bit for instructions, 32-bit for data). This gives plenty of room to expand while still keeping the instruction and data address spaces flat.
* Second, now all of hub memory is executable.
* Third, this allows both instruction and data addressing to start at the same value ($80000). Of course, from there, you are dealing with longs (instructions addresses) or bytes (data addresses). To make conversion easy (for those times you need to treat instructions like data and vice versa), you could have two instructions HLTOB (hub long-to-byte) and HBTOL (hub byte-to-long), which basically does DATA_ADDR = ((INST_ADDR << 2) - $180000) and the inverse.
A code example would look like:
+ 1
Chip,
I am sure you are looking at this way more difficult than it needs to be, and at the same time introducing unnecessary restrictions.
Remember the complexities we had to make LMM work. Hubexec complexities are insignificant compared to LMM.
HUB Address Differences
The only differences in HUB addresses (labels) are where they are used...
1. In an instruction as #abs or @rel when used as a destination address. They should always be long addresses.
2. When used as a Data location address... They should always be byte addresses.
The compiler should be able to determine this, and use the appropriate address (ie discard the 2xLSBs when a destination address).
If we want to mix, load, manipulate, etc, any hub address as an instruction destination (or return address), then we will need to take the hub address and >>2.
ie the programmer will need to take care of this. There is no other solution that will work, and it is IMHO the simplest to cover.
However, if you really want to simplify this, you could use an alternative identifier other than # or ##.
The @ will always be long.
(see the next post for more...)
BTW Did you see this post of mine?
Now, I need to update the assembler and recompile everything to test it.
I need to go over what you wrote more carefully, because it didn't click the first time. I will do that now.
To clarify, this change means exactly what an opcode does, now depends on the memory address it executes from ?
Here are all the relevant JMP/CALLx/RETx etc instructions which are effected by the Addressing Issue...
DJxx/TJxx D,S/@ and JP/JNP D/#,S/@ Instructions
These are simple and straightforward.
- @rel Restricted to +/- 256 long relative jumps within HUB or within COG/LUT
- S/#) Restricted to direct jump to Cog-Register code(ie COG/LUT $008..1FF)
Again, simple and straightforward (reduce to 18-bit absolute/relative long address).
- @rel Restricted to +/- 128K-Long relative jumps within HUB or
..... within COG/LUT (COG/LUT ignores bits[18:11])
- #abs) Restricted to absolute jump to Hub $00800-$3FFFF
..... COG/LUT $00008..007FF
We need to determine if we require twosimilar instructions for this, one with a 20-bit result and one with an 18-bit result. Otherwise, we may need to perform an >>2 on the result.
IMHO all other instructions are straightforward.
However, it's just not the case. For straight line code it's the same speed once it's going. It's only branches that incur a stall, and it's not uber long.
Realistically, I can't think of a case where I'd want the exact same binary code to both run in hub space and cog space. Especially, since it ALREADY has to live in hub space, and doing the copy to cog space to then run it there is enough overhead that you probably end up with a wash or loss doing that instead of just running it in hub space. Tight loops being the exception, and the code would need to function both ways (with the hub stalls on branches) anyway, so why?
Can you give a non-contrived example of it being actually useful?
I believe you answered your own question. (I made it bold)
Hard real time, deterministic code still matters.
The new LUT execute has opened a lot more space to include local-libraries, but I get the feeling the extra admin complexity will get that placed in the 'too hard box' ( as your post illustrates), and so libraries will tend to be HUB only, which relegates the P2 in typical use cases, to much like any other MCU running interrupts.
But you have missed an important issue... It will most likely consume more power because you have an additional 16KB block active every clock, and per hubexec executing cog. So we are going to want to restrict hubexec where possible. I cannot think of an example. I am sure there are some, but whether they are important in the whole scheme of things, I think not.
But I would like the programming model, if possible, to be as similar as possible for hubexec vs cog/cogexec.
Thus, IMHO long addresses make the most sense. This is what we are used to in the P1. Look at the problems this is causing in the cog currently. We don't want to have 7 bit cog addresses with 2 "00" bits appended. The whole thing is upside down IMHO.
Anyway, with the same model using long addresses, @rel will work identically in both cogexec and hubexec. So it is possible (there will be some restrictions such as no rep instruction) to make the same routines able to be hub and/or cog resident.
So I am with you - keep it as simple as possible, no kludges starting hub at $00001, skip the lower 8KB hub and map the cog/lut here. We now have 504KB of other hubexec space which gives us 126K instruction space.