I've got the ROM booter written now, but it needs some debugging, yet. It does signed loader verification using the fuses. The bug is in the SHA-256/HMAC code. I hope to have that fixed today. Once that's done, we may be finished with the chip, as far as what's needed to make the silicon goes.
I decided NOT to have a monitor program in ROM, for several reasons which I'll write about later.
Sounds like progress! Once you've got the boot code written, are you going to release a version of pnut that builds the binary with the boot loader?
I've got the ROM booter written now, but it needs some debugging, yet. It does signed loader verification using the fuses. The bug is in the SHA-256/HMAC code. I hope to have that fixed today. Once that's done, we may be finished with the chip, as far as what's needed to make the silicon goes.
I decided NOT to have a monitor program in ROM, for several reasons which I'll write about later.
Sounds like progress! Once you've got the boot code written, are you going to release a version of pnut that builds the binary with the boot loader?
The Flash/Eprom/OTP (included OTP also) would not necessarily have to be that big.
Large enough for one or more of the following...
* Boot from SD Card
* Perform a Monitor/Debug resident function
* Perform a alternate Download/Update function other than the default ROM
* Removes the requirement that an additional SPI Flash may be required for all uses
The last is perhaps my biggest bugbear with the P1 because almost all my current designs use an SD Card, and the EEPROM takes precious board space and adds unnecessary cost.
MRAM would be way better than any of those options Cluso. It can replace the HubRAM directly and would allow the full 1MB because it's much smaller cell size than SRAM.
I asked Beau about MRAM way back and got an answer of it needing four metal layers, or something like that. Since I didn't ask for clarification at the time, I've never known why that was a roadblock.
Would it load the bootloader into RAM and execute and then load program over serial?
The chip would always check for a signed 1st-stage boot loader, which would, in turn, load the main program. Eventually, there will be custom options for what these two loaders are. For now, PNut will handle it, but I'll document the protocol, so that you guys can make your own tools, if you want.
It's like the Prop1, where it checks for a serial connection, then attempts to load from SPI flash (just starts reading at $000000, gets $200 longs, with the last 8 being the HMAC signature).
I discovered some problems with the way LOC assembles. In the case of cog-register code, it needs to use absolute addresses, only. I'll work on this tonight. The ROM booter is working with 'MOV PTRA,#', for now.
Chip,
Is this how the Cog & Lut interact with the dual port rams?
I have been thinking that there doesn't seem to be any good reason that I can think of why the COG and LUT need to be separate entities. I realise they need to be two blocks of DPRAM so that the streamer can deliver data from LUT to the DAQs/etc blocks on the S/D clock.
Is there any reason the LUT could not be treated as just COG with addresses $200-$3FF ?
When not in streamer mode, the AUGS/AUGD instruction(s) could be used to access the additional register space $200-$3FF.
When streamer mode was selected, this pathway would be "disabled".
Would this affect the shared LUT between cogs?
Would this simplify the design, or is it not worth considering???
Chip,
Is this how the Cog & Lut interact with the dual port rams?
I have been thinking that there doesn't seem to be any good reason that I can think of why the COG and LUT need to be separate entities. I realise they need to be two blocks of DPRAM so that the streamer can deliver data from LUT to the DAQs/etc blocks on the S/D clock.
Is there any reason the LUT could not be treated as just COG with addresses $200-$3FF ?
When not in streamer mode, the AUGS/AUGD instruction(s) could be used to access the additional register space $200-$3FF.
When streamer mode was selected, this pathway would be "disabled".
Would this affect the shared LUT between cogs?
Would this simplify the design, or is it not worth considering???
Cluso, sorry it's taken me so long to respond to this. Almost anything can be done, as you know, but I think it's too disruptive, at this point, to embark on a change like that. For what it's worth, though, could you outline how AUGS/AUGD would be used to make the LUT an extension of the cog RAM?
I don't want anything that may cause any problems with the current P2 progress. It just seemed to me that it may simplify usage of the LUT space if it were just considered to be Cog Ram that has an alternative use, being the streamer. When the streamer is not used (I would think more times than not), Cog Ram could just be considered flat cog space $000-$3FF.
Normal instructions could only directly address (the lower 2KB of) Cog Ram as register space $000-$1FF, but when used in conjunction with AUGS/AUGD (the upper 2KB of) Cog Ram $200-$3FF could be reached as register space too (in addition to instruction code space).
ORG $300 ' ORG into LUT space
routine mov lutreg, #$100 ' move "$100" into Cog $380 (ie into LUT $180)
' If lutreg was > $1FF (ie LUT), the assembler would replace the above single instruction with a pair
' of instructions as follows, or the programmer could supply the pair of instructions instead
routine augd #(lutreg >> 9)
mov (lutreg & $1FF), #$100 ' move "$100" into Cog $380
add lutreg, ##$123456 ' add "$123456" to Cog $380 (ie to LUT $180)
' would become...
augd #(lutreg >> 9)
augs #($123456 >> 9)
add (lutreg & $1FF),#($123456 & $1FF) ' add "$123456" to Cog $380
rdlong lutreg,lutptr ' read long from hub at lutptr into lutreg (Cog $380)
' becomes...
augd #(lutreg >> 9)
augs #(lutptr >> 9)
rdlong (lutreg & $1FF),(lutptr & $1FF) ' read long from hub at lutptr into lutreg (Cog $380)
ORG $380
lutreg long 0
lutptr long $0_7000
I would think that RDLUT and WRLUT would no longer be required as we could do it with 2 instructions, which may not be too much of an imposition.
We would still require SETQ and SETQ2 to block move fast to/from lower Cog ram or upper Cog ram. Alternately, if AUGS could be interposed between SETQ and RD/WR/LONG/WORD/BYTE then SETQ2 would not be required either.
A future P2 variant could have Cog Ram expanded simply beyond 4KB using the above method.
I'm on the fence about the use of AUGx for LUT access. On the one hand, it's probably a wash, in terms of overhead, between using AUGS/AUGD and RDLUT/WRLUT. I'm sure we can all come up with examples where either version is worse than the other. On the other hand, it certainly would open up the use of LUT. And there's no reason you couldn't have support for both, allowing the user to choose what works best.
But, I'm still of the mind that the best use of LUT (when not used as a LUT, streamer, or whatever) is running code. That then allows the entire COGRAM to be used for data, without using AUGS/AUGD (or RDLUT/WRLUT, for that matter).
But, I'm still of the mind that the best use of LUT (when not used as a LUT, streamer, or whatever) is running code. That then allows the entire COGRAM to be used for data, without using AUGS/AUGD (or RDLUT/WRLUT, for that matter).
I'd agree that Code is the first non LUT natural use, but it can be good to reduce the jump required by the user in thinking about COG vs LUT.
This '64b opcode' approach is already used elsewhere in P2, worth checking if it can apply to LUT as well ?
I think also some code has to reside in COG, to start.
One use case could be a larger single array that spans both COG and LUT - is that supported now ?
Looking at the 8051, they have DATA space and IDATA space, which roughly map as COG and LUT, but the 8051 does allow IDATA pointers to access the whole of [DATA+IDATA] as arrays.
As here, variables placed in DATA give smaller code, but you can use IDATA for variables (usually data arrays)
I would think that RDLUT and WRLUT would no longer be required as we could do it with 2 instructions, which may not be too much of an imposition.
I would be wary of any change that makes code usage larger, but LUT may be able to auto-select by address in order to reduce opcodes ? (as above, without a size/speed penalty ?)
It is only necessary to use AUGS/AUGD when using the upper 2KB of Cog Ram as register space. There is no issue with code residing there because it is identical to it residing in LUT.
I see the big advantage in how you think about the memory model (being simpler).
The other big advantage comes when using AUGS/AUGD and any of the regular instructions to access a new additional set of registers. eg ROL, AND, OR, ADD, CMP, etc.
Perhaps instead of RD/WRLUT we could call them MOVX and the compiler just use the correct instruction. Need to recheck the rd/wrlut instruction bits.
Really, I am just using AUGS/AUGD to provide the 10th bit for the S and/or D in the usual instructions to permit the upper cog ram to also be used as additional register space. This saves moving via rd/wrlut from/to normal register space should you need more register space, such as table space, etc.
The other big advantage comes when using AUGS/AUGD and any of the regular instructions to access a new additional set of registers. eg ROL, AND, OR, ADD, CMP, etc.
The more complex opcodes I think need 4 port memory, so they may be harder to stretch into LUT ? (ie they need prefix opcode plus more cycles to write back for 6/12 clocks ?)
I think is feasible for arrays (an indirect VAR) to be able to cross the COG-LUT boundary - not sure if current opcodes can do that ?
I should add that I cannot see a major difference in LUT vs Upper Cog Ram. If there is no real hardware difference, then IMHO it would be simpler to call it Cog Ram where the upper half has some slight differences when using it as register space. All is fine when using it as code space.
There is a benefit if LUT is used as upper cog space because both D and S can be fetched on the same clock cycle, and the result can be written and the next instruction fetched on the same clock cycle. This is identical to how lower cog ram is accessed.
Therefore, I thought it might be simpler hardware (in terms of verilog description) to have 4KB of Cog Ram (with AUGS/AUGD used to provide the 10th address bit for D and/or S). When used as the streamer this pathway is changed to suit.
I should add that I cannot see a major difference in LUT vs Upper Cog Ram. If there is no real hardware difference, then IMHO it would be simpler to call it Cog Ram where the upper half has some slight differences when using it as register space. All is fine when using it as code space.
I agree & can see the benefits of less distinction across COG/LUT, but I'm not so sure there is zero hardware difference.
FWIR Chip added another (write only?) port to LUT, but I believe they are hardwired as one port for streamer ?
I should add that I cannot see a major difference in LUT vs Upper Cog Ram. If there is no real hardware difference, then IMHO it would be simpler to call it Cog Ram where the upper half has some slight differences when using it as register space. All is fine when using it as code space.
I agree & can see the benefits of less distinction across COG/LUT, but I'm not so sure there is zero hardware difference.
FWIR Chip added another (write only?) port to LUT, but I believe they are hardwired as one port for streamer ?
As I understand it currently, both cog and lut are dual port. If the dual ports were linked to the alu, like is done for cog, then all this aug stuff should work automatically. Now, if the streamer was put into operation a mux would disable the second port to alu, and instead feed to/from the streamer. There are also paths to/from the lut to/from the cog via rd/wrlut, and there are also paths to/from hub via setq2 and rd/wrxxxx.
My thoughts were that perhaps these paths could be simplified by treating lut as cog, and only switch a path if the streamer is active.
Unfortunately I don't know much about the streamer, so I am not sure of precisely how this works in conjunction with the lut.
BTW I implemented AUGS/AUGD in the P1V and expanded cog ram to 4KB so that I could access extended cog ram as registers and code space. In P1V there is no dual port ram.
Comments
Sounds like progress! Once you've got the boot code written, are you going to release a version of pnut that builds the binary with the boot loader?
Of course. That will be the only way forward.
Would it load the bootloader into RAM and execute and then load program over serial?
Have you asked OnSemi about onchip Flash/EEPROM/OTP yet?
Postedit: added OTP
Large enough for one or more of the following...
* Boot from SD Card
* Perform a Monitor/Debug resident function
* Perform a alternate Download/Update function other than the default ROM
* Removes the requirement that an additional SPI Flash may be required for all uses
The last is perhaps my biggest bugbear with the P1 because almost all my current designs use an SD Card, and the EEPROM takes precious board space and adds unnecessary cost.
I asked Beau about MRAM way back and got an answer of it needing four metal layers, or something like that. Since I didn't ask for clarification at the time, I've never known why that was a roadblock.
The chip would always check for a signed 1st-stage boot loader, which would, in turn, load the main program. Eventually, there will be custom options for what these two loaders are. For now, PNut will handle it, but I'll document the protocol, so that you guys can make your own tools, if you want.
I have not. I will. I'm sorry. I'm just working to get what we've got into producible shape.
It's like the Prop1, where it checks for a serial connection, then attempts to load from SPI flash (just starts reading at $000000, gets $200 longs, with the last 8 being the HMAC signature).
I discovered some problems with the way LOC assembles. In the case of cog-register code, it needs to use absolute addresses, only. I'll work on this tonight. The ROM booter is working with 'MOV PTRA,#', for now.
BTW when is the current shuttle due?
Is this how the Cog & Lut interact with the dual port rams?
I have been thinking that there doesn't seem to be any good reason that I can think of why the COG and LUT need to be separate entities. I realise they need to be two blocks of DPRAM so that the streamer can deliver data from LUT to the DAQs/etc blocks on the S/D clock.
Is there any reason the LUT could not be treated as just COG with addresses $200-$3FF ?
When not in streamer mode, the AUGS/AUGD instruction(s) could be used to access the additional register space $200-$3FF.
When streamer mode was selected, this pathway would be "disabled".
Would this affect the shared LUT between cogs?
Would this simplify the design, or is it not worth considering???
Cluso, sorry it's taken me so long to respond to this. Almost anything can be done, as you know, but I think it's too disruptive, at this point, to embark on a change like that. For what it's worth, though, could you outline how AUGS/AUGD would be used to make the LUT an extension of the cog RAM?
No problem with me waiting.
I don't want anything that may cause any problems with the current P2 progress. It just seemed to me that it may simplify usage of the LUT space if it were just considered to be Cog Ram that has an alternative use, being the streamer. When the streamer is not used (I would think more times than not), Cog Ram could just be considered flat cog space $000-$3FF.
Normal instructions could only directly address (the lower 2KB of) Cog Ram as register space $000-$1FF, but when used in conjunction with AUGS/AUGD (the upper 2KB of) Cog Ram $200-$3FF could be reached as register space too (in addition to instruction code space).
I would think that RDLUT and WRLUT would no longer be required as we could do it with 2 instructions, which may not be too much of an imposition.
We would still require SETQ and SETQ2 to block move fast to/from lower Cog ram or upper Cog ram. Alternately, if AUGS could be interposed between SETQ and RD/WR/LONG/WORD/BYTE then SETQ2 would not be required either.
A future P2 variant could have Cog Ram expanded simply beyond 4KB using the above method.
But, I'm still of the mind that the best use of LUT (when not used as a LUT, streamer, or whatever) is running code. That then allows the entire COGRAM to be used for data, without using AUGS/AUGD (or RDLUT/WRLUT, for that matter).
This '64b opcode' approach is already used elsewhere in P2, worth checking if it can apply to LUT as well ?
I think also some code has to reside in COG, to start.
One use case could be a larger single array that spans both COG and LUT - is that supported now ?
Looking at the 8051, they have DATA space and IDATA space, which roughly map as COG and LUT, but the 8051 does allow IDATA pointers to access the whole of [DATA+IDATA] as arrays.
As here, variables placed in DATA give smaller code, but you can use IDATA for variables (usually data arrays)
I see the big advantage in how you think about the memory model (being simpler).
The other big advantage comes when using AUGS/AUGD and any of the regular instructions to access a new additional set of registers. eg ROL, AND, OR, ADD, CMP, etc.
Perhaps instead of RD/WRLUT we could call them MOVX and the compiler just use the correct instruction. Need to recheck the rd/wrlut instruction bits.
Really, I am just using AUGS/AUGD to provide the 10th bit for the S and/or D in the usual instructions to permit the upper cog ram to also be used as additional register space. This saves moving via rd/wrlut from/to normal register space should you need more register space, such as table space, etc.
I think is feasible for arrays (an indirect VAR) to be able to cross the COG-LUT boundary - not sure if current opcodes can do that ?
There is a benefit if LUT is used as upper cog space because both D and S can be fetched on the same clock cycle, and the result can be written and the next instruction fetched on the same clock cycle. This is identical to how lower cog ram is accessed.
Therefore, I thought it might be simpler hardware (in terms of verilog description) to have 4KB of Cog Ram (with AUGS/AUGD used to provide the 10th address bit for D and/or S). When used as the streamer this pathway is changed to suit.
FWIR Chip added another (write only?) port to LUT, but I believe they are hardwired as one port for streamer ?
My thoughts were that perhaps these paths could be simplified by treating lut as cog, and only switch a path if the streamer is active.
Unfortunately I don't know much about the streamer, so I am not sure of precisely how this works in conjunction with the lut.
BTW I implemented AUGS/AUGD in the P1V and expanded cog ram to 4KB so that I could access extended cog ram as registers and code space. In P1V there is no dual port ram.
The new version 11 is posted at the top of this thread. I'll be working on updating the documentation over the next 24 hours.
Rearing to go and get SD boot working!
Peter and I will be in close contact while we do this. In fact we had a two long phone calls only a few hours ago regarding this
Sounds great!
Is that supposed to work with negative numbers?