I'd miss tasks. I have three asynchronous tasks running in a single (P3-) cog, and only one one of those (video stream) needs to interact with the hub rather than the pins.
If tasks go then the replacement is probably a 4~6 cog "traditional P1 style" cog solution, with key functions divided into one P1+ cog each. It would have a heap of data going via hub, which would be unnecessary, but it'd still work. 4~6 cogs depending on how smart the pins are.
The good news is I'd have more cogs left over now, and perhaps get a P1+ slightly sooner, which has appeal too. But it worries me slightly that we're getting towards the point where two P1's (also giving 16 cogs, 64 pins, but available now) is looking like valid competition. I know that's simplistic.
I haven't looked in depth at software tasking, so I'd be interested in your proposed general approach for the monitor, Chip. I expect there would be lots of comms objects where you have an input stream, an output stream, and a command processor / state machine, and what kind of performance looks achievable with a software approach.
I had totally forgotten about video. Tasking helps a cog get a lot of other stuff done.
I'll get things running without tasking, at first, and then see what it would take to add it back in. I just need to hew this down to get it started.
Is anyone going to be disappointed if we get rid of hardware multitasking in the cogs?
It's a cool feature, but does introduce jitter in tasks, depending on the instruction mix. It also takes some extra flops and logic to support properly, beyond the Z/C/PC's.
I'm thinking about the ROM_Monitor and realizing that I could code it with a single task by doing cooperative multitasking at a few time-critical points in the program. I wouldn't need hardware multitasking, after all.
No multitasking would keep the cogs very simple to understand, and keep them deterministic.
Any thoughts?
Totally fine here without hardware multitasking Chip, we had 8 cogs without hardware multitasking on the original Prop, we now have 16 cogs with awesome bells and whistles, we will be fine without it!
I agree with jmg that having either tasks or hub exec, selectable on a cog by cog basis, would work well too, if that would make your life simpler
Chip while I'm a big supporter of tasks, I do think just getting out a basic DE0 P1+ release (tasks out, hubexec out, just for now) would be prudent. We can start testing the cog logic, but we can also port current P2 applications in and see how they "fit".
Question: would be "smart pins" be included in the DE0 release? (We could wait for smart pins too, we've done ok so far on P2)
I agree with jmg that having either tasks or hub exec, selectable on a cog by cog basis, would work well too, if that would make your life simpler
Chip while I'm a big supporter of tasks, I do think just getting out a basic DE0 P1+ release (tasks out, hubexec out, just for now) would be prudent. We can start testing the cog logic, but we can also port current P2 applications in and see how they "fit".
Question: would be "smart pins" be included in the DE0 release? (We could wait for smart pins too, we've done ok so far on P2)
Tasks are not that hard, once everything else is in place.
I'm just nailing the instruction set down now, so I'll know what I'm making. I hope to start the Verilog tomorrow.
Smart pins would be on the DE0-Nano. We can add as many as will fit.
I agree, a cog may use either hubexec or multi-tasking, but not both concurrently (in the same cog).
But even simple hubexec has preference over multi-tasking.
With the old 8 or 4 cog P2, I though multitasking was critical.
But, with 16 cogs, I don't think it's worth the trouble...
And, hubexec is 100x more important, even if it's a barebones hubexec.
With the old 8 or 4 cog P2, I though multitasking was critical.
But, with 16 cogs, I don't think it's worth the trouble...
And, hubexec is 100x more important, even if it's a barebones hubexec.
Hubexec is not worth the risk of not having a P16X32B.
I know that hub execution is probably now out of scope but I wanted to point out one other possible advantage of having it rather than falling back onto the old LMM technique of executing code from hub memory. If we had direct execution of code from hub memory then the vast majority of COG memory could be made available for a CMM kernel which would allow us to have a compressed code model for non-time-critical code along with hub execution for code that needs to run faster. This isn't possible with LMM+CMM because there isn't enough room in COG memory for both the LMM and the CMM kernel at the same time. You could even combine direct hub execution with XMM to get the ability to run some code from external memory. Of course, these combination modes would require compiler support in PropGCC and in Catalina but they are possible.
Hub exec IS going to be in the next chip. I've just been stalled out over the last few days getting the new instruction set nailed down. At times, all the details seem overwhelming and I think about paring it down, just to get it going again. This new memory scheme (dual-port 128x128 bits, instead of quad-port 512x32) changes a lot of things. It's hard for me to get there in one step. Asking you guys how you'd feel about dropping certain features alleviates the pressure on me. In the end, we'll likely have hardware multi-tasking back, too. Sorry for all the alarm.
Hub exec IS going to be in the next chip. I've just been stalled out over the last few days getting the new instruction set nailed down. At times, all the details seem overwhelming and I think about paring it down, just to get it going again. This new memory scheme (dual-port 128x128 bits, instead of quad-port 512x32) changes a lot of things. It's hard for me to get there in one step. Asking you guys how you'd feel about dropping certain features alleviates the pressure on me. In the end, we'll likely have hardware multi-tasking back, too. Sorry for all the alarm.
Phew! Putting heart pills away now...
I assume the DE0 + DE2 FPGA builds will fit more cogs than before?
Maybe DE0 4+Cogs and more Hub, and DE2 8+Cogs and Hub?
A fmax gain too?
Not cheating, my vote was about loss of performance and capabilities - which does include hubexec, however hubexec was not the only reason. Others were:
- loss of AUX/CLUT/HD output
- loss of single cycle instructions
- loss of INDA/INDB/PTRX/PTRY
- loss of AUX stacks
- loss of CORDIC/big MUL/big DIV
etc.
So you are correct, hubexec was a big part of it - but not all, I should not have said "nothing to do with hubexec", but said "only partially due to hubexec"
As my wife points out... my memory is not great anynore
hub exec is going to be in the next chip. I've just been stalled out over the last few days getting the new instruction set nailed down. At times, all the details seem overwhelming and i think about paring it down, just to get it going again. This new memory scheme (dual-port 128x128 bits, instead of quad-port 512x32) changes a lot of things. It's hard for me to get there in one step. Asking you guys how you'd feel about dropping certain features alleviates the pressure on me. In the end, we'll likely have hardware multi-tasking back, too. Sorry for all the alarm.
On the DE0, could you bring out all 64 I/O's? Make the rest of them 'dumb' but still available for regular dumb-io
So we don't establish a new term how about naming them "standard" I/O for now if they're not all "smart" on the FPGA, if it is the case that they don't have the same characteristics? Minor suggestion and I respect your contributions a bunch.
As my wife points out... my memory is not great anynore
which does not equate to cheating.
Orrrrr, seeing the light, maybe? :P ... Nah, if I may offer an analysis, I think the voting thing caught you off guard. You had invested your time into helping Chip and then some punk thought this was all a democracy and it appeared like that was the deciding factor.
While Chip may have taken a second look due to the vote, in the end, I think the last round of thermal simulations at four Cogs is really what did it in for the P2 design at 180nm.
So we don't establish a new term how about naming them "standard" I/O for now if they're not all "smart" on the FPGA, if it is the case that they don't have the same characteristics? Minor suggestion and I respect your contributions a bunch.
I would drop as many features to make hubexec easy. Fall back to the simple P1 instruction set if necessary, but keep the multiplier, and maybe the divider.
What made the P1 special is multiple cores. Not lots of fancy asm instructions. Hubexec removes the central limitation with the P1.
Hub exec IS going to be in the next chip. I've just been stalled out over the last few days getting the new instruction set nailed down. At times, all the details seem overwhelming and I think about paring it down, just to get it going again. This new memory scheme (dual-port 128x128 bits, instead of quad-port 512x32) changes a lot of things. It's hard for me to get there in one step. Asking you guys how you'd feel about dropping certain features alleviates the pressure on me. In the end, we'll likely have hardware multi-tasking back, too. Sorry for all the alarm.
Fantastic news for hubexec !!!
Multi-tasking I will likely never use except if there is a standard driver using it.
Orrrrr, seeing the light, maybe? :P ... Nah, if I may offer an analysis, I think the voting thing caught you off guard. You had invested your time into helping Chip and then some punk thought this was all a democracy and it appeared like that was the deciding factor.
While Chip may have taken a second look due to the vote, in the end, I think the last round of thermal simulations at four Cogs is really what did it in for the P2 design at 180nm.
> won't even find VGA monitors. Its all HDMI.
If it was me designing the P2/P3 I would start it out as a SDRAM to HDMI Transmitter and add MCU/analog stuff on later.
But it's the 180nm that is problem to archive GBit transfers, so it would have to be done in expensive (for now) 65nm
So we don't establish a new term how about naming them "standard" I/O for now if they're not all "smart" on the FPGA, if it is the case that they don't have the same characteristics? Minor suggestion and I respect your contributions a bunch.
Ken Gracey
Thanks, Ken!!
I've been staying out of this tumultuous debate but since this exchange made me spew "morning beverage" on my keyboard, I have to reply!!
Yes, "standard I/O' is a more acceptable name!
(now I need to go clean my keyboard! )
On a serious note, it will be very nice to have the Smart I/O functions availble on the FPGA so they can be tested and explored before the silicon is cast.
My gut feeling says sims will be something like 1.5W-3W w 16 P1+ cogs @ 200MHz (100MIPS) (prediction only, I hope I am wrong!)
Based on the 8W for P2 I'm guessing the current P1+ design will be around 1W-1.5W. The P2 required 16x the number of LEs as the P1, and ran at 8x the instruction rate. The P1+ has 2x the number of cogs with 5x the instruction rate. Throwing in anther factor of 2x for the extra features gives a total of 20x for P1+ versus 128x for P2. Scaling this from the 8W P2 value gives 1.25W.
I do wonder on the impact of 64 smart pins, with 32 bit counters, running at 200Mhz - which is one of the reasons I put the low end of my estimate at 1.5W
Please note - I would be absolutely delighted if the simulation returned less than 1.5W as the maximum!
Based on the 8W for P2 I'm guessing the current P1+ design will be around 1W-1.5W. The P2 required 16x the number of LEs as the P1, and ran at 8x the instruction rate. The P1+ has 2x the number of cogs with 5x the instruction rate. Throwing in anther factor of 2x for the extra features gives a total of 20x for P1+ versus 128x for P2. Scaling this from the 8W P2 value gives 1.25W.
> 64 smart pins, with 32 bit counters, running at 200Mhz
As the 32bit counter are binary, the first bit toggles at 200mhz the next 100mhz and the next at 50Mhz and the next at 25Mhz and the next at 12.5Mhz............................................
It's the toggling that use power, so the upper 24bits toggles at such low rate that their power usage is negligible
Question: Is anyone going to be disappointed if we get rid of hardware multitasking in the cogs?
Get rid of it. I think the increase in cogs has made this feature less necessary anyhow. There will certainly be use cases where hardware multitasking is preferred (possibly even necessary), but the trade-off probably isn't worth it at this point. You can revisit that for P2+. ::) In earlier posts, I had suggested bringing back PORTD (or something like it) in lieu of hardware multitasking. The truth is that this would only be a "nice to have" feature. With the P1+ running at 100MIPS and hub access every 16 clocks, hub data interchange will still be faster than on the P1. It may only be a marginal increase, but it's free!
> 64 smart pins, with 32 bit counters, running at 200Mhz
As the 32bit counter are binary, the first bit toggles at 200mhz the next 100mhz and the next at 50Mhz and the next at 25Mhz and the next at 12.5Mhz............................................
It's the toggling that use power, so the upper 24bits toggles at such low rate that their power usage is negligible
Comments
I had totally forgotten about video. Tasking helps a cog get a lot of other stuff done.
I'll get things running without tasking, at first, and then see what it would take to add it back in. I just need to hew this down to get it started.
Totally fine here without hardware multitasking Chip, we had 8 cogs without hardware multitasking on the original Prop, we now have 16 cogs with awesome bells and whistles, we will be fine without it!
Post is corrected now.
Chip while I'm a big supporter of tasks, I do think just getting out a basic DE0 P1+ release (tasks out, hubexec out, just for now) would be prudent. We can start testing the cog logic, but we can also port current P2 applications in and see how they "fit".
Question: would be "smart pins" be included in the DE0 release? (We could wait for smart pins too, we've done ok so far on P2)
Tasks are not that hard, once everything else is in place.
I'm just nailing the instruction set down now, so I'll know what I'm making. I hope to start the Verilog tomorrow.
Smart pins would be on the DE0-Nano. We can add as many as will fit.
But even simple hubexec has preference over multi-tasking.
An FPGA with anything will be a help asap
But, with 16 cogs, I don't think it's worth the trouble...
And, hubexec is 100x more important, even if it's a barebones hubexec.
Hubexec is not worth the risk of not having a P16X32B.
Ross.
I assume the DE0 + DE2 FPGA builds will fit more cogs than before?
Maybe DE0 4+Cogs and more Hub, and DE2 8+Cogs and Hub?
A fmax gain too?
Cheers
Brian
- loss of AUX/CLUT/HD output
- loss of single cycle instructions
- loss of INDA/INDB/PTRX/PTRY
- loss of AUX stacks
- loss of CORDIC/big MUL/big DIV
etc.
So you are correct, hubexec was a big part of it - but not all, I should not have said "nothing to do with hubexec", but said "only partially due to hubexec"
As my wife points out... my memory is not great anynore
which does not equate to cheating.
On the DE0, could you bring out all 64 I/O's? Make the rest of them 'dumb' but still available for regular dumb-io
So we don't establish a new term how about naming them "standard" I/O for now if they're not all "smart" on the FPGA, if it is the case that they don't have the same characteristics? Minor suggestion and I respect your contributions a bunch.
Ken Gracey
Orrrrr, seeing the light, maybe? :P ... Nah, if I may offer an analysis, I think the voting thing caught you off guard. You had invested your time into helping Chip and then some punk thought this was all a democracy and it appeared like that was the deciding factor.
While Chip may have taken a second look due to the vote, in the end, I think the last round of thermal simulations at four Cogs is really what did it in for the P2 design at 180nm.
Thanks Ken.
What made the P1 special is multiple cores. Not lots of fancy asm instructions. Hubexec removes the central limitation with the P1.
Multi-tasking I will likely never use except if there is a standard driver using it.
Even better news
Actually what really threw me was the "sky is falling" when 1-2W typical was predicted a long time ago.
I thought the last published sim was: (I could be wrong)
8W w 8 P2 cogs @ 180MHz
3W w 8 P2 cogs @ 100MHz
Both unrealistic worst case figures due to ALL subsystems in the P2 working at once, which I think is not possible with pasm or compiled code.
I am really curious to see what the max power envelopes will be for the currently discussed P16X512 running at 200MHz (100MIPS)
My gut feeling says sims will be something like 1.5W-3W w 16 P1+ cogs @ 200MHz (100MIPS) (prediction only, I hope I am wrong!)
(If I am right, I can already hear the "Toss out hubexec! Toss out PTRA/PTRB! It will save a lot of power!" - when in fact it would save very little.)
If it was me designing the P2/P3 I would start it out as a SDRAM to HDMI Transmitter and add MCU/analog stuff on later.
But it's the 180nm that is problem to archive GBit transfers, so it would have to be done in expensive (for now) 65nm
Thanks, Ken!!
I've been staying out of this tumultuous debate but since this exchange made me spew "morning beverage" on my keyboard, I have to reply!!
Yes, "standard I/O' is a more acceptable name!
(now I need to go clean my keyboard! )
On a serious note, it will be very nice to have the Smart I/O functions availble on the FPGA so they can be tested and explored before the silicon is cast.
Well I'll be damned, I didn't grok Ken at all. "dumb-i/o" is a perfectly ordinary phrase to me.
I do wonder on the impact of 64 smart pins, with 32 bit counters, running at 200Mhz - which is one of the reasons I put the low end of my estimate at 1.5W
Please note - I would be absolutely delighted if the simulation returned less than 1.5W as the maximum!
As the 32bit counter are binary, the first bit toggles at 200mhz the next 100mhz and the next at 50Mhz and the next at 25Mhz and the next at 12.5Mhz............................................
It's the toggling that use power, so the upper 24bits toggles at such low rate that their power usage is negligible
Get rid of it. I think the increase in cogs has made this feature less necessary anyhow. There will certainly be use cases where hardware multitasking is preferred (possibly even necessary), but the trade-off probably isn't worth it at this point. You can revisit that for P2+. ::) In earlier posts, I had suggested bringing back PORTD (or something like it) in lieu of hardware multitasking. The truth is that this would only be a "nice to have" feature. With the P1+ running at 100MIPS and hub access every 16 clocks, hub data interchange will still be faster than on the P1. It may only be a marginal increase, but it's free!
I simply don't know how much 64 pins * even just low 4 bits will matter. It will matter some, the question is how much?