I've been mentally combing through the issues here. I think what was getting me flustered was tying hub exec to cog RAM stacks. That's a headache! CALLA and CALLB are very simple, but slower.
This is all easy, and adequate:
CALLA/CALLB/RETA/RETB - necessary for hub exec, even just one set
CALL/RET - use 4-level LIFO stack, perfect for internal cog programs
LINK - useful for many things
I'll proceed with these. I want to have this nailed down before I sleep again. I need to get moving on the Verilog.
Cool, having a minimal set of working opcodes, sounds even better than just reserving opcodes in the map.
Then the Software Whizz's can exercise that, and identify if/where there are any 'squeeze points' on some real code.
rdquad can improve LMM speed, but all generated code would have to be in four instruction packets (think VLIW), which would require a LOT of compiler work (new gcc backend), which would cost a lot of $$$ - and even then, it would not reach hubexec speed.
Maybe are simpler version of it? A read-only, threat the hubram as it was compiled in flash that is feeding the cog continues opcodes.
No Calls just Jumps, sure it will be harder to program but better than nothing and programmer have to check a program supplied flag to judge where to jump back to.
Are you just saying that to get things moving? I suppose you, working on C compilers, would be lamenting the absence of hub exec more than most.
Chip,
Of course I'll miss hub-exec, but at this point there are so many attractive, cheap, low power, and feature rich alternatives out there that it doesn't matter anymore. BUT NONE of them offer this: some superset of P1, with at least 512KB of on-board HUB RAM, 64+ IO pins, and a bunch of ADCs.
I've given up on wishing for lots of things including the idea that the FTDI chip might be eliminated for USB or that some kind of a very useful SERDES might be possible.
So, what is your feature set? It is certainly lost to me right now. Please define a list and maintain it yourself so that we can easily follow.
Everyone and his brother seems to want get something they suggested into your design. I suppose as long as they don't sue for royalties it doesn't matter.
I'm sorry, but the new chip needs to be in silicon before the end of this year. The wait has left me catatonic, with little Parallax pulse left.
Maybe are simpler version of it? A read-only, threat the hubram as it was compiled in flash that is feeding the cog continues opcodes.
No Calls just Jumps, sure it will be harder to program but better than nothing and programmer have to check a program supplied flag to judge where to jump back to.
It's a moving target, Chip has since said this :
["This is all easy, and adequate:
CALLA/CALLB/RETA/RETB - necessary for hub exec, even just one set
CALL/RET - use 4-level LIFO stack, perfect for internal cog programs
LINK - useful for many things
I'll proceed with these. "]
Such a simplified version can be exercised to find any caveats and it gets a FPGA image to test all the other stuff.
Flipped into the Power Domain, this means Hubexec code can have one quarter the power footprint of equivalent LMM.
This chip is still an exercise in keeping inside an Power Envelope.
The problem at the moment is that he's stuck on hubexec implementation, and he needs to move to something more productive. This doesn't mean it's entirely removed from the design, but it would help him to show some productivity elsewhere. With some time passing the hubexec design might work itself out, you might provide some assistance, or the C compiler guys Jazzed, Eric and David might find performance opportunities elsewhere. But for the time being, it's probably best for Chip to work on parts of the design he's both enthused and confident about.
Flipped into the Power Domain, this means Hubexec code can have one quarter the power footprint of equivalent LMM.
This chip is still an exercise in keeping inside an Power Envelope.
rdquad can improve LMM speed, but all generated code would have to be in four instruction packets (think VLIW), which would require a LOT of compiler work (new gcc backend), which would cost a lot of $$$ - and even then, it would not reach hubexec speed.
Not necessarily...the LMM engine can handle it, it is just more complex.
When an operation that changes the LMM PC occurs:
- If address is quad aligned, enter the 'normal' fast LMM loop
- otherwise enter a LMM 'stub' that executes from one to three instructions depending on alignment prior to entering the 'normal' loop.
Except, the "fast" four instruction sequences cannot contain any load-32-bit const primitives, fcall primitives, fjump primitives - as the 32 bit long following the call to the primitive would be executed, with undesired results.
Which is why a qlmm compiler needs to understand the vliw nature. All this was gone over back in nov/dec 2012 in an old thread of mine.
Not necessarily...the LMM engine can handle it, it is just more complex.
When an operation that changes the LMM PC occurs:
- If address is quad aligned, enter the 'normal' fast LMM loop
- otherwise enter a LMM 'stub' that executes from one to three instructions depending on alignment prior to entering the 'normal' loop.
Except, the "fast" four instruction sequences cannot contain any load-32-bit const primitives, fcall primitives, fjump primitives - as the 32 bit long following the call to the primitive would be executed, with undesired results.
Which is why a qlmm compiler needs to understand the vliw nature. All this was gone over back in nov/dec 2012 in an old thread of mine.
What I'm saying is to use the RDQUAD more like a CACHE. All of the primitives would (might?) invalidate the CACHE causing re-entry based on alignment.
I'm saying there is a middle ground between LMM and QLMM.
Getting a FPGA out with the enhanced instruction set would let the testers start testing. That gives Chip time to solve the hubexec riddle.
We need some kind of hubexec mode that does not have too big of a hit compared to running in cog ram. This is not only a practical issue, for driving the expanded I/O, it is a marketing issue. If Parallax has to publish that their cores run at 1/8, or whatever it is, when running 'C' programs compared to small cog PASM programs that would be bad. Can you just hear the programmers looking at the data sheet? "Wow 100 MIPS per core! Oh wait, that is for really small PASM programs... OMG Only 12.5 MIPS for my C programs? Forget that...". They won't know, or care, about the reasons for the loss of performance they will just see it as the chip having really bad C tools. Which we all know is no longer the case. Or worse they will pick the chip based on 100 MIPS and then try to program it in C and the wonder why they have such bad performance.
Right now I can't see how it could be done to be faster than LMM. And it would need significant compiler modifications.
Perhaps you could give an example of how it would work, and also show how it would not need significant compiler mods. I am not being sarcastic, I'd love to see it work.
Depending on spacer requirements, an RDLONGC might help get LMM to ~25MIPS.
Right now I can't see how it could be done to be faster than LMM. And it would need significant compiler modifications.
Perhaps you could give an example of how it would work, and also show how it would not need significant compiler mods. I am not being sarcastic, I'd love to see it work.
Depending on spacer requirements, an RDLONGC might help get LMM to ~25MIPS.
Gee, I was hoping to get your gears spinning so you would do it...
Basically the main LMM loop would use RDQUAD and execute up to 4 instructions.
If any of the 4 instructions is a primitive or a jump the handler code does what it needs and then either returns to the loop if the current PC is QUAD aligned or enters a section of code that reads and executes from 1 to 3 instructions from the hub prior to reentering the main LMM loop.
It is really just like the old LMM except it needs a section that executes one instruction at a time until it is QUAD aligned.
Would it be possible to do a 4-cycle LMM loop like this:
REPS #0, #2
RDLONGC instr, PTRA++
instr NOP
The #0 in the REPS would represent an infinite loop, or the loop count could just be whatever the maximum is for REPS. This loop would require that the next instruction could be executed without requiring NOPs for a pipeline delay. Also, RDLONGC would need to take 2 cycles. If this could be made to work it would run just as fast as HUBEX. If we wanted to keep the instruction cache we would need a RDLONGI instruction to read from it instead of using the RDLONGC instruction with the data cache.
I don't wish to sound like a downer... but, the P1+ needs to be able to do hub exec at 100 MIPs to be competitive with ARMs. To increase bandwidth I don't see why the hub can't be controlled by a round robin priority encoder. This would give the business logic cog the ability to read/write to the hub as much as possible and driver cogs would get hub access without starvation. One cog then could effectively get every access cycle.
If hub exec is too hard, at least put in a round robin priority encoder for hub access. It makes no sense to keep the simple P1 style hub anymore. It wastes too much bandwidth and most code already assumes undeterministic hub access timing.
---
A round robin priority encoder selects the "lowest numbered cog" that wants hub access every clock cycle. The "highest number cog" being the one that got access last clock cycle.
...at least put in a round robin priority encoder for hub access. It makes no sense to keep the simple P1 style hub anymore. It wastes too much bandwidth and most code already assumes undeterministic hub access timing.
Yes, with 16 COGs, some from of not-just-1:16 slot mapping is certainly needed.
There have been discussions around this already, not sure what Chip will actually implement.
I favour using a table mapping as that is easy to visualize and manage, and with a modulus to allow matching to any used numbers of COGS and Slots.
If COG clock enables are also implemented for power savings, the same table can extend to support that. (again working from an easy to visualize and manage angle)
Kye, it seems you are proposing a HUBEX that is twice as fast as the one Chip was pursuing. Not only that, but you're also asking for hub-slot sharing. I suggest you read the past 2 or 3 thousand posts on P1+ and P2 so you're up to speed on the subect.
Gee, I was hoping to get your gears spinning so you would do it...
Basically the main LMM loop would use RDQUAD and execute up to 4 instructions.
If any of the 4 instructions is a primitive or a jump the handler code does what it needs and then either returns to the loop if the current PC is QUAD aligned or enters a section of code that reads and executes from 1 to 3 instructions from the hub prior to reentering the main LMM loop.
It is really just like the old LMM except it needs a section that executes one instruction at a time until it is QUAD aligned.
- Chip dropped the 'RDxxxC' instructions
- Chip has yet to state how many spacers are required between an RDLONG and the execution of the instruction it fetched
Would it be possible to do a 4-cycle LMM loop like this:
REPS #0, #2
RDLONGC instr, PTRA++
instr NOP
The #0 in the REPS would represent an infinite loop, or the loop count could just be whatever the maximum is for REPS. This loop would require that the next instruction could be executed without requiring NOPs for a pipeline delay. Also, RDLONGC would need to take 2 cycles. If this could be made to work it would run just as fast as HUBEX. If we wanted to keep the instruction cache we would need a RDLONGI instruction to read from it instead of using the RDLONGC instruction with the data cache.
1) Chip's 256 bit bus with pre-fetch, which requires at least two cache lines, a lot more transistors, and more power
2) my slot mapping scheme, with 100MIPS hubexec cogs being assigned 2/16 hub cycles (16/128 with my 128 length hub slot table), which needs some more transistors
But we are in agreement, 100MIPS hubexec would make P1+ competitive.
I don't wish to sound like a downer... but, the P1+ needs to be able to do hub exec at 100 MIPs to be competitive with ARMs. To increase bandwidth I don't see why the hub can't be controlled by a round robin priority encoder. This would give the business logic cog the ability to read/write to the hub as much as possible and driver cogs would get hub access without starvation. One cog then could effectively get every access cycle.
If hub exec is too hard, at least put in a round robin priority encoder for hub access. It makes no sense to keep the simple P1 style hub anymore. It wastes too much bandwidth and most code already assumes undeterministic hub access timing.
---
A round robin priority encoder selects the "lowest numbered cog" that wants hub access every clock cycle. The "highest number cog" being the one that got access last clock cycle.
- Chip dropped the 'RDxxxC' instructions
- Chip has yet to state how many spacers are required between an RDLONG and the execution of the instruction it fetched
So if HUBEX is too complex to implement, maybe RDLONGC and zero spacers are feasible.
What Chip said earlier was that INDA/INDB cog stack support was too complex, but that hubexec was fine with a PTRA/PTRB hub based stack.
Also 100MIPS is far more difficult, 50MIPS is easy.
We will see what he ends up doing later. I think he desperately needs to catch up on sleep/rest!
I would not be surprised if there will be an FPGA image without hubexec, to get started with very soon, and after he thinks it through, another image in a week or two with hubexec.
We know hubexec works, it worked after all in the P2 image - it just has to be adapted to the P1+ (enhanced P1 / pruned P2). I think Chip was trying to pull off the impossible (full INDA autoinc/dec modes in 2 clocks) and it must have been very frustrating.
The impression I got (and I could be wrong on this one) is he found hubexec support easier / taking less logic than supporting RDxxxxC.
All the extra features are nice... but, it's very important for the core to be fast. Honestly, most of the processors will do nothing most of the time. Core 0 will likely be doing the most work. So, it's best to optimize for that. Even now on the P1 most of the other cores spend their time in wait states.
I would suggest other features be dropped to support hub exec.You can emulate most of the extra instructions in the time you loose to having to run LMM code.
with 64 ADC/DAC 16x100MIPS cores 512Kb ram and determistic timing there is no competition from any arm?
Pretty much, but 'ARM' now covers a very wide range.
There is also a significant market available working with an ARM (or Atom), so there is some 'them or us' on the smaller ARMs, but more of a 'them and us' on the larger parts, and there, minimizing the culture shock of ARM users supporting P1+ is going to be important.
In these areas, the P1+ is going to displace perhaps a small FPGA or moderate CPLD.
you might be right there - but hubexec pulls behind it all that feature creep making the last P2 impossible.
just read Post #1 from Chip on this thread. And then skim thru.
Kens plan to send Chip for a couple of days with the chainsaw into the orchard is quite good. Cutting away trees and so will get him mentally focused on the task at hand.
I've been doing a little side work on a project that uses Atmel SAM D20 MCUs, which are ARM Cortex M0+ based chips with lots of I/Os(52), sercoms(6), and ADCs(20). The top model has 256K flash and 32K sram (code runs from flash). They allow you to map resources to pins in a reasonably flexible way. These are the kinds of MCUs that I see P2 going up against based on price and complexity to implement.
Most of the Cortex M0/M0+ based chips run at 48Mhz max. They do have a fair number of instructions that are 1 clock, but all the load/store ones are 2 clocks, and branches are 3-4 clocks. So effective MIPS is probably much lower than the 48 it could do with all 1 clock instructions. Most likely somewhere in the 30-35 range (or worse since you pretty much have to load/store everything to registers to do operations on them, and branching is so costly).
So the proposed P2 so far would compare favorably in most cases, if not soundly smash them due to having 16 cores.
Yeah, there are more powerful and feature rich ARM based MCUs out there, but the price/complexity levels rise pretty steeply. I don't think we could or should worry about the higher end ARM stuff at all.
Anyway, I think a 50 MIPS hubexec along with 16 cores and 512K shared memory (+32k local across the cores) will compare pretty nicely with the ARM stuff we'll be going against.
Comments
Cool, having a minimal set of working opcodes, sounds even better than just reserving opcodes in the map.
Then the Software Whizz's can exercise that, and identify if/where there are any 'squeeze points' on some real code.
rdquad can improve LMM speed, but all generated code would have to be in four instruction packets (think VLIW), which would require a LOT of compiler work (new gcc backend), which would cost a lot of $$$ - and even then, it would not reach hubexec speed.
I'll go so far as to say that'll it'll kill the P1+ dead UNLESS LMM works at a reasonable speed.
So, the question is, what speed, in raw MIPS, will LMM give us on the new chip?
No Calls just Jumps, sure it will be harder to program but better than nothing and programmer have to check a program supplied flag to judge where to jump back to.
Of course I'll miss hub-exec, but at this point there are so many attractive, cheap, low power, and feature rich alternatives out there that it doesn't matter anymore. BUT NONE of them offer this: some superset of P1, with at least 512KB of on-board HUB RAM, 64+ IO pins, and a bunch of ADCs.
I've given up on wishing for lots of things including the idea that the FTDI chip might be eliminated for USB or that some kind of a very useful SERDES might be possible.
So, what is your feature set? It is certainly lost to me right now. Please define a list and maintain it yourself so that we can easily follow.
Everyone and his brother seems to want get something they suggested into your design. I suppose as long as they don't sue for royalties it doesn't matter.
I'm sorry, but the new chip needs to be in silicon before the end of this year. The wait has left me catatonic, with little Parallax pulse left.
For simple, 2 cycle instructions, the hard limit is
200/16 in system clock cycles
100/8 in cog instruction cycles (2 cycles per instruction)
Without hubexec, caching, LMM straight line code runs at 12.5MIPs (*fcache can help a lot, if the code fits in fcache)
Hubexec with 1 quad long cache can run 4 simple instructions per hub cycle, so 50MIPS (*fcache can help a lot, if the code fits in fcache)
Factor of four difference.
It's a moving target, Chip has since said this :
["This is all easy, and adequate:
CALLA/CALLB/RETA/RETB - necessary for hub exec, even just one set
CALL/RET - use 4-level LIFO stack, perfect for internal cog programs
LINK - useful for many things
I'll proceed with these. "]
Such a simplified version can be exercised to find any caveats and it gets a FPGA image to test all the other stuff.
Flipped into the Power Domain, this means Hubexec code can have one quarter the power footprint of equivalent LMM.
This chip is still an exercise in keeping inside an Power Envelope.
The problem at the moment is that he's stuck on hubexec implementation, and he needs to move to something more productive. This doesn't mean it's entirely removed from the design, but it would help him to show some productivity elsewhere. With some time passing the hubexec design might work itself out, you might provide some assistance, or the C compiler guys Jazzed, Eric and David might find performance opportunities elsewhere. But for the time being, it's probably best for Chip to work on parts of the design he's both enthused and confident about.
Ken Gracey
1) 1/4 the number of LMM instructions run, BUT
2) to run one LMM instruction takes a minimum of two cog instructions, three if no auto-increment ptra, plus any required spacer instructions
so from power envelope point of view, maybe 1/2 of the power footprint... but will take 4 times as long, so actually use 2x the power overall.
FYI, if there is need to reduce power usage, crank the multiplier down
Not necessarily...the LMM engine can handle it, it is just more complex.
When an operation that changes the LMM PC occurs:
- If address is quad aligned, enter the 'normal' fast LMM loop
- otherwise enter a LMM 'stub' that executes from one to three instructions depending on alignment prior to entering the 'normal' loop.
C.W.
Which is why a qlmm compiler needs to understand the vliw nature. All this was gone over back in nov/dec 2012 in an old thread of mine.
What I'm saying is to use the RDQUAD more like a CACHE. All of the primitives would (might?) invalidate the CACHE causing re-entry based on alignment.
I'm saying there is a middle ground between LMM and QLMM.
C.W.
We need some kind of hubexec mode that does not have too big of a hit compared to running in cog ram. This is not only a practical issue, for driving the expanded I/O, it is a marketing issue. If Parallax has to publish that their cores run at 1/8, or whatever it is, when running 'C' programs compared to small cog PASM programs that would be bad. Can you just hear the programmers looking at the data sheet? "Wow 100 MIPS per core! Oh wait, that is for really small PASM programs... OMG Only 12.5 MIPS for my C programs? Forget that...". They won't know, or care, about the reasons for the loss of performance they will just see it as the chip having really bad C tools. Which we all know is no longer the case. Or worse they will pick the chip based on 100 MIPS and then try to program it in C and the wonder why they have such bad performance.
Perhaps you could give an example of how it would work, and also show how it would not need significant compiler mods. I am not being sarcastic, I'd love to see it work.
Depending on spacer requirements, an RDLONGC might help get LMM to ~25MIPS.
Gee, I was hoping to get your gears spinning so you would do it...
Basically the main LMM loop would use RDQUAD and execute up to 4 instructions.
If any of the 4 instructions is a primitive or a jump the handler code does what it needs and then either returns to the loop if the current PC is QUAD aligned or enters a section of code that reads and executes from 1 to 3 instructions from the hub prior to reentering the main LMM loop.
It is really just like the old LMM except it needs a section that executes one instruction at a time until it is QUAD aligned.
This should work without any compiler changes.
C.W.
If hub exec is too hard, at least put in a round robin priority encoder for hub access. It makes no sense to keep the simple P1 style hub anymore. It wastes too much bandwidth and most code already assumes undeterministic hub access timing.
---
A round robin priority encoder selects the "lowest numbered cog" that wants hub access every clock cycle. The "highest number cog" being the one that got access last clock cycle.
Yes, with 16 COGs, some from of not-just-1:16 slot mapping is certainly needed.
There have been discussions around this already, not sure what Chip will actually implement.
I favour using a table mapping as that is easy to visualize and manage, and with a modulus to allow matching to any used numbers of COGS and Slots.
If COG clock enables are also implemented for power savings, the same table can extend to support that. (again working from an easy to visualize and manage angle)
I was hoping you had a magic solution
- Chip has yet to state how many spacers are required between an RDLONG and the execution of the instruction it fetched
1) Chip's 256 bit bus with pre-fetch, which requires at least two cache lines, a lot more transistors, and more power
2) my slot mapping scheme, with 100MIPS hubexec cogs being assigned 2/16 hub cycles (16/128 with my 128 length hub slot table), which needs some more transistors
But we are in agreement, 100MIPS hubexec would make P1+ competitive.
50MIPS is also ran.
25MIPS or less is no competition to ARM.
with 64 ADC/DAC 16x100MIPS cores 512Kb ram and determistic timing there is no competition from any arm?
Enjoy!
Mike
Also 100MIPS is far more difficult, 50MIPS is easy.
We will see what he ends up doing later. I think he desperately needs to catch up on sleep/rest!
I would not be surprised if there will be an FPGA image without hubexec, to get started with very soon, and after he thinks it through, another image in a week or two with hubexec.
We know hubexec works, it worked after all in the P2 image - it just has to be adapted to the P1+ (enhanced P1 / pruned P2). I think Chip was trying to pull off the impossible (full INDA autoinc/dec modes in 2 clocks) and it must have been very frustrating.
The impression I got (and I could be wrong on this one) is he found hubexec support easier / taking less logic than supporting RDxxxxC.
All the extra features are nice... but, it's very important for the core to be fast. Honestly, most of the processors will do nothing most of the time. Core 0 will likely be doing the most work. So, it's best to optimize for that. Even now on the P1 most of the other cores spend their time in wait states.
I would suggest other features be dropped to support hub exec.You can emulate most of the extra instructions in the time you loose to having to run LMM code.
Pretty much, but 'ARM' now covers a very wide range.
There is also a significant market available working with an ARM (or Atom), so there is some 'them or us' on the smaller ARMs, but more of a 'them and us' on the larger parts, and there, minimizing the culture shock of ARM users supporting P1+ is going to be important.
In these areas, the P1+ is going to displace perhaps a small FPGA or moderate CPLD.
you might be right there - but hubexec pulls behind it all that feature creep making the last P2 impossible.
just read Post #1 from Chip on this thread. And then skim thru.
Kens plan to send Chip for a couple of days with the chainsaw into the orchard is quite good. Cutting away trees and so will get him mentally focused on the task at hand.
Enjoy!
Mike
Most of the Cortex M0/M0+ based chips run at 48Mhz max. They do have a fair number of instructions that are 1 clock, but all the load/store ones are 2 clocks, and branches are 3-4 clocks. So effective MIPS is probably much lower than the 48 it could do with all 1 clock instructions. Most likely somewhere in the 30-35 range (or worse since you pretty much have to load/store everything to registers to do operations on them, and branching is so costly).
So the proposed P2 so far would compare favorably in most cases, if not soundly smash them due to having 16 cores.
Yeah, there are more powerful and feature rich ARM based MCUs out there, but the price/complexity levels rise pretty steeply. I don't think we could or should worry about the higher end ARM stuff at all.
Anyway, I think a 50 MIPS hubexec along with 16 cores and 512K shared memory (+32k local across the cores) will compare pretty nicely with the ARM stuff we'll be going against.