some cycles need to be stolen to move into the dest register in COG memory.
I haven't thought of this, and you're probably right. If that is indeed the case, then it would need to be split into two instructions. Still worthwhile, I think.
It's looking like a split opcode & more choices (I think someone else mentioned split opcodes ? ) wold be best
RDREQ - issues Read Address - (resets RDGET flags ?)
...
RDGET - Fetches RdReq result, stalls if not ready yet (must be paired with RDREQ)
WRREQ Send Register to Write Buffer. Stalls if Buffer not empty.
WRDIR Direct Write, stalls until Nibble==
RDDIR Direct Read, stalls until Nibble==
I think those would allow higher average bandwidths, (worst case is the same) but not solve cycle-determinism, which needs SNAPCNT or similar.
SNAPCNT could optionally 'attach' to those opcodes.
Tools could advise on spacing, but the auto-stall protects users.
Nice simple description of the op-codes that would be needed to support a buffered read and write. I actually think the RDREQ/RDGET pair will improve determinism of the code assuming you put 16-18 clocks between them. Basically this pair allows you to do a hub read without a hub sync as long as you know the read address soon enough. (just like the RDBLOCK instruction) WRREQ would be the same. Keep your write frequency down, and your code would never have to sync with the hub. Biggest cost I see for code is that using RDREQ, RDGET, and WRREQ would have a higher minimum read modify write latency.
Please forget all these RDREQ etc. The perceived benefits are not really there because you still require the clocks to do the transfer anyway. It is not like its free because it has to access the cog to perform the transfer.
It is much better for Chip to spend the time getting basic hubexec working which will bring far more benefits from this new hub scheme.
It takes ~9,000 LE's, which is equivalent to ~2 cogs. It's hefty, for sure, but buys a lot of performance.
Yep, that's a muck'n big crossbar switch. :cool: Wonder why it's not compiled as run-time reconfiguring of the FPGA routing fabric? (too slow?)
The Wikipedia entry for Nonblocking minimal spanning switches shows that there are ways to shrink cross-bar switches at the cost of more complexity. (lots of useful looking hyper-links and search terms in that Wiki article too)
Please forget all these RDREQ etc. The perceived benefits are not really there because you still require the clocks to do the transfer anyway. It is not like its free because it has to access the cog to perform the transfer.
Sure, the cycles per HUB access are not changed, but what has changed is the other code you can run while the HUB is rotating to the needed slot.
Think of it as boosting opcode bandwidth. In operation, rather like a buffered UART where you can do other stuff while waiting for new data, which you know will be along eventually.
I'm not opposed to to a simple wait or the snapcnt instruction. As you say it is useful in and out of hub loops. That does not make a R/W that lets execution continue less useful or or more risky. The only risk I can see is that the programmer tries to use the data being read before it arrives or overwriting the data in a register before it is written (if the implementation does not buffer the address and data). Bad programmer....don't do that.
As to the no gain/stolen cycles argument, the data has to be written to the register so the time is lost either way, and what you gain is the execution of several instructions instead of a stalled cog for several cycles. If the loop is shorter than the time between access to the hub block it would automatically synchronize to that hub block access with no fuss or muss.
It does pretty much what you say - ie has a bit to attach the wait to WR or RD, so those opcodes apply the snap, but you need to define the delay value, so that is why a separate opcode (or config register) is suggested.
The SNAPCNT is quite useful outside of HUB loops, which is another reason to have an opcode for it.
Doesn't get me down at all. I enjoy my work and dabbling in programming. I did write software for a number of years but after getting a taste of field work I just could not go back to sitting in an office all day. Thanks for the thought though.
I'm not opposed to to a simple wait or the snapcnt instruction. As you say it is useful in and out of hub loops. That does not make a R/W that lets execution continue less useful or or more risky. The only risk I can see is that the programmer tries to use the data being read before it arrives or overwriting the data in a register before it is written (if the implementation does not buffer the address and data). Bad programmer....don't do that.
The split read manages that, with low programmer risk, and no extra cost (as the transfer has to use cycles).
Relying on hidden transfers is a very bad idea, and not HLL friendly.
Any non-stalling solution would have to include single level, so write is 'self protecting'.
My head hurts. I also find it amusing that in another topic two people, who's work I admire, have difficulty communicating how this new scheme works and the timing implications. Can you imagine what someone browsing datasheets for a new processor will think?
This new processor is in danger of becoming a case study in how not to do things. Spot a problem, focus in on it, solve it by adopting a clever complex solution, rinse, repeat. Nowhere is anyone standing back and looking at the bigger picture. There's a good reason most mainstream chip makers have moved to linear memory maps and why they do things like analyse how C compilers work.
It's been stated that C is a major 'must-have' for this chip. I've been using it, and other embedded languages for over 30 years whilst keeping an eye what goes on under the hood. I used to be able to decompile PL/M80 and PL/M86 by eye. And I'll make this prediction...C compilers for this chip will not generate anything like optimal code without a lot of work on the part of the compiler writers and the programmer. It's not going to happen. Are programmers really expected to have to use a tool to check how their data access timings will work out? How will the compiler optimise code to make best use of this new access scheme? It's madness.
The good news is I've discovered an even better access scheme and have got a picture which illustrates it...
My head hurts. I also find it amusing that in another topic two people, who's work I admire, have difficulty communicating how this new scheme works and the timing implications. Can you imagine what someone browsing datasheets for a new processor will think?
This new processor is in danger of becoming a case study in how not to do things. Spot a problem, focus in on it, solve it by adopting a clever complex solution, rinse, repeat. Nowhere is anyone standing back and looking at the bigger picture. There's a good reason most mainstream chip makers have moved to linear memory maps and why they do things like analyse how C compilers work.
It's been stated that C is a major 'must-have' for this chip. I've been using it, and other embedded languages for over 30 years whilst keeping an eye what goes on under the hood. I used to be able to decompile PL/M80 and PL/M86 by eye. And I'll make this prediction...C compilers for this chip will not generate anything like optimal code without a lot of work on the part of the compiler writers and the programmer. It's not going to happen. Are programmers really expected to have to use a tool to check how their data access timings will work out? How will the compiler optimise code to make best use of this new access scheme? It's madness.
Agreed 110%. If you can't explain it in a couple of paragraphs and one diagram, then it ain't gonna fly.
I do think Chip's basic scheme has merit - but if he adopts it then we don't need all these additional instructions and complications. For those cogs that need absolute determinism, an additional instruction or two fixes the problem (but loses any speed advantage).
I don't think we need any additions or changes from what Chip has implemented and originally described.
I think people just need to stop thinking it's more complex than it is, and stop trying to make it more complex than it is with new stuff that really doesn't buy you anything worthwhile.
Roy,
Yes it is simple, and easy to explain, especially with the lazy susan concept. And no, it doesn't need to be any more complex.
And it certainly delivers on throughput - each cog could achieve 800MB/s in parallel (with tricks of course).
But, by the same token, it doesn't hurt for some of us to explore other possibilities. After all, if you and Chip hadn't explored other ideas, you would not have come up with this. In fact, many of us have bee trying to work out ways to increase the hub bandwidth (while Chip was off in other parts of the design anyway) while many, IIRC you included, have tried to shut down the ideas discussion.
Cluso,
I am fine with exploring ideas. I'm not fine with adding complications to this when it's not warranted. I was against all the hub sharing ideas because they all involved making the cogs unequal (and most of them were complicated messes).
This solution came about from a discussion with chip while he was explaining to me how things were currently working in the ALU/main cog pipeline, and how the memory stuff was arranged and split apart. A big part of the reason we "went with it" was because it simplified things, allowed for going down to 32bit lower power memory setups, and because it kept with the spirit of the Propeller where all the cogs are equal and independent of each other. It also gave us better overall bandwidth between cog/hub than we had even for the old P2 design. It was one of those things where Chip was super excited the whole time, which in my experience means very cool things.
Yes it is simple, and easy to explain, especially with the lazy susan concept.
So, let's assume I'm a vegetarian (I am) and all the veggie dishes are placed next to each other. How long will I have to wait for my meal which consists of 4 different dishes given...
a) that it's constantly rotating
b) it takes me a finite time to transfer the food from the dish to my plate
c) different dishes takes different amount of time to transfer
d) I like some things more than other
I have to agree with Brian, for a outsider this scheme is a bit of brain bruiser, I think I get it, but it''s complicated. Complicated enough until I see a working FPGA image and a C compiler that generates code for it and that doesn't force the coder to be aware of the underlying architecture, I'll remain skeptical.
If you can't explain it in a couple of paragraphs and one diagram, then it ain't gonna fly.
The first thing that has to go is any attempt to explain it by way of any analogy. Gears, hubs, lazy susan's, Ferris wheels, they all have to go. It's a big rectangular box in the middle of the block diagram that people have to explain.
There's no tricks with the cog/hub memory bandwidth. It's 16 longs every 18 clocks. Which is about 711MB/s. In real world practical use cases (where you are actually doing work to read/write the data from/to pins) then your will probably not realize the full bandwidth except in short bursts.
The 800MB/s number comes from the fact that during the 16 long transfer period it is going at the 800MB rate, but then you have 2 clock gaps between each one of those for the RDBLOC/WRBLOCK instruction.
The REAL kicker is that if multiple cogs are working together you could easily realize much higher overall throughput (up to around 12GB/s actually being possible).
I don't think we need any additions or changes from what Chip has implemented and originally described.
I think people just need to stop thinking it's more complex than it is, and stop trying to make it more complex than it is with new stuff that really doesn't buy you anything worthwhile.
Sounds great.
Now show me how what Chip has described, (no improvements), can stream Data continually into HUB, at 3 SysCLKs per sample, no jitter.
Or is that sort of speed gain, over what we have now, what you meant by 'worthwhile' ?
Now show me how what Chip has described, (no improvements), can stream Data continually into HUB, at 3 SysCLKs per sample, no jitter.
Or is that sort of speed gain, over what we have now, what you meant by 'worthwhile' ?
3 SysCLKs per sample? That's slower than WRBLOC is now. It's effectively 1.125 SysCLKs per long now. Jitter only matters when hitting the pins, not when hitting HUB.
The first thing that has to go is any attempt to explain it by way of any analogy. Gears, hubs, lazy susan's, Ferris wheels, they all have to go. It's a big rectangular box in the middle of the block diagram that people have to explain.
Why "a big rectangular box", when the rotation model is exactly how this actually works ?.
But that's one of the things that concerns me. How on earth do you put that as a bullet point on a datasheet?
That number is easy, but of more interest to customers is simple examples of what the device can actually DO.
For example,
"be configured so 3 COGs can stream Pin information into Main Memory at 200Ms/s (3 SysCLKs each)"
and
"be configured so 3 more COGs can stream Pin from Main Memory at 200Ms/s (3 SysCLKs each)"
and Do this at the same time. With no Bandwidth impact on the 10 COGS left.
What would you like to do with the remaining 10 COGs ?
3 SysCLKs per sample? That's slower than WRBLOC is now. It's effectively 1.125 SysCLKs per long now. Jitter only matters when hitting the pins, not when hitting HUB.
Yes, I want to sample the pins at 3SysCLKs, no jitter. (or Drive the Pins, at 3 SysCLKs, no jitter.)
Instructions are 2 sysclk's each. reading INA is possible every 2 sysclks, so you could in theory read pins at 2 sysclk's per sample for burst on a single cog, and then burst that out to HUB. With 2 cogs doing it, you could achieve continuous reading of pins at that rate. Not sure what you are advocating changing or doing for your 3 sysclk thing, but 2 sysclk seems doable already...
Instructions are 2 sysclk's each. reading INA is possible every 2 sysclks, so you could in theory read pins at 2 sysclk's per sample for burst on a single cog, and then burst that out to HUB. With 2 cogs doing it, you could achieve continuous reading of pins at that rate. Not sure what you are advocating changing or doing for your 3 sysclk thing, but 2 sysclk seems doable already...
No Cigar, This is continual, 500k samples, no pauses or jitter.
Why "a big rectangular box", when the rotation model is exactly how this actually works ?.
So, what exactly is rotating and how does that answer my question in post #290.
I'm not against analogies, I even use them myself to explain technical concepts to non-technical people, but when you have to have one in a datasheet then it's time to worry.
No Cigar, This is continual, 500k samples, no pauses or jitter.
Like I said with 2 cogs, you could get continuous reading at 2 sysclks per sample. one would read 16 and write it to HUB, the other would read 16 while the other was writing to hub. They would alternate. They could be synced up easily, so I no jitter and continuous reading...
People have done similar setups on the P1 with multiple cogs synced up to read data quickly to HUB. The same will be possible on P2, with MUCH higher rates possible.
Comments
Nice simple description of the op-codes that would be needed to support a buffered read and write. I actually think the RDREQ/RDGET pair will improve determinism of the code assuming you put 16-18 clocks between them. Basically this pair allows you to do a hub read without a hub sync as long as you know the read address soon enough. (just like the RDBLOCK instruction) WRREQ would be the same. Keep your write frequency down, and your code would never have to sync with the hub. Biggest cost I see for code is that using RDREQ, RDGET, and WRREQ would have a higher minimum read modify write latency.
Marty
It is much better for Chip to spend the time getting basic hubexec working which will bring far more benefits from this new hub scheme.
Yep, that's a muck'n big crossbar switch. :cool: Wonder why it's not compiled as run-time reconfiguring of the FPGA routing fabric? (too slow?)
The Wikipedia entry for Nonblocking minimal spanning switches shows that there are ways to shrink cross-bar switches at the cost of more complexity. (lots of useful looking hyper-links and search terms in that Wiki article too)
Marty
Sure, the cycles per HUB access are not changed, but what has changed is the other code you can run while the HUB is rotating to the needed slot.
Think of it as boosting opcode bandwidth. In operation, rather like a buffered UART where you can do other stuff while waiting for new data, which you know will be along eventually.
Don't let it get you down.
As to the no gain/stolen cycles argument, the data has to be written to the register so the time is lost either way, and what you gain is the execution of several instructions instead of a stalled cog for several cycles. If the loop is shorter than the time between access to the hub block it would automatically synchronize to that hub block access with no fuss or muss.
Doesn't get me down at all. I enjoy my work and dabbling in programming. I did write software for a number of years but after getting a taste of field work I just could not go back to sitting in an office all day. Thanks for the thought though.
The split read manages that, with low programmer risk, and no extra cost (as the transfer has to use cycles).
Relying on hidden transfers is a very bad idea, and not HLL friendly.
Any non-stalling solution would have to include single level, so write is 'self protecting'.
This new processor is in danger of becoming a case study in how not to do things. Spot a problem, focus in on it, solve it by adopting a clever complex solution, rinse, repeat. Nowhere is anyone standing back and looking at the bigger picture. There's a good reason most mainstream chip makers have moved to linear memory maps and why they do things like analyse how C compilers work.
It's been stated that C is a major 'must-have' for this chip. I've been using it, and other embedded languages for over 30 years whilst keeping an eye what goes on under the hood. I used to be able to decompile PL/M80 and PL/M86 by eye. And I'll make this prediction...C compilers for this chip will not generate anything like optimal code without a lot of work on the part of the compiler writers and the programmer. It's not going to happen. Are programmers really expected to have to use a tool to check how their data access timings will work out? How will the compiler optimise code to make best use of this new access scheme? It's madness.
The good news is I've discovered an even better access scheme and have got a picture which illustrates it...
Agreed 110%. If you can't explain it in a couple of paragraphs and one diagram, then it ain't gonna fly.
I do think Chip's basic scheme has merit - but if he adopts it then we don't need all these additional instructions and complications. For those cogs that need absolute determinism, an additional instruction or two fixes the problem (but loses any speed advantage).
Ross.
I think people just need to stop thinking it's more complex than it is, and stop trying to make it more complex than it is with new stuff that really doesn't buy you anything worthwhile.
+1
-Phil
Yes it is simple, and easy to explain, especially with the lazy susan concept. And no, it doesn't need to be any more complex.
And it certainly delivers on throughput - each cog could achieve 800MB/s in parallel (with tricks of course).
But, by the same token, it doesn't hurt for some of us to explore other possibilities. After all, if you and Chip hadn't explored other ideas, you would not have come up with this. In fact, many of us have bee trying to work out ways to increase the hub bandwidth (while Chip was off in other parts of the design anyway) while many, IIRC you included, have tried to shut down the ideas discussion.
But that's one of the things that concerns me. How on earth do you put that as a bullet point on a datasheet?
I am fine with exploring ideas. I'm not fine with adding complications to this when it's not warranted. I was against all the hub sharing ideas because they all involved making the cogs unequal (and most of them were complicated messes).
This solution came about from a discussion with chip while he was explaining to me how things were currently working in the ALU/main cog pipeline, and how the memory stuff was arranged and split apart. A big part of the reason we "went with it" was because it simplified things, allowed for going down to 32bit lower power memory setups, and because it kept with the spirit of the Propeller where all the cogs are equal and independent of each other. It also gave us better overall bandwidth between cog/hub than we had even for the old P2 design. It was one of those things where Chip was super excited the whole time, which in my experience means very cool things.
So, let's assume I'm a vegetarian (I am) and all the veggie dishes are placed next to each other. How long will I have to wait for my meal which consists of 4 different dishes given...
a) that it's constantly rotating
b) it takes me a finite time to transfer the food from the dish to my plate
c) different dishes takes different amount of time to transfer
d) I like some things more than other
Oh wait, I know, there's and app for that.
The 800MB/s number comes from the fact that during the 16 long transfer period it is going at the 800MB rate, but then you have 2 clock gaps between each one of those for the RDBLOC/WRBLOCK instruction.
The REAL kicker is that if multiple cogs are working together you could easily realize much higher overall throughput (up to around 12GB/s actually being possible).
Sounds great.
Now show me how what Chip has described, (no improvements), can stream Data continually into HUB, at 3 SysCLKs per sample, no jitter.
Or is that sort of speed gain, over what we have now, what you meant by 'worthwhile' ?
3 SysCLKs per sample? That's slower than WRBLOC is now. It's effectively 1.125 SysCLKs per long now. Jitter only matters when hitting the pins, not when hitting HUB.
Why "a big rectangular box", when the rotation model is exactly how this actually works ?.
That number is easy, but of more interest to customers is simple examples of what the device can actually DO.
For example,
"be configured so 3 COGs can stream Pin information into Main Memory at 200Ms/s (3 SysCLKs each)"
and
"be configured so 3 more COGs can stream Pin from Main Memory at 200Ms/s (3 SysCLKs each)"
and
Do this at the same time. With no Bandwidth impact on the 10 COGS left.
What would you like to do with the remaining 10 COGs ?
Yes, I want to sample the pins at 3SysCLKs, no jitter. (or Drive the Pins, at 3 SysCLKs, no jitter.)
No Cigar, This is continual, 500k samples, no pauses or jitter.
So, what exactly is rotating and how does that answer my question in post #290.
I'm not against analogies, I even use them myself to explain technical concepts to non-technical people, but when you have to have one in a datasheet then it's time to worry.
People have done similar setups on the P1 with multiple cogs synced up to read data quickly to HUB. The same will be possible on P2, with MUCH higher rates possible.