USB FS gives us an exact case in question. The responses must take place extremely fast, in a matter of a small number of clocks. Hub can never deliver this minimum latency. The egg beater does nothing to help in this instance. In fact it makes calculating the latency even more difficult because you need to know your cog and the other cog, plus the hub address of the byte/word/long.
With the Smart Pin cell support, is FS USB still an issue ?
I Thought with Chip's change to faster Pn-cell interface, and the HW Support, that USB FS was doable ?
Does anyone have a working FS-USB example yet ?
Aside from USB, are there other use cases where low enough latency cannot be designed in ?
I can see that the eggbeater fundamentals are going to need user COG placement control, for truly deterministic COG-COG data flows.
We are missing a great opportunity by not using the dual port ability of the LUT. There is silicon available, and Chip has done it. It should be a simple matter of addressing to permit extended LUT exec but I can live without it.
I gave an example where more cogs can be used to get better results.
Maybe Chip should leave this in, and let some real use cases be explored ?
I finding consecutive COGs clashes with finding suitable COGs for different pin functionality, that sounds like a real headache.
That is why removing pin-constraints (as Chip is mentioning doing on DACs) makes more sense, than removing possible COG-COG constraints.
The Egg-beater already scans COGs in a certain order, that is fundamental to the silicon design.
Any P2 Simulator, will have to know which COG code is running in, to correctly simulate times.
I've not studied Prop1 code that relies on clock counting between Cogs. The only one that comes to mind is the hi-res VGA drivers and they are unidirectional data flow that is timed for output timing, not Hub phase timing. Although that probably also occurs incidentally because the loops are identically matched to Hub's lowest latency window.
The Prop2's lowest latency Hub access window is something that hasn't been discussed that I know of. I certainly don't understand it.
Here is a real life use case that I have been pondering... How many PropCams can be used by a single P2 and deliver 500FPS each? As it stands, the fastest mode is to use a single nib coming from each PropCam... so one long is 8 PropCams and 32+ pins. If we want 500+ fps, as far as I can tell, LUT sharing will reduce the number of cogs used, but have no effect on the maximum FPS. Deterministic Cog signaling, however, will greatly reduce the complexity and stability of the software design.
For the shared LUT case, I have used a very specific case (which has been specifically satisfied by the silicon in the smart pins) to show how passing data between cogs via the hub falls down. The P2 is a very different animal to what was originally conceived as an enhanced P1+.
Everyone seems to presume that the hub egg beater design solves everything. It simply does not!!!
We don't have fixed calculatable latency without some complex maths, and requires knowing not only which two cogs are involved, but also the hub lower 4 long address bits for a byte/word/long transfer. Fast block transfers are covered quite well although not quite deterministically, but it really doesn't matter in these cases anyway.
But the plenthora of cogs are there to do smart protocols with pins, particularly where the smart pins don't cater for these cases.
To kill the simple method of LUT sharing between adjacent cogs on the basis that all cogs are not equal is simply putting ones head in the sand. If you don't like it, don't use it. If you take this attitude, then you best kill the egg beater, because here cogs are not equal in latency!!!
All sorts of tricks can be done with LUT sharing beyond the example I have given. Are we trying to get the best functionality out of the P2? It's been implemented by Chip, so there is no further delay. Even adding extended LUT exec would not be a big delay, and would be worth the trouble, but I still remain happy without it.
Look at it this way... If the USB smart pins were not there, or there is a flaw in the design that shows up, the shared LUT could be the saviour. Remember, this is only for a 12MHz signal. These days that is quite a low speed. What else can we do, or not do, because we cannot use cooperating cogs due to sharing latency, where smart pins are unable to help.
Continual discussions on P2 have been about how to increase data transfers. This is no coincidence. The P1 has a problem here and many of us have found real problems here. There have been discussions about slot sharing, threads, hubexec. All these and many others have been to solve inter processor communications. P2HOT was 1 clock instructions. The current P2 is 2 clock instructions.
Who knows, perhaps 10MHz Ethernet might be possible, maybe even 100MHz with smart pins and a few cogs using shared LUT.
...
Look at it this way... If the USB smart pins were not there, or there is a flaw in the design that shows up, the shared LUT could be the saviour. Remember, this is only for a 12MHz signal. These days that is quite a low speed. What else can we do, or not do, because we cannot use cooperating cogs due to sharing latency, where smart pins are unable to help.
Another 'real world' use case I was wondering about, was Ethernet ?
With the streamer supporting nibble transfers, it seems 25MHz nibble flows that 100MHz Ethernet needs, might become doable. 10MHz is probably easier, but some mentioned that may be going away.... ?
Those are likely to need some serious SW, and maybe two COGs, one for each direction ?
Ethernet will most likely require a cog just for CRC calculation, while another cog decodes the packet and probably another preparing the reply in parallel to both these. A separate cog for transmit and receive is not the main problem here.
BTW I have not looked at the Ethernet protocol. I was not aware nibbles were used - are you sure of this??? Just reread you comment. FYI In the 80's I did 2780 bisync and 3270 sdlc with an attached 3274 terminal.
Everyone seems to presume that the hub egg beater design solves everything. It simply does not!!!
Well, the presumption isn't quite everything, instruction stalling is still present. The FIFO can't even flip direction without a stall.
We don't have fixed calculatable latency without some complex maths, and requires knowing not only which two cogs are involved, but also the hub lower 4 long address bits for a byte/word/long transfer.
Question is is pre-arranged instruction counting as simple as the Prop1?
If you take this attitude, then you best kill the egg beater, because here cogs are not equal in latency!!!
That's wrong. The amount of variance is even and symmetrically ordered - predictable.
All sorts of tricks can be done with LUT sharing beyond the example I have given. Are we trying to get the best functionality out of the P2? It's been implemented by Chip, so there is no further delay. Even adding extended LUT exec would not be a big delay, and would be worth the trouble, but I still remain happy without it.
The sacrifice would be removal of COGNEW as a feature, dynamic allocation out the window, only COGINIT would exist. Personally, I'm okay with this.
My question is, how do I get those two consecutive COGs started?
Cant you do
COGINIT COGID+1
IIRC Cluso99 asked to a way for fast and low latency communication between cogs. Is Chip that brought out the idea to make the LUT dual-port and share it.
I agree with Cluso99 that a way of direct (not necessary big area but fast) communication between cogs is needed.
I do not know how the architecture of P2 is routed and muxed internally but since the main cog ram/registers are dual-port cant a new opcode set a working mode (default disabled) where two registers (eg 0 and 1 or whatever number on 0..512 range) becomes shared between cogs. Eg register0 with below cog and register1 with above cog?
Or just make a latch actually two, one between lower cog and one between higher cog, writable and readable by two new opcodes (RDCOGL, WRCOGL) where
- RDCOGL source sets upper/lower latch and destination sets the register where to transfer the value inside the cog; WZ set if first read of new data, cleared for consecutive reads (of same old data). WC acknowledge the read to other cog
- WRGOGL source selects the register while the destination the upper/lower latch. The WC writes the latch only if previous value has been read. WZ returns write ACK.
Perhaps a write to the latch can be an interrupt source for the written cog.
This should be only a direct path, should not involve any mux (I hope), and also do not break cogs equity between each other.
I do not know how the architecture of P2 is routed and muxed internally but since the main cog ram/registers are dual-port cant a new opcode set a working mode (default disabled) where two registers (eg 0 and 1 or whatever number on 0..512 range) becomes shared between cogs. Eg register0 with below cog and register1 with above cog?
I think when Chip was looking at different sizes, he mentioned that the DATA mux needs 32, and the address 9, and halving the area only saves one of those 9 Adr Mux.
In other words, doing a full LUT MUX has low incremental cost, but best benefit.
If it has no speed impact, that makes sense to do.
From a chip physical placement view point, you can place one LUT area between two COGs (odd/even) and have them share that.
COGNEW makes things simple by getting rid of the need to plan your cogs. Objects don't have to know anything about which cog to use. They just start one up. It is easy on the mind, as it keeps you away from some ugly details that you'd rather not have to think about all the time. LUT sharing aside, I think automatic cog allocation is important.
COGNEW makes things simple by getting rid of the need to plan your cogs. Objects don't have to know anything about which cog to use. They just start one up. It is easy on the mind, as it keeps you away from some ugly details that you'd rather not have to think about all the time. LUT sharing aside, I think automatic cog allocation is important.
Sounds good, for general use cases.
Is this P2 COGNEW 100% predictable, when run in the same order ?
ie user may not need to define a COG#, but their code will always start the same way, and use the same COG#'s, ( I guess in the simple order requested ? )
While at power on, all cogs will be normally allocated in the same order, should one cog start another at some random time, set by some external means,the cog order can different.
The most efficient (minimal latency) is for cog n to pass a data byte to cog n-1 (or perhaps n-2).
The rule for COGNEW is to use the lowest-numbered inactive cog. If you wanted two cogs in a row, you could, at startup, start them both, and they would be adjacent.
I don't see 'cog-number-agnosticism' as a holy grail, ...
It is kind of a holy grail when the OBEX comes into play. Without it, jobs that might fit otherwise won't, because of cog-assignment constraints. ("My project has a cog left over. Whaddya mean it won't fit?") Agnosticism also makes PCB layouts way simpler, since cog/pin-assignment decisions don't have to be made at hardware design time. Moreover, software written for a project will not have constraints imposed solely by a particular board layout.
That said, I understand that sophisticated fitting algorithms can overcome many allocation issues. But without total agnosticism, there will always be those projects that won't fit solely because of it. And when they do work, how much does it add to the compile time? When I hear of multi-minute compiles for FPGAs, compared with the fraction of a second that most Spin/PASM compiles require for the Prop 1, it makes my ears bleed. There's a lot to be said for instant gratification!
... Agnosticism also makes PCB layouts way simpler, since cog/pin-assignment decisions don't have to be made at hardware design time. Moreover, software written for a project will not have constraints imposed solely by a particular board layout.
Of course, but Pin-agnostic, and COG-agnostic are not quite the same topic.
Also, the pairing needed for LUT sharing does not exclude any other COG from being used, and the DAC fix Chip has recently suggested makes COG-Pin usage more decoupled.
..And when they do work, how much does it add to the compile time?
You likely use (or have used) tool chains that use a linker ? - How much time does that add ?
(A: Nothing to the compile time, and very little to the total build time)
The CPLD fitters I use are usually all done in 1-2 seconds, and that is FAR more complex than any COG allocate could be.
The rule for COGNEW is to use the lowest-numbered inactive cog. If you wanted two cogs in a row, you could, at startup, start them both, and they would be adjacent.
Interesting. So lower number cogs will be used more often as higher numbered ones. Do they wear out?
I think lut sharing between two neighbor cogs is to complicated to be used. On the P1 we had Mailboxes in the HUB or unused pins for cog to cog communication. We still can do that on the P2. I also remember some talk about special HUB locations for interrupt driven mailboxes, not sure if they are still there.
Is it possible to use the smartpin interface to send messages from cog to cog?
as for the signaling register (16 times ported 32 bits of ram?) it would be nice to have two bits per cog, so one bit could be data and one clock?
Is it possible to use the smartpin interface to send messages from cog to cog?
Of course and many Self-test examples will do this.
However, there are caveats :
* It consumes a Pin
* The latency is likely to be larger than simply going via HUB, so why bother, if you just want COG-COG. ?
I was thinking about this cog-number-agnosticism matter, myself, this evening, lamenting that the cogs' four fast DAC channels are hard-tied to pins %CCCCxx, where %CCCC is the cog number. This killed cog-number-agnosticism a while back and the LUT sharing further muddied the water.
Here's what I think we need to do: ...
For a second, I thought the whole outer ring , or whatever it was called, was coming back [as the die size was increased, though perhaps not enough to accommodate that, not to mention that the A9 is out of resources and that Treehouse completed the design work around the periphery] Anyway, got to love those "Here's what I think we need to do" lines. Just when you thought the movie was over, you get even more bang for your buck (in the sense of design refinements, I mean). Riveting stuff!
The rule for COGNEW is to use the lowest-numbered inactive cog. If you wanted two cogs in a row, you could, at startup, start them both, and they would be adjacent.
COGNEW, being a language feature, could maybe grow itself a new extension where it can be asked to start multiple tasks at once/consecutively and guarantee those will be in matching consecutive ordered Cogs.
EDIT: Of course, this then becomes a complex function that needs a varying bunch of collated parameters. Bit of a turn-off for easy reading.
If your project requires special cog numbers, just let the main program (first to run) start the required cog numbers, with a dummy jump if need be.
Remember, special requirements may require special resources. If you dont use them, everything is fine. But if you do need them you pay the price which is better than never being able to do it because of not implementing the hardware.
BTW Typically we are talking about requiring two consecutive cogs - there are 8 pairs available!!!
The rule for COGNEW is to use the lowest-numbered inactive cog. If you wanted two cogs in a row, you could, at startup, start them both, and they would be adjacent.
Interesting. So lower number cogs will be used more often as higher numbered ones. Do they wear out?
No. Otherwise a single cpu micros would wear out too
I think lut sharing between two neighbor cogs is to complicated to be used.
Really it isn't. But if you are concerned, don't use it. But please don't deprive others of this big feature.
On the P1 we had Mailboxes in the HUB or unused pins for cog to cog communication. We still can do that on the P2. I also remember some talk about special HUB locations for interrupt driven mailboxes, not sure if they are still there.
Is it possible to use the smartpin interface to send messages from cog to cog?
Too slow. I am trying to get the fastest (and easiest) way possible.
as for the signaling register (16 times ported 32 bits of ram?) it would be nice to have two bits per cog, so one bit could be data and one clock?
If your project requires special cog numbers, just let the main program (first to run) start the required cog numbers, with a dummy jump if need be.
Problem is OBEX programs will need modified to suit. If that's the case then COGNEW has failed it's purpose and therefore shouldn't exit.
BTW Typically we are talking about requiring two consecutive cogs - there are 8 pairs available!!!
The idea of going to the trouble of extending COGNEW is to cater to what may be needed.
But it also has a possible compactness advantage if all tasks can be launched by a single statement. May as well provide a general improvement at the same time.
COGNEW makes things simple by getting rid of the need to plan your cogs. Objects don't have to know anything about which cog to use. They just start one up. It is easy on the mind, as it keeps you away from some ugly details that you'd rather not have to think about all the time. LUT sharing aside, I think automatic cog allocation is important.
How do you know COGID+1 is available? Some other COG may have taken it into use since the COG executing that was started? Oops.
Because,
- if you are the programmer of the whole thing you know what you are doing, this apply also on cogs allocation
- if you develop a driver COGINIT COGID+1 make the thing dynamic thus it can be in obex. The user knows about the double cog requirements of the driver and knows that he doesn't have to start cogs with COGNEW in there between. The driver/obex init method can signal the end of initialization, the completion of the two cog allocation so the main init can proceed further.
When you start a cog it is not immediately available, you have to wait that it boots. You are aware of this so you can handle it. COGINIT COGID+1 is the same thing in my opinion, if you are aware you can handle. Moreover I think that this is an option, not the default behaviour, so the one using it is responsible for what and how he does.
COGNEW is still needed and with the above assumptions has not failed its purpose.
The rule for COGNEW is to use the lowest-numbered inactive cog.
Any code that requires two or more consecutive cogs, could start from the highest available cog downward with COGINIT, thus going in the COGNEW opposite direction, checking for cogs already started by other similar code. Chances are that it can find all consecutive cogs at the first attempt. I don't know if there is already an instruction that check is a given cog is running or not but may be useful in this case.
Comments
With the Smart Pin cell support, is FS USB still an issue ?
I Thought with Chip's change to faster Pn-cell interface, and the HW Support, that USB FS was doable ?
Does anyone have a working FS-USB example yet ?
Aside from USB, are there other use cases where low enough latency cannot be designed in ?
I can see that the eggbeater fundamentals are going to need user COG placement control, for truly deterministic COG-COG data flows.
Maybe Chip should leave this in, and let some real use cases be explored ?
Nobody came back with an answer to my COG allocation question.
If finding consecutive COGs clashes with finding suitable COGs for different pin functionality, that sounds like a real headache.
Compilers can resolve what goes on at run time.
Did you also just say "eggbeater fundamentals are going to need user COG placement control, for truly deterministic COG-COG data flows.".
It gets worse and worse...
That is why removing pin-constraints (as Chip is mentioning doing on DACs) makes more sense, than removing possible COG-COG constraints.
The Egg-beater already scans COGs in a certain order, that is fundamental to the silicon design.
Any P2 Simulator, will have to know which COG code is running in, to correctly simulate times.
The Prop2's lowest latency Hub access window is something that hasn't been discussed that I know of. I certainly don't understand it.
For the shared LUT case, I have used a very specific case (which has been specifically satisfied by the silicon in the smart pins) to show how passing data between cogs via the hub falls down. The P2 is a very different animal to what was originally conceived as an enhanced P1+.
Everyone seems to presume that the hub egg beater design solves everything. It simply does not!!!
We don't have fixed calculatable latency without some complex maths, and requires knowing not only which two cogs are involved, but also the hub lower 4 long address bits for a byte/word/long transfer. Fast block transfers are covered quite well although not quite deterministically, but it really doesn't matter in these cases anyway.
But the plenthora of cogs are there to do smart protocols with pins, particularly where the smart pins don't cater for these cases.
To kill the simple method of LUT sharing between adjacent cogs on the basis that all cogs are not equal is simply putting ones head in the sand. If you don't like it, don't use it. If you take this attitude, then you best kill the egg beater, because here cogs are not equal in latency!!!
All sorts of tricks can be done with LUT sharing beyond the example I have given. Are we trying to get the best functionality out of the P2? It's been implemented by Chip, so there is no further delay. Even adding extended LUT exec would not be a big delay, and would be worth the trouble, but I still remain happy without it.
Look at it this way... If the USB smart pins were not there, or there is a flaw in the design that shows up, the shared LUT could be the saviour. Remember, this is only for a 12MHz signal. These days that is quite a low speed. What else can we do, or not do, because we cannot use cooperating cogs due to sharing latency, where smart pins are unable to help.
Continual discussions on P2 have been about how to increase data transfers. This is no coincidence. The P1 has a problem here and many of us have found real problems here. There have been discussions about slot sharing, threads, hubexec. All these and many others have been to solve inter processor communications. P2HOT was 1 clock instructions. The current P2 is 2 clock instructions.
Who knows, perhaps 10MHz Ethernet might be possible, maybe even 100MHz with smart pins and a few cogs using shared LUT.
Another 'real world' use case I was wondering about, was Ethernet ?
With the streamer supporting nibble transfers, it seems 25MHz nibble flows that 100MHz Ethernet needs, might become doable. 10MHz is probably easier, but some mentioned that may be going away.... ?
Those are likely to need some serious SW, and maybe two COGs, one for each direction ?
BTW I have not looked at the Ethernet protocol. I was not aware nibbles were used - are you sure of this??? Just reread you comment. FYI In the 80's I did 2780 bisync and 3270 sdlc with an attached 3274 terminal.
Question is is pre-arranged instruction counting as simple as the Prop1?
That's wrong. The amount of variance is even and symmetrically ordered - predictable.
The sacrifice would be removal of COGNEW as a feature, dynamic allocation out the window, only COGINIT would exist. Personally, I'm okay with this.
Cant you do
COGINIT COGID+1
IIRC Cluso99 asked to a way for fast and low latency communication between cogs. Is Chip that brought out the idea to make the LUT dual-port and share it.
I agree with Cluso99 that a way of direct (not necessary big area but fast) communication between cogs is needed.
I do not know how the architecture of P2 is routed and muxed internally but since the main cog ram/registers are dual-port cant a new opcode set a working mode (default disabled) where two registers (eg 0 and 1 or whatever number on 0..512 range) becomes shared between cogs. Eg register0 with below cog and register1 with above cog?
Or just make a latch actually two, one between lower cog and one between higher cog, writable and readable by two new opcodes (RDCOGL, WRCOGL) where
- RDCOGL source sets upper/lower latch and destination sets the register where to transfer the value inside the cog; WZ set if first read of new data, cleared for consecutive reads (of same old data). WC acknowledge the read to other cog
- WRGOGL source selects the register while the destination the upper/lower latch. The WC writes the latch only if previous value has been read. WZ returns write ACK.
Perhaps a write to the latch can be an interrupt source for the written cog.
This should be only a direct path, should not involve any mux (I hope), and also do not break cogs equity between each other.
I think when Chip was looking at different sizes, he mentioned that the DATA mux needs 32, and the address 9, and halving the area only saves one of those 9 Adr Mux.
In other words, doing a full LUT MUX has low incremental cost, but best benefit.
If it has no speed impact, that makes sense to do.
From a chip physical placement view point, you can place one LUT area between two COGs (odd/even) and have them share that.
@evanh, I'm not.
Is this P2 COGNEW 100% predictable, when run in the same order ?
ie user may not need to define a COG#, but their code will always start the same way, and use the same COG#'s, ( I guess in the simple order requested ? )
The most efficient (minimal latency) is for cog n to pass a data byte to cog n-1 (or perhaps n-2).
That said, I understand that sophisticated fitting algorithms can overcome many allocation issues. But without total agnosticism, there will always be those projects that won't fit solely because of it. And when they do work, how much does it add to the compile time? When I hear of multi-minute compiles for FPGAs, compared with the fraction of a second that most Spin/PASM compiles require for the Prop 1, it makes my ears bleed. There's a lot to be said for instant gratification!
-Phil
Of course, but Pin-agnostic, and COG-agnostic are not quite the same topic.
Also, the pairing needed for LUT sharing does not exclude any other COG from being used, and the DAC fix Chip has recently suggested makes COG-Pin usage more decoupled.
You likely use (or have used) tool chains that use a linker ? - How much time does that add ?
(A: Nothing to the compile time, and very little to the total build time)
The CPLD fitters I use are usually all done in 1-2 seconds, and that is FAR more complex than any COG allocate could be.
Interesting. So lower number cogs will be used more often as higher numbered ones. Do they wear out?
I think lut sharing between two neighbor cogs is to complicated to be used. On the P1 we had Mailboxes in the HUB or unused pins for cog to cog communication. We still can do that on the P2. I also remember some talk about special HUB locations for interrupt driven mailboxes, not sure if they are still there.
Is it possible to use the smartpin interface to send messages from cog to cog?
as for the signaling register (16 times ported 32 bits of ram?) it would be nice to have two bits per cog, so one bit could be data and one clock?
Enjoy!
Mike
Of course and many Self-test examples will do this.
However, there are caveats :
* It consumes a Pin
* The latency is likely to be larger than simply going via HUB, so why bother, if you just want COG-COG. ?
COGNEW, being a language feature, could maybe grow itself a new extension where it can be asked to start multiple tasks at once/consecutively and guarantee those will be in matching consecutive ordered Cogs.
EDIT: Of course, this then becomes a complex function that needs a varying bunch of collated parameters. Bit of a turn-off for easy reading.
Remember, special requirements may require special resources. If you dont use them, everything is fine. But if you do need them you pay the price which is better than never being able to do it because of not implementing the hardware.
BTW Typically we are talking about requiring two consecutive cogs - there are 8 pairs available!!!
The idea of going to the trouble of extending COGNEW is to cater to what may be needed.
But it also has a possible compactness advantage if all tasks can be launched by a single statement. May as well provide a general improvement at the same time.
- if you are the programmer of the whole thing you know what you are doing, this apply also on cogs allocation
- if you develop a driver COGINIT COGID+1 make the thing dynamic thus it can be in obex. The user knows about the double cog requirements of the driver and knows that he doesn't have to start cogs with COGNEW in there between. The driver/obex init method can signal the end of initialization, the completion of the two cog allocation so the main init can proceed further.
When you start a cog it is not immediately available, you have to wait that it boots. You are aware of this so you can handle it. COGINIT COGID+1 is the same thing in my opinion, if you are aware you can handle. Moreover I think that this is an option, not the default behaviour, so the one using it is responsible for what and how he does.
COGNEW is still needed and with the above assumptions has not failed its purpose.
Any OBEX program that uses COGINIT forces everything to use it.
EDIT: I might be wrong about the separate API thing.
Any code that requires two or more consecutive cogs, could start from the highest available cog downward with COGINIT, thus going in the COGNEW opposite direction, checking for cogs already started by other similar code. Chances are that it can find all consecutive cogs at the first attempt. I don't know if there is already an instruction that check is a given cog is running or not but may be useful in this case.