Any tricks to make passing data between cogs using hub RAM more efficient?

ags · 2011-03-09 07:48

I posted this question (inappropriately) buried in another thread in another sub-forum, and would like to "elevate" it to it's own topic on the proper forum.

Background:

I'm working on a project that requires a high datarate throughput (data is received from the WIZnet module that is used in the Spinneret). The PASM driver to get data from the WIZnet module will run in one cog, with the received data being available as bytes in that cog's RAM. The data will be processed, manipulated, and then will be sent out a Propeller pin; which specific pin will be determined by specific values embedded in the data. The output format will require bit-banging. The WIZnet is expected to produce data faster than any one cog could output it, so there will be multiple cogs running in parallel to consume a slice of the data produced by the WIZnet driver cog.

The only way I know how to get the data from the WIZnet driver ("master") cog RAM to the output pin driver ("slave") cog RAM is though the shared hub RAM. Timing will be critical if I am to sustain the desired data rate. If every time the master cog wants to write to hub RAM it has to wait the worst-case 22 clock cycles (from the documentation), and every time a slave cog wants to read from hub RAM it's the same situation, that would be a problem.

Questions:

1) Are there methods used to reliably make sure that each cog is able to read/write hub RAM in 7 cycles instead of closer to the worst-case 22 cycles? (I will make a point of writing a long at a time rather than bytes/words to save time that way.)

2) Is there something I can do within each cog PASM code so that it can get 2 hub RAM reads or writes in one "slice" of access to shared resources?

3) Can I treat wrlong/rdlong as an atomic operation? If a rdlong takes 7 clocks (best case) it seems that it would not be complete before the next cog in line could try to read that same location. Will I need to use locks to protect against this problem?

4) Am I on the right track? Are these the right questions?

Thanks to any and all that offer their knowledge and experience on this.

Bill Henning · 2011-03-09 08:20

1) NO

Best case: back to back hup ops, 16 cycles for all but first hub op

2) NO

3) atomic: YES

NO WAY to get more than one hub access every 16 cycles, and that requires back to back hub access, with first access taking up to 22 cycles

4) good questions, but can't break the rules. Read propeller data sheet re/ round robbin hub access

ags wrote: »

I posted this question (inappropriately) buried in another thread in another sub-forum, and would like to "elevate" it to it's own topic on the proper forum.

Background:

I'm working on a project that requires a high datarate throughput (data is received from the WIZnet module that is used in the Spinneret). The PASM driver to get data from the WIZnet module will run in one cog, with the received data being available as bytes in that cog's RAM. The data will be processed, manipulated, and then will be sent out a Propeller pin; which specific pin will be determined by specific values embedded in the data. The output format will require bit-banging. The WIZnet is expected to produce data faster than any one cog could output it, so there will be multiple cogs running in parallel to consume a slice of the data produced by the WIZnet driver cog.

The only way I know how to get the data from the WIZnet driver ("master") cog RAM to the output pin driver ("slave") cog RAM is though the shared hub RAM. Timing will be critical if I am to sustain the desired data rate. If every time the master cog wants to write to hub RAM it has to wait the worst-case 22 clock cycles (from the documentation), and every time a slave cog wants to read from hub RAM it's the same situation, that would be a problem.

Questions:

1) Are there methods used to reliably make sure that each cog is able to read/write hub RAM in 7 cycles instead of closer to the worst-case 22 cycles? (I will make a point of writing a long at a time rather than bytes/words to save time that way.)

2) Is there something I can do within each cog PASM code so that it can get 2 hub RAM reads or writes in one "slice" of access to shared resources?

3) Can I treat wrlong/rdlong as an atomic operation? If a rdlong takes 7 clocks (best case) it seems that it would not be complete before the next cog in line could try to read that same location. Will I need to use locks to protect against this problem?

4) Am I on the right track? Are these the right questions?

Thanks to any and all that offer their knowledge and experience on this.

ags · 2011-03-09 14:27

Bill Henning wrote: »

4) good questions, but can't break the rules. Read propeller data sheet re/ round robbin hub access

I've read all that the Propeller Manual (the .pdf that ships with the Propeller Tool) has to say on the subject. Maybe I'm missing something. I've read elsewhere on this board that that manual is not correct, but don't know exactly how or where.

[Note: I just read the datasheet (page 7) as suggested. I see that a hub instruction can only happen on the first of the two clocks during which the cog "owns" hub resource access. As you said, only one hub operation per cog time-slice is possible; the best thing I can do is to interleave 2 non-hub instructions in between hub operations to avoid having the cog doing nothing while waiting between hub operations.]

I'm stuck early on at a fundamental issue: if each cog has just two clocks per slice of access to shared resources (namely hub RAM) and a hub RAM operation takes 7 clocks best case, how can that work? For example, if cog0 writes to hub RAM address $4000 during it's "slice of time", and 2 clocks later cog1 gets its "slice of time" and attempts to read from hub RAM address $4000, and that's just 2 clocks after cog0's write started (which takes 7 clocks - so it is not done), how is there not conflict?

[Note: it is clearly stated in the datasheet (page 7) that this will not cause a resource conflict. All I can guess is that there is some pipelining going on and the actual read/write to hub RAM happens always at exactly clock<n> after the cog gains access to hub RAM, and each successive cog will have the same delay of <n> clocks, so there is no conflict. I wondered why there are 2 clocks per time slice, and if the answer is that the actual read/write from hub RAM takes exactly 2 clocks (of access time - where a conflict could occur) that would explain it.]

As I said, I must be missing some fundamental understanding. I would be very happy to read any other documentation that is available.

Thanks.

Mike Green · 2011-03-09 14:57

You've got some of the basic facts correct, but they go together a bit differently than you seem to think. Figures 1-3 and 1-4 on pages 25 of the Propeller Manual shows how the timing works. The hub read or write instruction takes a couple of clock cycles to set up. If the hub access slot for the cog comes up immediately, the hub read or write occurs immediately and the instruction takes a total of 7 clock cycles. If the hub access slot has not come up when the cog is ready, the cog waits until the hub access slot comes up. The worst case wait would be 15 clock cycles for a total execution time for the instruction of 22 clock cycles.

For successive accesses to hub ram, the first of a series of hub accesses can take from 7 to 22 clock cycles. Two cog-only instructions can execute after that (for a total of 15 clock cycles) followed by another access to hub ram. The hub will force the cog to wait for one clock cycle, then allow the hub access to take place. It's possible to have a sequence of hub accesses this way with only one idle clock cycle per access. Usually one or both cog-only instructions are used to increment the hub address and increment / decrement the cog address.

jazzed · 2011-03-09 15:54

Sometimes it helps to think about ways to employ 2 (or more) cogs to solve one problem assuming you have such luxury. I'm not sure if that's of any help in your example as I didn't read it in detail. However you may find it really helps with propeller programming to step away from of your normal tool box and consider the unique features of the chip and the instruction set for finding better solutions. .... Just a thought or two that may help. Good luck with your Propellering.

Tubular · 2011-03-09 17:14

I'm no expert on this, but didn't Linus (Turbulence demo) write to and from I/O pins for instant inter cog communication, bypassing the hub bottleneck?

RossH · 2011-03-09 17:14

Tubular wrote: »

I'm no expert on this, but didn't Linus (Turbulence demo) write to and from I/O pins for instant inter cog communication, bypassing the hub bottleneck?

Who has that many pins to spare???

Any tricks to make passing data between cogs using hub RAM more efficient?

Comments