P2 - New Instruction Ideas, Discussions and Requests

Cluso99 · 2014-03-14 15:07

I have started a new thread to discuss ideas etc, so they can be in one place without the other generalised discussions that take place on the main thread

Post links to other discussions where relevant.

This thread can be used so we can summarise the good points more easily after the imminent fpga release has been done.

So it's not lost, Bill proposed a change to the SETRACE instruction
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1250704&viewfull=1#post1250704
Bill, would you mind reposting here?

Bill Henning · 2014-03-14 15:26

(moved as per Ray's request)

Chip,

AFTER the next FPGA image is released...

if it is not a lot of logic, or work, I think a slight addition to TRACE would improve debugging capabilities greatly, and also make verifying P2 shuttle runs a bit easier.

According to the latest documentation, SETRACE D/# only uses the lowest four bits to control where the trace data is placed, as SETRACE #%TTTT

If it is easy, then

SETTRACE D/#%FFFTTTT

would be nice, where FFF decodes as:

%0XX - operate as currently defined

%1tt - filter output for task tt only, and in the output, show current pipeline stage for task tt instruction being executed

If it is simpler to implement, it could be fixed to task3 as

%FTTTT

where

F = 0 work as currently defined
F = 1 filter to task three trace only, replace task id in trace output with task3's pipeline stage

By setting task 3 to 1/16, we would get a "slow motion" view of T3's instructions at they pass through the pipeline (including REPx blocks)

What do you think?

Cluso99 · 2014-03-14 17:12

I think SETRACE should have further discussion before anf changes are done.

I am sure there are a few thing that would really help. This doesn't mean I don't like your ideaBill, because I do.

Ozpropdev will likely havesome ideas too,with his new tracer.

Previously I suggested that it might benice tobe able to pause execution, butthat may beno longer necessary with the new task/thread instructions.
BTW I don't mind using extrabitsifits optional to output more info.

ozpropdev · 2014-03-14 17:22

In addition to Bill's suggestion a modification to the pin transfer function to allow a prescaler to control data capture rates.
or a link to CTRx as the clock source. This would allow easy generation of "scope" like viewing of I/o pins.

:):)
Also a optional transfer cancel when PTRY rolls over to prevent data overwrite.

rogloh · 2014-03-14 18:21

ozpropdev wrote: »

In addition to Bill's suggestion a modification to the pin transfer function to allow a prescaler to control data capture rates.
or a link to CTRx as the clock source. This would allow easy generation of "scope" like viewing of I/o pins. :):)
Also a optional transfer cancel when PTRY rolls over to prevent data overwrite.

Speaking of which, given we can send an entire wide's worth (256 data bits) to the hub every 8 clocks, it would be be rather cool if we could sample all the 96(or 128) pins in the same clock cycle and write them all to the hub in one go (ie. no skew in sampling). A rudimentary logic analyzer application is then possible then and we can store ~252kB of data in the hub. Be awesome if one COG could do this, though maybe 3-4 COGs could already do it (doing 1 port each) if they can be aligned to sample their port on the exact same sample clock and we ensure it is not their hub cycle when they sample. So for only half the 8 sequential hub time slots we can sample, the other half can be used to write the data. A REPD could allow a tight loop with a lot of samples written before we have to stop, and prevent running out of hub memory once triggered. The other 4-5 COGs are free for data display and analysis applications etc.

With the 3-4 COGs approach you could probably sample up to 100MHz.

bartgrantham · 2014-03-14 19:37

I was distracted by life for a few weeks and missed out on the entire preemptive multithreading discussion, which I really regret. Partly because it took several days to read hundred of posts worth of detailed discussion! Now that the machinery for dumping a task's state exists and COGNEW has begat COGRUN/COGRUNX, I'd like to return to what Chip said in the big thread:

cgracey wrote: »

To really orient debugging properly, it needs to be done from another cog. That cog has to have the ability to view into the target cog, step it, etc. Shy of that, we have a rather impure circumstance where the target cog must do debugging on itself, not allowing itself to be wholly what it would have been without the internal accommodations for the debug stuff. If I had time, I would certainly pursue this, but I do feel that is a bit much to jump on right now. So, what we have is adequate for grease monkeys like ourselves, but it's not shrink-wrapped like customers may expect it to be.

How far of a stretch is it for COGSTOP to work on the next cycle and to implement a COGPUSH that allowed one cog to dump another cog's entire state into hub RAM? Or even to dump it though the intercog exchange? I mentioned before in the "What's needed for preemptive..." thread that I feared it would take too much silicon, but maybe it's a little closer now.

rjo__ · 2014-03-14 19:56

In addition to Bill's suggestion a modification to the pin transfer function to allow a prescaler to control data capture rates.
or a link to CTRx as the clock source. This would allow easy generation of "scope" like viewing of I/o pins.
Also a optional transfer cancel when PTRY rolls over to prevent data overwrite.

***** that's five stars. As far as the rating system goes.

I have no concept what a prescaler does to the architecture, but it sounds awfully nice.

jmg · 2014-03-14 20:45

ozpropdev wrote: »

In addition to Bill's suggestion a modification to the pin transfer function to allow a prescaler to control data capture rates.
or a link to CTRx as the clock source. This would allow easy generation of "scope" like viewing of I/o pins. :):)

Chip has said the Counters already have capture ( I hope with atomic control) and a variant on simple grab-and-save is to capture time-stamps on edges.
A small variant on this would be to allow Capture from a group of pins, and save a pin pattern with the time-stamp.

If you do want a burst save, then a REP loop should allow suite high speeds, even if with not huge depth.

Those two approaches can combine to give packed samples at short bursts of high speed, and then below a moderate speed, the edge-time-tag approach has very wide dynamic range, for a more generally engineering useful capture.

Cluso99 · 2014-03-14 23:42

IIRC there is already some burst mode for pin(s) sampling. Cannot recall where it was discussed.

ozpropdev · 2014-03-14 23:55

Cluso99 wrote: »

IIRC there is already some burst mode for pin(s) sampling. Cannot recall where it was discussed.

The pin transfer function (SETXFR) is used to capture trace data (SETRACE) commonly via PortD using (SETXCH).

rogloh · 2014-03-15 00:35

jmg wrote: »

Chip has said the Counters already have capture ( I hope with atomic control) and a variant on simple grab-and-save is to capture time-stamps on edges.
A small variant on this would be to allow Capture from a group of pins, and save a pin pattern with the time-stamp.

If you do want a burst save, then a REP loop should allow suite high speeds, even if with not huge depth.

Those two approaches can combine to give packed samples at short bursts of high speed, and then below a moderate speed, the edge-time-tag approach has very wide dynamic range, for a more generally engineering useful capture.

In some cases at lower capture frequencies it may be possible to stream data to external SDRAM which will increase the pin capture depth significantly. For example, if we collect 32 bit samples (one port) of capture data at 50MHz we get a 200MB/s data stream which I think is a readily acheivable bandwidth we can target for writing 16 bit wide SDRAM @ 200MHz when using large transfers. We could also try to capture a 32 bit port and a 16 bit port together (15 usable pins on the second port, given how many pins the SDRAM interface would consume). There is possibly enough SDRAM bandwidth for that too. Hub bandwidth using wides is 32*25MHz = 800MB/s @200MHz P2 so we have plenty of that. Maybe 2 SDRAM COGs would be needed for sustaining the writes to SDRAM in a ping pong manner and keeping data flowing at the peak rate. I'm not totally sure of the final capabilities there as it depends on new XFER engine and clock speed etc.

kwinn · 2014-03-15 11:51

rogloh wrote: »

Speaking of which, given we can send an entire wide's worth (256 data bits) to the hub every 8 clocks, it would be be rather cool if we could sample all the 96(or 128) pins in the same clock cycle and write them all to the hub in one go (ie. no skew in sampling). A rudimentary logic analyzer application is then possible then and we can store ~252kB of data in the hub. Be awesome if one COG could do this, though maybe 3-4 COGs could already do it (doing 1 port each) if they can be aligned to sample their port on the exact same sample clock and we ensure it is not their hub cycle when they sample. So for only half the 8 sequential hub time slots we can sample, the other half can be used to write the data. A REPD could allow a tight loop with a lot of samples written before we have to stop, and prevent running out of hub memory once triggered. The other 4-5 COGs are free for data display and analysis applications etc.

With the 3-4 COGs approach you could probably sample up to 100MHz.

Love this idea. Would be even better if we could also take 8 32 bit, 4 64 bit or 2 128 bit samples, one per clock, put them in a wide and store it in the hub ram.

rogloh · 2014-03-15 16:56

kwinn wrote: »

Love this idea. Would be even better if we could also take 8 32 bit, 4 64 bit or 2 128 bit samples, one per clock, put them in a wide and store it in the hub ram.

The basic 3-4 COG algorithm I was thinking of for reaching 100MHz sampling on all 3 or 4 ports would go something like this below. You'd need to ensure each COG chosen writes in the odd hub cycles (or all in the even hub cycle) so they can sample at the same clock cycle, once they become synchronized to the hub.

REPD D, #16  loop to take D*8 32 bit samples:
NOP
NOP
NOP
  GET 32 bit input data to WIDE[0]
  NOP
  GET 32 bit input data to WIDE[1]
  NOP
  GET 32 bit input data to WIDE[2]
  NOP
  GET 32 bit input data to WIDE[3]
  NOP
  GET 32 bit input data to WIDE[4]
  NOP
  GET 32 bit input data to WIDE[5]
  NOP
  GET 32 bit input data to WIDE[6]
  NOP
  GET 32 bit input data to WIDE[7]
  WRITE WIDE, increment write pointer by 32 bytes

If you only wanted to acquire a single 32 bit port at a time, you could just use two COGs (one odd and one even) and this would hit 200MHz. The collected data would be interleaved in two regions of hub memory but you can always compensate for that offline in your analysis application.

EDIT: Maybe this new pin transfer engine Chip's recently updated can help out here and give us a boost if thing can be done in parallel.

kwinn · 2014-03-16 12:51

rogloh wrote: »
The basic 3-4 COG algorithm I was thinking of for reaching 100MHz sampling on all 3 or 4 ports would go something like this below. You'd need to ensure each COG chosen writes in the odd hub cycles (or all in the even hub cycle) so they can sample at the same clock cycle, once they become synchronized to the hub.
REPD D, #16  loop to take D*8 32 bit samples:
NOP
NOP
NOP
  GET 32 bit input data to WIDE[0]
  NOP
  GET 32 bit input data to WIDE[1]
  NOP
  GET 32 bit input data to WIDE[2]
  NOP
  GET 32 bit input data to WIDE[3]
  NOP
  GET 32 bit input data to WIDE[4]
  NOP
  GET 32 bit input data to WIDE[5]
  NOP
  GET 32 bit input data to WIDE[6]
  NOP
  GET 32 bit input data to WIDE[7]
  WRITE WIDE, increment write pointer by 32 bytes
If you only wanted to acquire a single 32 bit port at a time, you could just use two COGs (one odd and one even) and this would hit 200MHz. The collected data would be interleaved in two regions of hub memory but you can always compensate for that offline in your analysis application.

EDIT: Maybe this new pin transfer engine Chip's recently updated can help out here and give us a boost if thing can be done in parallel.

Yes, I understand most of this can be done using multiple cogs and clever programming, but having hardware read the pins at 200MHz, pack the data into wides, and store it to hub ram using a single cog and one instruction would be fantastic. In 20/20 hindsight I should have said in multiples of 8 pins, so 8, 16, 32, 64, 128 bits. Of course the data rate for 64 and 128 bits would be limited by hub access so multiple cogs would be needed to hit 200MHz.

Ariba · 2014-03-16 15:03

kwinn wrote: »

Yes, I understand most of this can be done using multiple cogs and clever programming, but having hardware read the pins at 200MHz, pack the data into wides, and store it to hub ram using a single cog and one instruction would be fantastic. In 20/20 hindsight I should have said in multiples of 8 pins, so 8, 16, 32, 64, 128 bits. Of course the data rate for 64 and 128 bits would be limited by hub access so multiple cogs would be needed to hit 200MHz.

A cog can already sample 16 or 32 bits of a port at the full clockfreq (200 MHz) and write the data to hubram. Either use the Pin-Transfer hardware or the new Auto-Read/Write-Wide instructions:

SETPTRA <hubaddr>
    REPS    #<wides*8>,#1
    WRWIDEA #<wides>      'sync to hub and auto-write a wide every 8 clocks
     MOV    $1F1,INA      'sample all 32 bits of INA at clockfreq. rate

It's even easier to sample at full clock speed than at 1/2 clock rate which would need an unrolled loop.

Andy

kwinn · 2014-03-16 17:12

Nice. Thanks for pointing it out Andy.

evanh · 2014-03-27 04:16

Here's a link to Cluso's first(?) mention of hub slot pairing - http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223134&viewfull=1#post1223134

I know this was more extensively debated as consuming any spare hub slots but I don't remember Chip having anything else to say about ease of implementation.

Cluso99 · 2014-03-27 06:54

That's probably the first mention.
What I think would be nice is two options...

1. A cog can donate it's slot to another cog, either giving the other cog priority, or retaining priority.
Typically two cogs wouldbe paird by software and would cooperate. The wouldbe spaced 4 slots apart typically. ie cog 0 & 4; 1 & 5;etc.

2. Any cog can set itself to utilise unused slot(s).
Typically, this would just mean that when the cog requires a hub slot, it would grab the next free (unused) slot, and then typically it would not then require itsnext allocated slot, thereby placing it back in the unusd pool.

Chip put forward a plan but to me this was more complex than necessary.

evanh wrote: »

Here's a link to Cluso's first(?) mention of hub slot pairing - http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223134&viewfull=1#post1223134

I know this was more extensively debated as consuming any spare hub slots but I don't remember Chip having anything else to say about ease of implementation.

Bill Henning · 2014-03-27 07:03

To make sure we are on the same page:

1) example

DONOR cog 0 donates its slot to DONEE cog 4
DONEE now is locket to 4 clock hub cycles

Any of the 4 cycle apart cycles DONEE does not use, DONOR can use

(DONEE/DONOR names totally arbitrary, could be A/B whatever)

2) "GREEN RECYCLE MODE"

A cog that sets the GREEN mode can use its own slots, and any otherwise unused slots

I agree that the above two scenarios is all we need right now. Chip at one point suggested that if there is more than one "GREEN" cog, green cogs could share unused hub cycles round-robbin.

GREEN would provide a huge speed bust to hubexec.

Cluso99 wrote: »

That's probably the first mention.
What I think would be nice is two options...

1. A cog can donate it's slot to another cog, either giving the other cog priority, or retaining priority.
Typically two cogs wouldbe paird by software and would cooperate. The wouldbe spaced 4 slots apart typically. ie cog 0 & 4; 1 & 5;etc.

2. Any cog can set itself to utilise unused slot(s).
Typically, this would just mean that when the cog requires a hub slot, it would grab the next free (unused) slot, and then typically it would not then require itsnext allocated slot, thereby placing it back in the unusd pool.

Chip put forward a plan but to me this was more complex than necessary.

Cluso99 · 2014-03-27 07:35

Bill,
1. Yes, agreed. I hadn't actually thought that the donor could use unused slots ofthe donee too, but it makes sense. If its not too much additional silicon,if neither require the slot, it becomes available to the pool.
2. Yes, agreEd.

With the wide access, I am expecting there is not going to be many possibilities where cogs will actually be able to use muchmore than a 1:8 slots, except maybe in a short burstwhere the instruction cache gets loaded and thre is also hub dataaccesstoo. Most useful is the ability of a cog to not have to wait for its own slot, but in fact gets an earlier slot, and then doesnt require its own slot. This will give a nice performance boost.

rjo__ · 2014-03-27 07:35

So… this is one of those places in the conversation that I got lost. I thought this was already done:)

Should there be an option of turning all cogs "off" ----so to speak? … so that the the machine operates as one big cog at full clock speed until it resumes normal operations… at some time or at some command?

I can see why this hasn't been sorted… consider 8 cogs with 4 tasks each… at 80 MHz we have the equivalent of 32 P1 cogs…sort of. Then we steal back hub cycles using some schema.
If a cog is already multitasking… why bother? If a cog isn't multitasking AND only needs a half, quarter or eighth of its hub slots, then I guess it sort of makes sense.

Rich

ctwardell · 2014-03-27 07:52

Do we have any documentation yet on the timing of hub operations for P2?

For example if Cog 0 writes a value to the hub which Cog will first see that value in the round robin order.

IIRC on the P1 it is 4 slots, so if Cog 0 writes a value Cog 4 will be the first to see it.

If we don't have any docs, maybe someone with a DE2-115 would be willing to find out through testing (or Chip could tell us).

C.W.

Bill Henning · 2014-03-27 08:08

Cool!

Exactly, all "GREEN" / "PAIRED" gets us is reduced latency... but that will be a HUGE deal to hubexec, especially as there is only one line of dcache per cog.

I expect hubexec data access to be 2-3x faster with "GREEN"

It will especially help with tasks where more than one task (cog mode or hubexec) needs hub access.

"PAIRED" would allow one cog (in a pair) to be able to deterministically count on 4 cycle hub access - a win for some applications, even if it does not increase hub bandwidth.

Also, I'd expect the default would be all cogs coming up in "HUB8" mode - ie 8 cycle deterministic, and a cog would have to execute some configuration instruction to enter DONOR/DONEE or GREEN modes.

Cluso99 wrote: »

Bill,
1. Yes, agreed. I hadn't actually thought that the donor could use unused slots ofthe donee too, but it makes sense. If its not too much additional silicon,if neither require the slot, it becomes available to the pool.
2. Yes, agreEd.

With the wide access, I am expecting there is not going to be many possibilities where cogs will actually be able to use muchmore than a 1:8 slots, except maybe in a short burstwhere the instruction cache gets loaded and thre is also hub dataaccesstoo. Most useful is the ability of a cog to not have to wait for its own slot, but in fact gets an earlier slot, and then doesnt require its own slot. This will give a nice performance boost.

Dave Hein · 2014-03-27 09:30

I added a mode to spinsim that ignores the hub-slot requirement, and I saw an improvement of 30% to 36% when running a Dhrystone benchmark. In practice this would only be achieved if a single cog was running, or the other running cogs were not accessing the hub. The 30% to 36% improvement is probably a reasonable expectation. The greatest possible improvement would be 44% for a program that always missed it's hub slot by 1 cycle.

In another thread I proposed a hub arbitration scheme that would work as follows:

- cogs start up using their own hub-slot, and must explicitly run a "HUNGRY" instruction to enable using other slots
- if a cog is not accessing the hub during its slot time the hub-slot is automatically available to other cogs
- a cog always gets guaranteed hub access during its own hub-slot
- access to unused hub slots are granted on the basis of the time since the last access, and time till its next hub slot

The arbitration algorithm should reward cogs that haven't used the hub for a while, and are not close to their own hub slots.

potatohead · 2014-03-27 09:34

I like this scheme.

Bill Henning · 2014-03-27 10:52

Dave Hein wrote: »

I added a mode to spinsim that ignores the hub-slot requirement, and I saw an improvement of 30% to 36% when running a Dhrystone benchmark. In practice this would only be achieved if a single cog was running, or the other running cogs were not accessing the hub. The 30% to 36% improvement is probably a reasonable expectation. The greatest possible improvement would be 44% for a program that always missed it's hub slot by 1 cycle.

Great data Dave, thanks. Which propgcc mode did you run the test in?

Dave Hein wrote: »

In another thread I proposed a hub arbitration scheme that would work as follows:

- cogs start up using their own hub-slot, and must explicitly run a "HUNGRY" instruction to enable using other slots
- if a cog is not accessing the hub during its slot time the hub-slot is automatically available to other cogs
- a cog always gets guaranteed hub access during its own hub-slot

Above is the same as what I proposed, it makes sense.

Recently I started calling it "GREEN recycle mode" - some people seemed to be offended by the "HUNGRY" label for some reason.

Dave Hein wrote: »

- access to unused hub slots are granted on the basis of the time since the last access, and time till its next hub slot

The arbitration algorithm should reward cogs that haven't used the hub for a while, and are not close to their own hub slots.

Sounds good to me!

I think this is effectively pretty much the same as what Chip was suggesting when I was discussing this with him - he wanted to give round-robbin access to unused slots, to ensure fair distribution.

I suspect that the greatest performance wins will be for heavy hub usage apps, such as VM's, and multi-tasking hubexec cogs, as those usage cases are badly affected by only having a one line dcache.

Again, thanks for getting a benchmark for this!

Fibo running with a hub stack should also speed up nicely.

Dave Hein · 2014-03-27 12:08

Bill Henning wrote: »

Great data Dave, thanks. Which propgcc mode did you run the test in?

The Dhrystone program was converted to Spin using cspin and compiled as a P1 Spin binary. It was then run on P2 using my p1spin interpreter running under spinsim. So the results are probably more relevant to how p1spin is implemented. Most of it executes from the hub, and it's constantly accessing the stack, which is located in hub RAM.

Bill Henning wrote: »

I think this is effectively pretty much the same as what Chip was suggesting when I was discussing this with him - he wanted to give round-robbin access to unused slots, to ensure fair distribution.

I suspect that the greatest performance wins will be for heavy hub usage apps, such as VM's, and multi-tasking hubexec cogs, as those usage cases are badly affected by only having a one line dcache.

A fairly simple way to arbitrate the hub would be to weight a hub request by the distance of from the cog's hub-slot. The weighting might look like this:

   |    Cycle
COG| 0 1 2 3 4 5 6 7
--------------------
 0 | 7 0 2 4 6 5 3 1
 1 | 1 7 0 2 4 6 5 3
 2 | 3 1 7 0 2 4 6 5
 3 | 5 3 1 7 0 2 4 6
 4 | 6 5 3 1 7 0 2 4
 5 | 4 6 5 3 1 7 0 2
 6 | 2 4 6 5 3 1 7 0
 7 | 0 2 4 6 5 3 1 7

Each column represents a cycle in the repeating sequence of 8 cycles. Each row represents a different cog. The weight pattern is offset by one cycle for each cog. Of all the cogs requesting the hub, the cog with the highest weight for the particular cycle will get the hub. A weight of 7 guarantees that the cog will get access during its hub slot if it needs it. The drawback of this scheme is that it doesn't factor in the length of time since the cog last used the hub.

bartgrantham · 2014-03-27 14:19

Quoting myself from a few months ago:

bartgrantham wrote: »

I'm not sure what the point of proposing a "hungry" mode is. It seems to me that if you want to make hub access fluid then pending hub read/writes by cogs should be governed by LRU. In other words, all cogs are hungry. Or none are. Depends on if they want access to the hub RAM. If all cogs want access, they end up time slicing by 8. If only two do, they access the hub back and forth. If it's just one 99% of the time, then it gets 99% bandwidth.

This is a generalization/restatement of the idea, which feels much more natural to me as a developer. Unfortunately it throws a huge monkey wrench into a prime design criteria, which is simplicity and determinism of timing for cogs

To clarify my thoughts, having a scheme where you explicitly manage the hub access by pairing up cogs or some other explicit management seems wrong to me. It sounds error-prone and will definitely make reuse difficult for any code that uses it. If we're proposing hub access slot sharing, I prefer a hub access model where every cog is hungry. The only question then becomes: how to manage code that is designed to run in lock-step with hub access, explicitly relying on the 8 cycle delay for it's own timing?

I like Dave's hub access matrix listed above, it's a good, clear statement of the logic. It makes it clear what cog will have precedence at any given hub cycle.

Bill Henning · 2014-03-27 14:31

Personally, I think all we really need is the ability for a cog to use otherwise unused hub cycles, perhaps shared round robbin the way Chip suggested.

I can see a case for paired cogs (deterministic 4 cycle access) however anything more than that becomes more complex than we need for P2, and should be left for P2.1+

In my opinion, any "prioritizing" complicates things too much. Two classes of cogs - deterministic hub access (8 cycles and perhaps 4 with pairs), and other cogs that can use early/more frequent hub access to reduce latency (green recycling cogs) is enough for now.

bartgrantham wrote: »

Quoting myself from a few months ago:

To clarify my thoughts, having a scheme where you explicitly manage the hub access by pairing up cogs or some other explicit management seems wrong to me. It sounds error-prone and will definitely make reuse difficult for any code that uses it. If we're proposing hub access slot sharing, I prefer a hub access model where every cog is hungry. The only question then becomes: how to manage code that is designed to run in lock-step with hub access, explicitly relying on the 8 cycle delay for it's own timing?

I like Dave's hub access matrix listed above, it's a good, clear statement of the logic. It makes it clear what cog will have precedence at any given hub cycle.

Edit:

"Hungry" or "Green" CANNOT affect cogs that expect to run in lock-step with the hub every eight clocks, as that is the normal & default behaviour.

Only cogs that don't need that determinism, and want a chance at recycling hub slots that would otherwise go unused in order to decrease latency for hub access would ask for "recycled" slots.

Prediction:

1) High bandwidth driver (HD etc) will use the fully deterministic access with RDWIDE as that can already maximize hub bandwidth and is predictable.

2) Compiled code and virtual machines will use "GREEN" mode to run faster and partially work around the limits of a single dcache line

potatohead · 2014-03-27 14:42

If we're proposing hub access slot sharing, I prefer a hub access model where every cog is hungry.

And...

The only question then becomes: how to manage code that is designed to run in lock-step with hub access, explicitly relying on the 8 cycle delay for it's own timing?

Gets us to individually setting the hungry mode per COG, with the assumption that we know it gets X cycles minimum. You could flag all the cogs, or just one, etc...

if timing needs to be explicit, then the developer can mark it not hungry, or use other means to establish the timing if the program would fail when granted more HUB cycles than it would get under strict round robin.

potatohead · 2014-03-27 14:43

Personally, I think all we really need is the ability for a cog to use otherwise unused hub cycles, perhaps shared round robbin the way Chip suggested.

Completely agreed. This discussion is a lot like the tasking one. We talked about a ton of stuff, all of which boiled down to a couple of key features in the silicon.

Just this one addition would leave the rest of COG coordination in software where it should be. And code that needs or is written to it's share of cycles will always get them.

P2 - New Instruction Ideas, Discussions and Requests

Comments