P2 - New Instruction Ideas, Discussions and Requests
Cluso99
Posts: 18,069
I have started a new thread to discuss ideas etc, so they can be in one place without the other generalised discussions that take place on the main thread
Post links to other discussions where relevant.
This thread can be used so we can summarise the good points more easily after the imminent fpga release has been done.
So it's not lost, Bill proposed a change to the SETRACE instruction
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1250704&viewfull=1#post1250704
Bill, would you mind reposting here?
Post links to other discussions where relevant.
This thread can be used so we can summarise the good points more easily after the imminent fpga release has been done.
So it's not lost, Bill proposed a change to the SETRACE instruction
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1250704&viewfull=1#post1250704
Bill, would you mind reposting here?
Comments
Chip,
AFTER the next FPGA image is released...
if it is not a lot of logic, or work, I think a slight addition to TRACE would improve debugging capabilities greatly, and also make verifying P2 shuttle runs a bit easier.
According to the latest documentation, SETRACE D/# only uses the lowest four bits to control where the trace data is placed, as SETRACE #%TTTT
If it is easy, then
SETTRACE D/#%FFFTTTT
would be nice, where FFF decodes as:
%0XX - operate as currently defined
%1tt - filter output for task tt only, and in the output, show current pipeline stage for task tt instruction being executed
If it is simpler to implement, it could be fixed to task3 as
%FTTTT
where
F = 0 work as currently defined
F = 1 filter to task three trace only, replace task id in trace output with task3's pipeline stage
By setting task 3 to 1/16, we would get a "slow motion" view of T3's instructions at they pass through the pipeline (including REPx blocks)
What do you think?
I am sure there are a few thing that would really help. This doesn't mean I don't like your ideaBill, because I do.
Ozpropdev will likely havesome ideas too,with his new tracer.
Previously I suggested that it might benice tobe able to pause execution, butthat may beno longer necessary with the new task/thread instructions.
BTW I don't mind using extrabitsifits optional to output more info.
or a link to CTRx as the clock source. This would allow easy generation of "scope" like viewing of I/o pins. :):)
Also a optional transfer cancel when PTRY rolls over to prevent data overwrite.
Speaking of which, given we can send an entire wide's worth (256 data bits) to the hub every 8 clocks, it would be be rather cool if we could sample all the 96(or 128) pins in the same clock cycle and write them all to the hub in one go (ie. no skew in sampling). A rudimentary logic analyzer application is then possible then and we can store ~252kB of data in the hub. Be awesome if one COG could do this, though maybe 3-4 COGs could already do it (doing 1 port each) if they can be aligned to sample their port on the exact same sample clock and we ensure it is not their hub cycle when they sample. So for only half the 8 sequential hub time slots we can sample, the other half can be used to write the data. A REPD could allow a tight loop with a lot of samples written before we have to stop, and prevent running out of hub memory once triggered. The other 4-5 COGs are free for data display and analysis applications etc.
With the 3-4 COGs approach you could probably sample up to 100MHz.
How far of a stretch is it for COGSTOP to work on the next cycle and to implement a COGPUSH that allowed one cog to dump another cog's entire state into hub RAM? Or even to dump it though the intercog exchange? I mentioned before in the "What's needed for preemptive..." thread that I feared it would take too much silicon, but maybe it's a little closer now.
***** that's five stars. As far as the rating system goes.
I have no concept what a prescaler does to the architecture, but it sounds awfully nice.
Chip has said the Counters already have capture ( I hope with atomic control) and a variant on simple grab-and-save is to capture time-stamps on edges.
A small variant on this would be to allow Capture from a group of pins, and save a pin pattern with the time-stamp.
If you do want a burst save, then a REP loop should allow suite high speeds, even if with not huge depth.
Those two approaches can combine to give packed samples at short bursts of high speed, and then below a moderate speed, the edge-time-tag approach has very wide dynamic range, for a more generally engineering useful capture.
In some cases at lower capture frequencies it may be possible to stream data to external SDRAM which will increase the pin capture depth significantly. For example, if we collect 32 bit samples (one port) of capture data at 50MHz we get a 200MB/s data stream which I think is a readily acheivable bandwidth we can target for writing 16 bit wide SDRAM @ 200MHz when using large transfers. We could also try to capture a 32 bit port and a 16 bit port together (15 usable pins on the second port, given how many pins the SDRAM interface would consume). There is possibly enough SDRAM bandwidth for that too. Hub bandwidth using wides is 32*25MHz = 800MB/s @200MHz P2 so we have plenty of that. Maybe 2 SDRAM COGs would be needed for sustaining the writes to SDRAM in a ping pong manner and keeping data flowing at the peak rate. I'm not totally sure of the final capabilities there as it depends on new XFER engine and clock speed etc.
Love this idea. Would be even better if we could also take 8 32 bit, 4 64 bit or 2 128 bit samples, one per clock, put them in a wide and store it in the hub ram.
The basic 3-4 COG algorithm I was thinking of for reaching 100MHz sampling on all 3 or 4 ports would go something like this below. You'd need to ensure each COG chosen writes in the odd hub cycles (or all in the even hub cycle) so they can sample at the same clock cycle, once they become synchronized to the hub.
If you only wanted to acquire a single 32 bit port at a time, you could just use two COGs (one odd and one even) and this would hit 200MHz. The collected data would be interleaved in two regions of hub memory but you can always compensate for that offline in your analysis application.
EDIT: Maybe this new pin transfer engine Chip's recently updated can help out here and give us a boost if thing can be done in parallel.
Yes, I understand most of this can be done using multiple cogs and clever programming, but having hardware read the pins at 200MHz, pack the data into wides, and store it to hub ram using a single cog and one instruction would be fantastic. In 20/20 hindsight I should have said in multiples of 8 pins, so 8, 16, 32, 64, 128 bits. Of course the data rate for 64 and 128 bits would be limited by hub access so multiple cogs would be needed to hit 200MHz.
A cog can already sample 16 or 32 bits of a port at the full clockfreq (200 MHz) and write the data to hubram. Either use the Pin-Transfer hardware or the new Auto-Read/Write-Wide instructions: It's even easier to sample at full clock speed than at 1/2 clock rate which would need an unrolled loop.
Andy
I know this was more extensively debated as consuming any spare hub slots but I don't remember Chip having anything else to say about ease of implementation.
What I think would be nice is two options...
1. A cog can donate it's slot to another cog, either giving the other cog priority, or retaining priority.
Typically two cogs wouldbe paird by software and would cooperate. The wouldbe spaced 4 slots apart typically. ie cog 0 & 4; 1 & 5;etc.
2. Any cog can set itself to utilise unused slot(s).
Typically, this would just mean that when the cog requires a hub slot, it would grab the next free (unused) slot, and then typically it would not then require itsnext allocated slot, thereby placing it back in the unusd pool.
Chip put forward a plan but to me this was more complex than necessary.
1) example
DONOR cog 0 donates its slot to DONEE cog 4
DONEE now is locket to 4 clock hub cycles
Any of the 4 cycle apart cycles DONEE does not use, DONOR can use
(DONEE/DONOR names totally arbitrary, could be A/B whatever)
2) "GREEN RECYCLE MODE"
A cog that sets the GREEN mode can use its own slots, and any otherwise unused slots
I agree that the above two scenarios is all we need right now. Chip at one point suggested that if there is more than one "GREEN" cog, green cogs could share unused hub cycles round-robbin.
GREEN would provide a huge speed bust to hubexec.
1. Yes, agreed. I hadn't actually thought that the donor could use unused slots ofthe donee too, but it makes sense. If its not too much additional silicon,if neither require the slot, it becomes available to the pool.
2. Yes, agreEd.
With the wide access, I am expecting there is not going to be many possibilities where cogs will actually be able to use muchmore than a 1:8 slots, except maybe in a short burstwhere the instruction cache gets loaded and thre is also hub dataaccesstoo. Most useful is the ability of a cog to not have to wait for its own slot, but in fact gets an earlier slot, and then doesnt require its own slot. This will give a nice performance boost.
Should there be an option of turning all cogs "off" ----so to speak? … so that the the machine operates as one big cog at full clock speed until it resumes normal operations… at some time or at some command?
I can see why this hasn't been sorted… consider 8 cogs with 4 tasks each… at 80 MHz we have the equivalent of 32 P1 cogs…sort of. Then we steal back hub cycles using some schema.
If a cog is already multitasking… why bother? If a cog isn't multitasking AND only needs a half, quarter or eighth of its hub slots, then I guess it sort of makes sense.
Rich
For example if Cog 0 writes a value to the hub which Cog will first see that value in the round robin order.
IIRC on the P1 it is 4 slots, so if Cog 0 writes a value Cog 4 will be the first to see it.
If we don't have any docs, maybe someone with a DE2-115 would be willing to find out through testing (or Chip could tell us).
C.W.
Exactly, all "GREEN" / "PAIRED" gets us is reduced latency... but that will be a HUGE deal to hubexec, especially as there is only one line of dcache per cog.
I expect hubexec data access to be 2-3x faster with "GREEN"
It will especially help with tasks where more than one task (cog mode or hubexec) needs hub access.
"PAIRED" would allow one cog (in a pair) to be able to deterministically count on 4 cycle hub access - a win for some applications, even if it does not increase hub bandwidth.
Also, I'd expect the default would be all cogs coming up in "HUB8" mode - ie 8 cycle deterministic, and a cog would have to execute some configuration instruction to enter DONOR/DONEE or GREEN modes.
In another thread I proposed a hub arbitration scheme that would work as follows:
- cogs start up using their own hub-slot, and must explicitly run a "HUNGRY" instruction to enable using other slots
- if a cog is not accessing the hub during its slot time the hub-slot is automatically available to other cogs
- a cog always gets guaranteed hub access during its own hub-slot
- access to unused hub slots are granted on the basis of the time since the last access, and time till its next hub slot
The arbitration algorithm should reward cogs that haven't used the hub for a while, and are not close to their own hub slots.
Great data Dave, thanks. Which propgcc mode did you run the test in?
Above is the same as what I proposed, it makes sense.
Recently I started calling it "GREEN recycle mode" - some people seemed to be offended by the "HUNGRY" label for some reason.
Sounds good to me!
I think this is effectively pretty much the same as what Chip was suggesting when I was discussing this with him - he wanted to give round-robbin access to unused slots, to ensure fair distribution.
I suspect that the greatest performance wins will be for heavy hub usage apps, such as VM's, and multi-tasking hubexec cogs, as those usage cases are badly affected by only having a one line dcache.
Again, thanks for getting a benchmark for this!
Fibo running with a hub stack should also speed up nicely.
A fairly simple way to arbitrate the hub would be to weight a hub request by the distance of from the cog's hub-slot. The weighting might look like this: Each column represents a cycle in the repeating sequence of 8 cycles. Each row represents a different cog. The weight pattern is offset by one cycle for each cog. Of all the cogs requesting the hub, the cog with the highest weight for the particular cycle will get the hub. A weight of 7 guarantees that the cog will get access during its hub slot if it needs it. The drawback of this scheme is that it doesn't factor in the length of time since the cog last used the hub.
To clarify my thoughts, having a scheme where you explicitly manage the hub access by pairing up cogs or some other explicit management seems wrong to me. It sounds error-prone and will definitely make reuse difficult for any code that uses it. If we're proposing hub access slot sharing, I prefer a hub access model where every cog is hungry. The only question then becomes: how to manage code that is designed to run in lock-step with hub access, explicitly relying on the 8 cycle delay for it's own timing?
I like Dave's hub access matrix listed above, it's a good, clear statement of the logic. It makes it clear what cog will have precedence at any given hub cycle.
I can see a case for paired cogs (deterministic 4 cycle access) however anything more than that becomes more complex than we need for P2, and should be left for P2.1+
In my opinion, any "prioritizing" complicates things too much. Two classes of cogs - deterministic hub access (8 cycles and perhaps 4 with pairs), and other cogs that can use early/more frequent hub access to reduce latency (green recycling cogs) is enough for now.
Edit:
"Hungry" or "Green" CANNOT affect cogs that expect to run in lock-step with the hub every eight clocks, as that is the normal & default behaviour.
Only cogs that don't need that determinism, and want a chance at recycling hub slots that would otherwise go unused in order to decrease latency for hub access would ask for "recycled" slots.
Prediction:
1) High bandwidth driver (HD etc) will use the fully deterministic access with RDWIDE as that can already maximize hub bandwidth and is predictable.
2) Compiled code and virtual machines will use "GREEN" mode to run faster and partially work around the limits of a single dcache line
And...
Gets us to individually setting the hungry mode per COG, with the assumption that we know it gets X cycles minimum. You could flag all the cogs, or just one, etc...
if timing needs to be explicit, then the developer can mark it not hungry, or use other means to establish the timing if the program would fail when granted more HUB cycles than it would get under strict round robin.
Completely agreed. This discussion is a lot like the tasking one. We talked about a ton of stuff, all of which boiled down to a couple of key features in the silicon.
Just this one addition would leave the rest of COG coordination in software where it should be. And code that needs or is written to it's share of cycles will always get them.