Each cog has a 34-bit bus to each smart pin for write data and acknowledgment signalling. Each smart pin OR's all incoming 34-bit buses from the cogs in the same way DIR and OUT bits are OR'd before going to the pins. Therefore, if you intend to have multiple cogs execute WRPIN / WXPIN / WYPIN / RDPIN / AKPIN instructions on the same smart pin, you must be sure that they do so at different times, in order to avoid clobbering each other's bus data. Any number of cogs can read a smart pin simultaneously, without bus conflict, though, by using RQPIN ('read quiet'), since it does not utilize the 34-bit cog-to-smart-pin bus for acknowledgement signalling, like RDPIN does.
Each smart pin has an outgoing 33-bit bus which conveys its Z result and a special flag. RDPIN and RQPIN are used to multiplex and read these buses, so that a pin's Z result is read into D and its special flag can be read into C. C will be either a mode-related flag or the MSB of the Z result.
The first paragraph indicates RDPIN will mess with any coinciding writes ...
Because RDPIN isn't read-only. It writes to the write bus - it uses the write bus to acknowledge the read.
but the second paragraph also states there is a separate read bus for RDPIN/RQPIN.
There is one read bus per smart pin, and that one bus is used by both RDPIN and RQPIN. In terms of reading, both instructions do the same thing and are equivalent. But RDPIN additionally acknowledges the read, using the write bus. RQPIN does not acknowledge, and thus is "quiet" and the smart pin cannot even tell that some cog is reading anything.
That now begs the question of how the write bus works? It looks like there is 32 bits for data plus 2 bits for register select. With one of the four combinations, %00, being no action. So three register selects: Mode, X and Y. That leaves a question mark on how ACK is achieved.
DIR signal is the control for resetting a smartpin to initial state. So bashing the 32-bit DIRA/B registers, or DIRL/H instructions with "addpins", (N-1)<<6, also can reset them in parallel. FLT/DRV instructions do the same, respectively.
Eg: This code resets DIR for a group of four pins, together. Then sets up the same smartpin mode in all four, together. Then finally enables all four smartpins, together. The smartpins have the the same timing so will stay in phase with each other.
@pic18f2550 said:
The calculation of the sine curve is all well and good, but it consumes more than 60 clk impulses.
Not really
The CORDIC latency only matters if you want to calculate the value once and never again. Then the latency can't be hidden. But in that case you don't care about the latency: after all, you're just computing one value, likely in some setup code.
When you are running computations where speed matters, you obviously do many of them. In a loop (whether unrolled or not) of some sort. So you should arrange the operations so that they keep CORDIC as busy as you can, and CORDIC then operates in parallel with the COG. The actual cost of doing a computation on the CORDIC is 4-6 clock cycles: 2 clocks to submit an operation, and 2-4 clocks to read the result, based on its size (2 clocks for 32 bits, 4 clocks for 64 bits).
CORDIC is pipelined, so think of every computation as being a wavefront in a tube, traveling from the input (operation request) to the output (result read). There can be 54 wavefronts in the CORDIC at any given time. Each COG can make a new wavefront every 8 clock cycles - i.e. every four 2-clock instructions. It is, most of the time, rather hard to keep CORDIC 100% busy to fully utilize it, so CORDIC being "slow" is very rarely a problem. In other words, if you actually manage to keep the CORDIC at 100% utilization in some algorithm, you'll note that speed isn't a problem, since the COG will be barely keeping up with CORDIC. And such algorithms can be usually easily split across two COGs, with shared data in LUT memory, and you get 2x the performance right away.
In other words: if you can't stuff the CORDIC with a new operation or read request every 4 COG instructions, the problem is the COG code being inefficient, or the algorithm not being parallelized sufficiently, not the CORDIC being slow
The HUB architecture is set up in such a way that each COG has independent access to resources: it is as if each COG had its own access pipe to the HUB memory and CORDIC. In other words, it doesn't matter that there's only one CORDIC, or only one HUB RAM. Each COG is guaranteed independent access to these resources, at a guaranteed bandwidth, without interference from other COGs. So, the mental view of HUB resources like memory or CORDIC is that they are "per cog", with the memory contents being shared, but not the bandwidth. The fact that there's a round-robin system of access from COGs to HUB resources is just an implementation detail. It has nil effect on how COGs perceive the HUB, other than each COG's HUB access being offset from another COG by the number of clock cycles proportional to the distance between COGs (i.e. COG 1 and 4 are 3 clock cycles away).
I am in favour of each COG getting its own cordic.
That would be a solid waste of silicon
First of all, CORDIC accepts a new wavefront (operation) every clock cycle. A COG with a dedicated CORDIC would only be able to issue operations every other clock cycle. So that's half of the dedicated CORDIC wasted. Then, assuming you are actually interested in the results of the computations, you need another 2-4 clock cycles to read the results. So, with dedicated CORDIC per COG, and the COG doing no work but just issuing instructions and retrieving the results, 50-75% of the CORDIC capacity would be unused.
In real algorithms, you're actually doing something more with the data than just shuffling it between the COG RAM and CORDIC, so you'd likely need to fit additional instructions between CORDIC requests/reads just to manage that. As it currently stands, with the central CORDIC, you can execute 3 COG instructions per every CORDIC instruction. Since it takes 2-3 CORDIC instructions to perform an operation, i.e. start it and read the results, you have the ability to execute 6-9 two-cycle COG instructions per every CORDIC operation.
And, it turns out that this is often not enough, i.e. if you are actually using the results for something practical, you need more COG instructions than that, and the CORDIC speed is not a limiting factor.
Your mistake, IMHO, is that you're trying to run the oscillators in vacuum. That's just a silly micro-benchmark, with no relevance to real world use. In a real synthesizer, you'll be taking those CORDIC results and applying envelope, computing FM inputs to CORDIC, mixing, filtering, etc. You'll soon find that it's extremely hard to keep CORDIC 100% utilized - the COG will be the limiting factor.
And thus, the approach you need to take is to never optimize prematurely. First, make the entire single voice functional: implement the oscillator, FM operators, envelope, mixing, etc. - really, everything you have in mind for the final "product". And only when the functionality is there and works you should be optimizing it. You'll then see that to keep the CORDIC 100% occupied, you'll need to move some of the COG code out to another COG. With 100% CORDIC and COG utilization, you may only have 8-16 voices per COG, but those voices are complete - not merely oscillators, but the whole "kitchen sink". And that's fine: such voices will be completely utilizing the hardware resources, and then it's a simple matter to have the same code run on several COGs and provide the capacity you need. You can even let the user allocate resources: i.e. if they don't care for a fancy UI that takes two COGs, they can use a simpler single-COG UI and get 8-16 more voices
Premature optimization can be the root of all evil, and in pipelined parallel architectures (e.g. CORDIC+COG), micro-optimizations only pay off when done once all the needed functionality is finished.
Now, of course, you could argue that a per-COG CORDIC would be smaller in terms of silicon, since it wouldn't need to be as fast, so it could merge many pipeline stages and operate them in short "loops" before passing the result to the next stage. But that's still a gigantic amount of silicon. P2 does what it can with the available resources, but if you need the computational capacity of an ASIC or an FPGA, P2 is not fit for the application. It seems that your application, though, is well within the abilities of P2, it just requires efficient code and maximum utilization of resources on the chip, i.e. parallel execution on multiple COGs, etc. And it requires that you implement everything you want it to do first, and scale it up (i.e. expand number of identical voices) when that's done. That approach has worked well for me, at the very least.
I also find that in cases where lots of identical parallel computations have to be done, it may well be more cost efficient to use 2 or 4 P2s in parallel than going the FPGA route - at least in small volume applications, where your time is a significant chunk of the product price.
@Ariba
I have been working with your code and have a question. I cannot find what the _, in the line: _,smp := polxy(vol, phs2) does. I cannot find it in the Propeller 2 spin2 language documentation v35u.
Would you please explain.
Thanks in advance
Martin
Comments
OK now I understand.
I have adjusted my example accordingly.
I just have to test it.
I have to think about a runtime measurement.
Because RDPIN isn't read-only. It writes to the write bus - it uses the write bus to acknowledge the read.
There is one read bus per smart pin, and that one bus is used by both RDPIN and RQPIN. In terms of reading, both instructions do the same thing and are equivalent. But RDPIN additionally acknowledges the read, using the write bus. RQPIN does not acknowledge, and thus is "quiet" and the smart pin cannot even tell that some cog is reading anything.
That was like five weeks back! O_o
That now begs the question of how the write bus works? It looks like there is 32 bits for data plus 2 bits for register select. With one of the four combinations, %00, being no action. So three register selects: Mode, X and Y. That leaves a question mark on how ACK is achieved.
The runtime measurement brought it to light.
I had to reduce the number of generators from 32 to 30.
1st block
244 with fastread 164 without fastread
2. block
234
With this I have enough reserves to 250 clocks
How can I synchronize my DACs?
I know that the y-value is reloaded into the DAC after 256 clocs and the conversion starts.
I am looking for a way that all start converting at the same time.
DIR signal is the control for resetting a smartpin to initial state. So bashing the 32-bit DIRA/B registers, or DIRL/H instructions with "addpins", (N-1)<<6, also can reset them in parallel. FLT/DRV instructions do the same, respectively.
Eg: This code resets DIR for a group of four pins, together. Then sets up the same smartpin mode in all four, together. Then finally enables all four smartpins, together. The smartpins have the the same timing so will stay in phase with each other.
Sounds good only which pinns are they?
For now, I'm only breaking from pin 0 to 5.
Where are the pins in "pllpins = 28 | 3<<6 ".
Can I also use any pins e.g. 3, 5, 7, 12, ...
pllpins is P[31..28]. Same as "28 addpins 3" in Spin2.
Not with ADDPINS, a bitmask can be used instead. Eg:
OR DIRA, ##%1_0000_1010_1000
Completed. The example is in #001.
Thank you.
Not really
The CORDIC latency only matters if you want to calculate the value once and never again. Then the latency can't be hidden. But in that case you don't care about the latency: after all, you're just computing one value, likely in some setup code.
When you are running computations where speed matters, you obviously do many of them. In a loop (whether unrolled or not) of some sort. So you should arrange the operations so that they keep CORDIC as busy as you can, and CORDIC then operates in parallel with the COG. The actual cost of doing a computation on the CORDIC is 4-6 clock cycles: 2 clocks to submit an operation, and 2-4 clocks to read the result, based on its size (2 clocks for 32 bits, 4 clocks for 64 bits).
CORDIC is pipelined, so think of every computation as being a wavefront in a tube, traveling from the input (operation request) to the output (result read). There can be 54 wavefronts in the CORDIC at any given time. Each COG can make a new wavefront every 8 clock cycles - i.e. every four 2-clock instructions. It is, most of the time, rather hard to keep CORDIC 100% busy to fully utilize it, so CORDIC being "slow" is very rarely a problem. In other words, if you actually manage to keep the CORDIC at 100% utilization in some algorithm, you'll note that speed isn't a problem, since the COG will be barely keeping up with CORDIC. And such algorithms can be usually easily split across two COGs, with shared data in LUT memory, and you get 2x the performance right away.
In other words: if you can't stuff the CORDIC with a new operation or read request every 4 COG instructions, the problem is the COG code being inefficient, or the algorithm not being parallelized sufficiently, not the CORDIC being slow
The HUB architecture is set up in such a way that each COG has independent access to resources: it is as if each COG had its own access pipe to the HUB memory and CORDIC. In other words, it doesn't matter that there's only one CORDIC, or only one HUB RAM. Each COG is guaranteed independent access to these resources, at a guaranteed bandwidth, without interference from other COGs. So, the mental view of HUB resources like memory or CORDIC is that they are "per cog", with the memory contents being shared, but not the bandwidth. The fact that there's a round-robin system of access from COGs to HUB resources is just an implementation detail. It has nil effect on how COGs perceive the HUB, other than each COG's HUB access being offset from another COG by the number of clock cycles proportional to the distance between COGs (i.e. COG 1 and 4 are 3 clock cycles away).
That would be a solid waste of silicon
First of all, CORDIC accepts a new wavefront (operation) every clock cycle. A COG with a dedicated CORDIC would only be able to issue operations every other clock cycle. So that's half of the dedicated CORDIC wasted. Then, assuming you are actually interested in the results of the computations, you need another 2-4 clock cycles to read the results. So, with dedicated CORDIC per COG, and the COG doing no work but just issuing instructions and retrieving the results, 50-75% of the CORDIC capacity would be unused.
In real algorithms, you're actually doing something more with the data than just shuffling it between the COG RAM and CORDIC, so you'd likely need to fit additional instructions between CORDIC requests/reads just to manage that. As it currently stands, with the central CORDIC, you can execute 3 COG instructions per every CORDIC instruction. Since it takes 2-3 CORDIC instructions to perform an operation, i.e. start it and read the results, you have the ability to execute 6-9 two-cycle COG instructions per every CORDIC operation.
And, it turns out that this is often not enough, i.e. if you are actually using the results for something practical, you need more COG instructions than that, and the CORDIC speed is not a limiting factor.
Your mistake, IMHO, is that you're trying to run the oscillators in vacuum. That's just a silly micro-benchmark, with no relevance to real world use. In a real synthesizer, you'll be taking those CORDIC results and applying envelope, computing FM inputs to CORDIC, mixing, filtering, etc. You'll soon find that it's extremely hard to keep CORDIC 100% utilized - the COG will be the limiting factor.
And thus, the approach you need to take is to never optimize prematurely. First, make the entire single voice functional: implement the oscillator, FM operators, envelope, mixing, etc. - really, everything you have in mind for the final "product". And only when the functionality is there and works you should be optimizing it. You'll then see that to keep the CORDIC 100% occupied, you'll need to move some of the COG code out to another COG. With 100% CORDIC and COG utilization, you may only have 8-16 voices per COG, but those voices are complete - not merely oscillators, but the whole "kitchen sink". And that's fine: such voices will be completely utilizing the hardware resources, and then it's a simple matter to have the same code run on several COGs and provide the capacity you need. You can even let the user allocate resources: i.e. if they don't care for a fancy UI that takes two COGs, they can use a simpler single-COG UI and get 8-16 more voices
Premature optimization can be the root of all evil, and in pipelined parallel architectures (e.g. CORDIC+COG), micro-optimizations only pay off when done once all the needed functionality is finished.
Now, of course, you could argue that a per-COG CORDIC would be smaller in terms of silicon, since it wouldn't need to be as fast, so it could merge many pipeline stages and operate them in short "loops" before passing the result to the next stage. But that's still a gigantic amount of silicon. P2 does what it can with the available resources, but if you need the computational capacity of an ASIC or an FPGA, P2 is not fit for the application. It seems that your application, though, is well within the abilities of P2, it just requires efficient code and maximum utilization of resources on the chip, i.e. parallel execution on multiple COGs, etc. And it requires that you implement everything you want it to do first, and scale it up (i.e. expand number of identical voices) when that's done. That approach has worked well for me, at the very least.
I also find that in cases where lots of identical parallel computations have to be done, it may well be more cost efficient to use 2 or 4 P2s in parallel than going the FPGA route - at least in small volume applications, where your time is a significant chunk of the product price.
@Ariba
I have been working with your code and have a question. I cannot find what the _, in the line: _,smp := polxy(vol, phs2) does. I cannot find it in the Propeller 2 spin2 language documentation v35u.
Would you please explain.
Thanks in advance
Martin
Its in the spin2 docs under using methods...
Underscore means to ignore that return value.
Useful when a method returns two values but you only need one of them...
@Rayman
I will look thanks.
@Rayman
Found it thanks now makes sense.
Martin