Of those 54 stages, how many are recursive in nature?
None. I made it as short as I could. There used to be only 38 stages, but there was no time to do the K-factor compensation within the normal stages. So, I had to make 16 discrete stages, littered among the iteration stages, just to do subtractions to keep the scale at 1.00000000.
The great thing about those periodic subtraction stages is that they keep overflow totally in check. You can rotate ($7FFF_FFFF,$0000_0000) by $8000_0000 and get ($8000_0001,$0000_0000). In most CORDIC implementations, the K-factor compensation is done at the end of the computation and you need a few guard MSBs to contain the result, then an over-sized multiplier to scale the result down. Much easier to tap it down here and there, along the way, so that it comes out perfect at the end.
In nature, as in they can be rolled back up if dedicated to one command only.
EDIT: You kind of have already answered previously by saying it could be done with just four two barrel shifters.
Okay. I see what you are asking about now. Thirty-two of those pipelined stages are iterative and would otherwise have been implemented by two barrel shifters and two adders.
I am not sure I understand this. It appears that if you use CORDIC, you cannot use interrupts. That means that one OBEX can break another? If so, P2 tools need a built in lint.
I am not sure I understand this. It appears that if you use CORDIC, you cannot use interrupts. That means that one OBEX can break another? If so, P2 tools need a built in lint.
John Abshier
John, you can use interrupts with CORDIC. You just can't interleave CORDIC operations to get really high throughput. And this all happens within one COG. It wouldn't impact OBEX programs.
Chip,
From that info I've built the following cordic execution map for a single cog, of a 16-cog prop2, feeding the cordic at full speed. Let me know if anything is in the wrong place.
1 x,y,z mag 49 CORDIC 33 sub 17 sub
2 x,y,z sft 50 CORDIC 34 CORDIC 18 CORDIC
3 CORDIC 51 sub 35 CORDIC 19 CORDIC
4 CORDIC 52 hyperbolic 36 sub 20 sub
5 sub 53 shift and round 37 CORDIC 21 CORDIC
6 CORDIC 54 final x,y 38 CORDIC 22 CORDIC
7 CORDIC 55 39 sub 23 sub
8 sub 56 40 CORDIC 24 CORDIC
9 CORDIC 57 41 CORDIC 25 CORDIC
10 CORDIC 58 42 sub 26 sub
11 sub 59 43 CORDIC 27 hyperbolic
12 CORDIC 60 44 CORDIC 28 CORDIC
13 CORDIC 61 45 sub 29 CORDIC
14 sub 62 46 CORDIC 30 sub
15 CORDIC 63 47 CORDIC 31 CORDIC
16 CORDIC 64 48 sub 32 CORDIC
17 sub 1 x,y,z mag 49 CORDIC 33 sub
18 CORDIC 2 x,y,z sft 50 CORDIC 34 CORDIC
19 CORDIC 3 CORDIC 51 sub 35 CORDIC
20 sub 4 CORDIC 52 hyperbolic 36 sub
21 CORDIC 5 sub 53 shift and round 37 CORDIC
22 CORDIC 6 CORDIC 54 final x,y 38 CORDIC
23 sub 7 CORDIC 55 39 sub
24 CORDIC 8 sub 56 40 CORDIC
25 CORDIC 9 CORDIC 57 41 CORDIC
26 sub 10 CORDIC 58 42 sub
27 hyperbolic 11 sub 59 43 CORDIC
28 CORDIC 12 CORDIC 60 44 CORDIC
29 CORDIC 13 CORDIC 61 45 sub
30 sub 14 sub 62 46 CORDIC
31 CORDIC 15 CORDIC 63 47 CORDIC
32 CORDIC 16 CORDIC 64 48 sub
33 sub 17 sub 1 x,y,z mag 49 CORDIC
34 CORDIC 18 CORDIC 2 x,y,z sft 50 CORDIC
35 CORDIC 19 CORDIC 3 CORDIC 51 sub
36 sub 20 sub 4 CORDIC 52 hyperbolic
37 CORDIC 21 CORDIC 5 sub 53 shift and round
38 CORDIC 22 CORDIC 6 CORDIC 54 final x,y selection/adding
39 sub 23 sub 7 CORDIC 55
40 CORDIC 24 CORDIC 8 sub 56
41 CORDIC 25 CORDIC 9 CORDIC 57
42 sub 26 sub 10 CORDIC 58
43 CORDIC 27 hyperbolic 11 sub 59
44 CORDIC 28 CORDIC 12 CORDIC 60
45 sub 29 CORDIC 13 CORDIC 61
46 CORDIC 30 sub 14 sub 62
47 CORDIC 31 CORDIC 15 CORDIC 63
48 sub 32 CORDIC 16 CORDIC 64
49 CORDIC 33 sub 17 sub 1 magnitude determination of inputs
50 CORDIC 34 CORDIC 18 CORDIC 2 initial x,y,z shift
51 sub 35 CORDIC 19 CORDIC 3 CORDIC
52 hyperbolic 36 sub 20 sub 4 CORDIC
53 shift and round 37 CORDIC 21 CORDIC 5 sub
54 final x,y 38 CORDIC 22 CORDIC 6 CORDIC
55 39 sub 23 sub 7 CORDIC
56 40 CORDIC 24 CORDIC 8 sub
57 41 CORDIC 25 CORDIC 9 CORDIC
58 42 sub 26 sub 10 CORDIC
59 43 CORDIC 27 hyperbolic 11 sub
60 44 CORDIC 28 CORDIC 12 CORDIC
61 45 sub 29 CORDIC 13 CORDIC
62 46 CORDIC 30 sub 14 sub
63 47 CORDIC 31 CORDIC 15 CORDIC
64 48 sub 32 CORDIC 16 CORDIC
If you want CORDIC throughput, batch up your operations in special timed code. Once the first CORDIC command executes, your timing will be locked in. No getting off that crazy train. Once you are on, you are committed. No interruptions allowed. You will always come out the other end safely, with all your results. It is GLORIOUS!!!!
If no interrupts are allowed, then cordic should do that disable INT in HW, however that is the exact inverse of how users expect interrupts to work (and indeed why they are named interrupts!).
Can the cordic instead be paused for that cog, if an interrupt does occur ?
That’s the more expected operation.
If you do a single CORDIC instruction and then get the results, interrupts are fine. If you want high throughput by interleaving CORDIC operations, then interrupts are not fine.
Jmg, if we pause the thing it's deterministic nature will be impacted, which will affect any other COGS using it. It's up to each user of the cordic to meet the timing.
The way It is right now, is any Cog can do whatever it wants, and not affect any other cog.
I think people are getting hung up on a couple of things:
Interrupts are not Global to the P2, only the Cog in which they happen. This means programs in the object exchange will not break one another, because they're all running in different cogs.
The other thing that was done, which is different from P1, is we definitely put facilities in for non-deterministic programs.
So people got to choose on this. And the reward for making that choice, is a lot of Fast Math. It's a killer feature.
They either meet timing, write their programs in a way that does that, or they're interrupt driven, and they write their programs in a way that deals with that.
There's no protecting anyone on this, without a big logic cost, or breaking the symmetry of this thing and with that limiting its throughput.
The CORDIC is super simple, input arguments, hit the timing to get results. That's it. People just have to do that. And they only really have to do that, if they're doing a whole lot of math. And it needs to be fast.
Well, you could have the ISR routine set a flag that tells code that cordic results may be invalid and make it do it again, right?
I think there is also an underflow event that was mentioned, set when this corruption/loss occurs?
(Reading a non existing answer)
That could trigger a re-do, and alert the user they have an issue?
1 - magnitude determination of inputs
2 - initial x,y,z shift
3..52 - 32 iteration stages punctuated by 16 subtraction stages and 2 extra hyperbolic stages
53 - post-iteration shift and round
54 - final x,y selection/adding
I can see now that what I wanted to do as a partial pipeline doesn't pack well. It would still need the fully unrolled cordic for larger cog counts ... and the resource saving probably wouldn't be as good as I had hoped.
What if you added a mode in which it automatically drops CORDIC results in ascending LUTRAM addresses? You would submit CORDIC commands as fast as you could, and they'd show up in LUTRAM eventually. This would be convenient for FFTs: you could do the smaller sub-transforms out of LUTRAM and have the results automagically show up in the right places in LUTRAM. If an interrupt happened while you were submitting CORDIC commands, the results would still go to the right place. The CORDIC underflow event would let you know when they were all done.
EDIT: The write would happen at the same time when a write from the neighboring cog would take place - one would win if they both tried to write at the same time. For simplicity, I guess you'd have it so getqx and getqy would still do the right things, although I can't imagine why you'd want to do both.
I like it. Good from cog view. Not sure how easy it will be for cordic to reach into every cog like that though. Currently the cogs are all reaching out.
Sounds like too big a change for the current design, and the 'knobs' that Chip has detailed for scaling of the current design do not include this type of change either.
It doesn't have to change anything that's already there; getq[xy] would work just as they do now; there would just be a second way to get the results that is activated when you run an instruction to set it up with a start pointer. If it's added and it does cause any problems, it can be ignored until the next design. The only problem it could cause that would break current functionality is if the muxing of the LUT write port is buggy.
I agree that the only way to automate high CORDIC throughput would be to have it write directly into the LUT. That's probably more change than we have safe margin for, at this point.
I was pleased to confirm yesterday that in an 8-cog setup, any mix of CORDIC commands can be initiated, overlapped, and trailing results received at a pace of 8 clocks per function. That's pretty decent and not hard to manage. You just need to get the concept clear in your head.
Chip,
From that info I've built the following cordic execution map for a single cog, of a 16-cog prop2, feeding the cordic at full speed. Let me know if anything is in the wrong place.
1 x,y,z mag 49 CORDIC 33 sub 17 sub
2 x,y,z sft 50 CORDIC 34 CORDIC 18 CORDIC
3 CORDIC 51 sub 35 CORDIC 19 CORDIC
4 CORDIC 52 hyperbolic 36 sub 20 sub
5 sub 53 shift and round 37 CORDIC 21 CORDIC
6 CORDIC 54 final x,y 38 CORDIC 22 CORDIC
7 CORDIC 55 39 sub 23 sub
8 sub 56 40 CORDIC 24 CORDIC
9 CORDIC 57 41 CORDIC 25 CORDIC
10 CORDIC 58 42 sub 26 sub
11 sub 59 43 CORDIC 27 hyperbolic
12 CORDIC 60 44 CORDIC 28 CORDIC
13 CORDIC 61 45 sub 29 CORDIC
14 sub 62 46 CORDIC 30 sub
15 CORDIC 63 47 CORDIC 31 CORDIC
16 CORDIC 64 48 sub 32 CORDIC
17 sub 1 x,y,z mag 49 CORDIC 33 sub
18 CORDIC 2 x,y,z sft 50 CORDIC 34 CORDIC
19 CORDIC 3 CORDIC 51 sub 35 CORDIC
20 sub 4 CORDIC 52 hyperbolic 36 sub
21 CORDIC 5 sub 53 shift and round 37 CORDIC
22 CORDIC 6 CORDIC 54 final x,y 38 CORDIC
23 sub 7 CORDIC 55 39 sub
24 CORDIC 8 sub 56 40 CORDIC
25 CORDIC 9 CORDIC 57 41 CORDIC
26 sub 10 CORDIC 58 42 sub
27 hyperbolic 11 sub 59 43 CORDIC
28 CORDIC 12 CORDIC 60 44 CORDIC
29 CORDIC 13 CORDIC 61 45 sub
30 sub 14 sub 62 46 CORDIC
31 CORDIC 15 CORDIC 63 47 CORDIC
32 CORDIC 16 CORDIC 64 48 sub
33 sub 17 sub 1 x,y,z mag 49 CORDIC
34 CORDIC 18 CORDIC 2 x,y,z sft 50 CORDIC
35 CORDIC 19 CORDIC 3 CORDIC 51 sub
36 sub 20 sub 4 CORDIC 52 hyperbolic
37 CORDIC 21 CORDIC 5 sub 53 shift and round
38 CORDIC 22 CORDIC 6 CORDIC 54 final x,y selection/adding
39 sub 23 sub 7 CORDIC 55
40 CORDIC 24 CORDIC 8 sub 56
41 CORDIC 25 CORDIC 9 CORDIC 57
42 sub 26 sub 10 CORDIC 58
43 CORDIC 27 hyperbolic 11 sub 59
44 CORDIC 28 CORDIC 12 CORDIC 60
45 sub 29 CORDIC 13 CORDIC 61
46 CORDIC 30 sub 14 sub 62
47 CORDIC 31 CORDIC 15 CORDIC 63
48 sub 32 CORDIC 16 CORDIC 64
49 CORDIC 33 sub 17 sub 1 magnitude determination of inputs
50 CORDIC 34 CORDIC 18 CORDIC 2 initial x,y,z shift
51 sub 35 CORDIC 19 CORDIC 3 CORDIC
52 hyperbolic 36 sub 20 sub 4 CORDIC
53 shift and round 37 CORDIC 21 CORDIC 5 sub
54 final x,y 38 CORDIC 22 CORDIC 6 CORDIC
55 39 sub 23 sub 7 CORDIC
56 40 CORDIC 24 CORDIC 8 sub
57 41 CORDIC 25 CORDIC 9 CORDIC
58 42 sub 26 sub 10 CORDIC
59 43 CORDIC 27 hyperbolic 11 sub
60 44 CORDIC 28 CORDIC 12 CORDIC
61 45 sub 29 CORDIC 13 CORDIC
62 46 CORDIC 30 sub 14 sub
63 47 CORDIC 31 CORDIC 15 CORDIC
64 48 sub 32 CORDIC 16 CORDIC
Evanh, that looks correct, neverminding the exact order of the middle stages. You could treat all 54 stages as black boxes for the purpose of helping a programmer understand.
EDIT: But there is an event (QMT) for last GETQx got nothing. This'll probably trigger if attempting to re-retrieve the final result.
Is using that event trap going to be a reliable 'lost cordic value' flag ? The 'probably' sounds a little unsure ?
That "probably", wasn't about lost data. I was unsure of the exact condition that could trigger a QMT event at all. The thing is, a result that was produced a million clock prior will still be there to be collected.
What must happen is GETQx must flag it has done the collection - buffer becomes empty. Attempting another result fetch will either wait for an upcoming result or, if no more data to come then don't wait but, trigger the QMT event.
QX and QY will each have an empty flag. Either can trigger the QMT event upon GETQx while empty and inactive.
So Chip is now asking us if we want another condition combined into the same QMT event, again both QX and QY can trigger it. It detects new result arriving at the result buffer while the buffer is not empty, ie: prior result overwritten.
PS: A small detail: The buffer empty flags are forced set whenever a solitary command is issued, ie: the first command of a batch.
Comments
None. I made it as short as I could. There used to be only 38 stages, but there was no time to do the K-factor compensation within the normal stages. So, I had to make 16 discrete stages, littered among the iteration stages, just to do subtractions to keep the scale at 1.00000000.
The great thing about those periodic subtraction stages is that they keep overflow totally in check. You can rotate ($7FFF_FFFF,$0000_0000) by $8000_0000 and get ($8000_0001,$0000_0000). In most CORDIC implementations, the K-factor compensation is done at the end of the computation and you need a few guard MSBs to contain the result, then an over-sized multiplier to scale the result down. Much easier to tap it down here and there, along the way, so that it comes out perfect at the end.
EDIT: You kind of have already answered previously by saying it could be done with just four two barrel shifters.
Okay. I see what you are asking about now. Thirty-two of those pipelined stages are iterative and would otherwise have been implemented by two barrel shifters and two adders.
1 - magnitude determination of inputs
2 - initial x,y,z shift
3..52 - 32 iteration stages punctuated by 16 subtraction stages and 2 extra hyperbolic stages
53 - post-iteration shift and round
54 - final x,y selection/adding
John Abshier
John, you can use interrupts with CORDIC. You just can't interleave CORDIC operations to get really high throughput. And this all happens within one COG. It wouldn't impact OBEX programs.
From that info I've built the following cordic execution map for a single cog, of a 16-cog prop2, feeding the cordic at full speed. Let me know if anything is in the wrong place.
Can the cordic instead be paused for that cog, if an interrupt does occur ?
That’s the more expected operation.
The way It is right now, is any Cog can do whatever it wants, and not affect any other cog.
I think people are getting hung up on a couple of things:
Interrupts are not Global to the P2, only the Cog in which they happen. This means programs in the object exchange will not break one another, because they're all running in different cogs.
The other thing that was done, which is different from P1, is we definitely put facilities in for non-deterministic programs.
So people got to choose on this. And the reward for making that choice, is a lot of Fast Math. It's a killer feature.
They either meet timing, write their programs in a way that does that, or they're interrupt driven, and they write their programs in a way that deals with that.
There's no protecting anyone on this, without a big logic cost, or breaking the symmetry of this thing and with that limiting its throughput.
The CORDIC is super simple, input arguments, hit the timing to get results. That's it. People just have to do that. And they only really have to do that, if they're doing a whole lot of math. And it needs to be fast.
(Reading a non existing answer)
That could trigger a re-do, and alert the user they have an issue?
I can see now that what I wanted to do as a partial pipeline doesn't pack well. It would still need the fully unrolled cordic for larger cog counts ... and the resource saving probably wouldn't be as good as I had hoped.
I'm done with this.
EDIT: The write would happen at the same time when a write from the neighboring cog would take place - one would win if they both tried to write at the same time. For simplicity, I guess you'd have it so getqx and getqy would still do the right things, although I can't imagine why you'd want to do both.
Sounds like too big a change for the current design, and the 'knobs' that Chip has detailed for scaling of the current design do not include this type of change either.
I was pleased to confirm yesterday that in an 8-cog setup, any mix of CORDIC commands can be initiated, overlapped, and trailing results received at a pace of 8 clocks per function. That's pretty decent and not hard to manage. You just need to get the concept clear in your head.
Evanh, that looks correct, neverminding the exact order of the middle stages. You could treat all 54 stages as black boxes for the purpose of helping a programmer understand.
Is using that event trap going to be a reliable 'lost cordic value' flag ? The 'probably' sounds a little unsure ?
Maybe the event should capture both:
a) Result overwritten with new result because GETX/GETY didn't execute in time.
b) GETX/GETY executed without prior CORDIC instruction.
So the event flag would mean "CORDIC result not valid"?
What must happen is GETQx must flag it has done the collection - buffer becomes empty. Attempting another result fetch will either wait for an upcoming result or, if no more data to come then don't wait but, trigger the QMT event.
QX and QY will each have an empty flag. Either can trigger the QMT event upon GETQx while empty and inactive.
So Chip is now asking us if we want another condition combined into the same QMT event, again both QX and QY can trigger it. It detects new result arriving at the result buffer while the buffer is not empty, ie: prior result overwritten.
PS: A small detail: The buffer empty flags are forced set whenever a solitary command is issued, ie: the first command of a batch.
I've worked out enough now to be sure it would need two designs, depending on cog count. So have thrown in the towel.
I've just checked it: "b)" means GETQx has returned immediately with the same result as before and there's nothing new to come.
Example:
shortlen will be correctly length/10. But that will also trigger a QMT event.
Here's an example of using the QMT event as it is right now, (b) only:
So, do you think it would be better to trap overrun, too?
I do not follow the depths of the Cordic queue, but yes, to me it makes sense to also have
a) Result overwritten with new result because GETX/GETY didn't execute in time.
because (I think) that gives you an earlier warning, and that makes both recovery and debug easier.