I think Chip wants something *REALLY* simple in hardware, I don't believe he has the time / logic budget for anything complicated. If he did, I think all of us would innundate him with additional suggestions
Chip:
First, thanks for SETQUADZ and everything!
Right after this message I'll program your latest Nano image & update to the latest PNut; then I will torture-test RDQUADLONG.
Cluso99:
I like the movx (condition code)
For byte packing/unpacking Chip has a MOVF that he will document (can't wait)
potatohead:
Yep, testing RDQUAD was "interesting", and non-cog aligned RDQUAD will be nice.
re/ compressed: take a look at what I proposed, I was wary of any more complexity due to implementation delay, logic budget, chip's time
Sapieha:
I think movf will do this... (I hope!)
Cluso99:
Of course I don't mind you posting! An efficient loader will be needed for FCACHE
I will try your code... looks good at a quick glance.
Due to how READQUAD works, the LMM2 loop will alternately execute i1-i4, then i5-i8 ensuring that RDQUAD reads a different group of four instructions on every loop iteration
i1-i4 are simple adds to counters, which are later deposited into result1-result4
i5-i7 increment result6 in the hub
i8 does an extra increment of result1
The LMM2 loops RDQUAD is executed 256 times (so we get nice easy to read cycle counts in hex), however the last four instructions fetched are not executed, and the first four are executed as NOP's as they would not have been fetched yet.
The first iteration executes four NOP's on the first pass (thanks to SETQUAZ)
This means that i1-i4 gets executed 128 times, and i5-i8 gets executed 127 times
Therefore the mathematically correct expected results (in hex) are:
result 1: $0000FE ($7F+$7F)
result 2: $000080
result 3: $000080
result 4: $000080
result 5: $000FF7 *** approximate
result 6: $00007F
256*8 RDQUAD's ($800 cycles), 127*8 RDLONG's ($3FF cycles), 127*8 WRLONGS ($3FF) cycles as after first hub access all will hit hub sync sweet spots
$800+$3F0+$3F0 = $FE0 cycles
Sorry about the incorrect calculation in an earlier edit of this message - the timing works as expected!
p.s.
Other instructions can be tried in i1-i8, but then predicted results should be calculated for them.
The attached file is basically a framework for testing LMM2 and native instructions.
I wonder if the torture test is faster because there is only a single cog implemented on my nano, and thus it actually ends up using "other cogs" hub cycles?
I fixed my initial incorrect calculation, the cycle count was correct as well.
Could someone try the torture test using my loops on a DE2-115 and post the results?
#1 makes sense - you might want to read the (long) explanation I put in the post#156 with the code that calculates the expected results for the torture test.
Unfortunately Chip's and Andy's loops have pipeline issues, where as my out of order loop gets the right result.
I am planning to come up with other torture tests over time - we really need to get LMM2 right.
Well, we've now got a nice test framework, and a template to calculate expected results. Time to try different sets of instructions, among other things. If needed, I can run things this evening. The pipeline is going to be interesting to work with...
The more verification we can get done, the more likely the test shuttle Chip's (pun intended) will just work!
I am going to concentrate on making sure LMM2 is bullet proof, and appreciate additional torture testing on that.
It would be very useful if we could get every instruction verified in both LMM2 and native cog environment... but we'd need many volunteers for that. I would prefer that LMM2 test results be reported in this thread.
Ideally people would post a list of instructions they are working on, and the verified/not status.
(I am certain that Chip already has great test vectors for his Verilog files, but extra verification never hurts)
Well, we've now got a nice test framework, and a template to calculate expected results. Time to try different sets of instructions, among other things. If needed, I can run things this evening. The pipeline is going to be interesting to work with...
I added the Torture Test documentation and latest version to post#2 - this version adds a test for my out of order LMM2 loop that does not use PTRA, but uses a pc register.
Code snippet:
'nop ' needed for aligned, comment out for un-aligned test
setquaz #ins1
reps #256,#8
getcnt start
ins1 nop 'four LMM instructions from RDQUAD before last execute here
ins2 nop
ins3 nop
ins4 nop
ins0 rdquad pc 'LMM loop
ins5 add pc,#16 '(this is where the mapped QUADs actually become executable after RDQUAD)
ins6 nop
ins7 nop
Me too, I've been wondering. I'm half tempted to just attempt it anyway, tinkering with the modes until I see that. Hope it's there, because it's useful on a few levels, not just video. Bit banging is better now, but still... (fingers crossed)
Me too, I've been wondering. I'm half tempted to just attempt it anyway, tinkering with the modes until I see that. Hope it's there, because it's useful on a few levels, not just video. Bit banging is better now, but still... (fingers crossed)
Bill
There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.
What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
With normal 1 cycle instruction I don't see this problems.
I think Chip needs to look again in his Verilog code....
Baggers, that would be maybe too much at this late stage. I wish I would have known about this earlier. You told me, but I didn't picture it this clearly back them. That would be pretty simple.
Bill
There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.
What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
With normal 1 cycle instruction I don't see this problems.
I think Chip needs to look again in his Verilog code....
before the loop, and I verified that at the end of the run, result6 was $7F
Thinking about it, I think that only RDLONG has an issue, otherwise we would not see $7F... and the single cycle adds are all executing correctly.
Update:
Thinking some more about it, it pretty much has to be a difference in how the Verilog handles executing RDLONG from a "real cog register" and a "RDQUAD cache mapped to registers"
Reason I think this is the case is that two days ago the RDLONG tests worked fine - but I am going to go and re-test that right now...
Bill
There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.
What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
With normal 1 cycle instruction I don't see this problems.
I think Chip needs to look again in his Verilog code....
Bill
There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.
What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
With normal 1 cycle instruction I don't see this problems.
I think Chip needs to look again in his Verilog code....
Andy
The problem is that RDLONG's take more than one clock and therefore make subsequent instructions in QUADs available for execution earlier. I don't know what to do about this, other than suggest timing provisions are made to execute the just-read QUADs and not execute overlapped QUADs, as the overlapped execution is where timing gets advanced by multi-clock instructions and causes new instructions to execute in place of old.
Read this carefully and it will make sense:
After a RDQUAD, mapped QUAD registers are accessible via D and S after three clocks:
RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3
NOP 'do something for at least 3 clocks to allow QUADs to update
NOP
NOP
CMP quad0,quad1 'mapped QUADs are now accessible via D and S
After a RDQUAD, mapped QUAD registers are executable after three clocks and one instruction:
SETQUAD #quad0 'map QUADs to quad0..quad3
RDQUAD hubaddress 'read a quad into the QUAD registers mapped at quad0..quad3
NOP 'do something for at least 3 clocks to allow QUADs to update
NOP
NOP
NOP 'do at least 1 instruction to get QUADs into pipeline
quad0 NOP 'QUAD0..QUAD3 are now executable
quad1 NOP
quad2 NOP
quad3 NOP
After a SETQUAD, mapped QUAD registers are writable immediately, but original contents are
readable via D and S after 2 instructions:
SETQUAD #quad0 'map QUADs to quad0..quad3 (new address)
NOP 'do at least two instructions to queue up QUADs
NOP
CMP quad0,quad1 'mapped QUADS are now accessible via D and S
On cog startup, the QUAD registers are cleared to 0's.
RDQUAD fully exposes the 7 cycles after it (in an 8 cycle hub window)
Is it possible for the RDBYTE/WORD/LONG (non cache version) to do the same? it would make it simpler to schedule.
I can't change the timing of RDxxxx. That's kind of set, at this point.
I think one thing you could do would be to put the RDLONG at the 4th (last) location, and have three single-cycle instructions in front of it. This would inhibit the problem.
The only way overlapped QUADs can be executed is if the instructions within them are all single-cycle. Otherwise, the overlapping gets mixed up, affecting the 4th and maybe even 3rd instruction.
Comments
NANO reprogrammed
> Function correctly
Thanks
I will try now some other features of the Prop2...
Andy
I go to dinner, and to sleep, and look how much happens....
I will head down to the lab shortly, update my Nano, run tests, and catch up on the posts :-)
Here is my massive "Catch Up" post:
jmg,David:
I think Chip wants something *REALLY* simple in hardware, I don't believe he has the time / logic budget for anything complicated. If he did, I think all of us would innundate him with additional suggestions
Chip:
First, thanks for SETQUADZ and everything!
Right after this message I'll program your latest Nano image & update to the latest PNut; then I will torture-test RDQUADLONG.
Cluso99:
I like the movx (condition code)
For byte packing/unpacking Chip has a MOVF that he will document (can't wait)
potatohead:
Yep, testing RDQUAD was "interesting", and non-cog aligned RDQUAD will be nice.
re/ compressed: take a look at what I proposed, I was wary of any more complexity due to implementation delay, logic budget, chip's time
Sapieha:
I think movf will do this... (I hope!)
Cluso99:
Of course I don't mind you posting! An efficient loader will be needed for FCACHE
I will try your code... looks good at a quick glance.
Baggers:
Your *REALLY* should ge a Nano...
Chip:
Downloading new .zip ... thanks!
Andy:
Loos good!
I will torture your version too
Chip's sample:
not what we would like to see :-(
Andy's main loop, modified for my test framework:
not what we would like to see :-(
My latest main loop, quad aligned:
IT WORKS!!!
My latest main loop, NOT quad aligned:
THANKS CHIP - WORKS PERFECTLY!!!
No more worries about alignment...
I am attaching the torture test to this message, it has all three different LMM2 loops in it - just uncomment the one you want to run.
THE TORTURE TEST EXPLAINED
Currently, the torture test consists of the following eight instructions, repeated 256 times in the hub (2k instructions, 8k hub used)
Due to how READQUAD works, the LMM2 loop will alternately execute i1-i4, then i5-i8 ensuring that RDQUAD reads a different group of four instructions on every loop iteration
i1-i4 are simple adds to counters, which are later deposited into result1-result4
i5-i7 increment result6 in the hub
i8 does an extra increment of result1
The LMM2 loops RDQUAD is executed 256 times (so we get nice easy to read cycle counts in hex), however the last four instructions fetched are not executed, and the first four are executed as NOP's as they would not have been fetched yet.
The first iteration executes four NOP's on the first pass (thanks to SETQUAZ)
This means that i1-i4 gets executed 128 times, and i5-i8 gets executed 127 times
Therefore the mathematically correct expected results (in hex) are:
result 1: $0000FE ($7F+$7F)
result 2: $000080
result 3: $000080
result 4: $000080
result 5: $000FF7 *** approximate
result 6: $00007F
256*8 RDQUAD's ($800 cycles), 127*8 RDLONG's ($3FF cycles), 127*8 WRLONGS ($3FF) cycles as after first hub access all will hit hub sync sweet spots
$800+$3F0+$3F0 = $FE0 cycles
Sorry about the incorrect calculation in an earlier edit of this message - the timing works as expected!
p.s.
Other instructions can be tried in i1-i8, but then predicted results should be calculated for them.
The attached file is basically a framework for testing LMM2 and native instructions.
I fixed my initial incorrect calculation, the cycle count was correct as well.
Could someone try the torture test using my loops on a DE2-115 and post the results?
Meanwhile, I found an error in my initial calculation for expected cycle count - and the result I am getting is correct.
#1 --> "my out of order LMM2 inner loop
=== Propeller II Monitor ===
>n2000.201f
02000- 000000FE 00000080 00000080 00000080 '................'
02010- 00000FF6 0000007F 00000000 00000000 '................'
>
#2 --> "Chip's easy to read LMM2 loop"
=== Propeller II Monitor ===
>n2000.201f
02000- 00000080 00000080 000000FF 000000FF '................'
02010- 00000FF3 00000000 00000000 00000000 '................'
>
#3 --> "Andy's non-pta LMM2 loop with jumpd, modified for 256 iterations"
=== Propeller II Monitor ===
>N2000.201f
02000- 00000080 00000080 000000FF 000000FF '................'
02010- 00000FF3 00000000 00000000 00000000 '................'
>
Unfortunately Chip's and Andy's loops have pipeline issues, where as my out of order loop gets the right result.
I am planning to come up with other torture tests over time - we really need to get LMM2 right.
Well, we've now got a nice test framework, and a template to calculate expected results. Time to try different sets of instructions, among other things. If needed, I can run things this evening. The pipeline is going to be interesting to work with...
The more verification we can get done, the more likely the test shuttle Chip's (pun intended) will just work!
I am going to concentrate on making sure LMM2 is bullet proof, and appreciate additional torture testing on that.
It would be very useful if we could get every instruction verified in both LMM2 and native cog environment... but we'd need many volunteers for that. I would prefer that LMM2 test results be reported in this thread.
Ideally people would post a list of instructions they are working on, and the verified/not status.
(I am certain that Chip already has great test vectors for his Verilog files, but extra verification never hurts)
Code snippet:
Test results:
The results are correct
I really hope the non-DAC parallel mode output for VGA is still possible, I need it for LCD's.- otherwise I'll have bit (byte?word?long?) bang it
waitvid classic mode
There seems to be issues with hub access instructions inside LMM code. Unfortunatly your code does also not work right, you get only the (nearly) right results by chance. Try the following:
1) initialize lc with #$180 for example before the lmm loop - you will get $1FF as result6 , so lc is just incremented and written to hub, but never loaded by rdlong.
2) make i5 a NOP instead of rdlong,result6 - and all 3 versions show the same results.
What happens is that your "out of order loop" is so out of order that the first instruction is not ready to execute after the previous rdquad.
I've tried half an hour to find the right instruction slots that works always, but it looks like the rdlong destroys something with the pipeline, and the working slots changes for every code variant I tried.
With normal 1 cycle instruction I don't see this problems.
I think Chip needs to look again in his Verilog code....
Andy
No worries, you can save it for P3
I will try your modifications, and no doubt get the same results, and I am sure Chip will look over the Verilog code again - all of us want this work
Releasing the Terasic loadables was a brillant move - it is letting us find bugs before the shuttle run.
I added:
testvalue long $180
at the bottom of the file, and
wrlong testvalue,result6
before the loop, and I verified that at the end of the run, result6 was $7F
Thinking about it, I think that only RDLONG has an issue, otherwise we would not see $7F... and the single cycle adds are all executing correctly.
Update:
Thinking some more about it, it pretty much has to be a difference in how the Verilog handles executing RDLONG from a "real cog register" and a "RDQUAD cache mapped to registers"
Reason I think this is the case is that two days ago the RDLONG tests worked fine - but I am going to go and re-test that right now...
The problem pretty much has to be the difference between executing RDLONG from regular registers, or from the RDLONG cache buffer.
I initialized result6 to $180, then I executed the following code:
in the torture test framework, an got the following result:
Note that result six is $180+$7F, and all the other results are correct as well.
The problem is that RDLONG's take more than one clock and therefore make subsequent instructions in QUADs available for execution earlier. I don't know what to do about this, other than suggest timing provisions are made to execute the just-read QUADs and not execute overlapped QUADs, as the overlapped execution is where timing gets advanced by multi-clock instructions and causes new instructions to execute in place of old.
Read this carefully and it will make sense:
Question:
RDQUAD fully exposes the 7 cycles after it (in an 8 cycle hub window)
Is it possible for the RDBYTE/WORD/LONG (non cache version) to do the same? it would make it simpler to schedule.
I'll go try to make a new torture test to show RDLONG does work if scheduled right.
The current thread expected RDLONG to pause the pipe until synced to the hub like in non-quad code, which makes writing code for it easier.
I can't change the timing of RDxxxx. That's kind of set, at this point.
I think one thing you could do would be to put the RDLONG at the 4th (last) location, and have three single-cycle instructions in front of it. This would inhibit the problem.
Fills registers that are hidden by QUAD with same data as QUAD ?
You explanation let me schedule the code properly for the kernel.
Re-written, scheduled LMM2 code:
Test results - which make sense:
Basically, it was NOT a verilog issue, just improper scheduling of code to be run through the LMM2 kernel.
No, it has to do with when new QUAD values enter the pipeline, based on how many clocks have elapsed since the last RDQUAD.