4 Cog overclocked P1V
rjo__
Posts: 2,114
Before my first attempt, help me Jesus, a 4Cog P1V.
The goal is improve bandwidth by increasing the main clock and by reducing the hub cycle.
The question here isn't if I am missing something, but what I am missing:)
I have found just two places that need to be modified. In dig.v, in "generate" I need to change
"for (i=0; i<8; i++)" to "for (i=0; i<4; i++)
Below that at line 129, I need to change
"wire [7:0] cog_ena;" to "wire [3:0] cog_ena;"
I know how to change the clock my question regards how best to limit the hub cycle to 4 cogs.
Obviously, it can't be this simple:)
Thanks,
Rich
The goal is improve bandwidth by increasing the main clock and by reducing the hub cycle.
The question here isn't if I am missing something, but what I am missing:)
I have found just two places that need to be modified. In dig.v, in "generate" I need to change
"for (i=0; i<8; i++)" to "for (i=0; i<4; i++)
Below that at line 129, I need to change
"wire [7:0] cog_ena;" to "wire [3:0] cog_ena;"
I know how to change the clock my question regards how best to limit the hub cycle to 4 cogs.
Obviously, it can't be this simple:)
Thanks,
Rich
Comments
These lines in dig.v implement a one-hot shiftregister that selects one of eight cog's to connect to the hub.
In a 4 cog design you can reduce the length of the shiftsequence, example 6 positions so the hub rotates faster.
I found that further reducing the runlength gives strange results. Reducing it to to 4 positions results in only 2 cog running but is ok for a 2 cog design.
This may not be a practical idea longer term, but it may be worth trying to make a 1 COG P1V first just to understand the code.
Then try making a 4 COG P1V.
Then try making a 4 COG P1V that could share hub in some alternative way with the 1 COG P1V. ...
For example, the one COG P1V might use the even 4 of 8 HUB slots, and the 4 COG P1V could only use the odd 4 HUB slots.
Wishing I had more time for this stuff .... Maybe soon.
That's right. The other thing to address is the COGID instruction (in hub.v), as it will return a wrong 3-bit cog# most of the time.
changes mentioned in my first post:
In dig.v, in "generate" change
"for (i=0; i<8; i++)" to "for (i=0; i<4; i++)
and following the leads above
in dig.v
and in hub.v
test code run in PropellerIDE
Nothing breaks but I get exactly the same timing on both the nano_P1V and a P1.
"hubslots" ... where are those hubslots?
Thanks
Rich
This code generates a 9 bit value where a single "1" bit cycles over 5 positions = hub timeslots, 10 CPU clocks.....bus_sel[2:0] would result in 4 hub time slots.
I have done some experiments with less hub timeslots, look in this threadhttp://forums.parallax.com/showthread.php/156955-Small-V-Prop-2-Cog-s-4-KB-ROM-4KB-Hub-RAM In the last post I posted two oscilloscope screens that show the speedup with 4 slots / 2 Cog's for a series of sequential RDLONG's compared to 8 slots / 8 cog's..
Warning: probably more Verilog changes are necessary to change the number of timeslots without breaking some logic. With 6 timeslots I can have only 4 cog's running, with 4 timeslots only 2 cogs.
So you were lucky with your 5 timeslots, I guess that only 3 Cogs can be running with that (did not try)
So, if anyone wants half a P1v... it seems to be available here:)
BUT I am seeing absolutely no differences in the timing.
Cluso99 is working on documentation. That should help a lot.
I went back to PropellerIDE and used cog 4 and it worked... then I switched to cog 5 (which shouldn't) exist. The code ran fine. The timing was unaffected and is still the same as for a normal P1. The correct led for the different cogs assigned lit up on my Nano.
I know that I recompiled and reprogrammed correctly... the time stamps prove it.
I am thinking that when I am asking for a cog that doesn't exist, it uses the 2 lsb and chooses a cog that does exist... and the LED is simply an artifact.
But I'm not sure about this.
The Verilog code for COGINIT/COGNEW and other hub functions is way beyond me, it may be that this code knows about "active" cog's.
I've seen the same LED and performance behaviour with code I've tried. I'm building with your changes now.
This is the Spin code I used to test performance.
MySimpleSerial.spin
I'm getting the same "Diff Time 704" as before. So it seems either the Spin is being optimized away (very unlikely) or more verilog digging is on order.
For anyone looking in, who hasn't been here for a while, I need to add that I am a pure hobbyist, who ordinarily leaves the room when serious conversations start.
I am such a feckless programmer that I normally test my code after entering each line. That's a little tedious with FPGA's.
So far I haven't found anything I want to do with a propeller that I can't eventually do... and the verilog code makes about as much
sense to me as PASM did when I first looked at it... I'm guessing the end results will be similar.
It could take a while:)
Thank you pik33.
On my Nano P1V*1/2, I got largely the same results as pik33. At 150Mhz, the loading was unreliable. At 141.666Mhz and 133.333Mhz, the loading seemed reliable. I added a blinking LED on P0 and the program seemed to run just fine. The LED seemed to blink at the right rate, the cog led's which are hooked up in the verilog behaved correctly.
However, I could not get reliable serial communications, despite using fullserialduplex and mysimpleserial (above in Jazzed's post)at a variety of baud rates. I even hard coded the clock into the serial drivers, just in case something in the declaration wasn't quite kosher. No luck. I am out of time for the next couple of days. I used both bst and PropellerIDE with identical results.
If you are following along. The next step is to put out a frequency on one of the pins and measure it with a real prop... if you have the time, give it a whirl. And if that works, then we will have to figure out what is going on with the serial stuff.
Tah Tah
Rich
Originally, I had the following:
which nutson correctly informed me had too many bits. I then made another error in the zip files above... right number of bits, wrong number of cogs:)
Which brings me to this: which I am fairly certain is correct.
The issue is that even though I think I am properly restricting bus selection to 4 cogs... the timing for my test file remains unchanged.
at line 246 of hub.v
I also made this change:
The concatenation works as in the original code(exept for 4 cogs), but I think I need to change to something like sys_q <={1b'0,ac[1:0]... but .... but...
And right now that looks like it could take forever:)
I think it can be wrong.
Try this one instead:
HOWTO testbench:
p1bus.v
p1bus_tb.v
Execute with icarus verilog:
(code stealed from http://iverilog.wikia.com/wiki/Getting_Started
... do not know why ena_bus has 3 bits)
Because there are 8 cogs?
Big thank you. Away from my massive Xp machine. I need a itty bitty 64 bit laptop:)
No. Actually It had 9 bits. I have found the typo. A duplicated variable in the monitor line:
BAD> $monitor("At time %t, bus_sel = %b, ena_bus = %h", $time, bus_sel, bus_sel, ena_bus);
OK > $monitor("At time %t, bus_sel = %b, ena_bus = %b", $time, bus_sel, ena_bus);
It didn't warned that I used three parameters (%t, %b, %h) and four variables, the compiler just joined the last two variables 8 bits + 1 bit (2nd bus_sel & ena_bus).
Substitute this:
with this:
Beware ! Not tested.
Thanks again.
We are moving our house... and it isn't going well:)
I had just enough time tonight to go to my "lab" and test your changes... I first tested post#21... it worked but did not change the number of clocks (same as measured on my "p1v*1/2" and on a regular p1. I started and stopped all 4 cogs... they worked as expected)
I then added the change from the above( and if you hadn't shown me the truth table, I wouldn't have believed it.) Again, everything works, but the timing in my test code remains as it was tested on a regular p1 with the spin file that I posted.
One issue I have about post#25... (and I suspect that it is just me) is this: you show results for all bus_sel options... but to my mind, bus_sel[7..4] should always be 0... the idea is to never have these selected.
It doesn't seem to make a difference to my final result.... so, I don't see a reason to change it back.
I'm at something of a loss. There must be some other source of bus arbitration that I am missing... I would have expected the measured clocks to drop... maybe not in half, but substantially. They are exactly same. Kind of amazing.
We are taking a hack saw to a Propeller... and it doesn't seem to care:)
On the bright side... we do have a smaller footprint and much quicker compile times in Quartus... but that is not exactly what I want:)(:~~~~^^^^^
BUT using spin to measure elapsed times... I get a decrease of 16 clocks(496->480) when the code is run on a Project Board vs. p1v*1/2
Note... in PropellerIDE, use a baud rate of 115200.
to
In Spin, the measured clocks drops to 448.
As before, the LEDs light up appropriately, so the Prop1v*1/2 seems to think the cog is being used.
Cogstop does work but the pasm routine never writes to hub ram.
Now that Ramon, Cluso99 and Ozpropdev have me heading in the right (Thank you guys:)
I'm going back to over clocking and see what I screwed up there:)
With unoptimized pasm code, there is about a 15 percent improvement in PASM timing of the P1v*1/4 over a standard P1 and a similar increase (though smaller) for Spin.
The ultimate goal here is to make the 4Cog P1v... perform all hub related tasks about as fast as optimized PASM, with no regard to code optimization.
It bothers me (but I don't know what to do about it) that cog_ena doesn't reflect anything that we have done so far.
I have tried to follow the uses and assignments through multi file searching, but it is seems very much like 32-bit sudoku:)
Yes, my code was wrong. It introduced an "all_zero" in bus_sel. God to know that you solved it. I have found that this one may also be ok: "bus_sel <= {bus_sel[2:0], ~|bus_sel[2:0]};"
Look at the following code, I think that there is an assign that maybe need to be changed: