4 Cog overclocked P1V
rjo__
Posts: 2,114
Before my first attempt, help me Jesus, a 4Cog P1V.
The goal is improve bandwidth by increasing the main clock and by reducing the hub cycle.
The question here isn't if I am missing something, but what I am missing:)
I have found just two places that need to be modified. In dig.v, in "generate" I need to change
"for (i=0; i<8; i++)" to "for (i=0; i<4; i++)
Below that at line 129, I need to change
"wire [7:0] cog_ena;" to "wire [3:0] cog_ena;"
I know how to change the clock my question regards how best to limit the hub cycle to 4 cogs.
Obviously, it can't be this simple:)
Thanks,
Rich
The goal is improve bandwidth by increasing the main clock and by reducing the hub cycle.
The question here isn't if I am missing something, but what I am missing:)
I have found just two places that need to be modified. In dig.v, in "generate" I need to change
"for (i=0; i<8; i++)" to "for (i=0; i<4; i++)
Below that at line 129, I need to change
"wire [7:0] cog_ena;" to "wire [3:0] cog_ena;"
I know how to change the clock my question regards how best to limit the hub cycle to 4 cogs.
Obviously, it can't be this simple:)
Thanks,
Rich

Comments
reg [7:0] bus_sel; always @(posedge clk_cog or negedge nres) if (!nres) bus_sel <= 8'b0; else if (ena_bus) bus_sel <= {bus_sel[6:0], ~|bus_sel[6:0]};These lines in dig.v implement a one-hot shiftregister that selects one of eight cog's to connect to the hub.
In a 4 cog design you can reduce the length of the shiftsequence, example 6 positions so the hub rotates faster.
reg [7:0] bus_sel; always @(posedge clk_cog or negedge nres) if (!nres) bus_sel <= 8'b0; else if (ena_bus) bus_sel <= {2'b00,bus_sel[4:0], ~|bus_sel[4:0]};I found that further reducing the runlength gives strange results. Reducing it to to 4 positions results in only 2 cog running but is ok for a 2 cog design.
This may not be a practical idea longer term, but it may be worth trying to make a 1 COG P1V first just to understand the code.
Then try making a 4 COG P1V.
Then try making a 4 COG P1V that could share hub in some alternative way with the 1 COG P1V. ...
For example, the one COG P1V might use the even 4 of 8 HUB slots, and the 4 COG P1V could only use the odd 4 HUB slots.
Wishing I had more time for this stuff .... Maybe soon.
That's right. The other thing to address is the COGID instruction (in hub.v), as it will return a wrong 3-bit cog# most of the time.
changes mentioned in my first post:
In dig.v, in "generate" change
"for (i=0; i<8; i++)" to "for (i=0; i<4; i++)
and following the leads above
in dig.v
reg [7:0] bus_sel; always @(posedge clk_cog or negedge nres) if (!nres) bus_sel <= 8'b0; else if (ena_bus) bus_sel <= {4'b0,bus_sel[3:0], ~|bus_sel[3:0]};and in hub.v
// output reg [2:0] sys_q; reg sys_c; always @(posedge clk_cog) if (ena_bus && sys) //sys_q <= ac[2:0] == 3'b001 ? { bus_sel[7] || bus_sel[6] || bus_sel[5] || bus_sel[0], // cogid //bus_sel[7] || bus_sel[4] || bus_sel[3] || bus_sel[0], //bus_sel[6] || bus_sel[4] || bus_sel[2] || bus_sel[0] } //: num; // others sys_q <=ac[2:0]==3'b001 ? {1'b0,bus_sel[3] || bus_sel[0], bus_sel[2] || bus_sel[0]} :num;test code run in PropellerIDE
CON _clkmode = xtal1+pll16x _clkfreq = 80_000_000 OBJ ser : "FullDuplexSerial" VAR long i,x[1000],time1,time2,elapsed pub main ser.Start(31,30,0,115200) i:=0 time1 :=50 waitcnt(clkfreq/4 +cnt) ser.str(string("Watch LEDs")) ser.tx(13) waitcnt(clkfreq*4 + cnt) repeat 1000 x[i]:=i i++ coginit(3,@timeit,@x) repeat until x[0] > 0 ser.str(string("Results:",13)) i:=0 ser.dec(x[0]) ser.str(string(" should be = 100",13)) ser.dec(x[999]) ser.str(string(" should be = 1099",13)) elapsed:=time2-time1 ser.dec(elapsed) ser.str(string("_clocks ")) waitcnt(clkfreq*2+cnt) cogstop(3) waitcnt(clkfreq*2+cnt) cogstop(1) waitcnt(clkfreq*2+cnt) Dat org 0 timeit mov loops,reps mov a1,par mov t1,cnt myloop rdlong xval,a1 add xval,#100 wrlong xval,a1 add a1,#4 nop djnz loops,#myloop mov t2, cnt wrlong t1,a1 add a1,#4 wrlong t2,a1 nothing jmp #nothing reps long 1000 t1 res 1 t2 res 1 t3 res 1 loops res 1 a1 res 1 xval res 1Nothing breaks but I get exactly the same timing on both the nano_P1V and a P1.
"hubslots" ... where are those hubslots?
Thanks
Rich
bus_sel <= {4'b0,bus_sel[3:0], ~|bus_sel[3:0]};This code generates a 9 bit value where a single "1" bit cycles over 5 positions = hub timeslots, 10 CPU clocks.....bus_sel[2:0] would result in 4 hub time slots.
I have done some experiments with less hub timeslots, look in this threadhttp://forums.parallax.com/showthread.php/156955-Small-V-Prop-2-Cog-s-4-KB-ROM-4KB-Hub-RAM In the last post I posted two oscilloscope screens that show the speedup with 4 slots / 2 Cog's for a series of sequential RDLONG's compared to 8 slots / 8 cog's..
Warning: probably more Verilog changes are necessary to change the number of timeslots without breaking some logic. With 6 timeslots I can have only 4 cog's running, with 4 timeslots only 2 cogs.
So you were lucky with your 5 timeslots, I guess that only 3 Cogs can be running with that (did not try)
So, if anyone wants half a P1v... it seems to be available here:)
BUT I am seeing absolutely no differences in the timing.
Cluso99 is working on documentation. That should help a lot.
I went back to PropellerIDE and used cog 4 and it worked... then I switched to cog 5 (which shouldn't) exist. The code ran fine. The timing was unaffected and is still the same as for a normal P1. The correct led for the different cogs assigned lit up on my Nano.
I know that I recompiled and reprogrammed correctly... the time stamps prove it.
I am thinking that when I am asking for a cog that doesn't exist, it uses the 2 lsb and chooses a cog that does exist... and the LED is simply an artifact.
But I'm not sure about this.
The Verilog code for COGINIT/COGNEW and other hub functions is way beyond me, it may be that this code knows about "active" cog's.
I've seen the same LED and performance behaviour with code I've tried. I'm building with your changes now.
This is the Spin code I used to test performance.
CON _clkmode = XTAL1 + PLL16X _clkfreq = 80_000_000 OBJ ser : "MySimpleSerial" PUB start | addr, t0, t1 ser.init(31,30,19200) waitcnt(CLKFREQ/2+CNT) '' Wait for start up 'ser.str(string($d,"Hello.",$d)) t0 := CNT addr := $7f00 t1 := CNT ser.Str(string("Diff Time ")) ser.Dec(t1-t0) ser.tx($d)MySimpleSerial.spin
''******************************************************************* ''* Simple Asynchronous Serial Driver v1.3 * ''* Authors: Chip Gracey, Phil Pilgrim, Jon Williams, Jeff Martin * ''* Copyright (c) 2006 Parallax, Inc. * ''* See end of file for terms of use. * ''******************************************************************* '' '' Performs asynchronous serial input/output at low baud rates (~19.2K or lower) using high-level code '' in a blocking fashion (ie: single-cog (serial-process) rather than multi-cog (parallel-process)). '' '' To perform asynchronous serial communication as a parallel process, use the FullDuplexSerial object instead. '' '' '' v1.3 - May 7, 2009 - Updated by Jeff Martin to fix rx method bug, noted by Mike Green and others, where uninitialized '' variable would mangle received byte. '' v1.2 - March 26, 2008 - Updated by Jeff Martin to conform to Propeller object initialization standards and compress by 11 longs. '' v1.1 - April 29, 2006 - Updated by Jon Williams for consistency. '' '' '' The init method MUST be called before the first use of this object. '' Optionally call finalize after final use to release transmit pin. '' '' Tested to 19.2 kbaud with clkfreq of 80 MHz (5 MHz crystal, 16x PLL) VAR long sin, sout, inverted, bitTime, rxOkay, txOkay PUB init(rxPin, txPin, baud): Okay {{Call this method before first use of object to initialize pins and baud rate. For true mode (start bit = 0), use positive baud value. Ex: serial.init(0, 1, 9600) For inverted mode (start bit = 1), use negative baud value. Ex: serial.init(0, 1, -9600) Specify -1 for "unused" rxPin or txPin if only one-way communication desired. Specify same value for rxPin and txPin for bi-directional communication on that pin and connect a pull-up/pull-down resistor to that pin (depending on true/inverted mode) since pin will set it to hi-z (input) at the end of transmission to avoid electrical conflicts. See "Same-Pin (Bi-Directional)" examples, below. EXAMPLES: Standard Two-Pin Bi-Directional True/Inverted Modes Standard One-Pin Uni-Directional True/Inverted Mode Ex: serial.init(0, 1, ±9600) Ex: serial.init(0, -1, ±9600) -or- serial.init(-1, 0, ±9600) ┌────────────┐ ┌──────────┐ ┌────────────┐ ┌──────────┐ │Propeller P0├─────────────┤I/O Device│ │Propeller P0├───────────────┤I/O Device│ │ P1├─────────────┤ │ └────────────┘ └──────────┘ └────────────┘ └──────────┘ Same-Pin (Bi-Directional) True Mode Same-Pin (Bi-Directional) Inverted Mode Ex: serial.init(0, 0, 9600) Ex: serial.init(0, 0, -9600)  ┌────────────┐ ┌──────────┐ │ │Propeller P0├─────┳─────┤I/O Device│  4.7 kΩ └────────────┘ │ └──────────┘ ┌────────────┐ │ ┌──────────┐  4.7 kΩ │Propeller P0├─────┻─────┤I/O Device│ │ └────────────┘ └──────────┘  }} finalize ' clean-up if restart rxOkay := rxPin > -1 ' receiving? txOkay := txPin > -1 ' transmitting? sin := rxPin & $1F ' set rx pin sout := txPin & $1F ' set tx pin inverted := baud < 0 ' set inverted flag bitTime := clkfreq / ||baud ' calculate serial bit time return rxOkay | TxOkay PUB finalize {{Call this method after final use of object to release transmit pin.}} if txOkay ' if tx enabled dira[sout]~ ' float tx pin rxOkay := txOkay := false PUB rx: rxByte | t {{ Receive a byte; blocks caller until byte received. }} if rxOkay dira[sin]~ ' make rx pin an input waitpeq(inverted & |< sin, |< sin, 0) ' wait for start bit t := cnt + bitTime >> 1 ' sync + 1/2 bit repeat 8 waitcnt(t += bitTime) ' wait for middle of bit rxByte := ina[sin] << 7 | rxByte >> 1 ' sample bit waitcnt(t + bitTime) ' allow for stop bit rxByte := (rxByte ^ inverted) & $FF ' adjust for mode and strip off high bits PUB tx(txByte) | t {{ Transmit a byte; blocks caller until byte transmitted. }} if txOkay outa[sout] := !inverted ' set idle state dira[sout]~~ ' make tx pin an output txByte := ((txByte | $100) << 2) ^ inverted ' add stop bit, set mode t := cnt ' sync repeat 10 ' start + eight data bits + stop waitcnt(t += bitTime) ' wait bit time outa[sout] := (txByte >>= 1) & 1 ' output bit (true mode) if sout == sin dira[sout]~ ' release to pull-up/pull-down PUB str(strAddr) {{ Transmit z-string at strAddr; blocks caller until string transmitted. }} if txOkay repeat strsize(strAddr) ' for each character in string tx(byte[strAddr++]) ' write the character PUB dec(value) | i, z '' Print a signed decimal number if value < 0 -value tx("-") i := 1_000_000_000 z~ repeat 10 if value => i tx(value / i + "0") value //= i z~~ elseif z or i == 1 tx("0") i /= 10 PUB hex(value, digits) '' Print a hexadecimal number value <<= (8 - digits) << 2 repeat digits tx(lookupz((value <-= 4) & $F : "0".."9", "A".."F")) PUB bin(value, digits) '' Print a binary number value <<= 32 - digits repeat digits tx((value <-= 1) & 1 + "0") {{ ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ TERMS OF USE: MIT License │ ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation │ │files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, │ │modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software│ │is furnished to do so, subject to the following conditions: │ │ │ │The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.│ │ │ │THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE │ │WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR │ │COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, │ │ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ }}I'm getting the same "Diff Time 704" as before. So it seems either the Spin is being optimized away (very unlikely) or more verilog digging is on order.
For anyone looking in, who hasn't been here for a while, I need to add that I am a pure hobbyist, who ordinarily leaves the room when serious conversations start.
I am such a feckless programmer that I normally test my code after entering each line. That's a little tedious with FPGA's.
So far I haven't found anything I want to do with a propeller that I can't eventually do... and the verilog code makes about as much
sense to me as PASM did when I first looked at it... I'm guessing the end results will be similar.
It could take a while:)
Thank you pik33.
On my Nano P1V*1/2, I got largely the same results as pik33. At 150Mhz, the loading was unreliable. At 141.666Mhz and 133.333Mhz, the loading seemed reliable. I added a blinking LED on P0 and the program seemed to run just fine. The LED seemed to blink at the right rate, the cog led's which are hooked up in the verilog behaved correctly.
However, I could not get reliable serial communications, despite using fullserialduplex and mysimpleserial (above in Jazzed's post)at a variety of baud rates. I even hard coded the clock into the serial drivers, just in case something in the declaration wasn't quite kosher. No luck. I am out of time for the next couple of days. I used both bst and PropellerIDE with identical results.
If you are following along. The next step is to put out a frequency on one of the pins and measure it with a real prop... if you have the time, give it a whirl. And if that works, then we will have to figure out what is going on with the serial stuff.
Tah Tah
Rich
Originally, I had the following:
reg [7:0] bus_sel; always @(posedge clk_cog or negedge nres) if (!nres) bus_sel <= 8'b0; else if (ena_bus) bus_sel <= {4'b0,bus_sel[3:0], ~|bus_sel[3:0]};which nutson correctly informed me had too many bits. I then made another error in the zip files above... right number of bits, wrong number of cogs:)
Which brings me to this:
reg [7:0] bus_sel; always @(posedge clk_cog or negedge nres) if (!nres) bus_sel <= 8'b0; else if (ena_bus) bus_sel <= {3'b0,bus_sel[3:0], ~|bus_sel[3:0]};which I am fairly certain is correct.The issue is that even though I think I am properly restricting bus selection to 4 cogs... the timing for my test file remains unchanged.
at line 246 of hub.v
I also made this change:
// output reg [2:0] sys_q; reg sys_c; always @(posedge clk_cog) if (ena_bus && sys) sys_q <=ac[2:0]==3'b001 ? {1'b0,bus_sel[3] || bus_sel[0], bus_sel[2] || bus_sel[0]} :num;The concatenation works as in the original code(exept for 4 cogs), but I think I need to change to something like sys_q <={1b'0,ac[1:0]... but .... but...And right now that looks like it could take forever:)
I think it can be wrong.
Try this one instead:
bus_sel <= {4'b0,bus_sel[2:0], ~|bus_sel[3:0]};HOWTO testbench:
p1bus.v
module p1bus (bus_sel, nres, ena_bus, clk_cog); output [7:0] bus_sel; input nres, clk_cog, ena_bus; reg [7:0] bus_sel; always @(posedge clk_cog or negedge nres) if (!nres) bus_sel <= 8'b0; else if (ena_bus) bus_sel <= {4'b0,bus_sel[2:0], ~|bus_sel[3:0]}; endmodulep1bus_tb.v
module test; reg nres = 1; reg ena_bus = 0; initial begin # 16 nres = 0; # 1 nres = 1; # 1 ena_bus = 1; # 100 nres = 0; # 1 nres = 1; # 100 $finish; end reg clk_cog = 0; always #5 clk_cog = !clk_cog; wire [7:0] bus_sel; p1bus p1 (bus_sel, nres, ena_bus, clk_cog); initial $monitor("At time %t, bus_sel = %b, ena_bus = %h", $time, bus_sel, bus_sel, ena_bus); endmodule // testExecute with icarus verilog:
(code stealed from http://iverilog.wikia.com/wiki/Getting_Started
... do not know why ena_bus has 3 bits)
Because there are 8 cogs?
Big thank you. Away from my massive Xp machine. I need a itty bitty 64 bit laptop:)
No. Actually It had 9 bits. I have found the typo. A duplicated variable in the monitor line:
BAD> $monitor("At time %t, bus_sel = %b, ena_bus = %h", $time, bus_sel, bus_sel, ena_bus);
OK > $monitor("At time %t, bus_sel = %b, ena_bus = %b", $time, bus_sel, ena_bus);
It didn't warned that I used three parameters (%t, %b, %h) and four variables, the compiler just joined the last two variables 8 bits + 1 bit (2nd bus_sel & ena_bus).
Substitute this:
sys_q <= ac[2:0] == 3'b001 ? { [B]bus_sel[7] || bus_sel[6] || bus_sel[5] || bus_sel[0], // cogid[/B] bus_sel[7] || bus_sel[4] || bus_sel[3] || bus_sel[0], bus_sel[6] || bus_sel[4] || bus_sel[2] || bus_sel[0] } : num; // others // 76543210 {OR(7,6,5,0),OR(7&4&3&0),OR(6&4&2&0)} // ======== // 00000001 { 1 , 1, 1} = 111b (7) // 00000010 { 0 , 0, 0} = 000b (0) // 00000100 { 0 , 0, 1} = 001b (1) // 00001000 { 0 , 1, 0} = 010b (2) // 00010000 { 0 , 1, 1} = 011b (3) // 00100000 { 1 , 0, 0} = 100b (4) // 01000000 { 1 , 0, 1} = 101b (5) // 10000000 { 1 , 1, 0} = 110b (6)with this:
sys_q <= ac[2:0] == 3'b001 ? { [B]1'b0, // cogid[/B] bus_sel[7] || bus_sel[4] || bus_sel[3] || bus_sel[0], bus_sel[6] || bus_sel[4] || bus_sel[2] || bus_sel[0] } : num; // others // 76543210 { 1'b0,OR(7&4&3&0),OR(6&4&2&0)} // ======== // 00000001 { 1 , 1, 1} = 111b (3) // 00000010 { 0 , 0, 0} = 000b (0) // 00000100 { 0 , 0, 1} = 001b (1) // 00001000 { 0 , 1, 0} = 010b (2) // 00010000 { 0 , 1, 1} = 011b (3) // 00100000 { 0 , 0, 0} = 000b (0) // 01000000 { 0 , 0, 1} = 001b (1) // 10000000 { 0 , 1, 0} = 010b (2)Beware ! Not tested.
Thanks again.
We are moving our house... and it isn't going well:)
I had just enough time tonight to go to my "lab" and test your changes... I first tested post#21... it worked but did not change the number of clocks (same as measured on my "p1v*1/2" and on a regular p1. I started and stopped all 4 cogs... they worked as expected)
I then added the change from the above( and if you hadn't shown me the truth table, I wouldn't have believed it.) Again, everything works, but the timing in my test code remains as it was tested on a regular p1 with the spin file that I posted.
One issue I have about post#25... (and I suspect that it is just me) is this: you show results for all bus_sel options... but to my mind, bus_sel[7..4] should always be 0... the idea is to never have these selected.
It doesn't seem to make a difference to my final result.... so, I don't see a reason to change it back.
I'm at something of a loss. There must be some other source of bus arbitration that I am missing... I would have expected the measured clocks to drop... maybe not in half, but substantially. They are exactly same. Kind of amazing.
We are taking a hack saw to a Propeller... and it doesn't seem to care:)
On the bright side... we do have a smaller footprint and much quicker compile times in Quartus... but that is not exactly what I want:)(:~~~~^^^^^
BUT using spin to measure elapsed times... I get a decrease of 16 clocks(496->480) when the code is run on a Project Board vs. p1v*1/2
Note... in PropellerIDE, use a baud rate of 115200.
bus_sel <= {4'b0,bus_sel[2:0], ~|bus_sel[3:0]};to
bus_sel <= {4'b0,bus_sel[2:0], ~|bus_sel[2:0]};In Spin, the measured clocks drops to 448.
As before, the LEDs light up appropriately, so the Prop1v*1/2 seems to think the cog is being used.
Cogstop does work but the pasm routine never writes to hub ram.
Now that Ramon, Cluso99 and Ozpropdev have me heading in the right (Thank you guys:)
I'm going back to over clocking and see what I screwed up there:)
With unoptimized pasm code, there is about a 15 percent improvement in PASM timing of the P1v*1/4 over a standard P1 and a similar increase (though smaller) for Spin.
The ultimate goal here is to make the 4Cog P1v... perform all hub related tasks about as fast as optimized PASM, with no regard to code optimization.
It bothers me (but I don't know what to do about it) that cog_ena doesn't reflect anything that we have done so far.
I have tried to follow the uses and assignments through multi file searching, but it is seems very much like 32-bit sudoku:)
Yes, my code was wrong. It introduced an "all_zero" in bus_sel. God to know that you solved it. I have found that this one may also be ok: "bus_sel <= {bus_sel[2:0], ~|bus_sel[2:0]};"
Look at the following code, I think that there is an assign that maybe need to be changed:
P1V -> assign bus_ack = ed ? { bus_sel[1:0], bus_sel[7:2]} : 8'b0; P1V_1/2 -> assign bus_ack = ed ? {4'b0, bus_sel[1:0], bus_sel[3:2]} : 8'b0; bus_sel P1v_cog[n-2] P1v_1/4[n-2] ========= ============ =========== 0000 0001 0100 0000 0000 0100 0000 0010 1000 0000 0000 1000 0000 0100 0000 0001 0000 0001 0000 1000 0000 0010 0000 0010 0001 0000 0000 0100 0010 0000 0000 1000 0100 0000 0001 0000 1000 0000 0010 0000