NEW! Fast indirect access to COG RAM with LOAD/STORE instructions

rogloh · 2014-09-16 23:38

I've just added the latest piece of the puzzle I am solving with two more P1V instructions. This now enables COG RAM based LOAD and STORE functionality with register pointers, giving indirect memory access of COG RAM in a single operation. This was really only possible before on the Prop with self modifying code with a delay cycle required between modifying the D/S addresses using MOVD/MOVS and the instruction being affected. Self modifying code is not used here so LOAD/STORE can very conveniently run back to back with any code that changes the pointer values.

The (attached) change appears to be working well in my testing so far. This is what it does.

[B]LOAD  D, S/# [/B]  ' this results in register D being written with the contents of COGRAM pointed to by address in register S or constant #.  (D=*S/*#)
[B]STORE D, S/# [/B]  ' this results in the COG memory pointed to by register D to be written by contents of register S or constant #.       (*D=S/#)

These two primitives essentially enable the 32 bit data in COG memory to be accessed similarly to the RAM on your typical microcontroller. It behaves like your LD/ST instructions do on an AVR micro for example, but it transfers 32 bits in one instruction cycle for high performance (the AVR takes 2 cycles for reading/writing its 8 bit wide internal RAM).

High level code such as assembled C can easily do this now:

LOAD  R1, R2   ' R1<=[R2]  read R1 from COG memory using R2 as address pointer
STORE R2, R3   ' [R2]<=R3  write R3 to COG memory using R2 as address pointer

I have carefully considered this mod along with my other COGSTACK change for CALLX/RETX/PUSH/POP and set it up in a way that should interoperate with Cluso's >2kB AUGDS/AUGS work in the future (where things will start to really shine when the PC is greater than 9 bits wide). The opcode I am using is (000110) which I have tried to share with his changes. So far it seems to fit very nicely with his instruction formats.

Assembled high level code (eg. GCC output) will normally keep its internal machine registers in the lower 2kB to avoid using AUGDS overheads everywhere and access to the data segment in the COGRAM will primarily be through register pointers. However we still will want the ability in raw PASM to write directly to any COG memory > 2kB at times and this when we can use Cluso's AUGDS stuff as well. Having both forms of accessing data (direct with optional AUGDS and indirect with pointers) is the best of both worlds and is therefore ideal.

Here's the encoding I used for LOAD/STORE.

                ' iiiiii_zcri_cccc_ddddddddd_sssssssss

LOAD  D, S/#    ' 000110_x01x_cccc_ddddddddd_sssssssss    optional WZ functions as expected (ie. sets Z flag if written value = 0)
STORE D, S/#    ' 000110_x11x_cccc_ddddddddd_sssssssss        (ditto)

AUGDS #D,#S     ' 000110_x001_xxxx_ddddddddd_sssssssss    reserved for Cluso (distinguised from LOAD/STORE by using write result bit = 0)
AUGDS D,S	' 000110_x000_xxxx_ddddddddd_sssssssss    reserved for Cluso/indirect way of setting up Ish:Isl and Idh:Idl of next instruction
AUGS  #S        ' 000110_x10s_ssss_sssssssss_sssssssss    reserved for Cluso (ditto)

So in summary, we now have the following nice new group of P1V instructions using the spare opcodes (0001xx) which we can combine in useful ways to run different languages on the P1V:

MUL/MULS - 32 bit HW multiplication thank's to Willy Ekerslyke
LOAD - indirect COG memory READs
STORE - indirect COG memory WRITEs
AUGDS/AUGS - Cluso working on this. Will enable larger programs!
PUSH - to COGRAM stack
POP - from COGRAM stack
RETX - uses new COGRAM stack
CALLX - uses new COGRAM stack, support planned for >2kB COGRAM
JUMPX - support planned for >2kB COGRAM

My earlier hub mods are also available for faster LMM (using existing hub memory transfer instructions with WC modifier). I still plan to get that Link Register feature going there too.

WRxxxx WC
RDxxxx WC

In case you are wondering why I am interested in all this. Well my end goal for a lot of this (apart from getting back into FPGAs again and learning some Verilog) is a nice little P1V dev system on the DE0 nano that has the following features:

A superCOG that can access the onboard 32MB SDRAM as expanded HUB memory and run huge LMM applications from there, hence my HUB based PUSH/POP work
At least one COG with larger COGRAM (maybe 16-32kB instead of 2kB), for running embedded COG based C/PASM code fast! Cluso's and my other COGRAM changes will help this. I have another cray idea here I will keep for later...
Probably 6 regular 2kB COGs (for I/O driver stuff).
An SDRAM controller to interleave access from the single superCOG and an internal LVDS based video controller I want to create with text/gfx modes. This needs to be carefully designed for simultaneous bandwidth sharing enabling 1280x800x18bpp LVDS output (it's just doable in the memory bandwidth available according to my calculations).
I/O PORTA/B (mapped to internal/external peripherals, looking forward to some INA masks etc for this too).

Note: the attached code below is combined with the MUL/MULS stuff Willy added, so the ALU operating speed may be slower (at least before optimizations).
It is also starting to grow a bit in size. There's scope to drop it back as I've not removed the COGs video features yet.

Actually in my case final clock speeds of 80MHz+ is not the main goal as I am probably limited to operate at ~72MHz by both the SDRAM and LVDS panel I want to hook up. Speed will inevitably drop as more features get added. Hmm, just thinking 72MHz is a nice 12MHz multiple too for USB...

Enjoy.

Cluso99 · 2014-09-17 01:14

WOW - WTG !!!

Now I understand why you wanted the LOAD/STORE.

I suggest you swap the WC & NR bits so that they conform to the RD/WRxxxx where read from hub to cog is a write to cog and write to hub from cog is a read from cog and hence NR. It just fits nicer. Means I will have to change AUGS which will not matter much having the NR bit reset. Your thoughts?

Currently I have the pc using 11 bits, but its easy to expand. Willy showed me how to define tables for the cog ram for each cog, and vga for each cog. I have the pc width as a parameter, so now I will be able to make it variable depending on cog ram size.

rogloh · 2014-09-17 02:49

Cluso99 wrote: »

WOW - WTG !!!

Now I understand why you wanted the LOAD/STORE.

I suggest you swap the WC & NR bits so that they conform to the RD/WRxxxx where read from hub to cog is a write to cog and write to hub from cog is a read from cog and hence NR. It just fits nicer. Means I will have to change AUGS which will not matter much having the NR bit reset. Your thoughts?

Currently I have the pc using 11 bits, but its easy to expand. Willy showed me how to define tables for the cog ram for each cog, and vga for each cog. I have the pc width as a parameter, so now I will be able to make it variable depending on cog ram size.

Thanks for that. I am kind of hoping some of the PropGCC guys like David Betz and Eric Smith etc may see this stuff I am adding and like what it now offers enough to ultimately help support some of the new capabilities we will have soon in a future GCC update. I would like to look at that part too but have a heap of other things in my posted list to try to work on too so I'm not sure how much time I can put into that piece.

As to my thoughts on NR/WR actually I'm not yet quite sold on the option you've suggested. Yes I do understand your point about trying to keep some symmetry/consistency with hub access opcode forms of RDLONG/WRLONG, but in the case of LOAD/STORE it's actually a little different here because in both cases underlying COG register write backs are happening from the ALU. I need to writeback the ALU result to a COG register when I LOAD and I need to write the result to a COG register from the ALU when I STORE. So using WR=0 is troubling there and I would have to override it in one case. It could be done but it doesn't fit quite as cleanly in my opinion.

Also I believe it suits us nicely to have AUGDS and AUGS use WR=0 as you already coded because these are actually intermediate/temporary operations that are not writing back to any COG register within this instruction, so WR=0 seems to fit better for helping people understanding that. At least that's how I think of it.

Really looking forward to AUGDS fitting into all this and expanding the PC width for larger COGRAM space and the capability that will allow. I have another idea for the non-immediate form of AUGDS that might help too but I first need to nut it out a bit more in my own head to see if it has legs and is worthy of inclusion.

Cheers,
rogloh.

Cluso99 · 2014-09-17 05:34

Ok, I follow the NR use. I am likewise using the NR bit to prevent the writeback inAUGDS. I only have to prevent the writeback with AUGS as the NR bit is used for immediate data.

rogloh · 2014-09-17 06:44

Cluso99 wrote: »

Ok, I follow the NR use. I am likewise using the NR bit to prevent the writeback inAUGDS. I only have to prevent the writeback with AUGS as the NR bit is used for immediate data.

From what I can tell it seems you shouldn't actually need to use the WR/NR bit as immediate data (if that is what you are saying above) if the IM bit gets used as one of the data bits for the AUGS constant data. I count 23 consecutive bits available to AUGS data that way 1 (IM) + 4 (CCCC) + 18 for (D+S) , leaving the remaining 9 of the 32 to come from in the next instruction itself. So WR/NR can be left set as zero. Isn't that how it would work or have I got something wrong here?

Bill Henning · 2014-09-17 09:12

VERY NICE addition!

jac_goudsmit · 2014-09-17 11:29

This is a great feature! I'll add it to my TO DO list of things to add to Github.

===Jac

Cluso99 · 2014-09-17 12:27

rogloh wrote: »

From what I can tell it seems you shouldn't actually need to use the WR/NR bit as immediate data (if that is what you are saying above) if the IM bit gets used as one of the data bits for the AUGS constant data. I count 23 consecutive bits available to AUGS data that way 1 (IM) + 4 (CCCC) + 18 for (D+S) , leaving the remaining 9 of the 32 to come from in the next instruction itself. So WR/NR can be left set as zero. Isn't that how it would work or have I got something wrong here?

Of course you are correct... need more coffee

Todd Marshall · 2014-09-17 16:10

Rogloh: Your specialized COGs would benefit from an indirect addressed approach to sequencing the COGS (i.e. rather than GOG slots, the sequence would be viewed as time slots and many more than 8 would be implemented). This would give some COGS more than one bite at the apple in the rotation if that benefits the application. Right now the sequence is accomplished by shifting a bit in an 8 bit register and is not programmable. I wish I knew how to program it. I'm sure a Verilog jock could implement it in minutes.

rogloh · 2014-09-17 20:41

Todd Marshall wrote: »

Rogloh: Your specialized COGs would benefit from an indirect addressed approach to sequencing the COGS (i.e. rather than GOG slots, the sequence would be viewed as time slots and many more than 8 would be implemented). This would give some COGS more than one bite at the apple in the rotation if that benefits the application. Right now the sequence is accomplished by shifting a bit in an 8 bit register and is not programmable. I wish I knew how to program it. I'm sure a Verilog jock could implement it in minutes.

Hi Todd. Actually these COG changes I added don't really touch the hub memory; that was the point, to avoid running from hub. With the latest changes above we can keep a stack, access a data segment and ultimately store more PASM/assembled C code in COG RAM for the highest performance (with expanded COG RAM). That's what we are working to here. Even my earlier hub based push/pop LMM stuff wouldn't gain much from any more hub bandwidth as we need to loop and execute the instruction which already fills in the remaining cycles between hub accesses. Once we lock to the hub after the first read instruction we are running flat out from then on.

Changes to the hub allocation hardware is no doubt doable, but you'll get lots of differing views on the best model. It has been covered so much before but from what I recall no-one came up with a simple/flexible model with enough buy in. Anyway, I'd say that discussion best suits another thread outside this one.

Tubular · 2014-09-17 21:18

Very nice work you've done there, Rogloh. Thanks for taking the time to outline the broader picture, too

regards
Lachlan

rogloh · 2014-09-17 21:50

Tubular wrote: »

Very nice work you've done there, Rogloh. Thanks for taking the time to outline the broader picture, too

regards
Lachlan

Thanks, no problem

, I was hoping it may interest other people. I recall you might have mentioned LVDS sometime back too so you might be interested in this. I noticed the DE-0 nano has this lower set of IDC pins (shared with ADC) which if it could be made to work would make a great connection to a 7-10 inch LVDS LCD panel (maybe with capacitive touchscreen). A small connector daughterboard would be a great project and I'm considering that if I can make the time to design another board, unless someone else is interested builds one first (I know you make some nice boards Lachlan, LOL).

I'm just hoping the routing on the DE-0 nano board to these pins was designed well enough to carry a 4 lane 18bpp LVDS @ 65-72MHz clock (SERDES factor of x7) or so with no serious signal integrity issues but I don't know yet. Will just have to try it I guess. But if it worked that would leave the top rows of pins free for general GPIO on two ports A,B. It would then all fit my needs nicely.

On the DE-0 nano it appears the Cyclone IV device already has 3 LVDS capable pairs with internal terminations going over to this interface, the main issue is that still leaves the differential clock (which is thankfully lower at only 65-72MHz). Can likely do the clock over emulated LVDS transmitter pin pairs but it still requires external terminations that would need to be populated after the IDC connector and on the daughterboard before driving the panel wiring. For a short run this may be okay, kind of just praying there a bit because I don't know how much skew might occur. At least the fast data lanes are already terminated internally on the FPGA so that should help out a bit....probably just the clock and maybe VCCIO levels not being 2.5V that may become show stoppers (I think its set to 3.3V, but I read online someone still got it to work with that setting).

porcupine · 2014-09-18 06:31

This sounds really nice. With memory capacity like this I wonder about the feasability of porting a small operating system to run on it. It'd be really nifty to port EmuTOS (modern reimplementation of the Atari ST TOS/GEM OS) to it. Could build a nice fun microcomputer project.

Todd Marshall · 2014-09-18 08:50

rogloh wrote: »

Hi Todd. Actually these COG changes I added don't really touch the hub memory; that was the point, to avoid running from hub. With the latest changes above we can keep a stack, access a data segment and ultimately store more PASM/assembled C code in COG RAM for the highest performance (with expanded COG RAM). That's what we are working to here. Even my earlier hub based push/pop LMM stuff wouldn't gain much from any more hub bandwidth as we need to loop and execute the instruction which already fills in the remaining cycles between hub accesses. Once we lock to the hub after the first read instruction we are running flat out from then on.

Changes to the hub allocation hardware is no doubt doable, but you'll get lots of differing views on the best model. It has been covered so much before but from what I recall no-one came up with a simple/flexible model with enough buy in. Anyway, I'd say that discussion best suits another thread outside this one.

My point was, that as you make the COGS heterogeneous, you may want to change the sequence in which they are invoked (serviced). This has nothing to do with HUB memory ... except that sequence is managed in the HUB logic.

Of course it is doable. And maybe in time I'll do it and prove myself wrong about the utility of the feature. Right now I'm climbing the learning curve.

It just seems like right now, with this kind of activity going on with the configurations of the COGs, someone with interest and jock status with Verilog would give it a try. If nothing else, the resources consumed in switching from a shift register to an indirect addressed (mapped) pipelined memory vector of arbitrary long length could be shown to be trivial.

Tubular · 2014-09-18 18:12

rogloh wrote: »

Thanks, no problem , I was hoping it may interest other people. I recall you might have mentioned LVDS sometime back too so you might be interested in this. I noticed the DE-0 nano has this lower set of IDC pins (shared with ADC) which if it could be made to work would make a great connection to a 7-10 inch LVDS LCD panel (maybe with capacitive touchscreen). A small connector daughterboard would be a great project and I'm considering that if I can make the time to design another board, unless someone else is interested builds one first (I know you make some nice boards Lachlan, LOL).

I'm just hoping the routing on the DE-0 nano board to these pins was designed well enough to carry a 4 lane 18bpp LVDS @ 65-72MHz clock (SERDES factor of x7) or so with no serious signal integrity issues but I don't know yet. Will just have to try it I guess. But if it worked that would leave the top rows of pins free for general GPIO on two ports A,B. It would then all fit my needs nicely.

On the DE-0 nano it appears the Cyclone IV device already has 3 LVDS capable pairs with internal terminations going over to this interface, the main issue is that still leaves the differential clock (which is thankfully lower at only 65-72MHz). Can likely do the clock over emulated LVDS transmitter pin pairs but it still requires external terminations that would need to be populated after the IDC connector and on the daughterboard before driving the panel wiring. For a short run this may be okay, kind of just praying there a bit because I don't know how much skew might occur. At least the fast data lanes are already terminated internally on the FPGA so that should help out a bit....probably just the clock and maybe VCCIO levels not being 2.5V that may become show stoppers (I think its set to 3.3V, but I read online someone still got it to work with that setting).

Yeah I would like to try a LVDS display, and it would be really neat if it could be persuaded to work on that 26 pin header.

I think it would be a good exercise to see what the issues may be for P2 - eg what termination resistances and levels and frequencies are required. I think like you say it would be worth just jumping in and trying it. If we strike timing or phase issues then running on a DE2 (HSMC) or BeMicroCV are alternatives with enough matched lanes that might get something up and running.

The Pixel Qi kit from adafruit appeals. It uses 3v3 LVDS and a relatively friendly 22 to 46 MHz dotclock. It has an encoder board that we could scope to cross check timing. I think it'd be a good stepping stone to more serious LVDS, higher clock rate screens.

edit: If we need to make up some quickturn pcbs, that's easy to do too.

rogloh · 2014-09-18 18:51

Funny you mentioned that display at adafruit, as I actually had my eye on one of these guys, also from adafruit.. http://www.adafruit.com/products/1033 or http://www.adafruit.com/products/1667 Possibly to cannabalize, experiment with, and maybe try to fit the nano inside. That would probably be my ideal. If it failed I'd end up with a potentially useful screen for something else anyway. Ebay probably has cheaper knock offs, not sure what panel you'd end up with though.

Another good feature about the De-0 nano is it has an inbuilt accelerometer. So I was thinking something allowing a handheld form factor would be nice (esp. with touch). I wonder if you can retrofit slim touch screen films over these panels? Maybe that might be possible too, or find something else that already has it.

rogloh · 2014-09-18 20:39

porcupine wrote: »

This sounds really nice. With memory capacity like this I wonder about the feasability of porting a small operating system to run on it. It'd be really nifty to port EmuTOS (modern reimplementation of the Atari ST TOS/GEM OS) to it. Could build a nice fun microcomputer project.

I know, it certainly would be nifty to be able to run LMM at full hub speed from the 32MB available as expanded HUB RAM, then port some existing OS or add some home grown OS and have the gfx/text framebuffer mapped into the same RAM address space too. That is basically the capability I want as my end goal and am trying to (slowly) migrate towards it. Once I get extra GCC support maximizing LMM's new potential with RDLONG WC, WRLONG WC and my planned SDRAM driver working, it's on!

porcupine · 2014-09-19 06:16

rogloh wrote: »

I know, it certainly would be nifty to be able to run LMM at full hub speed from the 32MB available as expanded HUB RAM, then port some existing OS or add some home grown OS and have the gfx/text framebuffer mapped into the same RAM address space too. That is basically the capability I want as my end goal and am trying to (slowly) migrate towards it. Once I get extra GCC support maximizing LMM's new potential with RDLONG WC, WRLONG WC and my planned SDRAM driver working, it's on!

Let me know if you get there. Right now I'm playing with baremetal ARM (via QEMU anyways) in my 'spare time', waiting for the Prop 2. But it certainly would be interesting if someone could get P1V to a place where it could be used for 'small computer' projects rather than just 'microcontroller' ones.

Bill Henning · 2014-09-24 07:53

Following all this nice work with interest... looking forward to having time to play, but its looking like I won't have that time until late Oct.

DavidZemon · 2014-09-24 08:01

proper indirect addressing?!?! on the propeller!?!?! You rock

Take my money now! I've always felt this was one of the biggest shortcomings of the propeller. When will I see this in an ASIC

rogloh · 2014-09-24 08:53

SwimDude0614 wrote: »

proper indirect addressing?!?! on the propeller!?!?! You rock Take my money now! I've always felt this was one of the biggest shortcomings of the propeller. When will I see this in an ASIC

Cheers SwimDude0614.

I know how you feel - it was a real limitation. We've had 8 32 bit 20MIP Propeller processors available but under many situations it really couldn't come close to running C code as fast as an AVR could, particularly with simple 8 bit data stuff. I felt that sucked and I hope to help fix it on the the P1V, especially with expanded COGRAM and MUL/MULS where it will start to really shine. Should then be able to kick an AVR to the kerb. Not that I dislike AVRs or anything. I actually really like those Atmel chips too, played around with them for years, they perform well.

NEW! Fast indirect access to COG RAM with LOAD/STORE instructions

Comments