SDRAM and the PropII

jmg · 2013-12-13 12:01

Seairth wrote: »

Fair enough. Note, however, that there are other 8-bit buses that this could be used with. For instance, parallel-to-serial shift registers still have their place. And believe it or not, there are still legacy US Navy systems that use 8-bit ISA!

But I'm not going to push this one. If it's trivial to add, it might be worth doing. If it's not, then there's always bit-banging...

there are also FIFO's, and even FPGAs, or other controllers parallel Bus interfaces... ?

Seems a valid question to see if the SDRAM streamer engine will allow x8 or x32 choices ?

Seairth · 2013-12-13 12:05

Here's my current understanding of the SDRAM support:

It is still going to largely be supported via software. Chip is improving the XFR hardware to be a bit more user-friendly, but you will still be bit-banging the addresses, commands, etc (though the new SETFLD instructions might help a bit here too).
There is no special-purpose hardware support for SDRAM. As a result, the pin assignment is entirely user-defined (in software).
jazzed, just look at the balls.spin example code in Chip's latest code release (see the link earlier in this thread). Towards the bottom of the file, you will see the SDRAM driver code. It's fairly short and easy to understand. Even with Chip's otherr improvements, I expect the new version of the code to be roughly similar to the current version (functionally-speaking). Note that you will not[/] see any code for setting up the CLK pin. My vague recollection is that this is automatically wired up for the FPGAs, while the final version of the driver will have to do some CTR configuration.

jazzed · 2013-12-13 12:22

Seairth wrote: »

If I get your meaning, that's why I asked if the XFR changes were going to include writing to cog ram (in addition to aux and hub). While this wouldn't allow you to execute out of external RAM "directly", it would be possible to perform LMM-style execution without ever having to touch the hub.

Or, with careful partitioning of the hub space and the use of the new HUBEXEC stuff, I could easily see a each cog have a cog-local SDRAM driver that's managing the executable code blocks in the hub for that same cog. For instance, you could encode something like a "long jump" in the hub code, which would do the following steps:
Invoke the local SDRAM driver (in cog-execution mode) to read the new code into the HUB.

Perform a JMPA (or whatever Chip's calling it now) to switch back to hub-execution mode.

I could see some support for caching in the HUBEXEC COG based on switching from HUB to COG instructions if necessary. The problem though is that we really need some hardware caching algorithm support for that. Doing caching algorithms in PASM is not easy because we don't have things like a cache hit instruction or even a primitives like a magnitude comparator (which I asked for about 5 years ago and was told would never be added).

At the moment it takes about 20+ instructions to do a simple caching algorithm in P1. Maybe P2's instruction set is more caching friendly now. I'm aware of Chip's cache COG; it doesn't do any caching.

David Betz · 2013-12-13 12:35

jazzed wrote: »

I'm aware of Chip's cache COG; it doesn't do any caching.

Ummm... As far as I know, Chip doesn't have a "cache COG". He has an SDRAM driver that was mostly designed for providing video data but can also be used by the cache code in the PropGCC XMM kernel to read/write cache lines. The cache code itself will be in the XMM kernel. The driver, like the ones currently in the default branch of the propgcc tree, is just an external memory driver whose only function is to read/write blocks of external memory. It isn't a cache driver.

jazzed · 2013-12-13 12:56

David Betz wrote: »

The cache code itself will be in the XMM kernel. The driver, like the ones currently in the default branch of the propgcc tree, is just an external memory driver whose only function is to read/write blocks of external memory. It isn't a cache driver.

Well the caching algorithm has to go somewhere

I suppose XMM is really stuck with the VM model unless we can find a way to make better use of HUBEXEC COG resources.

David Betz · 2013-12-13 13:37

jazzed wrote: »

Well the caching algorithm has to go somewhere

Maybe some day (P3 timeframe?) it will go in hardware! :-)

I suppose XMM is really stuck with the VM model unless we can find a way to make better use of HUBEXEC COG resources.

I've been talking with Eric about the possiblity of having a kernel that could handle hub execution (to replace LMM) and maybe CMM and XMM as well. As you say, XMM will have to be handled by a VM scheme at least for the memory access instructions but it might still be nice to use it for outer loops and less time-critical stuff and use hub mode for anything requiring good performance.

jazzed · 2013-12-13 14:29

David Betz wrote: »

I've been talking with Eric about the possiblity of having a kernel that could handle hub execution (to replace LMM) and maybe CMM and XMM as well. As you say, XMM will have to be handled by a VM scheme at least for the memory access instructions but it might still be nice to use it for outer loops and less time-critical stuff and use hub mode for anything requiring good performance.

Hmm, I suppose that complicates things for the HUBTEXT label used in XMM modes to force functions into HUBRAM a little. I suppose the HUBTEXT code could be run in HUBEXEC mode while XMM code could be run with the VM assuming all COGs have HUBEXEC.

Bill Henning · 2013-12-13 15:48

Yep, XMM and CMM will still need a kernel.

Fortunately, there is no longer a need for a kernel for hubexec mode, as it totally replaces LMM.

David Betz wrote: »

I've been talking with Eric about the possiblity of having a kernel that could handle hub execution (to replace LMM) and maybe CMM and XMM as well. As you say, XMM will have to be handled by a VM scheme at least for the memory access instructions but it might still be nice to use it for outer loops and less time-critical stuff and use hub mode for anything requiring good performance.

rogloh · 2013-12-13 16:39

Is there any merit in having some sort of XMM model that could have a COG directly attached to external SDRAM pull in (largish) sections of cached instructions to hub RAM where it can then be executed in hubexec mode either by this same COG (or perhaps even other COGs) for higher performance, until there is some miss condition? Could this even be done, and would it buy us anything in performance for larger programs? I'm wondering how the miss condition can even be detected, perhaps it needs special opcodes to trigger a new page read if you jump outside a valid page???

It may not be possible, and might not help but it just an idea to throw out there...not trying to request us to add paging/virtual memory H/W support to P2 or anything.

Bill Henning · 2013-12-13 16:54

A compiler could be written to do something like that, however the issue is that functions call other functions... and would require relocation while loading, and much thrashing would ensue.

An XLMM kernel will be needed for now, until a future (P3? P4?) adds sufficient caching infrastructure to run code from SDRAM/DDRx

rogloh wrote: »

Is there any merit in having some sort of XMM model that could have a COG directly attached to external SDRAM pull in (largish) sections of cached instructions to hub RAM where it can then be executed in hubexec mode either by this same COG (or perhaps even other COGs) for higher performance, until there is some miss condition? Could this even be done, and would it buy us anything in performance for larger programs? I'm wondering how the miss condition can even be detected, perhaps it needs special opcodes to trigger a new page read if you jump outside a valid page???

It may not be possible, and might not help but it just an idea to throw out there...not trying to request us to add paging/virtual memory H/W support to P2 or anything.

David Betz · 2013-12-13 19:47

jazzed wrote: »

Hmm, I suppose that complicates things for the HUBTEXT label used in XMM modes to force functions into HUBRAM a little. I suppose the HUBTEXT code could be run in HUBEXEC mode while XMM code could be run with the VM assuming all COGs have HUBEXEC.

Yes there are lots of details that would have to be worked out and it might turn out not to be practical in the end but I thought it was worth thinking about.

David Betz · 2013-12-13 19:49

Bill Henning wrote: »

Yep, XMM and CMM will still need a kernel.

Fortunately, there is no longer a need for a kernel for hubexec mode, as it totally replaces LMM.

This is why I thought it might be possible to support XMM and hub execution at the same time. Hub execution doesn't require much COG space. In the case of the current propgcc ABI it could be just 16 registers plus LR and PC. The rest of the space could be used for an XMM kernel with ways of switching modes back and forth between hub execution and XMM execution. Something similar could be done with CMM to allow more compressed code in hub memory but also allow fast hub exec code for time-critical code.

rjo__ · 2013-12-14 11:52

I know that there is no such thing as a "stupid question," but this might be an exception.

Do the recent changes mean that we might be able to write to one SDRAM chip while we read from the other?

Seairth · 2013-12-14 12:58

rjo__ wrote: »

I know that there is no such thing as a "stupid question," but this might be an exception.

Do the recent changes mean that we might be able to write to one SDRAM chip while we read from the other?

Can you give a more concrete example?

potatohead · 2013-12-14 13:25

One might be a video display coupled with a larger program. The larger program would reside in it's memory space. It could direct the COGS driving the video display to do lots of things cached in that video RAM area. Think PC card running on a bus type structure rather than the more traditional Propeller shared memory model.

Dedicate a large amount of RAM to each, and there wouldn't need to be much communication between the two, particularly if a blitter, sprites and backing store were implemented on the video side of things. Data about that would land in the HUB, or stream over PORTD, leaving both to perform well.

Frankly, depending on the fill rates needed, the video memory bus could be smaller and handle a lot of non-game applications very well, saving pins for other things. The reverse is true, depending on the structure and performance requirements of a larger program.

On boot, the whole thing initializes, fetching assets and code from some connected storage. That takes a while, but then it's running at speed when done. The larger program is pretty easy. It just gets loaded in. Video tasks may include fonts, starting up the various COGS, init of the screen area, objects such as icons, sprites, window widgets and whatever procedural elements are necessary. From there, clean up and set pointers and register mailboxes for comms.

After that, the main program makes calls to it's video, handling very few actual assets for a speedy interaction.

Jim Bagley did something very similar to this on P1, using two P1's for PROPGFX. It ran over serial, or an 8 bit parallel connection.

On P2, just do it using a shared HUB memory region to buffer calls and results, etc...

Rjo is wanting a camera and various other high bit-depth display things. Doing it this way would make a lot of sense, IMHO.

Given what we've learned so far, nothing prevents this, other than pin limits.

jmg · 2013-12-14 13:56

I think we still do not know if the SDRAM engine will support 2 CLK lines, and is contained in each COG ?

SDRAM do not pin-share well, as the CE# CKE controls the state-in lines, but does not disable the DQ.

That said, there could be a hardware solution where there is ONE CLK, and multiple CE#, and each SDRAM has its own DQ connection. (ie DQ are not shared, CE# are separate, but CLK.RAS.CAS A0..A12 etc are common.)

The CE# of the two SDRAMS interleave the setup information, so that runs at <= 50% bandwidth,
Once done, the read bursts would clock from both SDRAMS into each DQ at full speed.

The data sheets show low-density CE# signals, so HW control interleave may be possible.
It would need Mux hardware (which may already be there, if Chip has one per COG ?)

This is then a small morph to run that Control line hardware at high swap rates. (DQ map separately)
(just like some SRAM systems interleave Video read, and possible Video Write )

The rule would be one COG per connected SDRAM chip,.
Incremental pin cost per SDRAM chip added becomes 1 x CE# and 16 x DQ ( or, maybe 8 x DQ ?)

A simple 50% slot allocate, on SDRAM commands, would make all transactions deterministic
Data transfers, up to Page Size, are at 100% bandwidth, over each DQ highway.

Baggers · 2013-12-14 14:11

When I did the DuoGfx two P1s they used both sets of hub-ram to generate a 256x192 8bit bitmap, each prop alternating generating the display, floating the pins whilst the other prop draws a scanline, then swapping, so 256x96 24KB in each prop leaving space for sprites also! so this wouldn't be like the pin-sharing jmg is talking about.

Seairth · 2013-12-14 20:20

jmg wrote: »

I think we still do not know if the SDRAM engine will support 2 CLK lines, and is contained in each COG ?

SDRAM do not pin-share well, as the CE# CKE controls the state-in lines, but does not disable the DQ.

That said, there could be a hardware solution where there is ONE CLK, and multiple CE#, and each SDRAM has its own DQ connection. (ie DQ are not shared, CE# are separate, but CLK.RAS.CAS A0..A12 etc are common.)

The CE# of the two SDRAMS interleave the setup information, so that runs at <= 50% bandwidth,
Once done, the read bursts would clock from both SDRAMS into each DQ at full speed.

The data sheets show low-density CE# signals, so HW control interleave may be possible.
It would need Mux hardware (which may already be there, if Chip has one per COG ?)

This is then a small morph to run that Control line hardware at high swap rates. (DQ map separately)
(just like some SRAM systems interleave Video read, and possible Video Write )

The rule would be one COG per connected SDRAM chip,.
Incremental pin cost per SDRAM chip added becomes 1 x CE# and 16 x DQ ( or, maybe 8 x DQ ?)

A simple 50% slot allocate, on SDRAM commands, would make all transactions deterministic
Data transfers, up to Page Size, are at 100% bandwidth, over each DQ highway.

Yup! That's more or less what I was suggesting in my earlier post. As for hardware support, I don't think anything is being added for SDRAM (except some improvements to the XFR hardware). But I'd be happy to be wrong about that!

rjo__ · 2013-12-15 09:28

Potatohead said it far better than I could:)

I think part of my problem might have already been solved by changes to XFR, Hub RAM and Hubexec but since I am understanding about 20 percent of those conversations, I thought I would ask.

To give a real world example the Balls sample waits to fill the SDRam before it displays anything. It would be far better if each rendering was shown incrementally at about 1 per second and then the animation would start.
From a camera point of view, we need to know what we are pointing the camera at before we push button #1:) Given all the upgrades, I think it might be possible to display a reduced resolution in near real time maybe:)

You guys are absolutely fantastic.

Thanks

Rich

potatohead · 2013-12-15 11:32

Is your camera going to have two displays, or will the one be used for both purposes?

Seems to me, in viewfinder mode, you just adjust the pixel clock down, until you get the fill rate and frame rate you want for the preview. Once the image capture has completed, adjust it up and scan it right out of the SDRAM, where you put it. Only one SDRAM needed.

Not that using two is a bad idea.

---assuming the clk issue above is resolved, and I think it is.

Alternatives include a monochrome preview, etc...

rjo__ · 2013-12-16 19:04

The improvements that Chip has created with waitpr and waitpf (If I have understand them correctly… and I should know soon:) make the coding a breeze. I started today and if my schedule stays the same, I should have it working by 2014 or 2015 by the very latest:) I have no doubt that a simple camera (single display) will be a no-brainer. The OV2640 actually allows you to drive its vsync and href signals, so synchronizing everything to a display seems more than possible.

But we don't really have enough Hubram for a single buffer… so single and double buffering has to be in the SDRAM. Being able to read from one buffer while writing to another (without killing the bandwidth) is important for all kinds of reasons.

I am interested in 3D analysis, so I will have two cameras and probably at least 3 prop2 boards talking to each other.

I can't tell you how excited I am at the prospects and how enjoyable I find the intelligent and purposeful conversations happening on this forum.

Best wishes

Rich

SDRAM and the PropII

Comments