DDR, SDRAM, and DAC thoughts + the future
pedward
Posts: 1,642
Rather than contribute more stuff to the blog thread, here's my thoughts.
I think the P2 should target the standard 16 bit SDRAM that Chip has already working, I think the chip should offer hardware assisted SDRAM access though, as a peripheral. The Verilog for accessing the SDRAM is already out there, but Chip already knows what is necessary. I think 16 bit SDRAM hardware assisted access is the best way to go.
All of Port C and half of Port B is used by the SDRAM access, which leaves 48 bits for Chips proposed high speed DAC pins.
I spoke with him and he explained that there is a message bus for setting pin info and states now, he would expand this to allow the P2 to use the message bus to set DAC states (low speed bus).
The video circuitry requires the high speed DAC bus, which is presently 288 bits wide, or 9 bits times 32 DACs. He proposes to map the high speed signals to dedicated per-COG per-pin locations.
This would look something like the low half of Port B and upper half of Port A, where COGs 4-7 would be Port B 0-16 and COGs 0-3 would be Port A pins 17-31. If you don't plan on using video output or SDRAM, the whole of Ports A,B,C would still allow "low" speed DAC output. Basically however fast you issue changes in software, it updates the port, according to Chip. I don't know what the limits of this would be/are.
On the subject of DDR SDRAM, I think the way to go is to leave that for future development. I would recommend using a PoP package LPDDR1 chip, since those are 1.8v and support a 200Mhz clock. A 128Mx32 or 64Mx32 would give 1,600MB/s at 200Mhz.
The use of PoP would allow chip to allocate however many pins he needs, it could be possible that we would gain back Ports A,B,C in full, because dedicated pins could be allocated to SDRAM and managed directly by verilog.
I would make the SDRAM accessible as a tertiary memory, with instructions that are similar to HUB memory. I am pretty certain that the access wouldn't work the same as HUB memory, because of the setup and turnaround, so it would probably be a Long Quad (8 longs), in N number of hub cycles. Writing directly to the AUX ram is interesting, because it allows DMA to occur while HUB access is occurring too.
I can't help but think that perhaps a cleaner interface to the SDRAM would be via an MMU. The MMU would map the SDRAM into Hub address space (above 128K) and allow the SDRAM to appear as HUB ram. Then there would be no additional instructions and there would only need to be 1 hardware element to access the SDRAM, because it would happen in the hub cycle. The caveat would be that you'd wait twice as long for a HUB access to the SDRAM (theoretically, I assume the overhead of getting a Quad from SDRAM would be 2x as long as the normal HUB RAM).
If SDRAM was linearly mapped to HUB address space, it would facilitate C and SPIN direct execute using the existing code that does execute from hub.
I think that as the Prop moves towards P8x32-D model and P8x64-A model, some consideration should be given to a proper MMU for hub access, allowing memory pages and protection. If you have 128MB of RAM and 1280+ MIPS with 64 bit operands, a MMU with paging and protection capability really needs to be considered, to allow for operating systems to implement ring levels.
I would make it so COGs could be initialized with a ring level, and they could only access memory that was mapped at their ring level or lower. RING 0 would be COG 0, it would run the core kernel that divies memory to other COGs. They would be COGINIT with RING 0,1,2, or 3. A COG can COGINIT another COG, but only to a RING level equal to or lower than the current RING. RING0 can INIT 0,1,2,3. RING1 can INIT 1,2,3. RING2 can INIT 2,3. RING3 can init 3. This provides for de-escalation of privileges, but if a COG re-init's itself at a lower RING, the chip could get into a state where no COG is RING-0. That may be by design in an embedded program however. You initialize all of your memory at start, then drop privs for the running of everything else. That way you can't move across boundaries, EVER, preventing security compromises.
I don't think execute disable applies, since HUB ram has no notion and it can't be enforced.
Paging is tricky. Doing page table mapping is a little tricky and can quickly lead to non-deterministic execution, which is bad. Allowing virtual mapping is useful for virtual memory and for creating copy on write memory allocations. I just don't think a microcontroller should be doing swapping, because it destroys deterministic execution.
So, the real purpose of the MMU would be to manage access to SDRAM and to implement security bits.
Conventionally, pages are 4KB in size, with a P3 and 512 64-bit values, that's 4KB per COG.
Perhaps having the ability to swap COG memories could be useful, doing task switching.
If you had 128MB of RAM (1Gb memory size), that's 32768 4KB pages. To store a RING value with each page would require 16KB of RAM.
Either the SDRAM could be used to store a RING value (with some HUB caching), or page size could be increased to substantially more than 4KB.
4KB is pretty small in today's world. The Huge Page extension to x86_64 starts at 2MB pages, which would only need 64*4 bits to store permissions, which is trivial. Perhaps 1024bits could be allocated for permissions mapping inside the MMU, allowing up to 4Gb memories (512MB).
Perhaps a simpler compromise could be made, let's say the page size is the same as HUB ram. On a P3 it would be a safe estimate to have 512KB of HUB ram, because lithography / 2 = 4x the area. It would be arranged as 64K x 64 logical.
So you have 512KB pages, 4 bits per page for RING, 128MB is 256x4 = 1024kbits = 256bytes. Give it 1KB and you've got 512MB and 512KB pages, RING per page.
If a COG tries to read a page that doesn't have permission, it would get a repeated value, perhaps 0, perhaps the RING value of the COG, perhaps the RING value for the address it tried to read. A flag could be set too.
That's all for now.
I think the P2 should target the standard 16 bit SDRAM that Chip has already working, I think the chip should offer hardware assisted SDRAM access though, as a peripheral. The Verilog for accessing the SDRAM is already out there, but Chip already knows what is necessary. I think 16 bit SDRAM hardware assisted access is the best way to go.
All of Port C and half of Port B is used by the SDRAM access, which leaves 48 bits for Chips proposed high speed DAC pins.
I spoke with him and he explained that there is a message bus for setting pin info and states now, he would expand this to allow the P2 to use the message bus to set DAC states (low speed bus).
The video circuitry requires the high speed DAC bus, which is presently 288 bits wide, or 9 bits times 32 DACs. He proposes to map the high speed signals to dedicated per-COG per-pin locations.
This would look something like the low half of Port B and upper half of Port A, where COGs 4-7 would be Port B 0-16 and COGs 0-3 would be Port A pins 17-31. If you don't plan on using video output or SDRAM, the whole of Ports A,B,C would still allow "low" speed DAC output. Basically however fast you issue changes in software, it updates the port, according to Chip. I don't know what the limits of this would be/are.
On the subject of DDR SDRAM, I think the way to go is to leave that for future development. I would recommend using a PoP package LPDDR1 chip, since those are 1.8v and support a 200Mhz clock. A 128Mx32 or 64Mx32 would give 1,600MB/s at 200Mhz.
The use of PoP would allow chip to allocate however many pins he needs, it could be possible that we would gain back Ports A,B,C in full, because dedicated pins could be allocated to SDRAM and managed directly by verilog.
I would make the SDRAM accessible as a tertiary memory, with instructions that are similar to HUB memory. I am pretty certain that the access wouldn't work the same as HUB memory, because of the setup and turnaround, so it would probably be a Long Quad (8 longs), in N number of hub cycles. Writing directly to the AUX ram is interesting, because it allows DMA to occur while HUB access is occurring too.
I can't help but think that perhaps a cleaner interface to the SDRAM would be via an MMU. The MMU would map the SDRAM into Hub address space (above 128K) and allow the SDRAM to appear as HUB ram. Then there would be no additional instructions and there would only need to be 1 hardware element to access the SDRAM, because it would happen in the hub cycle. The caveat would be that you'd wait twice as long for a HUB access to the SDRAM (theoretically, I assume the overhead of getting a Quad from SDRAM would be 2x as long as the normal HUB RAM).
If SDRAM was linearly mapped to HUB address space, it would facilitate C and SPIN direct execute using the existing code that does execute from hub.
I think that as the Prop moves towards P8x32-D model and P8x64-A model, some consideration should be given to a proper MMU for hub access, allowing memory pages and protection. If you have 128MB of RAM and 1280+ MIPS with 64 bit operands, a MMU with paging and protection capability really needs to be considered, to allow for operating systems to implement ring levels.
I would make it so COGs could be initialized with a ring level, and they could only access memory that was mapped at their ring level or lower. RING 0 would be COG 0, it would run the core kernel that divies memory to other COGs. They would be COGINIT with RING 0,1,2, or 3. A COG can COGINIT another COG, but only to a RING level equal to or lower than the current RING. RING0 can INIT 0,1,2,3. RING1 can INIT 1,2,3. RING2 can INIT 2,3. RING3 can init 3. This provides for de-escalation of privileges, but if a COG re-init's itself at a lower RING, the chip could get into a state where no COG is RING-0. That may be by design in an embedded program however. You initialize all of your memory at start, then drop privs for the running of everything else. That way you can't move across boundaries, EVER, preventing security compromises.
I don't think execute disable applies, since HUB ram has no notion and it can't be enforced.
Paging is tricky. Doing page table mapping is a little tricky and can quickly lead to non-deterministic execution, which is bad. Allowing virtual mapping is useful for virtual memory and for creating copy on write memory allocations. I just don't think a microcontroller should be doing swapping, because it destroys deterministic execution.
So, the real purpose of the MMU would be to manage access to SDRAM and to implement security bits.
Conventionally, pages are 4KB in size, with a P3 and 512 64-bit values, that's 4KB per COG.
Perhaps having the ability to swap COG memories could be useful, doing task switching.
If you had 128MB of RAM (1Gb memory size), that's 32768 4KB pages. To store a RING value with each page would require 16KB of RAM.
Either the SDRAM could be used to store a RING value (with some HUB caching), or page size could be increased to substantially more than 4KB.
4KB is pretty small in today's world. The Huge Page extension to x86_64 starts at 2MB pages, which would only need 64*4 bits to store permissions, which is trivial. Perhaps 1024bits could be allocated for permissions mapping inside the MMU, allowing up to 4Gb memories (512MB).
Perhaps a simpler compromise could be made, let's say the page size is the same as HUB ram. On a P3 it would be a safe estimate to have 512KB of HUB ram, because lithography / 2 = 4x the area. It would be arranged as 64K x 64 logical.
So you have 512KB pages, 4 bits per page for RING, 128MB is 256x4 = 1024kbits = 256bytes. Give it 1KB and you've got 512MB and 512KB pages, RING per page.
If a COG tries to read a page that doesn't have permission, it would get a repeated value, perhaps 0, perhaps the RING value of the COG, perhaps the RING value for the address it tried to read. A flag could be set too.
That's all for now.
Comments
Does this message bus mean hub-like time-slot access (one COG at a time ?)
How many high speed DACS can then be controlled from one COG ?
Do you mean something like this ?
http://download.micron.com/pdf/datasheets/dram/mobile/ddr_mobile_sdram_only_168b_pop.pdf
ie 12mm package,with balls on outer ring only, to allow stacking.
A dual-die approach certainly has appeal, and is used by RasPi and Nuvoton ARM parts, so is worth looking carefully at.
Nuvoton manage to do theirs, and keep a TQFP128 package.
A P2R version, that allowed an upgrade of P2 designs, would have obvious design-in ramp advantages.
There is a neat idea, with message passing, to set ANY pin to any arbitrary value, so only the per-COG high speed DAC is the limited resource.
Going from micro-controller to a microprocessor hybrid.
Why not just use TI's OMAP line instead at that point. The Cogs aren't going to compete with upper end ARM's for performance and if you need to run Linux or say QNX there are better choices than a chip without a normal bus architecture.
Combine these ideas with the P2 being in perpetual revision mode, I really wonder if we'll ever see working silicon in 2014.
Take a chill pill. No one is proposing an MMU for the P2.