Propeller II update - BLOG

potatohead · 2013-12-02 14:37

Re: What is the issue? It's the 2K of course

Personally, I prefer the low memory mapping, if the ROM needs to exist in the address space.

I also prefer a baseline, known present monitor too, though it can and should be small.

Given that code is largely authored, I'm for carrying it forward with the basic changes needed for the hardware changes. Low risk, low change, ship soon.

Low risk, low change, ship soon is pretty much my default preference from here forward.

I remain opposed to the hub slot timing changes. Reasons already given.

Cluso99 · 2013-12-02 14:41

jmg wrote: »

The question then is, just how much SerDes help is needed, to pack this into a single COG

I just don't know. Certainly it would need to read/write the block complete with stuffing/unstuffing and crc check/send. IMHO I think it's too much to ask and too restrictive for other uses.

What is the code-size cost of bit stuff/unstuff in SW ?

Bradc utilises a trick that just shifts each bit 6 times into a 32bit register and checks the result for Z. If you shift 6 bits of 0 in 6x you get 36 bits (4 are lost) and you have Z result. Any time a 1 bit is shifted in, it prevents the register from being Z until a further 6 sets of 0's are shifted in. Its a very simple and effective test.

With P2, Chip is going to implement a 1 clock instruction that reads a pin and XORs that with the C flag (which contains the previous bit) and places the result into the C flag (which is also ready for the next bit). The Z flag is set if the pin and its twin pair are both Z which is an SE0 condition which needs testing. The C flag is now used in another 1 clock instruction that Chip is going to implement (a single bit CRC adder using the C flag as the bit input).
Now we just shift the C bit into a byte accumulator (data byte), JZ if Z is set (SE0), shift the C flag into the unstuff counter/register and if Z is set, jmp to unstuff the next bit, otherwise go get the next bit.

With these 2 instructions, the CRC which is of general CRC bit calculation use also, I am unsure how much help the serdes will be unless it does sync detection, unstuffing, SE0 detection and crc.

I am not so confident to adequately explain what is required to Chip. I hate him to spend the time and miss something and make it useless.
So I would rather the few generic pin-pair instructions that I suggested, and Chip liked, to be implemented,
A generalised serdes that can be used in many ways, including chaining, etc,

Cluso99 · 2013-12-02 14:46

jmg wrote: »

Currently ROM is made by patching RAM, so that naturally makes it want to be small.
It also means 256k of RAM + 2K ROM, bumps to another address bit = more decode = some speed impact.
Memory generators also often are not expecting fractional pages, so manual intervention could be required.

It should be possible to alias the ROM to some high memory area, (for future proof) but it will still overlay into the top of the physical on-chip memory.

The ROM has to appear somewhere, so apart from cosmetic, what is the issue with 256k-2k ?

I don't think it has to appear anywhere. I think it could just be part of the special boot instruction that copies an external block into cog ram. It would be a little 2KB standalone block. I think it would just be a hardwired RAM cell with read only access.

potatohead · 2013-12-02 14:47

Getting rid of the DAC bus means that our custom routing largely goes away, so that the place-and-route tool will make all the connections from the core to both the memories and the pad frame. This is going to save tons of layout time.

NICE!! And quite a good trade for the DAC pin flexibility.

Seairth · 2013-12-02 14:52

cgracey wrote: »

I'm serious!

This is low-risk and it's the easiest way to make all that space useful. Changing to 12 cogs is simple. The biggest pain is remapping the COGINIT instruction, which is just a lot of busy work.

Also, Port D would have to be changed...

(Man, I totally chose the wrong 48 hours to stop reading this forum!)

David Betz · 2013-12-02 14:54

whicker wrote: »

I'd rather see the boot rom map into Cog 0's RAM area, because that's ultimately where it ends up being copied to anyway.

How would that work? If it maps to hub memory then it can be unmapped once COG 0 has started. If it overlaps COG 0 RAM, when would you unmap it to provide access to the underlying COG RAM?

Seairth · 2013-12-02 14:58

ozpropdev wrote: »

This makes sense for the VIDEO guys out there.

In one of my programs I am generating a 2 color 800x600 VGA image.
The image buffer gobbles up nearly half of the HUB ram (60K)
That's just one example of a need for HUB ram expansion.

But....I love COG's too.

No, I think its more of an example of a need for better external memory management. Increasing hub ram is just a stop-gap measure. More on this thought shortly...

Cluso99 · 2013-12-02 15:04

David Betz wrote: »

How would that work? If it maps to hub memory then it can be unmapped once COG 0 has started. If it overlaps COG 0 RAM, when would you unmap it to provide access to the underlying COG RAM?

That would mean a change to cog 0 design. Better to drop it onto the cog 0 bus between cog/hub - I suppose its sort of like overlaying it with hub $0 and immediately/automatically switching it out after the cog ram is loaded.
I don't really see the problem with the way it works now, but I don't see it that complex to implement the other way either.
Originally, it provided access to the monitor from all cogs by being able to reload the monitor in any cog. But the real purpose is lost because the moment you do that, your cog program is lost. Way better to load a new soft monitor into flash which then is available in hub ram that can co-exit with you cog code to enable better debugging using LMM and little cog resources.

jmg · 2013-12-02 15:05

Cluso99 wrote: »

With P2, Chip is going to implement a 1 clock instruction that reads a pin and XORs that with the C flag (which contains the previous bit) and places the result into the C flag (which is also ready for the next bit). The Z flag is set if the pin and its twin pair are both Z which is an SE0 condition which needs testing. The C flag is now used in another 1 clock instruction that Chip is going to implement (a single bit CRC adder using the C flag as the bit input).
Now we just shift the C bit into a byte accumulator (data byte), JZ if Z is set (SE0), shift the C flag into the unstuff counter/register and if Z is set, jmp to unstuff the next bit, otherwise go get the next bit.

With these 2 instructions, the CRC which is of general CRC bit calculation use also, I am unsure how much help the serdes will be unless it does sync detection, unstuffing, SE0 detection and crc.

Sounds like we need to wait for the new Pin-pair opcodes and CRC, and then see what the code size can shrink to.
What is your LS USB code size, without those opcodes ?

jmg · 2013-12-02 15:10

Cluso99 wrote: »

I don't think it has to appear anywhere. I think it could just be part of the special boot instruction that copies an external block into cog ram. It would be a little 2KB standalone block. I think it would just be a hardwired RAM cell with read only access.

True, once you go away from Patched-HUB-RAM as ROM, it then becomes non-readable at runtime, and only boot loaded.
It can become serial ROM (see my comment above), which could have a smaller cell and need no address decode matrix.

Roy Eltham · 2013-12-02 15:35

Really, I was just hoping for more ROM (since it's small), and I wouldn't want it to take away from the RAM. I'm fine with 256K - 2K for ROM if it stays as before, but it would be nice if we could have 8K or 16K of rom, and put in a more capable bootup that could maybe read from SDcard and do other interesting things.
I guess it doesn't matter that much...

Seairth · 2013-12-02 15:39

Dave Hein wrote: »

Bean, I agree with you. However, the good news is that the P2 is being skipped entirely so that the P3 will come out sooner.

You joke (I think), but this actually makes sense from a marketing perspective. At this point, treat the P2 as an iterative, community-influence (designed, tested, etc) CPU that only ever manifested as a FPGA core. From that, we get the P3 (an actual ASIC). From there, you can have the P4 be the next iterative design, followed by the P5 as the next ASIC.

In a way, this honors the real work and value of the P2, while also conveying the progress made as an actual commercial product (as the P3).

potatohead · 2013-12-02 15:41

Roy Eltham wrote: »

Really, I was just hoping for more ROM (since it's small), and I wouldn't want it to take away from the RAM. I'm fine with 256K - 2K for ROM if it stays as before, but it would be nice if we could have 8K or 16K of rom, and put in a more capable bootup that could maybe read from SDcard and do other interesting things.
I guess it doesn't matter that much...

I didn't think of it this way. Agreed on all points.

jmg · 2013-12-02 15:43

Seairth wrote: »

You joke (I think), but this actually makes sense from a marketing perspective. At this point, treat the P2 as an iterative, community-influence (designed, tested, etc) CPU that only ever manifested as a FPGA core. From that, we get the P3 (an actual ASIC). From there, you can have the P4 be the next iterative design, followed by the P5 as the next ASIC.

In a way, this honors the real work and value of the P2, while also conveying the progress made as an actual commercial product (as the P3).

Yes, this actually makes sense

jmg · 2013-12-02 15:49

Roy Eltham wrote: »

Really, I was just hoping for more ROM (since it's small), and I wouldn't want it to take away from the RAM.

Present ROM is not small, on a cell-size basis, but it is easy to do, and is read the same as RAM.

Roy Eltham wrote: »

I'm fine with 256K - 2K for ROM if it stays as before, but it would be nice if we could have 8K or 16K of rom, and put in a more capable bootup that could maybe read from SDcard and do other interesting things.
I guess it doesn't matter that much...

If you wanted to use a serial ROM (adds 2K back to RAM), then it would be simplest as 2K, so boot state engine, simply block-dumps into a COG.
Going above 2k needs some means to address serial ROM, as now you load fractions of ROM.
Not impossible, just another thing to do.

Chip got quite a lot packed into the present ROM

Cluso99 · 2013-12-02 15:59

Chip:
(1) Could the RDOCTET/RDOCTETC/WROCTET instructions (presume they replace the QUADs) use the Z & C bits to define and A & B OCTET registers, and could that be windowed or part of Aux Ram? This would allow quick RDAUX/WRAUX instructions to access this in place, and also permit the A being used by the cog while the B is being updated to/from hub.

(2) I mentioned adding AUXA/AUXB pointers to cog $1F0-1F1 and used in a similar way to INDA/INDB, but accesses Aux instead of cog. But I thought of an alternative...

(3) When referring to Cog Registers, they ultimately map to 9 bits. Could an instruction enable a new mapping scheme where 0_xxxx_xxxx uses Aux Ram instead of Cog Ram and 1_xxxx_xxxx uses existing Cog Ram 1_xxxx_xxxx for both the S and D registers?
If simple and possible, then..
Advantages:
* All instructions could use up to 256 variables in cog and up to 256 variables in aux.
* All instructions could operate on both cog and aux together.
* Therefore MOV D,S could move directly to/from cog/aux in 1 clock
* Gives cog code easy access to another 256 long variables which could also be LMM/XMM instructions.
* INDA/INDB/PINx/INx/DIRx are still directly accessible.
Disadvantages:
* Self-modifying code would now be restricted to the top 256-14? longs of cog.
* Variables would be restricted to the top 256-14 longs of cog.
Is it simple/doable/make sense???

Cluso99 · 2013-12-02 16:09

jmg wrote: »

Present ROM is not small, on a cell-size basis, but it is easy to do, and is read the same as RAM.

That is just because it was easiest to do this way because it did not require another bus extension.

If you wanted to use a serial ROM (adds 2K back to RAM), then it would be simplest as 2K, so boot state engine, simply block-dumps into a COG.
Going above 2k needs some means to address serial ROM, as now you load fractions of ROM.
Not impossible, just another thing to do.

Serial would slow the boot process down. But only a 32bit data bus is required and a counter to increment the address to the ROM.
If chip is expanding the hub address by 1 bit, perhaps he might expand it by 2 bits, and allow a larger ROM to be mapped in.
But I don't see this as an issue.

Chip got quite a lot packed into the present ROM

Yes he did, but most of it was rarely used and almost a waste - not that it would have increased hub ram any.

The only advantage to larger ROM would be to have an optional SD boot that could remove the FLASH requirement. Unfortunately I don't see that happening ATM though.

ozpropdev · 2013-12-02 16:17

Ray,
Lookibg at your idea from another angle.
In multi-tasking we can use SETMAP to map blocks of variables to COG ram.
If these could be mapped to AUX it frees up COG space for code not data.
This is the best use of COG ram...executable code!

Ozpropdev

jmg · 2013-12-02 16:44

Cluso99 wrote: »

Yes he did, but most of it was rarely used and almost a waste - not that it would have increased hub ram any.

? In the present design, ROM is RAM => each byte of ROM removed, adds 1 byte of user RAM

Cluso99 · 2013-12-02 16:50

ozpropdev wrote: »

Ray,
Lookibg at your idea from another angle.
In multi-tasking we can use SETMAP to map blocks of variables to COG ram.
If these could be mapped to AUX it frees up COG space for code not data.
This is the best use of COG ram...executable code!

Ozpropdev

Agreed!

My reasoning is to permit Aux to be used as variable space, just like we would use variable space in cog. This way all instructions would just operate on Aux if the 9th address bit was "0" (ie S & D addresses within $000..$0FF). This frees up cog variable space. It still permits some to be used as a stack, and some as swap space for instructions.

I am not even sure that this can be done as cog is quad-access and aux is not. So it might be impossible. But it is certainly worth asking.

Anyway, as Chip reads this, he might think of a simple way, as soon as he understands what we are trying to achieve.

Perhaps even if a variant of the RDOCTET instruction could run in the background like some of the maths routines do, that could help significantly.

So an LMM loop might be able to do something like
RDOCTC [#]D/PTRA++ WC ' continuously and autonomously reads 8*longs at a time into AUX cache A, then B, then A, etc, (ping-pongs) in the background until stopped
' - once A & B have been read, it waits for A to be completely read by the cog before automatically fetching the next 8*longs. The same with B.
' - This happens in the background and only stalls the cog if the data is not available.
REPS #n,#i
NOP
RDLONGC [#]D/PTRB++ ' stalls if its not available.
xxxx

Cluso99 · 2013-12-02 16:54

jmg wrote: »

? In the present design, ROM is RAM => each byte of ROM removed, adds 1 byte of user RAM

Yes, as I understand it, each RAM transistor bit cell is modified with a layer that forces a bit to be read as 0 or 1 depending on the mask.
Takes more space per bit, but saves another block with the associated bus and multiplexers. Ideal for small ROM space, not for larger ROM space.
It was an excellent trade-off and simplified the design.

ozpropdev · 2013-12-02 16:55

jmg wrote: »

? In the present design, ROM is RAM => each byte of ROM removed, adds 1 byte of user RAM

Correct, HUB ram now starts at $E00 instead of $E80. This only adds 32 longs to HUB though.

cgracey · 2013-12-02 17:01

I'm only up to post #3386, so I don't know what's been said in the interim, but I wanted to post a layout that Beau and I worked on today:

After taking out the DAC bus, we doubled the hub RAM to 256KB (minus $E00 bytes for the ROM). After doing that, we now have 19.7 square mm of standard cell space, whereas before we had only 14.7. So, we've got ~1/3 more area for logic than before.

This is going to be easier to get ready for synthesis, since the cell/routing area is going to be a huge square, with cut-outs for the memories and PLLs.

With 8 hub RAMs, we could have a 256-bit data bus for reading/writing 8 longs at a stroke, instead of the current 4. This would absorb some of the new cell area, along with the SERDES and PIN/CRC instructions.

I'm going to see about going from QUADs to OCTLs.

cgracey · 2013-12-02 17:08

Phil Pilgrim (PhiPi) wrote: »

This is what I would do with the surplus die area:
1. Onboard 1.8V regulator, so the chip could be powered from 3.3V only.
2. Fatter power distribution buses. This would provide more likely success to less-than-optimal board layouts.

Anyway, that's what I think your OEM customers would appreciate more than enhanced feature-set complexity.

-Phil

I've thought about a regulator, but dropping 3.3V to 1.8V through a linear regulator would burn a lot of power. If the peak current requirement at 1.8V is 1000mA, there will already be plenty of heat. I think a compact off-chip switcher is the best way to get the 1.8V.

Cluso99 · 2013-12-02 17:11

ozpropdev:
Brian,
I still don't quite understand how the hub updates the aux for video use? What instruction loop are you using? I seem to be missing something here.

jmg · 2013-12-02 17:17

cgracey wrote: »

I've thought about a regulator, but dropping 3.3V to 1.8V through a linear regulator would burn a lot of power. If the peak current requirement at 1.8V is 1000mA, there will already be plenty of heat. I think a compact off-chip switcher is the best way to get the 1.8V.

Besides the clear power-budget issues, on-chip regulators are not easy to adjust, and the Prop may benefit from a user-choice on Core Vcc.

Best to keep the heat off-chip, where it can be better managed.

Cluso99 · 2013-12-02 17:19

cgracey wrote: »

I'm only up to post #3386, so I don't know what's been said in the interim, but I wanted to post a layout that Beau and I worked on today:

After taking out the DAC bus, we doubled the hub RAM to 256KB (minus $E00 bytes for the ROM). After doing that, we now have 19.7 square mm of standard cell space, whereas before we had only 14.7. So, we've got ~1/3 more area for logic than before.

This is going to be easier to get ready for synthesis, since the cell/routing area is going to be a huge square, with cut-outs for the memories and PLLs.

With 8 hub RAMs, we could have a 256-bit data bus for reading/writing 8 longs at a stroke, instead of the current 4. This would absorb some of the new cell area, along with the SERDES and PIN/CRC instructions.

I'm going to see about going from QUADs to OCTLs.

Great news Chip & Beau! That pic is fantastic to see - yes a nice regular space

Fingers crossed about OCTLs as that alone doubles the transfer rate.

With the video cog being able to transfer at this speed, and getting a double slot, this should leave the cog reasonable time to do other things during the display time.

jazzed · 2013-12-02 17:25

Looks great Chip!

So that's rdoctal and wroctal 8 longs at a time without caching?

Thanks.

JRetSapDoog · 2013-12-02 17:35

The move from QUADS to OCTETS changes the analysis a lot! Doubling the throughput, particularly to SDRAM, where there was kind of a bottleneck could help considerably. That makes giving up the possibility of additional hubs MUCH more bearable (wiping my eyes and blowing my nose and starting to laugh a little). We get something great in return: double the memory and maximum throughput.

So, does it defintiely double the throughput to SDRAM (or is there another limiting factor)? Also, about using 2 SDRAM chips: if done, how many additional pins does that consume, just a chip select pin or also more data (and maybe control) pins? If only a CS pin, would access be interleaved among the pair?

It's interesting to see the Master at work: he detects a bottleneck and considers moving to the seemingly ideal DDR2, but upon weighing the risks involved returns to SDRAM and figures out a way to double the throughput (hopefully). I believe it's a natural thought process to consider the "ideal" before optimizing what's practical.

Wonder if there's any way to double the throughput to SDRAM yet again. A related question: would allowing a cog to yield all of its hub slots to an SDRAM driver cog allow the driver to access SDRAM faster (even after the move to OCTETS), or would the access speed of the SDRAM or something else in the P2 (or board design) be the limiting factor?

At any rate, regarding hub slot yielding, presently, I'm inclined to give the user the choice if doing so doesn't add much design time and risk. I guess it'd basically be "off" by default, but available for special cases, a kind of "overdrive."

Whoops! Looks like the move to OCTET's (OCTL's) is not a done-deal/slam-dunk yet (with all the data lines to route and related logic). Fingers crossed!

ozpropdev · 2013-12-02 17:47

Cluso99 wrote: »

ozpropdev:
Brian,
I still don't quite understand how the hub updates the aux for video use? What instruction loop are you using? I seem to be missing something here.

Ray,
Basically I use RDQUAD to get my 4 longs then I PUSH those 4 longs into the AUX.
I then use a WAITVID to send a 128 bit pixel packet out.
Added to this I have to use a 3 stage buffer to feed WAITVID to avoid double-buffer issues.

Brian

Propeller II update - BLOG

Comments