Propeller II update - BLOG

cgracey · 2013-12-02 17:51

Cluso99 wrote: »

Chip:
(1) Could the RDOCTET/RDOCTETC/WROCTET instructions (presume they replace the QUADs) use the Z & C bits to define and A & B OCTET registers, and could that be windowed or part of Aux Ram? This would allow quick RDAUX/WRAUX instructions to access this in place, and also permit the A being used by the cog while the B is being updated to/from hub.

(2) I mentioned adding AUXA/AUXB pointers to cog $1F0-1F1 and used in a similar way to INDA/INDB, but accesses Aux instead of cog. But I thought of an alternative...

(3) When referring to Cog Registers, they ultimately map to 9 bits. Could an instruction enable a new mapping scheme where 0_xxxx_xxxx uses Aux Ram instead of Cog Ram and 1_xxxx_xxxx uses existing Cog Ram 1_xxxx_xxxx for both the S and D registers?
If simple and possible, then..
Advantages:
* All instructions could use up to 256 variables in cog and up to 256 variables in aux.
* All instructions could operate on both cog and aux together.
* Therefore MOV D,S could move directly to/from cog/aux in 1 clock
* Gives cog code easy access to another 256 long variables which could also be LMM/XMM instructions.
* INDA/INDB/PINx/INx/DIRx are still directly accessible.
Disadvantages:
* Self-modifying code would now be restricted to the top 256-14? longs of cog.
* Variables would be restricted to the top 256-14 longs of cog.
Is it simple/doable/make sense???

AUX RAM has a single port to the cog, whereas cog RAM as three read ports and one write port. Those read/write ports operate at 3 different stages of the pipeline (stage 1 = read instruction, stage 3 = read D and S, stage 4 = write D). AUX can't be read ahead of time to suffice for D or S. As it is, if you use an SPx or immediate address, AUX gets read at stage 3, unless your address is an S register, in which case the read must occur in stage 4, when the S register value is available, then stalling the pipeline for one cycle while the result comes back. All AUX writes occur at stage 4.

...Wait maybe AUX could be read in time for D or S usage. The address would be the D or S field of the instruction in stage 2 of the pipeline. I guess it could work, but my stack is overflowing at the moment.

Roy Eltham · 2013-12-02 17:51

jazzed, should be the same as RDQUAD/WRQUAD where before, just with 8 longs now instead of 4. I assume we'll just get rid of the QUAD variants and just have the OCTET ones.

It should also mean that the RD*C instructions now get much better, with RDLONGC getting 7 one cycle reads after 1 hub access speed read (assuming consecutive reads).

cgracey · 2013-12-02 17:56

ozpropdev wrote: »

Ray,
Lookibg at your idea from another angle.
In multi-tasking we can use SETMAP to map blocks of variables to COG ram.
If these could be mapped to AUX it frees up COG space for code not data.
This is the best use of COG ram...executable code!

Ozpropdev

Now, there's the application for AUX being D or S. I still need to ponder this.

Our brains need an 8-port connection.

Cluso99 · 2013-12-02 17:59

cgracey wrote: »

AUX RAM has a single port to the cog, whereas cog RAM as three read ports and one write port. Those read/write ports operate at 3 different stages of the pipeline (stage 1 = read instruction, stage 3 = read D and S, stage 4 = write D). AUX can't be read ahead of time to suffice for D or S. As it is, if you use an SPx or immediate address, AUX gets read at stage 3, unless your address is an S register, in which case the read must occur in stage 4, when the S register value is available, then stalling the pipeline for one cycle while the result comes back. All AUX writes occur at stage 4.

...Wait maybe AUX could be read in time for D or S usage. The address would be the D or S field of the instruction in stage 2 of the pipeline.

Thanks for the detailed explanation.

I guess it could work, but my stack is overflowing at the moment.

You need more coffee and a larger stack

cgracey · 2013-12-02 18:00

Cluso99 wrote: »

Yes, as I understand it, each RAM transistor bit cell is modified with a layer that forces a bit to be read as 0 or 1 depending on the mask.
Takes more space per bit, but saves another block with the associated bus and multiplexers. Ideal for small ROM space, not for larger ROM space.
It was an excellent trade-off and simplified the design.

Yes, this is how it works and to change it would mean mux's, and it would wind up being slower. In an SRAM-based design, the SRAM is the inescapable critical path. SRAM reads are equivalent to many stages of logic. You design the logic to be as complicated as needed, but not to exceed the SRAM access time.

Ariba · 2013-12-02 18:02

jmg wrote: »

Ariba wrote:

But if you use many objects with fast DACs (Audio, Functiongenerators and so on) then the cog allocation can only work if you start the drivers in the exact right order.

Lots of SW development involves getting things organized into the right places, so this is a SW management problem.

It just needs a means to define, and check, that you are getting what you expect.
PCs are great at this sort of mundane allocate and cross check stuff, they do it in miliseconds.

The problem is that the cog allocation happens at runtime.
I don't say it's a unresolvable problem, but it has more influence on easy mixing of OBEX objects than every hub-slot reusing.

Some solutions I see:

1) You need to start the objects that use fast DACs first. The object allocates the cog with COGINIT and should check first if it is free. You only will get an Error message at runtime, if the object sends one.
The hardware (VGA connectors and such) must not be connected on P0..P8, because cog0 is normally not free.

2) The Spin compiler implements a cog-mapper for the objects. All the objects use COGINIT with a cogid that is generated by the complier. So the compiler can detect if the current cog-pin mapping is not possible.

Andy

cgracey · 2013-12-02 18:05

jazzed wrote: »

Looks great Chip!

So that's rdoctal and wroctal 8 longs at a time without caching?

Thanks.

Right. That's one long per clock. I'm trying not to think about executing from hub RAM, directly, with automatic RDOCTL's in the background. CALLA/RETA would be a suitable stack, after widening the PC bits from 9 to 18.

cgracey · 2013-12-02 18:08

JRetSapDoog wrote: »

...Also, about using 2 SDRAM chips: if done, how many additional pins does that consume, just a chip select pin or also more data (and maybe control) pins? If only a CS pin, would access be interleaved among the pair?

It would just be another 16 data pins. Currently, considering the DE2-115 board, it would mean P32..P47 get connected to the other SDRAM chip's data pins. All the control inputs to the SDRAM chips would be identical.

jazzed · 2013-12-02 18:09

cgracey wrote: »

Right. That's one long per clock. I'm trying not to think about executing from hub RAM, directly, with automatic RDOCTL's in the background. CALLA/RETA would be a suitable stack, after widening the PC bits from 9 to 18.

Cool. Ya I figured fetch and execute from hub RAM would be for the next chip.

Good luck squeezing in all the glue.

cgracey · 2013-12-02 18:11

JRetSapDoog wrote: »

...A related question: would allowing a cog to yield all of its hub slots to an SDRAM driver cog allow the driver to access SDRAM faster (even after the move to OCTETS), or would the access speed of the SDRAM or something else in the P2 (or board design) be the limiting factor?

You want the hub slot to only appear only once every 8 clocks, because it must be coordinated in time with XFR, which writes the incoming data into the OCTL's. So, slot sharing would hurt, not help.

Bill Henning · 2013-12-02 18:13

LOL

well, I have not been trying to not think about it...

The only issue would be that the simplest/fastest implementation would require that jump/call/return destinations be oct-long aligned, and calls/jumps would have to occur (I think) 1-3 clocks before the last long gets executed, to give the system time to update the hub-pc

cgracey wrote: »

Right. That's one long per clock. I'm trying not to think about executing from hub RAM, directly, with automatic RDOCTL's in the background. CALLA/RETA would be a suitable stack, after widening the PC bits from 9 to 18.

cgracey · 2013-12-02 18:21

Bill Henning wrote: »

LOL

well, I have not been trying to not think about it...

The only issue would be that the simplest/fastest implementation would require that jump/call/return destinations be oct-long aligned, and calls/jumps would have to occur (I think) 1-3 clocks before the last long gets executed, to give the system time to update the hub-pc

I figured that unless you had a branch, you'd automatically read the next 8 longs in your hub slot, in the likely case that you'd need subsequent instructions.

I think the glue for this would be very small. The work would be in making sure it was easy to use.

JRetSapDoog · 2013-12-02 18:21

@Chip: I see. Thanks! Although not-insignificantly cutting into available pins, 2 SDRAM chips could work for many designs. Things like built-in DAC's also help in terms of pins. But that probably doesn't mean that you're committing Parallax to a dual SDRAM board, I'm guessing. Although maybe a board that allowed for all cases (0, 1 or 2 SDRAM chips) could be made.

@Ariba: I had wondered a bit about pin usage, too, prior to your post #3391 (before your follow-on post 3487 above) though not in as much detail. I'm sorry that your first post on the matter got overlooked in all the rush. But I'm glad to see you suggesting some work-arounds or guidelines, as the freed-up die space with the functionality it will provide is basically too much to resist. Not only that, it looks like it simplifies everything based on Chip's words: "Getting rid of the DAC bus means that our custom routing largely goes away, so that the place-and-route tool will make all the connections from the core to both the memories and the pad frame. This is going to save tons of layout time." That's music to Ken's (and all of our) ears, too!

@Chip: Ah, slot-sharing doesn't help at all for SDRAM access, it hurts or would mess it up (sharing could still help in other ways, though). Thanks, again.

Erik Friesen · 2013-12-02 18:27

I may not be completely up to speed, but why exactly would there need to be a 1000ma source of power? As a comparison, what is being done different than say, a pic32 which has a vcap/vcore pin and internal regulator for internal logic? Assuming all the io would be fed from the 3.3 rail, this means the logic is going to consume up to 1A?

Bill Henning · 2013-12-02 18:31

Agreed.

I was talking about branches - JMP, CALL and RET

using all the D and S bits, we get an 18 bit address (just right for 256KB hub)

I see two paths:

1) we know D:S point to a long, so it can actually address 1MB of hub space

Needs some extra logic - always fetch octal-long (OL) aligned 8 longs, extra logic picks long to resume execution with within the OL block

This is the easiest path for compilers.

2) D:S points to an OL, we can address 8MB of HUB

Bit rougher on compilers, jumps/calls must be to OL boundaries, and must occur 1-3 clocks before end of OL block. Needs less logic, and a bit faster due to alignment.

cgracey wrote: »

I figured that unless you had a branch, you'd automatically read the next 8 longs in your hub slot, in the likely case that you'd need subsequent instructions.

I think the glue for this would be very small. The work would be in making sure it was easy to use.

jmg · 2013-12-02 18:42

Erik Friesen wrote: »

I may not be completely up to speed, but why exactly would there need to be a 1000ma source of power? As a comparison, what is being done different than say, a pic32 which has a vcap/vcore pin and internal regulator for internal logic? Assuming all the io would be fed from the 3.3 rail, this means the logic is going to consume up to 1A?

In simplest terms, a PIC32 specs 80mA typ at 200Mhz sys clock, for one core.
A Prop 2, can have 8 cores plus Video active, so ~700mA is not impossible (1A with margin) - and this is still largely an unknown figure.

There are better uses for die space, especially on a part that may need VCore adjusting up or down.

cgracey · 2013-12-02 18:46

Bill Henning wrote: »

Agreed.

I was talking about branches - JMP, CALL and RET

using all the D and S bits, we get an 18 bit address (just right for 256KB hub)

I see two paths:

1) we know D:S point to a long, so it can actually address 1MB of hub space

Needs some extra logic - always fetch octal-long (OL) aligned 8 longs, extra logic picks long to resume execution with within the OL block

This is the easiest path for compilers.

2) D:S points to an OL, we can address 8MB of HUB

Bit rougher on compilers, jumps/calls must be to OL boundaries, and must occur 1-3 clocks before end of OL block. Needs less logic, and a bit faster due to alignment.

Good observation about the combined 18 bits. They could be always used as relative addresses, too, in case there was more memory someday.

Yanomani · 2013-12-02 18:58

cgracey wrote: »

I figured that unless you had a branch, you'd automatically read the next 8 longs in your hub slot, in the likely case that you'd need subsequent instructions.

I think the glue for this would be very small. The work would be in making sure it was easy to use.

Chip

I was in a 80 mile trip, driving back home, just wondering about how a 256 bit bus, between HUB ram and the COGs would be useful, and you come with them, and RDOCTLs!
Way, way, way damn good!

If you still have some time, and coffee, and after flushing your stack, automatic RDOCTLs, in the background, are the way to go, just as if you'd used some endless REPS with them, but when the straight instruction block must be cutted off, case of a out of the straightline JUMP or CALL, the REPS vanish away, automaticaly.
Jumping inside the OCT block, should otherwise be preserved, since the target instruction is already present.
Perhaps if JUMPs whose target is already loaded, and progressing inside the pipeline, could activate the "no execute" bit of the intervening ones, it will act as a 1, 2 or up to three SKIP.

Yanomani

Heater. · 2013-12-02 19:00

Bean,

I hate to say it but I see the P2 being the best microcontroller never made.

The Prop 1.5 was damn good as well.

Bill Henning · 2013-12-02 19:09

Thanks... but they can address 2**18 longs, 1MB of hub - p3 ready

cgracey wrote: »

Good observation about the combined 18 bits. They could be always used as relative addresses, too, in case there was more memory someday.

rogloh · 2013-12-02 19:11

I do like the hub RAM boost to 256kB. The increase to 256k RAM is very good for GUI type applications wishing to fit entire graphics video buffers into the hub RAM, and then running a VM from larger external RAMs allowing XMM model applications which won't suffer shared bandwidth impacts from any graphics accesses.

256kB minus ~4kB boot ROM space in theory now allows these video modes to fit

640x400 x 8 bpp (256 color) = 256000 bytes (only barely fits)
640x480 x 4 bpp (16 color) = 153600 bytes
800x600 x 4 bpp (16 color) = 240000 bytes
1280x1024 x 1bpp (mono) = 163840 bytes
1366x768 x 1bpp (mono) = 131136 bytes
1600x1200 x 1bpp (mono) = 240000 bytes

These are pretty common and well supported video modes that can nicely fit the available panels and their aspect ratios. Color depth is not huge but works out ok for GUI and some games.

Also in an earlier post I mentioned a desire for some type of 8 ported AUX RAM. After contemplating this further and reading the replies I agree it will be very difficult to manage shared access to all COGs unless it is somehow mapped into the hub address space, because it doesn't really fit into any known software model. It is very specialized and tricky to deal with.

So instead of that whole 8 port nightmare, perhaps the otherwise unusable ROM hole in the hub address space could be used to access any shared AUX RAM memory regions. Each COG would still have direct access to its AUX RAM, and some of this AUX RAM (eg upper half) can be made accessible to other COGs via the ROM hub address space if desired. After bootup any COG could decide to make part (not necessarily all) of its upper AUX ram visible to other COGs via this hub memory hole. When it is not mapped by default it allows the COG its own dedicated exclusive access to this AUX RAM (so other COGs can't possibly stomp on it). The PASM code author gets to decide this for their COG depending on what it needs to do.

The feature might be useful for fast parallel I/O drivers wishing to share their data amongst themselves, and may provide slightly higher performance when transacting back and forth between COGs, by reducing some of the latency of individual transactions (on one side). The good thing is the client COGs don't have to know it is even special shared AUX RAM that they are reading from or not, they just use a regular hub memory access in low memory, everything still works from that point of view. It also only requires dual ported SRAM to implement (one side of the AUX RAM is from the attached COG reader/writer, the other side from the hub reader/writer). One problem I can see there however is what if multiple COGs all want to map their AUX RAM to the same window and all write to the same byte at the same time... some precedence rules need to apply there. Or one very simple solution there is just to divide this hub ROM address hole by 8, and each COG ID only gets its own fixed part of it. If you round it up to 4kB ROM then each COG gets 128 x 32 = 512 bytes of this space to share its AUX RAM to other COGs.

Roger.

Yanomani · 2013-12-02 19:12

jmg wrote: »

In simplest terms, a PIC32 specs 80mA typ at 200Mhz sys clock, for one core.
A Prop 2, can have 8 cores plus Video active, so ~700mA is not impossible (1A with margin) - and this is still largely an unknown figure.

There are better uses for die space, especially on a part that may need VCore adjusting up or down.

100% Agreed!

Unless the regulator could be placed at die's center, to ease the thermal balance and gradient! BUT, just there is where we need to place the synthesized logic block!
Also, the center is the best place to dissipate power, thru the use of a exposed padded pad frame.

Placing a regulator at the center, and having a so nice eight-fold COG design, it will better match the thermal issue in a eight sided polygon shaped die, so, more tooling costs on the horizon.

IMHO, the best place for it, is way out of the chip.

Yanomani

potatohead · 2013-12-02 19:27

Re: Video

Personally, I think the optimal scenario will be to store a lot of it in SDRAM, using the HUB for backing store, sprites and other things that are smaller in size. That way, one can have a high color depth for things that do not require dynamic drawing, or a lot of movement, or frame locked movement, etc... A lower color depth can be used for other elements, such as a mouse pointer, high resolution bitmap display, etc... and those can be overlay or pixel blended due to the nice, shiny new pixel instructions.

For those not using external memory, then yes! The HUB is roomy, and with dynamic drawing techniques could very easily maintain an 800x600 screen with a nice color depth.

Just catching up: 8 longs per hubop? SWEET!

Re: Best one never made.

Indeed. I find myself sharply limiting my comments here. Lots of great ideas, but we do have a window that is highly likely closing at this point. Remember, fully half the value of a product is contained in the early majority. If that's missed, literally half the revenue possible may well never appear.

jmg · 2013-12-02 19:36

rogloh wrote: »

256kB minus ~4kB boot ROM space in theory now allows these video modes to fit

640x400 x 8 bpp (256 color) = 256000 bytes (only barely fits)
640x480 x 4 bpp (16 color) = 153600 bytes
800x600 x 4 bpp (16 color) = 240000 bytes
1280x1024 x 1bpp (mono) = 163840 bytes
1600x1200 x 1bpp (mono) = 240000 bytes

and a couple of others, using Palette would be

(640*480*6/8)* 32/30 ( 64 colour palette, using 5*6 packing per long ) = 245760

(800*480*5/8)* 32/30 ( 32 colour palette, using 6*5 packing per long ) = 256000

For text, there is also a display-list approach, where font lookups are used within the scan loading.
The P2 is a moving target, so not clear where that would top-out.

potatohead · 2013-12-02 19:40

Right now, it's possible to get font lookups done on the early part of the scanline no problem. That's at the 60Mhz I last tested it on. Basic tile / text type displays run really well.

Dynamic drawing over the top of that is slower, but the tests done with the last pixel engine were favorable. With waitvid double buffered, you basically have the whole scan line to work with. For very high resolutions, you simply would be working ahead of the beam, populating the buffered waitvid while the other one is rendering to the screen.

This technique will be good for multi-purpose displays too. Say you've got text on most of the screen, save for a monochrome high bit depth grey scale data / visualization area and maybe another one that's a 16 color bitmap for line art, or some graph or other. A list of modes and waitvids can be packed into a display list, making for custom displays to be HUB RAM efficient. Doing this is difficult on P1.

All of that is a huge improvement over P1. For many basic video needs, an entire scanline can be packed in to one waitvid.

A P1 could throw up a couple hundred sprites per screen, with some high limit of say 50 sprites per scan line @ 4-5 COGS @ TV resolution (320x200)

P2 can exceed that, even at the 60Mhz clock by several times, and easily does higher resolutions and color depths.

Frankly, for simple byte / color displays, P2 can mash the images together overlay style quick enough for the video COG to also process quite a number of sprites all by itself. I've not tried the pixel blend instructions yet.

At the production clock of 160Mhz, basic video displays are only going to consume a small fraction of a scan line at fairly serious resolutions. 800x600+

This will leave the COG mostly free for basic displays. Video in a task is already being done at 1/16th by Ozpropdev through some careful timing. That all frees up at production speed.

Other nifty things are possible too. Say only a monochrome display is needed. If the component video display format is chosen, just output the "Y" channel, and it contains sync, the grey scale signal, etc... all on one pin with 8 bits of grey.

jmg · 2013-12-02 19:41

potatohead wrote: »

.... Remember, fully half the value of a product is contained in the early majority. If that's missed, literally half the revenue possible may well never appear.

Sounds like marketing 101 fluff, if it ever applied, it was only to consumer fashion products.

Away from the consumer fashion markets, the area under the curve still dominates, and there, design-lifetime is king.

Look at PIC and 8051 for examples of very long design lifetimes, still generating revenue.

ozpropdev · 2013-12-02 19:43

With the removal of the DAc bus and it allowing Beau to use more automated tools instead of
manual layout, makes me think P2 is a lot closer than we think!

Yanomani · 2013-12-02 19:55

ozpropdev wrote: »

With the removal of the DAc bus and it allowing Beau to use more automated tools instead of
manual layout, makes me think P2 is a lot closer than we think!

This also spared him, to buy an AUX pair of hands, arms and shoulders!

Yanomani

P.S. I'd just figured out; Coz I'm a 105kg pater familias, my "exposed center pad" just resembles Ganesh's one! OUCH

Thanks ozpropdev, for helping me to get the insight!

I believe I'd just found my best resembling AVATAR! It just lacks some Propeller related reference! Will study the simbology, to don't mess with some divinity related curse!:nerd:

User Name · 2013-12-02 20:14

ozpropdev wrote: »

With the removal of the DAc bus and it allowing Beau to use more automated tools instead of
manual layout, makes me think P2 is a lot closer than we think!

Here, here!

Entirely too much hyperventilating going on in certain circles.

Cluso99 · 2013-12-02 20:20

Here is a pic that I think reflects the current data bus architecture between hub/cog/aux.

Hopefully the QUAD (128bit) access will go to OCTET (256bit) access. This goes HUB to/from the CACHE. However, we are still looking to see how we can move this CACHE to/from AUX or COG fast and easily.

Unfortunately ATM I cannot see any ways to get it into the COG faster.
However, there could be ways to get it into the AUX faster.
And there may be ways for the COG to access AUX as variables better, or to map some AUX to COG space.

HUB to/from AUX faster:
Currently when we execute RD/WRQUAD, it takes 3..11 (1..8 for WR) clocks. This stalls the cog while it waits for the hub slot.

Could the AUX be built as 8 * 32 blocks so that the new RD/WROCT instructions can write straight into one of these 8 blocks, thereby effectively removing the CACHE requirement. Perhaps the RD/WRxxxxC instructions could now use AUX instead?

The benefits of this would be enormous!!!
* With 8 x RDOCT instructions, the whole AUX would be filled in 8 slots. No need to move the cache into AUX as its immediately there. Great for video buffering.
* Even better with a donated slot (1:4).
* LMM/XMM would benefit too, with an unrolled 8 instruction loop.
* Transferring data blocks between cogs can be done using 8 longs at once. Postedit: I mean if we can use data in the AUX, we can transfer between AUX-HUB-AUX (between 2 cogs) 8 longs at a time.

Propeller II update - BLOG

Comments