The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Seairth · 2015-09-02 16:02

cgracey wrote: »

Because the FIFO is sequential and not randomly accessed, it must be reloaded on every jump to hub memory, even if it contains the needed code.

I wasn't talking about randomly accessing the FIFO. But it occurs to me that my thinking was still wrong. Because of the pipeline, the instruction that would be returned to would have already been removed from the FIFO anyhow. Ah well...

cgracey · 2015-09-04 01:11

Well, it looks like we're back, minus a few posts.

I got the block-write instruction done today.

Thinking about the hub-to-LUT block read now. I don't think I'll make a LUT-to-hub block write, though, because it presents some big complexities and I don't know that it's very important.

ozpropdev · 2015-09-04 01:19

cgracey wrote: »

Well, it looks like we're back, minus a few posts.

I got the block-write instruction done today.

Thinking about the hub-to-LUT block read now. I don't think I'll make a LUT-to-hub block write, though, because it presents some big complexities and I don't know that it's very important.

Nice to see the thread back!
If we have RDLUT now then a lut2hub can be done in code if really needed.
hub2lut would be handy though.

jmg · 2015-09-04 01:22

cgracey wrote: »

Well, it looks like we're back, minus a few posts.

Good.
Strangely, it took two 'hops' to get back to latest post - first click topped out at 129 pages, then exit and re-enter 'found' 2 more pages... Strange things, this forum does...

evanh · 2015-09-04 02:47

cgracey wrote: »

Because the FIFO is sequential and not randomly accessed, it must be reloaded on every jump to hub memory, even if it contains the needed code.

So even a small REP loop will be repeatedly fetching from HubRAM, right? Which effectively means HubExec always consumes about 50% of that Cog's available Hub bandwidth.

Tubular · 2015-09-04 03:58

How long do you estimate doing the smart pins will take, Chip?

jmg · 2015-09-04 04:01

Tubular wrote: »

How long do you estimate doing the smart pins will take, Chip?

These Pins are so smart, that they will design themselves !!

Tubular · 2015-09-04 04:30

Hehe, if only. I think the danger in calling them smart pins is everyone expects them to do pretty much everything, which would be a mighty big ask.

I think Chip mentioned recently the P2 logic might grow by 10k flops or so, for the smart pins, which is a budget around 150 flops per pin. You probably need 100 flops just for working and config registers (a la P1 counters), but I don't doubt mighty useful things can be achieved, but its quite a thought exercise.

edit: might be worth Chip starting a new thread as per Ken's request, in case that helps the forum server

potatohead · 2015-09-04 05:47

So even a small REP loop will be repeatedly fetching from HubRAM, right? Which effectively means HubExec always consumes about 50% of that Cog's available Hub bandwidth.

And that's the advantage of the egg beater. If this performance plays out nicely, it's a very good thing, because it means we get a lot of pretty speedy COGS without all the complex interactions on the "hot" chip.

Looking forward to a jam session one day here soon.

evanh · 2015-09-04 07:06

Has Chip clarified if a jump/rep stalls execution or not?

Dave Hein · 2015-09-04 13:21

It was mentioned earlier that a jump will cause the streaming FIFO to refill, even if the jump address matches the top of the FIFO. I would guess that a REP would also cause the FIFO to refill as well. So, yes there would be stalls. The streaming FIFO is just a FIFO and not a cache. Hubexec will efficiently execute straight line code, but I don't think it will be very efficient with loops and jumps. This concerns me a bit because C code tends to have lots of jumps and function calls.

Ken Gracey · 2015-09-04 15:20

I suggest that we start new topics for the P2 instead of using this thread. We still aren't sure what caused the vaporizing of it earlier this week. - Ken

Bill Henning · 2015-09-04 16:10

At the risk of being shot, and only if it is trivial, the following instruction would make Spin and all other VM's much faster:

JMP clut(S)

which would simply do a jump to the address contained in clut location being pointed to by S

JMP clut(#)

may also be usable for jump tables, case() etc

CALL clut() may also be useful.

I suspect that the next step, JMP clut(S+=1/2/4) would be too expensive in terms of logic and mux delays.

cgracey wrote: »

Cluso99 wrote: »

Chip,
What can the LUT (256 longs) be used for besides CLUT (and obviously lookup tables since you mention sin/cos tables)?

There's a Goertzel algorithm circuit in the streamer which can output the lower two bytes of the LUT longs to DACs, while the top two bytes are treated as sine and cosine for accumulation. On each clock, the 32-bit phase is added to, the LUT is read using the upper bits of the phase as the address, then the looked-up top two bytes are each multiplied by an ADC feedback bit (0/1 --> -1/+1) and accumulated separately. After many cycles of this, those accumulations represent an (X,Y) point that expresses angle and amplitude. The CORDIC instruction ARTCAN converts (X,Y) into (ro,theta) so you get power and phase angle. Those bottom two bytes from the LUT were output to DACs as a stimulus, if needed, to excite some system through which the ADC returns a bitstream. When used in this closed-loop mode, you have an instrument which should be able to resolve all kinds of interesting things in the real world that relate to resonance, time-of-flight, phase differences through sensor arrays, and who know what else. Something new to play with.

Dave Hein · 2015-09-04 20:28

Bill, some of us would like to see an FPGA image as soon as possible. And Chip is working toward a deadline of getting the Verilog to Treehouse by November 1st. I don't think random suggestions for changes to P2 are helpful at this point.

cgracey · 2015-09-04 23:44

Dave Hein wrote: »

It was mentioned earlier that a jump will cause the streaming FIFO to refill, even if the jump address matches the top of the FIFO. I would guess that a REP would also cause the FIFO to refill as well. So, yes there would be stalls. The streaming FIFO is just a FIFO and not a cache. Hubexec will efficiently execute straight line code, but I don't think it will be very efficient with loops and jumps. This concerns me a bit because C code tends to have lots of jumps and function calls.

REP is not allowed in hub exec, as it would be a pain to make work and pretty much defeat the purpose of its efficiency. So, REP is only usable within cog exec.

There will be lots of hiccups branching around in hub code, for sure. An instruction cache could help that, in cases of tight loops, but I don't want to go there. I just figure that hub exec, in itself, is miraculous enough for this chip.

cgracey · 2015-09-04 23:48

Bill Henning wrote: »

At the risk of being shot, and only if it is trivial, the following instruction would make Spin and all other VM's much faster:

JMP clut(S)

which would simply do a jump to the address contained in clut location being pointed to by S

JMP clut(#)

may also be usable for jump tables, case() etc

CALL clut() may also be useful.

I suspect that the next step, JMP clut(S+=1/2/4) would be too expensive in terms of logic and mux delays.

cgracey wrote: »

Cluso99 wrote: »

Chip,
What can the LUT (256 longs) be used for besides CLUT (and obviously lookup tables since you mention sin/cos tables)?

There's a Goertzel algorithm circuit in the streamer which can output the lower two bytes of the LUT longs to DACs, while the top two bytes are treated as sine and cosine for accumulation. On each clock, the 32-bit phase is added to, the LUT is read using the upper bits of the phase as the address, then the looked-up top two bytes are each multiplied by an ADC feedback bit (0/1 --> -1/+1) and accumulated separately. After many cycles of this, those accumulations represent an (X,Y) point that expresses angle and amplitude. The CORDIC instruction ARTCAN converts (X,Y) into (ro,theta) so you get power and phase angle. Those bottom two bytes from the LUT were output to DACs as a stimulus, if needed, to excite some system through which the ADC returns a bitstream. When used in this closed-loop mode, you have an instrument which should be able to resolve all kinds of interesting things in the real world that relate to resonance, time-of-flight, phase differences through sensor arrays, and who know what else. Something new to play with.

It would save an instruction, right? I don't think it's worth doing at this point. But, keep thinking.

jmg · 2015-09-05 00:47

cgracey wrote: »

...

There will be lots of hiccups branching around in hub code, for sure. An instruction cache could help that, in cases of tight loops, but I don't want to go there. I just figure that hub exec, in itself, is miraculous enough for this chip.

I would expect software tools to help a lot here, as there is always a mix of COG and HUB codes, so smallest most critical code (and certainly interrupts) would go into COG, and then a more elastic amount of not-inner-loop code can go into HUBEXEC.
Manage and control of that mix and thresholds is a software tools problem.

tdg8934 · 2015-09-09 20:04

It's been a long while since I've been back here hoping for current status of the Propeller 2 to be sold by Parallax. Where can I check the latest information (without searching through a ton of posts)?

kwinn · 2015-09-09 20:37

Right here in part 2. http://forums.parallax.com/discussion/162069/the-new-16-cog-512kb-64-analog-i-o-propeller-chip-part-2#latest

Tubular · 2015-09-09 20:52

tdg8934,
- about 7 weeks out from submission of P2 verilog for synthesis
- Parallax 1-2-3 A9 FPGA board soon to be available for $475. Smaller A7 (10 cogs) $375 available now
- P2 FPGA image to be released very soon for 6 or so target boards, including the above

cgracey · 2016-04-15 06:02

cgracey wrote: »
I've been running some numbers and I think that we are going to be just fine with 16 cogs and 512KB, after all:
die outline		8mm x 8mm = 64 mm2
pad frame		7.25mm x 0.75mm x 4 = 21.8 mm2
interior		64 - 21.8 = 42.2 mm2

16 of 8192x32 SP RAM	16 x 1.57 mm2 = 25.1 mm2
16 of 512x32 DP RAM	16 x 0.292 mm2 = 4.7 mm2
16 of 256x32 SP RAM	16 x 0.095 mm2 = 1.5 mm2
16384x8 ROM		0.3 mm2
memories		25.1 + 4.7 + 1.5 + 0.3 = 31.6 mm2

logic area		interior 42.2 - memories 31.6 = 10.6 mm2 for logic

gates allowance		120k/mm2 x 0.65 utilization x 10.6 mm2 = 827k gates
We'll only need about 300k gates, so planning for an 8 x 8 mm die with a 0.75 mm thick pad ring will give us all the room we need.

Sorry for the false alarm. Thanks for all your inputs. The consensus seems to be that big RAM is important.

P.S. You can figure the area savings of eliminating 8 cogs by cutting in half the 512x32 and 256x32 RAM areas (that comes to 2.35 and 0.75 mm2) and taking away the area of 150k gates (150k/827k x 10.6 mm2 = 1.92mm). Summing that all up: 2.35 + 0.75 + 1.92 = 5.0 mm2. That only saves about 8% of the die area, which is not much. Better to keep 16 cogs.

I had a two hour Google Hangout meeting with Treehouse today and we went over the whole pad ring layout for Prop2. They did a really good job on it. Everything was in good order. I had them make the outer dimensions 8.5mm x 8.5mm, which is more than we should need ever need, but will guarantee that we won't have to remove any features at the last minute during synthesis. I should have taken some screenshots. I'll ask them for some and will post them here.

I had to do a lot of Googling to find this post I quoted here. I was wondering just how much more room we have now with a bigger die.

Check this out:

die outline		8.5mm x 8.5mm = 72.25 mm2 (+13%)
pad frame		14.5 mm2 (-33%)
interior		57.75 mm2 (+37%)

16 of 8192x32 SP RAM	16 x 1.57 mm2 = 25.1 mm2
16 of 512x32 DP RAM	16 x 0.292 mm2 = 4.7 mm2
16 of 512x32 SP RAM	16 x 0.150 mm2 = 2.4 mm2
16384x8 ROM		0.3 mm2
memories		25.1 + 4.7 + 2.4 + 0.3 = 32.5 mm2

logic area		interior 57.75 - memories 32.5 = 25.25 mm2 for logic (2.38x)

gates allowance		120k/mm2 x 0.65 utilization x 25.25 mm2 = 1970k gates (2.38x)

There's not enough room to double the hub RAM to 1MB, but we could do something like switch the 512x32 SP LUT RAMs to dual-ports (like the cogs have) at a cost of 2.3 mm2. That would enable outputting from one LUT section while updating another, without any glitching.

jmg · 2016-04-15 06:25

cgracey wrote: »

There's not enough room to double the hub RAM to 1MB, but we could do something like switch the 512x32 SP LUT RAMs to dual-ports (like the cogs have) at a cost of 3.2 mm2.

Do you mean so they would dual-port to adjacent COGs ? or just be able to extend std COG memory ?

cgracey · 2016-04-15 06:32

jmg wrote: »

cgracey wrote: »

There's not enough room to double the hub RAM to 1MB, but we could do something like switch the 512x32 SP LUT RAMs to dual-ports (like the cogs have) at a cost of 3.2 mm2.

Do you mean so they would dual-port to adjacent COGs ? or just be able to extend std COG memory ?

I just meant that the streamer would be able to access the LUT from its own bus.

What you are talking about would certainly solve the cog-to-cog communication problem!

Tubular · 2016-04-15 07:46

How many gates do you need, at the moment Chip?

That cost 2.3 instead of 3.2? (4.7-2.4mm2) ?

cgracey · 2016-04-15 08:35

Tubular wrote: »

How many gates do you need, at the moment Chip?

That cost 2.3 instead of 3.2? (4.7-2.4mm2) ?

We probably need about 600k gates.

Thanks for noticing the mixed-up-digits mistake. I just fixed it. Did you hear about the dyslexic agnostic insomniac? He sits up at night wondering if there is a dog.

Tubular · 2016-04-15 08:53

Its a great joke, Chip : )

Better work out how to fill that spare space fast, before we forumistas start our suggesting...

evanh · 2016-04-15 11:47

Wow, doubling HubRAM again! If that really fits I think many people would vote for it over other features. It begs the question though, given 72 - 25 = 47, how much cheaper would a 512kB HubRAM 7mm x 7mm die be?

Seairth · 2016-04-15 11:54

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

There's not enough room to double the hub RAM to 1MB, but we could do something like switch the 512x32 SP LUT RAMs to dual-ports (like the cogs have) at a cost of 3.2 mm2.

Do you mean so they would dual-port to adjacent COGs ? or just be able to extend std COG memory ?

I just meant that the streamer would be able to access the LUT from its own bus.

What you are talking about would certainly solve the cog-to-cog communication problem!

That would mean that there would be only 8 LUT blocks (one per pair of cogs). That may still be reasonable. Could you then go to 4-port RAM and provide both glitch-free streaming and cog-to-cog access?

Edit: also add one more event for writing to LUT address $1FE. That way, the paired cogs can use $1FE and $1FF for signaling.

evanh · 2016-04-15 12:09

I still think HubRAM as soft buffers is best interCog comms, it's so much more flexible. Prop2 Hub has the potential bandwidth, why not use it?

Cluso99 · 2016-04-15 13:19

Oooooh! Maybe 16mm2 available

Two thoughts come to mind...

16 of 512x32 DP RAM 16 x 0.292 mm2 = 4.7 mm2
Additional 4KB of LUT dual-ported between adjacent cogs (addresses $400-$5FF) and $3FF interrupt becomes $5FF.

16 of 4096x32 SP RAM 16 x 1.57 x0.5 mm2 = 25.1/2 mm2 = ~12.6 mm2
Additional 256KB of hub ram (total 768KB)

PostEdit
Seriously though...

If we could squeeze another 256KB of hub ram, this would be worthwhile.

For COGs, 16 of 32x32 DP RAM between adjacent cogs would give some serious and fast comms between cogs. Coupled with an interrupt, it would be even better. 16 of 32x32 DP RAM = 16 x 0.292 / 16 = 4.7 / 16 = 0.29 mm2.

For COGs, adding another 4KB or 8KB of SP RAM (2.4 or 4.8 mm2) would be nice. It would give a huge LUT for streaming, and a serious boost to cog/lut-exec space.
Presuming an additional 4KB, that would give 2 of 4KB blocks of LUT. It might be possible to be filling one 4KB block while streaming from the other 4KB block (ie pseudo dual-port).

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments