Joyous times :)

Ariba · 2014-02-06 18:52

rjo__ wrote: »

I am modifying Chip's Monitor.spin to add some math and serial utilities for the P2-Nano.

When I try to move PTRA's address into a variable, tempptr: mov tempptr, PTRA

I get an error: Expected a Constant, Unary Operator or "(.

Given the new PTRA instructions, I don't actually have to do this. I was just wondering about it.

All these special registers like PTRA have SET.. and GET... instructions, only the port registers are register mapped, and allow MOV.
So you need to do:
GETPTRA tempptr

Andy

rjo__ · 2014-02-06 19:10

Seems to be one of those forest from the trees sitiations.

Thanks

rjo__ · 2014-05-21 15:18

I have not really had time to properly lurk for the last month... let alone play with anything really fun... poor me:)

I just have to say that the current direction of this design is just.......brilliant.

For anyone not tickled by the new features, this design will provide a parsimonious environment for updating proven designs.

In a moment of legendary insight, it occurred to me that we will be able to stream data through the P2 at rates that defy processing. This is good. It means that up to the limit of streaming, that data can be made real, streamed to external memory and then post-processed at a leisurely pace.. To me, this is a game changer. We don't really have to worry if the chip can keep up with the signal it is trying to process, we only have to stream the data to a large block and then hash it out. I can't stop thinking about it... I wake up in the middle of the night screaming..."Praise Jesus!!!"

I want to live in a place where it is legal to marry this chip. . This is just fabulous.

We don't need simultaneous bidirectional FIFO's, this is PERFECT!

Bill Henning · 2014-05-21 15:26

divide and conquer

stream it to memory, then throw many cogs at the problem!

reverse is also true,

have many cogs generate a signal in a table, and stream it out.

and for the icing on the cake... the FIFO significantly speeds up hubexec (and even LMM)

if there is room for it, a write fifo is nice too - double speed memory copies does not hurt - but it is by no means necessary.

Ken Gracey · 2014-05-21 15:28

Wonderful! What you think about, you bring about. Choose joyous times or lament - just be sure it's the right decision.

Seriously, spoke to Chip today and he's making solid progress on the hub RAM management and expects to have something working within the day. I'll stop by his home tomorrow just to see it with my own eyes. If I'm ever feeling lament around our progress, I can usually talk to Chip and get back the vibe of joyous times.

Ken Gracey

rjo__ · 2014-05-21 15:29

Don't touch that space:)... reserved for memory.

rjo__ · 2014-05-21 15:31

http://www.youtube.com/watch?v=nKgUq5dziEk&feature=kp

jmg · 2014-05-21 15:59

Bill Henning wrote: »

if there is room for it, a write fifo is nice too - double speed memory copies does not hurt - but it is by no means necessary.

I think Chip already has it working in both directions, but in Simplex manner. (ie you configure direction).
He did ask about Duplex FIFOs (one for each way), but as these are looking quite large, I think starting with Simplex is better.

I'm trying to get a handle on the relative area/speed of FF+MUX or Dual Port RAM FIFOs, but ASIC info varies widely, depending on how custom your cells are.

Certainly on a FPGA, Dual Port RAM is smaller and faster than a sea of FF+MUX, because DPRAM is there as custom blocks.
Maybe OnSemi could give some guidelines re their Process/cells ?

Bill Henning · 2014-05-21 16:11

Good info.

I had some evil thoughts about double-purposing the LUT as an i/d cache for hubexec, but I won't mention it in the other thread as some people would freak out.

jmg wrote: »

I think Chip already has it working in both directions, but in Simplex manner. (ie you configure direction).
He did ask about Duplex FIFOs (one for each way), but as these are looking quite large, I think starting with Simplex is better.

I'm trying to get a handle on the relative area/speed of FF+MUX or Dual Port RAM FIFOs, but ASIC info varies widely, depending on how custom your cells are.

Certainly on a FPGA, Dual Port RAM is smaller and faster than a sea of FF+MUX, because DPRAM is there as custom blocks.
Maybe OnSemi could give some guidelines re their Process/cells ?

jmg · 2014-05-21 16:16

Bill Henning wrote: »

Good info.

I had some evil thoughts about double-purposing the LUT as an i/d cache for hubexec, but I won't mention it in the other thread as some people would freak out.

Yes, best get it working first

Simplex FIFO seems fine to me, especially as there are direct HUB opcodes coming too.

Bill Henning · 2014-05-21 16:19

Agreed, simplex FIFO is good enough on the FIFO side. I love how it can help hubexec

I am curious how many spacer instructions are needed after the fifo reads before the values are usable for D or S, or executable.

It makes a huge performance difference in my LMM model (results posted in other thread)

jmg wrote: »

Yes, best get it working first
Simplex FIFO seems fine to me, especially as there are direct HUB opcodes coming too.

rjo__ · 2014-05-21 16:32

Bill,

FWIW, I enjoy a little drama once in a while:) As long as it stays technical and I can learn something from it... which is about every other word, sometimes.
There are some terms, which I think we should be able to use as terms of endearment. Phrases such as: "You idiot. or "What century were you born in" these should be linked to some common resource that defines them as euphemism for "I respectfully disagree" and "I respect your age and experience." Code phrases, meant for this forum and nowhere else.

Most of the time, the simple dry explanation isn't enough for me to appreciate all of the implications, interdependencies, etc.
This is a fairly complex design... but all of the heated discussion somehow renders the inscrutable slightly more scrutable.

Simple concepts can be very difficult to explain in a simple way.

I like the simplex design because we have some real talent sitting around. Leaving it in simplex requires real talent to orchestrate 2 Cogs working at the same data from different directions, but it can be done... and when the explanations come in as to exactly how it is done, the general understanding of the chip will be clearer, in ways that are generally useful but hard to define.

Bill Henning · 2014-05-21 16:38

LOL!

I know I am guilty of dry, overtly technical explanations.

At least I refrain from terms of endearment

rjo__ · 2014-05-21 16:44

jmg,

What in the world are you talking about?... I'm not an engineer. Most of the time I have a good sense of what is possible and absolutely no idea how to actually do it:)
Not knowing is fun sometimes... but usually ...not. My mental picture is stuck on this: once the new Propeller goes into production, I am going to ask Chip to come up with an FPGA design to do what I have described... and then find a way for me to hook up my new Propeller to the FPGA... at that point, I will be in Heaven on Earth. No one will make any money from it... but for the early birds... and I suspect you are one: Why wait? Let's get cracking... beat Chip to the punch... save hime a little effort. Make me the happiest guy in the world.

Thanks,

Rich

Tubular · 2014-05-21 17:26

Bill Henning wrote: »

I had some evil thoughts about double-purposing the LUT as an i/d cache for hubexec, but I won't mention it in the other thread as some people would freak out.

I appreciate the restraint, Bill, and thats a good idea until we get an FPGA image. Thanks for the figures or merit for different schemes.

But do keep developing those evil thoughts in the meantime. Roy's proposed architecture is a giant leap forward, takes a bit for the rest of the design to catch up to the new paradigm and balance again.

We need to get a current consumption estimate on that new fpga image, and perhaps make some contingency plans in case its too high

Bill Henning · 2014-05-21 17:42

Tubular wrote: »

I appreciate the restraint, Bill, and thats a good idea until we get an FPGA image.

Tubular wrote: »

Thanks for the figures or merit for different schemes.

You are welcome!

Interesting that those opposed to hubexec and fifo cannot refute the numbers...

Tubular wrote: »

But do keep developing those evil thoughts in the meantime.

I can't stop

Tubular wrote: »

Roy's proposed architecture is a giant leap forward, takes a bit for the rest of the design to catch up to the new paradigm and balance again.

We need to get a current consumption estimate on that new fpga image, and perhaps make some contingency plans in case its too high

Yep.

Contingency plan:

- only use the number of cogs you need for your application

- run them at the lowest clock rate that allows your application to run

- use the various WAITxxx's as much as possible

- sell it as different TDP's at different clock rates

If you need all cogs, maximum performance, then at least you can do your app, versus not being able to do it on a "simplified to death" - essentially deliberately crippled - prop.

Funny thing is... same contingency plan would have worked for the "5W" p2, with a typical 1w-2w current consumption.

Mind you, I prefer the new hub interface to the 256 bit bus on the p2, and love the idea of smart pins.

The good news is that the 32 bit hub bus, and dual ported (instead of quad ported) cog memory, and gated ALU sections ought to save a lot of power.

I'd be really curious about the power envelope for a P2 with 32 bit new style hub, with gated ALU sections... but the eight cogs otherwise staying the same (pipelined, all the nice stuff).

Cluso99 · 2014-05-21 20:52

Bill Henning wrote: »

You are welcome!

Interesting that those opposed to hubexec and fifo cannot refute the numbers...

I can't stop

Yep.

Contingency plan:

- only use the number of cogs you need for your application

- run them at the lowest clock rate that allows your application to run

- use the various WAITxxx's as much as possible

- sell it as different TDP's at different clock rates

If you need all cogs, maximum performance, then at least you can do your app, versus not being able to do it on a "simplified to death" - essentially deliberately crippled - prop.

Funny thing is... same contingency plan would have worked for the "5W" p2, with a typical 1w-2w current consumption.

Mind you, I prefer the new hub interface to the 256 bit bus on the p2, and love the idea of smart pins.

The good news is that the 32 bit hub bus, and dual ported (instead of quad ported) cog memory, and gated ALU sections ought to save a lot of power.

I'd be really curious about the power envelope for a P2 with 32 bit new style hub, with gated ALU sections... but the eight cogs otherwise staying the same (pipelined, all the nice stuff).

The ALU became a nightmare with the 4 stage pipeline and forwarding circuits. Then to top that off, 4 level multitasking was added, and this meant the ALU was carrying 4 sets of parallel data thru the pipeline. I am amazed it even worked, irrespective of power. And for Chip wrapping his head around it all too. That's what caused the brain meltdown

Tubular · 2014-05-21 21:45

Yep I have to admit I prefer the hw fifo with nco clock, vs doing if via the stack/aux ram.

Smart pins should be an improvement too, still waiting on the detail.

Bill you're right the contingencies are similar, but I think this new one should be easier to shrink, should that be necessary

evanh · 2014-05-22 15:06

Cluso99 wrote: »

The ALU became a nightmare with the 4 stage pipeline and forwarding circuits. Then to top that off, 4 level multitasking was added, and this meant the ALU was carrying 4 sets of parallel data thru the pipeline.

That was the easy part and pretty typical for heavier processors.

Chip has said a few times now that hubexec is causing all the stress. Not so much just getting it to work, but getting it efficient and fast.

Hopefully, the new crosspoint switching and FIFO will smooth it all out and Chip won't have to add lots of special hubexec support in the end.

Joyous times :)

Comments