"Out on a Limb" - Some P2 (radical?) thoughts...

Cluso99 · 2015-10-03 01:34

For a little while I have wondered what maximum speed we are going to be able to run the P2 at.
Originally I had hopes for 200MHz although nowadays I seem to think 160MHz may be it.

Hence I asked Chip what the OnSemi RAM can run at - the answer ~350MHz.
My reasoning behind this question was, "Do we need Dual Port Cog RAM ?".

http://forums.parallax.com/utility/thumbnail/115347/FileUpload/ce/72e184cc9febdacf81e0558efeb520.jpg

Using my understanding from the P1V, makes me believe that the I & R information is ready at the beginning of one cog clock, and the S & D information is ready at the beginning of the next clock.

That means that if the Cog RAM was clocked at 2x Cog clock, then...
"I" could be read on the first half of the I&R clock and "R" written on the second half of the I&R clock
"S" could be read on the first half of the S&D clock and "D" read on the second half of the S&D clock.

This would mean that standard single port RAM cells could be used for the COG RAM.
This saves reasonable die space, and perhaps complexity.

What do we get for this ???

Now LUT RAM is identical to COG RAM. So what?

Well, now with an added instruction "LUTDS D/#,S/#" where each # is 11-bits (possibly replacing the AUGDS instruction), that could precede any normal mov/add/etc instruction, all normal instructions could not only address the whole LUT, but we have expanded the registers to the full COG/LUT space (admittedly using an extra instruction).
Now we no longer require the RDLUT/WRLUT instruction.

jmg · 2015-10-03 01:55

Cluso99 wrote: »

Hence I asked Chip what the OnSemi RAM can run at - the answer ~350MHz.
My reasoning behind this question was, "Do we need Dual Port Cog RAM ?".

A little care is needed with numbers, as FPGA's also spec similar high MHz RAM numbers, but give system speeds that are 1/2 to 1/3 or even 1/4 of that.
That is because RAM never exists alone, it needs address decode, multiplexers and routing paths, and what it feeds, needs a setup time.

I had a similar thought a while ago about counters, which are simpler than RAM.
The question was could counters run at 2x the CPU speed, to give better timing resolutions ?
IIRC outcome was Counters are faster than opcode decode, but opcode decode is already quite well optimised ( P2 is a mature design, in that area) and that means a 32 bit counter is not greater than 2x the opcode decode path. ie you could not get 'x2 for free', even on a Counter.

msrobots · 2015-10-03 01:55

more interesting would be if the eggbeater could run faster with double clocked RAM.

Interesting idea.

Mike

Cluso99 · 2015-10-03 02:15

msrobots wrote: »

more interesting would be if the eggbeater could run faster with double clocked RAM.

Interesting idea.

Mike

The COG/LUT RAM is tightly coupled whereas the eggbeater has long paths and muxes to all 16 cogs in its path.

evanh · 2015-10-03 06:05

Cluso,
It's probably theoretically possible to have a 320MHz clock for the register-set while using 160MHz for the rest of the core. It won't happen though. The synthesis system just won't have the features to handle it. Chip already mentioned how the dual-cycle execute phase couldn't be timing simulated, or something to that effect, so he has added extra pipeline flops to satisfy that limitation.

That's probably the biggest sacrifice of using so many levels of automated tools. You are bound by the limits of them all combined. It's a logical AND operation.

evanh · 2015-10-03 06:26

I found this nice diagram of functional interfacing between SDR and DDR.

Beau Schwabe · 2015-10-03 16:05

I'm not sure that 350MHz is a realistic target. I'm pulling this from my days working for National Semiconductor in their high speed communications division. I was on the team building the MacPhy (Media Access Controller Physical layer) for 10/100/1000 Ethernet cards that plug into laptops and PC's. That Chip was built in the same 180nm TSMC process that Parallax is using and for the MacPhy, we targeted 300 MHz for reliability reasons. The 180nm process itself only allows for just slightly over 300MHz .... So how do you achieve Gig speeds when 300 is the limit? ... you Phase shift with 4 oscillators each running at 250MHz... the combined throughput is a Gig. This however is still complex to design at frequencies that push the corner limit of the target silicon process ... What looks good on paper and through the eyes of a simulator, still requires empirical testing to get everything correct. Very seldom will a piece of silicon work right off the first bat. Evidence of this and early P2 efforts are brutally evident. Parallax got lucky with the P1.

Note: That 300MHz limit is just a "heartbeat oscillator" limit.... it is based on the parasitic capacitance of the smallest transistor oscillator capable of sustaining adequate drive strength and the S/D leakage (resistance) to the substrate as well as any parasitic inductance forming a simple parasitic RLC filter by the result of component layout. To achieve those frequencies at NSC the quad-phased "heartbeat oscillator(s)" required strict layout requirements positioned in isolated NWELLs as far away from the substrate and other active circuitry as possible.

Another concern I have that is directly related to the frequency bandwidth is clock propagation through proper fanout and delay insertion techniques. This can have a significant impact on the number of cells calculated for the overall size design. This fanout also apples to sensitive signal multiple lines such as the COG to COG, COG to MEMORY, or COG to I/O communication.

More cells = more propagation delay = better fanout timing = slower clock speeds
Less cells = less propagation delay = poor fanout timing = faster clock speeds

There is a fine balance between the two and the automated P&R tool (Avanti! software - Now Synopsys is the leading industry software to accomplish this) this software tool will programmatically accept whatever latency delays you provide on any selected signal, and it will attempt to meet the timing requirements you provide after several thousand/million iterations. I became very familiar with this tool at NSC and know it's limitations quite well. It's the proper tool to use in my opinion.

Ale · 2015-10-03 16:05

The problem is not the RAM. The ALU output MUX was in the P1V a 64 to 1 mux... that plus the shift muxes... there is your longest path, or one of them. The P1V gets two clocks for that... it is that long.

evanh · 2015-10-03 21:23

Ale wrote: »

The ALU output MUX was in the P1V a 64 to 1 mux... that plus the shift muxes... there is your longest path, or one of them. The P1V gets two clocks for that... it is that long.

Cluso is proposing that that part still runs at the original speed. Only the final registering of the SRAM block be double-clocked just to eliminate the dual-porting. It would probably add another effective stage to the pipeline but that's a detail for when such a manoeuvre is even an option.

Ale · 2015-10-04 04:21

A similar scheme could be tested on the P1V...

Cluso99 · 2015-10-05 03:23

Ale wrote: »

A similar scheme could be tested on the P1V...

While it could be tested, the P1V doesn't give a clean compile anyway, so timing cannot be analysed properly (and I don't know how to do it either).
But, I don't think the FPGA will test this out anyway. I think it's an OnSemi/Treehouse question about the possibility.

If it worked, there would be many positive benefits that I thought it was worth the question.

cgracey · 2015-10-05 03:30

Double clocking would make timing closure a pain. I've found it's best to have one clock and no multicycle paths. Otherwise, complexity just explodes.

Cluso99 · 2015-10-06 02:10

cgracey wrote: »

Double clocking would make timing closure a pain. I've found it's best to have one clock and no multicycle paths. Otherwise, complexity just explodes.

Thanks Chip for the further explanation. We certainly don't want to make it any more difficult than it is.

"Out on a Limb" - Some P2 (radical?) thoughts...

Comments