Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

1101102104106107160

Comments

  • My SD code has 3 nops after clock before read. Been a while, but I must have decided they were needed a long time ago...
    Prop Info and Apps: http://www.rayslogic.com/
  • We had similar i/o delays with P2-hot
  • Peter JakackiPeter Jakacki Posts: 8,514
    edited 2017-11-22 - 02:49:56
    Well we don't need delays for the data, in fact it would be possible to read it before taking the clock high. But adding delays in code because it needs to take into account the delay through the input path is not my idea of the way it should be. Yes, registered inputs can be good for some things, but we need to be able to read the pin metal itself unobstructed.

    I've been updating my code to use the newer pin output modes as this code is fairly old and still referenced port A or B. I do like the drvx instructions too!
    Yes, keep it "simple".

    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    P2 +++++ TAQOZ INTRO & LINKS +++++ P2 SHORTFORM DATASHEET
    P1 +++++ Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    Brisbane, Australia
  • YanomaniYanomani Posts: 877
    edited 2017-11-22 - 03:05:29
    Hi Chip

    The better all those delays can be documented, the more chances we will have to get decent results in timing-sensitive interfaces, such as HyperBus (ddr) ones.

    I was almost praying for a way for the data bus arrival (during read) to get logically validated (via an internally synchronized capture with) by the changes (both up and down going) of RWDS, thus advancing the streamer capture if and only if the former situation could be reliably sensed and reacted for (hardware-only dependency ), but it will not happen as for the current design.

    The other need a top speed HyperBus device must have, and that will also not happen as for the current design, is that the hyperclock driving pin should toggle its state, a half clock cycle (based in streamer's clock cycle rate) AFTER the data driven by streamer's logic reaching the data bus output pins.

    In the past months, may be a year or so, I'm not sure, I've had the impression, that the delays involved in the signal propagation paths inside the chip, could fairly simulate the needed behavior, presenting a hyperclock-driving signal change, at the pins, just at the middle of the data bus change, at the pins, as if it was driven by the complement of the system clock. I believe this will not be the case, no more.

    As for write operations and the other controlling signal, RWDS could be made stable during the entire cycle, unless someone decides to individually control wich of the two bytes that compose a word should be written or not; thus there is no critical timming relationship between RWDS and the data bus or the hyperclock signal, during full and valid word data write ops.
    Unless, of course, someone decides to tackle a WMLONG-instruction behavior simulation. Feasible at lower speeds, but not top ones.

    Only some thoughts

    Henrique
  • I think we will need a diagram showing where in the instruction pipeline the pin gets set and read. Obviously there are a number of clocks from the instruction writing to the OUTx register and those bits arriving at the PINs. Same must be true for input PINs arriving at the INx register, before an instruction can read the INx register.

    Sorry, but I hadn't anticipated such shortcomings in the P2, especially since P2 still embraces soft peripherals. I would expect this in ARM style chips though, as one cannot get to the bare metal in them. IMHO P2 is different. I don't expect everyone to use, even if it's possible, the smart pins.

    Will have to see the pipeline timing diagram.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • It's going to be like the HUB. Interleave ops to use the time well.

    Not happy either, though this effect could be largely ignored at the faster clocks, when dealing with slower signals.

    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
  • evanhevanh Posts: 7,520
    edited 2017-11-22 - 05:47:00
    I/O buffering is independent of the instruction pipeline. You could say they have their own pipelines. I/O clocking will be every single system clock cycle.

    So the question is really one of lag from execution to final output transition, and, back, from input registering to instruction execution. Chip has partly answered as a whole turnaround is about 3 instructions. I do remember noting this in my SmartPin testing, so it's about the same behaviour for Smartpins.

    EDIT: Different to the HubRAM stalling effect. There is no stalling, only a lag.

    The Prop1 will have equivalent behaviour but probably wasn't measurable in instructions.

    EDIT2: Oops, I only read JMG's quote from Chip. Input is where the extra delay is. Chip states 4 clocks (2 instructions) lag for an input.
    "... peers into the actual workings of a quantum jump for the first time. The results
    reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view
    —the jumps are neither abrupt nor as random as previously thought."
  • jmgjmg Posts: 13,783
    Yanomani wrote: »
    ...
    I was almost praying for a way for the data bus arrival (during read) to get logically validated (via an internally synchronized capture with) by the changes (both up and down going) of RWDS, thus advancing the streamer capture if and only if the former situation could be reliably sensed and reacted for (hardware-only dependency ), but it will not happen as for the current design.

    For the very highest speeds, these new memories (HyperBUS, OctalSPI) need the RWDS as a timing tracking capture clock.
    At lower clock speeds, that RWDS is less important for timing tracking.

    However, frustratingly vague in the data, is whether RWDS also has gaps across read boundaries, on continual reads.
    (ie adds dummy states). My reading of the HyperRAM is that is ok, but HyperFLASH is less clear.

    If that is needed, then P2 could still read RWDS memory, with the caveat of not across a read boundary.

    The way the packing NOPs changes with MHz on FPGA builds, is not a good sign for HyperRAM etc development.


  • evanhevanh Posts: 7,520
    edited 2017-11-22 - 07:28:51
    jmg wrote: »
    The way the packing NOPs changes with MHz on FPGA builds, ...
    I'm a little intrigued as to how that variance occurs at all. I would have thought that both in and out flops would be clocked in parallel so that the output pin would always transition after the input has been registered. So, clockrate shouldn't matter unless it takes more than a clock period for the output to swing.
    "... peers into the actual workings of a quantum jump for the first time. The results
    reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view
    —the jumps are neither abrupt nor as random as previously thought."
  • jmgjmg Posts: 13,783
    evanh wrote: »
    jmg wrote: »
    The way the packing NOPs changes with MHz on FPGA builds, ...
    I'm a little intrigued as to how that variance occurs at all. I would have thought that both in and out flops would be clocked in parallel so that the output pin would always transition after the input has been registered. So, clockrate shouldn't matter unless it takes more than a clock period for the output to swing.

    I think it needs routing delays to exceed a 80MHz SysCLK, and Peter suggested the threshold is even close to 40MHz (25ns) which seems a high delay for the FPGA speeds involved here ? - so parts of this are puzzling.

  • jmg wrote: »
    evanh wrote: »
    jmg wrote: »
    The way the packing NOPs changes with MHz on FPGA builds, ...
    I'm a little intrigued as to how that variance occurs at all. I would have thought that both in and out flops would be clocked in parallel so that the output pin would always transition after the input has been registered. So, clockrate shouldn't matter unless it takes more than a clock period for the output to swing.

    I think it needs routing delays to exceed a 80MHz SysCLK, and Peter suggested the threshold is even close to 40MHz (25ns) which seems a high delay for the FPGA speeds involved here ? - so parts of this are puzzling.

    It's not just routing delays. It's the logic in either direction after (out) and before (in) registering.
  • evanhevanh Posts: 7,520
    edited 2017-11-22 - 13:09:41
    Oh, I see, OUT goes all the way from the depths of each Cog core to the pin drive transistors without further flops. And with all the multiCog OR logic and Smartpin mux'ing gives the pin a large propagation.

    What about the Streamers, is it possible to work out their propagation delay? I imagine the Smartpin propagation to the pin drivers is very low. As Henrique has mentioned, it might be advantageous to use these delays for max speed synchronous clocking purpose.

    "... peers into the actual workings of a quantum jump for the first time. The results
    reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view
    —the jumps are neither abrupt nor as random as previously thought."
  • RaymanRayman Posts: 9,562
    edited 2017-11-22 - 15:08:24
    If the in and out delays were the same, seems like Peter's 20 MHz SPI code would work...
    Oops, that's not right... Or is it? Hard to do in my head...

    So, IN is actually from a few clock ago and OUT doesn't appear until a few clocks later, right?

    Maybe there's a way to make that work if we knew the delays better...
    Prop Info and Apps: http://www.rayslogic.com/
  • I hate to ask this, but can all of this be gated?

    Like, smart pin mode? On or off?

    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
  • cgraceycgracey Posts: 11,508
    edited 2017-11-22 - 18:47:22
    potatohead wrote: »
    I hate to ask this, but can all of this be vgated?

    Like, smart pin mode? On or off?

    To get timing sped up on the FPGA, we need to have flops inserted within certain high-fan-in and high-fan-out nets, due to the long (and slow) interconnects which they require.

    On the actual silicon, wiring delays aren't so relatively long, compared to logic delays. In order to optimize the logic delays in the FPGA design, which DO correlate well with real silicon logic delays, we must get the FPGA interconnects out of the way by using flops. Then, we can optimize the logic for timing, which will also benefit the actual silicon, later on.

    So, in the end, we wind up with a design that runs fast on both the FPGA and in real silicon. The cost of this is extra flops on certain nets.

    We could pull out some flops in the actual silicon, but it would break the design for FPGA testing at high speed. It's somewhat of a conundrum. The way we are doing it now is, at least, safe.

    While it seems like a disaster that there are multi-cycle delays, adding up OUT-to-pin time and pin-to-IN time, I've found it to be almost completely trivial. Think about this... If you are running really fast, is it ever requisite to change a pin, look at the feedback, and change it again within maybe 4 cycles? I have found that that is maybe never needed. What does require some thinking, though, is planning your OUT and IN timing for things like shift registers. That does require awareness. For pure IN or OUT activity, though, you would never even know these delays exist.
  • Ok. What is "slower" in terms of FPGA?

    If it's, say half current speed, that kind of sucks. But say it's one quarter slower?

    Shouldn't the end silicon be a focus? We don't technically need the fastest FPGA, but lower latency differences is the silicon are very highly desirable.

    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
  • cgraceycgracey Posts: 11,508
    edited 2017-11-22 - 19:07:25
    potatohead wrote: »
    Ok. What is "slower" in terms of FPGA?

    If it's, say half current speed, that kind of sucks. But say it's one quarter slower?

    Shouldn't the end silicon be a focus? We don't technically need the fastest FPGA, but lower latency differences is the silicon are very highly desirable.

    The only way I see this working is if I have a switch in my Verilog code that eliminates certain flops by using alternate logic. We will loose logic-timing awareness in that mode, as interconnects will suddenly dominate. We'd have to test the result at maybe 2/3 speed on the FPGA.

    The egg-beater memory could trim maybe three cycles by doing this. It's not just pin activity that is slowed down by flop insertions.
  • potatoheadpotatohead Posts: 9,764
    edited 2017-11-22 - 19:35:36
    Might be worth it. Are you game?

    Am I way off track?

    Just thinking end outcome here, and feel we should maximize that.

    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
  • Cluso99Cluso99 Posts: 15,231
    edited 2017-11-22 - 19:38:59
    cgracey wrote: »
    potatohead wrote: »
    Ok. What is "slower" in terms of FPGA?

    If it's, say half current speed, that kind of sucks. But say it's one quarter slower?

    Shouldn't the end silicon be a focus? We don't technically need the fastest FPGA, but lower latency differences is the silicon are very highly desirable.

    The only way I see this working is if I have a switch in my Verilog code that eliminates certain flops by using alternate logic. We will loose logic-timing awareness in that mode, as interconnects will suddenly dominate. We'd have to test the result at maybe 2/3 speed on the FPGA.

    The egg-beater memory could trim maybe three cycles by doing this. It's not just pin activity that is slowed down by flop insertions.

    Seems like a much better result for the real silicon to be faster, and hence more predictable and intuitive, at the cost of lower speed FPGA testing.
    Is the FPGA dictating the P2 design restrictions, or is the aim for "the best" silicon that the P2 can achieve?

    In the P1, a lot of what I have done relies on driving outputs with an address, followed by reading the inputs for resulting data. This would add so much delay that it would have been unusable. While the P2 negates the requirement for external SRAM, I do wonder what the P2 will not be able to do, due to the extra delays in the output and input circuits.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • potatoheadpotatohead Posts: 9,764
    edited 2017-11-22 - 19:40:19
    Agreed. The FPGA is fast enough, at 2/3, IMHO. That might put a few exotic things off the table during FPGA time, but do we really care?

    That puts us at like 50 to 60Mhz, right?

    If so, seems prudent to do.



    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
  • jmgjmg Posts: 13,783
    cgracey wrote: »
    While it seems like a disaster that there are multi-cycle delays, adding up OUT-to-pin time and pin-to-IN time, I've found it to be almost completely trivial. Think about this... If you are running really fast, is it ever requisite to change a pin, look at the feedback, and change it again within maybe 4 cycles? I have found that that is maybe never needed..

    Less cycles is simpler easier to design with, but the bigger concern I have is the delays exceeding SysCLK, so that as you ramp the frequency, you hit some threshold where suddenly another clock is added. (yikes!)
    Peter mentioned that seems to be as low as 40MHz ?

    Even DDR SPI memory does not do this nasty effect.
    Yes, they have some dummy adder for higher speeds, but that applies at ALL clock speeds, when selected.

    That effect is far harder to test for, and may bite users.
    Someone who crafts code using TCXO and WAITs to get what they think is exact-cycle-precision, can suddenly find 1 sysclk jitter on the pins.

    cgracey wrote: »
    ...
    We could pull out some flops in the actual silicon, but it would break the design for FPGA testing at high speed. It's somewhat of a conundrum. The way we are doing it now is, at least, safe.

    The only way I see this working is if I have a switch in my Verilog code that eliminates certain flops by using alternate logic. We will loose logic-timing awareness in that mode, as interconnects will suddenly dominate. We'd have to test the result at maybe 2/3 speed on the FPGA.

    The egg-beater memory could trim maybe three cycles by doing this. It's not just pin activity that is slowed down by flop insertions.

    The FPGA is always slower than final silicon, so some reduction is tolerable, one key threshold seems to be for USB.
    What SysCLK does USB need, to still allow easy testing ?

    Maybe this is a good time to add my previous suggestion of more granular clock divider, by starting from a higher PLL value ?
    That allows smaller steps in clock speed, and can let users test a more gradual clock increase.
    It's a pity the FPGA cannot quite emulate the VCO/PLL/PFD of the final device.

    The test shuttle is not started yet, or is it now going down a line ? Can that be used as Clk Generator ?

  • I think it's OK as is... Assuming we get a 160 MHz clock and the #clocks of delay same as now...
    Then, should be a lot easier to get 20 MHz SPI clock...
    Prop Info and Apps: http://www.rayslogic.com/
  • evanhevanh Posts: 7,520
    edited 2017-11-22 - 22:01:21
    Rayman wrote: »
    If the in and out delays were the same, seems like Peter's 20 MHz SPI code would work...
    Oops, that's not right... Or is it? Hard to do in my head...

    So, IN is actually from a few clock ago and OUT doesn't appear until a few clocks later, right?

    Maybe there's a way to make that work if we knew the delays better...

    Yep, Pin input data has a couple of registered steps on it's way to the Cog before an IN can see the data, so is always lagging by that number of clocks.

    OUT is entirely different in that it doesn't have anything other than the instruction OUT register inside the Cog. As a result the signal has a long route to the pin drivers and this has a measurable amount of time - called a propagation delay.

    Each Cog OUT to Pin propagation delay will be slightly unique. Whereas the SmartPin propagations should hopefully be all the same, because the synthesis presumably will place each pin's SmartPin logic symmetrically alongside every I/O pin.

    These differences in propagation could be useful for external databus management. In particular a difference between a SmartPin and a Streamer.
    "... peers into the actual workings of a quantum jump for the first time. The results
    reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view
    —the jumps are neither abrupt nor as random as previously thought."
  • I think Chip said that both input and outputs were now registered.
    So, all should update at exactly the same time...
    Prop Info and Apps: http://www.rayslogic.com/
  • Good point, I do remember such a statement. Then there must be something broken.
    "... peers into the actual workings of a quantum jump for the first time. The results
    reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view
    —the jumps are neither abrupt nor as random as previously thought."
  • jmgjmg Posts: 13,783
    Rayman wrote: »
    I think it's OK as is... Assuming we get a 160 MHz clock and the #clocks of delay same as now....

    The problem is, the #clocks is not precisely defined - it is present the 'might be 5, or might be 6' outcome that will cause serious problems.

    It's important that be fixed (made frequency independent).

    If it can also be reduced, in Clock counts, that's nice as it allows lower MHz sysclks, instead of forcing users to highest MHz purely to reduce the pipeline effects.

  • Where are we up to with the SD pins on the BeMicroCV-A9?
    ie is the switch working to move the SD pins, and are they working?

    My SD generic code (tested on P1) is now compiling under pnut v27a, so I am ready to test it :)

    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Cluso99 wrote: »
    Where are we up to with the SD pins on the BeMicroCV-A9?
    ie is the switch working to move the SD pins, and are they working?

    My SD generic code (tested on P1) is now compiling under pnut v27a, so I am ready to test it :)

    V27 does map them and although I have used the zz version with those pin reassignments locked in, it still plays up. However, I'm using V26 and sorting out some issues with multiple cogs running so when I fix that I will test it back on V27zz. Now, can you make it signed to fit into the serial Flash? I haven't tried doing that yet.

    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    P2 +++++ TAQOZ INTRO & LINKS +++++ P2 SHORTFORM DATASHEET
    P1 +++++ Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    Brisbane, Australia
  • cgraceycgracey Posts: 11,508
    edited 2017-11-23 - 11:46:58
    I went through the Verilog code to determine the IN/OUT paths to/from the cogs:
    OUT		registered in cog on 'go' (on last clock of instruction)
    		registered in hub after OR'ing all cogs' OUT signals *
    		goes through smart pin logic to physical pin
    		(total delay = 1 clock after last clock of instruction)
    
    		DRVC	#30	'2 (+1!)	updates after 1 clock after 2-clock instruction
    
    
    IN for D/S	registered from physical pin
    		goes through smart pin mux and logic/filtering
    		registered in hub for fan-out to all cogs *
    		registered in cog on 'go' (last clock of prior instruction)
    		(total delay = 3 clocks)
    
    		TESTB	INB,#31	'(?3+) 2	samples 3 clocks before 2-clock instruction
    
    
    IN for TESTP{N}	registered from physical pin
    		goes through smart pin mux and logic/filtering
    		registered in hub for fan-out to all cogs *
    		registered in cog on 'get' (first clock of instruction)
    		(total delay = 2 clocks, since it arrives in first clock of instruction)
    
    		TESTP	#31	'(?2+) 2	samples 2 clocks before 2-clock instruction
    
    
    * Can *maybe* be eliminated in ASIC
    

    As you can see, TESTP/TESTPN get IN data that is one clock fresher than instructions which read INA/INB via D or S.
  • cgraceycgracey Posts: 11,508
    edited 2017-11-23 - 12:46:08
    Here is a corrected code example:
    '
    ' Check OUT to IN time
    '
    con
    
    	x = 2, mode = $00	'at 20MHz, x=1 misses high, x=2 catches it
    '	x = 3, mode = $FF	'at 80MHz, x=2 misses high, x=3 catches it
    
    dat	org
    
    	clkset	#mode
    
    	drvl	#0	'2 (+1!)	make pin 0 low
    	waitx	#10	'12		give plenty of time
    
    	drvh	#0	'2 (+1!)	now make pin 0 high
    	waitx	#x	'2+x		wait 2+x cycles
    	testp	#0 wc	'(?2+) 2	sample pin 0 into c, !..? = 2+x+1
    
    	drvc	#32	'2!		write c to led
    	jmp	#$
    '
    ' 20MHz, x=2
    '
    ' 0	DRVH	get
    ' 1	DRVH	go
    ' 2	WAITX	get
    ' 3	WAITX	gox	OUT goes high from DRVH. IN samples high for TESTP.
    ' 4	WAITX	gox
    ' 5	WAITX	go
    ' 6	TESTP	get
    ' 7	TESTP	go
    '
    '
    ' 80MHz, x=3
    '
    ' 0	DRVH	get
    ' 1	DRVH	go
    ' 2	WAITX	get
    ' 3	WAITX	gox	OUT goes high from DRVH. Doesn't feed back to IN, yet.
    ' 4	WAITX	gox	IN samples high for TESTP.
    ' 5	WAITX	gox
    ' 6	WAITX	go
    ' 7	TESTP	get
    ' 8	TESTP	go
    '
    

    The extra clock cycle needed at 80Mhz is due to the unconstrained timing paths not allowing same-clock feedback from a register, through logic, out to a pin, back in from a pin, through more logic, then back into a register, all in 12.5ns.

    There's no need to suppose that some variable number of NOPs might be needed, based on frequency. It is safest to write code for the 80MHz timing situation, as it will always work.
Sign In or Register to comment.