Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

potatohead · 2017-12-01 16:27

We have interrupts for that now.

evanh · 2017-12-01 16:49

potatohead wrote: »

We have interrupts for that now.

True, and those have similar problems with same instructions.

Seairth · 2017-12-01 18:55

potatohead wrote: »

We have interrupts for that now.

And, somewhat related (i.e. RESIx/RETIx), we have CALLD, which can be used for cooperative multitasking.

cgracey · 2017-12-02 09:53

Over the last several days, I've been consolidating all instruction decoding to the cycle before the two cycles that the instructions actually execute in.

This has two advantages:

1) It actually saves logic, since replica logic doesn't exist at two different pipeline stages.
2) It makes things go faster.

Prop2-Hot worked this way, but I had abandoned this in the new design, since it requires a flipflop per decode. I guess the new design drifted to the point where there were few enough decodes that flops became more efficient than extra logic.

This change caused Fmax, for the 8-cog/64-smartpin Cyclone V A9 boards to go from 84.0 MHz to 89.6 MHz. That's a 6.7% speed increase that should translate straight into the silicon Fmax.

And look at the slack histogram on the FPGA. There are just a few dangling paths that are keeping the FPGA from reaching 100 MHz. The ASIC tools will be able to tuck these in a lot tighter.

I will be getting a v28 out soon and update the documentation accordingly.

One other thing... I changed the memory mapping slightly so that the last 16KB of hub RAM always appears at both it's natural location and at $FC000..$FFFFF. The write-protect mechanism works at both of the last 16KB address ranges. The debug interrupt jumps are always only accessible at the end of the 1MB map, though, and they are subject to the write-protect mechanism. This will let people use the memory more naturally if they are not caring about fixed code at the top of the 1MB hub memory map.

Cluso99 · 2017-12-02 10:29

cgracey wrote: »

Over the last several days, I've been consolidating all instruction decoding to the cycle before the two cycles that the instructions actually execute in.

This has two advantages:

1) It actually saves logic, since replica logic doesn't exist at two different pipeline stages.
2) It makes things go faster.

Prop2-Hot worked this way, but I had abandoned this in the new design, since it requires a flipflop per decode. I guess the new design drifted to the point where there were few enough decodes that flops became more efficient than extra logic.

This change caused Fmax, for the 8-cog/64-smartpin Cyclone V A9 boards to go from 84.0 MHz to 89.6 MHz. That's a 6.7% speed increase that should translate straight into the silicon Fmax.

And look at the slack histogram on the FPGA. There are just a few dangling paths that are keeping the FPGA from reaching 100 MHz. The ASIC tools will be able to tuck these in a lot tighter.

I will be getting a v28 out soon and update the documentation accordingly.

One other thing... I changed the memory mapping slightly so that the last 16KB of hub RAM always appears at both it's natural location and at $FC000..$FFFFF. The write-protect mechanism works at both of the last 16KB address ranges. The debug interrupt jumps are always only accessible at the end of the 1MB map, though, and they are subject to the write-protect mechanism. This will let people use the memory more naturally if they are not caring about fixed code at the top of the 1MB hub memory map.

The time tweets is great news.

This Hub mapping seems a much better way to me. Thanks for this.

Rayman · 2017-12-03 17:56

I must have forgotten how to use LOC...

Have some code that was working fine with this

loc       ptra,#@OV965X_REGS_QVGA

But then, I removed some debugging code and it stopped working...
Replaced with this

mov       ptra,##@OV965X_REGS_QVGA

and it works again...

The label, OV965X_REGS_QVGA, is around $400 in HUB

msrobots · 2017-12-03 18:30

Good idea to have the ROM area mirrored, this will make any loading more easy since it is a continuous block of RAM to load.

I still do not get the importance of treating the debug vectors different from the rest of the ROM.

Why are the debug vectors not accessible at the end of 512KB and why they are excluded from being loaded at boot time?

just curious,

Mike

evanh · 2017-12-03 21:41

msrobots wrote: »

I still do not get the importance of treating the debug vectors different from the rest of the ROM.

This is so the ROM version stays intact by default.

Why are the debug vectors not accessible at the end of 512KB and why they are excluded from being loaded at boot time?

Well, the jump table data will appear at both. Just the execution is always fixed at the high end.

This is for compatibility as much as anything. It provides for the simplest hard coded absolute addressing to be used in software. It's not much good having different pieces of code assuming different fixed locations for the table just because they were developed for different editions of the Prop2.

rjo__ · 2017-12-04 06:05

I vote for putting tables in the ROM ... my favorite would be to make pi log2 pi calculations easier ... for measuring information. This can always be put into a file and read into RAM... but I want that RAM for other things@!!!

No doubt there are other tables that would be useful?

Cluso99 · 2017-12-04 07:41

The ROM must be copied to HUB RAM to be usable. You are not getting extra code/table space.

evanh · 2017-12-04 07:46

Umm, there is no address space mapped to the Prop2 mask ROM. It is not execute in place (XIP). It is basically a tiny byte-wide ROM poked in a corner of the Prop2, that is copied into HubRAM at boot up time. Cog0, alone I think, has a special circuit and special microcode to access it. Execution only happens once it's in HubRAM. It's the one part of the Prop2 that is not symmetrical.

EDIT: Dang, Cluso beat me to it.

ozpropdev · 2017-12-04 11:27

Rayman wrote: »
I must have forgotten how to use LOC...

Have some code that was working fine with this
loc       ptra,#@OV965X_REGS_QVGA
But then, I removed some debugging code and it stopped working...
Replaced with this
mov       ptra,##@OV965X_REGS_QVGA
and it works again...

The label, OV965X_REGS_QVGA, is around $400 in HUB

@Rayman
For Hub addresses below $400 use the absolute version of LOC.

mov       ptra,#\OV965X_REGS_QVGA

Rayman · 2017-12-04 14:01

Thanks. I was hoping it was a bug and not a feature

Is there a reason that loc can't work with #@ below $400?

rjo__ · 2017-12-04 22:09

@Cluso99

I vote not to put tables in the ROM

ozpropdev · 2017-12-05 01:06

Rayman wrote: »

Is there a reason that loc can't work with #@ below $400?

This might shed some light on it.
Found this in the in instructions_v27.txt file.

chip wrote:

A symbol declared under ORGH will return its hub address when referenced.

A symbol declared under ORG will return its cog address when referenced,
but can return its hub address, instead, if preceded by '@':

COGINIT #0,#@newcode

For immediate-branch and LOC address operands, "#" is used before the
address. In cases where there is an option between absolute and relative
addressing, the assembler will choose absolute addressing when the branch
crosses between cog and hub domains, or relative addressing when the
branch stays in the same domain. Absolute addressing can be forced by
following "#" with "\".

Addresses below $400 would be assumed to be cog/lut addresses by Pnut me thinks.

Rayman · 2017-12-05 01:36

I think I'd prefer the opposite way, where @ always gives hub address and "\" can give cog address...

I saw this part in that txt file:

but can return its hub address, instead, if preceded by '@'

But missed that second part...

Anyway, shouldn't the boundary be $800? Or, did I do my math wrong...

Cluso99 · 2017-12-05 06:36

Addressed below $400 (JMP/call and similar) will take these as lut and cog addresses. So there are restrictions for hub addresses below $400. They cannot be used for hubexec code, only for rd/wr instructions. ie data or cog/lut code that can be loaded into cog/lut for execution.

ozpropdev · 2017-12-05 09:03

Cluso99 wrote: »

They cannot be used for hubexec code, only for rd/wr instructions. ie data or cog/lut code that can be loaded into cog/lut for execution.

I think Rayman's point is that LOC @ won't let you point to data for rd/wr or launch cog/lut code < $400

jmg · 2017-12-07 05:31

jmg wrote: »

IIRC the P2 PLL/VCO is now like most, with a SysCLK divider, and a VCO_FB_Divider, and Xtal_FB_Divider to the common PFD frequency.

Command then looks something like
">Prop_PLL Sys_Div VCO_Div Xtal_Div" + some pause for PLL lock, and host Baud-redefine, and then '>' at the new higher Baud rate.

Addit: Using this, a simple means to boost boot from a fast-UART part like EFM8UB3 becomes available
With the available ~ 32kBytes of P2 code storage in the UB3, that's 5.4~4ms loading times, at 6~8MBaud that part should be capable of.
(plus other hard-wired delays inside P2, hopefully, those are not too great...)

I'll bump this with news the new EFM8UB3 USB-MCU is now showing stock and prices
88.5c/1000 gives 8-bit MCU, Full Speed USB, 40kB Flash, 3kB RAM, 5 Volt, 12-bit ADC, UART, SMBus, SPI, 13 GPIOs

The 40k Flash is quite an increase from the EFM8UB1, and would allow multiple bridge devices to be coded.
eg a Mass Storage device could program EEPROM on a FLiP like P1 module, or program SPI Flash on a P2 design.

Such a MCU can also manage the Prop1/2 reset, reducing the BOM, and offer more than one bridge link.
It may even be able to power a P1, and the 48MHz SysClk could output 6MHz to P1, to further reduce the BOM.

Peter Jakacki · 2017-12-08 13:06

Hi Chip, looking forward to trying out V28 when you have it available.

I had a funny bug a while ago so let me relate how it affected my system and how I worked around it.

Normally I load up most cogs with Tachyon but get them to run an IDLE after reset and when finally cog 0 does a coginit the reset routine checks the cogid and if it is zero then it will get it to run the terminal startup instead. The trouble was that my serial receive seemed to get corrupted when I was downloading a source file into Tachyon yet a dump of the large receive buffer showed nothing wrong. Through the time honored method of trial and error and the process of elimination I knew that the problem was in my startups and by disabling the IDLE coginits everything seemed to work. However it didn't matter which one I enabled back again, there was a bug there. Disable them, no bug.

(It was as if another cog was identifying as cog 0 and running as the console, stealing a character from the receive stream now and then when the timing was right)

Was it that the coginits needed a delay between them? That seemed to work but didn't seem right, so I applied the time honored methods again and again. Finally I decided to insert a NOP after the CLKSET #$FF and prior to the coginits. That fixed the problem and since then I have left it at that.

However that got me to thinking that maybe this was one of the reasons why the DE2-115 had weird stepped levels on the output pins. Anyway, food for thought.

org
	        clkset  #$FF                    'switch to 80MHz (if pll, else 50MHz)
reboot
                nop			' seems to need delay after clkset (otherwise next coginit ids incorrectly)
                coginit #7,#@RESET
                coginit #6,#@RESET
                coginit #5,#@RESET
                coginit #4,#@RESET
                coginit #3,#@RESET
                coginit #2,#@RESET	
                coginit #1,#@rxcog
                coginit #0,#@RESET  ' RESET does a COGID so that #0 can run the console instead of an IDLE loop

evanh · 2017-12-08 13:22

Well done Peter. That looks a horrible bug to have almost got past.

All of my test code has a WAITX (for the purpose of giving the PC debug terminal time to take over the comport after download) immediately following the CLKSET and is being executed correctly, afaik.

evanh · 2017-12-08 14:58

Hmm, I don't know how to use COGINIT properly. Only way I can make it work at all is not use any Cog declared symbols in the "RESET" code.

So far, I have not been able to duplicate Peter's symptom above.

Past my bedtime ...

cgracey · 2017-12-09 00:34

Peter,

I can't make sense of what the trouble could be.

Would it be possible for you to distill the erring code to the bare essentials, so that we could determine what the trouble is? Just changing speed should have no effect on the logic. It does the same thing at any speed.

cgracey · 2017-12-09 01:47

This seems to work okay:

dat	org

	clkset	#$FF
	coginit	#7,#@go
	coginit	#6,#@go
	coginit	#5,#@go
	coginit	#4,#@go
	coginit	#3,#@go
	coginit	#2,#@go
	coginit	#1,#@go
	coginit	#0,#@go

	org

go	cogid	x
	add	x,#32
lp	drvnot	x
	waitx	##10_000_000
	jmp	#lp

x	res	1

All 8 LEDs blink, anyway.

ozpropdev · 2017-12-09 03:52

Peter
Does adding a 'WC' to COGINIT make a difference?

Cluso99 · 2017-12-09 05:07

Peter,
Are you still using v26, and might that be different to v27a/z/zz ?

I have my SD card booter ready for v28. Just need to know where the SD pins will be.

cgracey · 2017-12-09 07:01

Cluso99 wrote: »

Peter,
Are you still using v26, and might that be different to v27a/z/zz ?

I have my SD card booter ready for v28. Just need to know where the SD pins will be.

Peter Jakacki · 2017-12-09 12:11

V27z has been crashing after some time so I've been using V26. However in light of the fact that I have uncovered this startup bug I could try it out on V27z again. Now this bug is subtle, and certainly in the case of TAQOZ on V26 it was not always readily apparent but once I went to download a file I would get all kinds of download errors which had nothing to do with the received data. The RESET routine that is used with coginit #7,#@RESET etc immediately calls hub exec code which does a cogid as part of the init to check for cog 0 and have it run the console else run an idle loop. So it's not just a simple coginit.

Once V28 is available I will try out TAQOZ on there and look for subtle problems even by removing the nop for instance. Lets see how it goes and I will try V28 on the CVA9 and DE2.

cgracey · 2017-12-09 20:58

This matter of v27 being flakey is really concerning me. I think this is what's been eating me up. We need to discover whatever is wrong with it. ASAP.

cgracey · 2017-12-09 20:59

I'm a bit into compiling v28 now.

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Comments