Over the last several days, I've been consolidating all instruction decoding to the cycle before the two cycles that the instructions actually execute in.
This has two advantages:
1) It actually saves logic, since replica logic doesn't exist at two different pipeline stages.
2) It makes things go faster.
Prop2-Hot worked this way, but I had abandoned this in the new design, since it requires a flipflop per decode. I guess the new design drifted to the point where there were few enough decodes that flops became more efficient than extra logic.
This change caused Fmax, for the 8-cog/64-smartpin Cyclone V A9 boards to go from 84.0 MHz to 89.6 MHz. That's a 6.7% speed increase that should translate straight into the silicon Fmax.
And look at the slack histogram on the FPGA. There are just a few dangling paths that are keeping the FPGA from reaching 100 MHz. The ASIC tools will be able to tuck these in a lot tighter.
I will be getting a v28 out soon and update the documentation accordingly.
One other thing... I changed the memory mapping slightly so that the last 16KB of hub RAM always appears at both it's natural location and at $FC000..$FFFFF. The write-protect mechanism works at both of the last 16KB address ranges. The debug interrupt jumps are always only accessible at the end of the 1MB map, though, and they are subject to the write-protect mechanism. This will let people use the memory more naturally if they are not caring about fixed code at the top of the 1MB hub memory map.
Over the last several days, I've been consolidating all instruction decoding to the cycle before the two cycles that the instructions actually execute in.
This has two advantages:
1) It actually saves logic, since replica logic doesn't exist at two different pipeline stages.
2) It makes things go faster.
Prop2-Hot worked this way, but I had abandoned this in the new design, since it requires a flipflop per decode. I guess the new design drifted to the point where there were few enough decodes that flops became more efficient than extra logic.
This change caused Fmax, for the 8-cog/64-smartpin Cyclone V A9 boards to go from 84.0 MHz to 89.6 MHz. That's a 6.7% speed increase that should translate straight into the silicon Fmax.
And look at the slack histogram on the FPGA. There are just a few dangling paths that are keeping the FPGA from reaching 100 MHz. The ASIC tools will be able to tuck these in a lot tighter.
I will be getting a v28 out soon and update the documentation accordingly.
One other thing... I changed the memory mapping slightly so that the last 16KB of hub RAM always appears at both it's natural location and at $FC000..$FFFFF. The write-protect mechanism works at both of the last 16KB address ranges. The debug interrupt jumps are always only accessible at the end of the 1MB map, though, and they are subject to the write-protect mechanism. This will let people use the memory more naturally if they are not caring about fixed code at the top of the 1MB hub memory map.
The time tweets is great news.
This Hub mapping seems a much better way to me. Thanks for this.
I still do not get the importance of treating the debug vectors different from the rest of the ROM.
This is so the ROM version stays intact by default.
Why are the debug vectors not accessible at the end of 512KB and why they are excluded from being loaded at boot time?
Well, the jump table data will appear at both. Just the execution is always fixed at the high end.
This is for compatibility as much as anything. It provides for the simplest hard coded absolute addressing to be used in software. It's not much good having different pieces of code assuming different fixed locations for the table just because they were developed for different editions of the Prop2.
I vote for putting tables in the ROM ... my favorite would be to make pi log2 pi calculations easier ... for measuring information. This can always be put into a file and read into RAM... but I want that RAM for other things@!!!
No doubt there are other tables that would be useful?
Umm, there is no address space mapped to the Prop2 mask ROM. It is not execute in place (XIP). It is basically a tiny byte-wide ROM poked in a corner of the Prop2, that is copied into HubRAM at boot up time. Cog0, alone I think, has a special circuit and special microcode to access it. Execution only happens once it's in HubRAM. It's the one part of the Prop2 that is not symmetrical.
A symbol declared under ORGH will return its hub address when referenced.
A symbol declared under ORG will return its cog address when referenced,
but can return its hub address, instead, if preceded by '@':
COGINIT #0,#@newcode
For immediate-branch and LOC address operands, "#" is used before the
address. In cases where there is an option between absolute and relative
addressing, the assembler will choose absolute addressing when the branch
crosses between cog and hub domains, or relative addressing when the
branch stays in the same domain. Absolute addressing can be forced by
following "#" with "\".
Addresses below $400 would be assumed to be cog/lut addresses by Pnut me thinks.
Addressed below $400 (JMP/call and similar) will take these as lut and cog addresses. So there are restrictions for hub addresses below $400. They cannot be used for hubexec code, only for rd/wr instructions. ie data or cog/lut code that can be loaded into cog/lut for execution.
IIRC the P2 PLL/VCO is now like most, with a SysCLK divider, and a VCO_FB_Divider, and Xtal_FB_Divider to the common PFD frequency.
Command then looks something like
">Prop_PLL Sys_Div VCO_Div Xtal_Div" + some pause for PLL lock, and host Baud-redefine, and then '>' at the new higher Baud rate.
Addit: Using this, a simple means to boost boot from a fast-UART part like EFM8UB3 becomes available
With the available ~ 32kBytes of P2 code storage in the UB3, that's 5.4~4ms loading times, at 6~8MBaud that part should be capable of.
(plus other hard-wired delays inside P2, hopefully, those are not too great...)
I'll bump this with news the new EFM8UB3 USB-MCU is now showing stock and prices
88.5c/1000 gives 8-bit MCU, Full Speed USB, 40kB Flash, 3kB RAM, 5 Volt, 12-bit ADC, UART, SMBus, SPI, 13 GPIOs
The 40k Flash is quite an increase from the EFM8UB1, and would allow multiple bridge devices to be coded.
eg a Mass Storage device could program EEPROM on a FLiP like P1 module, or program SPI Flash on a P2 design.
Such a MCU can also manage the Prop1/2 reset, reducing the BOM, and offer more than one bridge link.
It may even be able to power a P1, and the 48MHz SysClk could output 6MHz to P1, to further reduce the BOM.
Hi Chip, looking forward to trying out V28 when you have it available.
I had a funny bug a while ago so let me relate how it affected my system and how I worked around it.
Normally I load up most cogs with Tachyon but get them to run an IDLE after reset and when finally cog 0 does a coginit the reset routine checks the cogid and if it is zero then it will get it to run the terminal startup instead. The trouble was that my serial receive seemed to get corrupted when I was downloading a source file into Tachyon yet a dump of the large receive buffer showed nothing wrong. Through the time honored method of trial and error and the process of elimination I knew that the problem was in my startups and by disabling the IDLE coginits everything seemed to work. However it didn't matter which one I enabled back again, there was a bug there. Disable them, no bug.
(It was as if another cog was identifying as cog 0 and running as the console, stealing a character from the receive stream now and then when the timing was right)
Was it that the coginits needed a delay between them? That seemed to work but didn't seem right, so I applied the time honored methods again and again. Finally I decided to insert a NOP after the CLKSET #$FF and prior to the coginits. That fixed the problem and since then I have left it at that.
However that got me to thinking that maybe this was one of the reasons why the DE2-115 had weird stepped levels on the output pins. Anyway, food for thought.
org
clkset #$FF 'switch to 80MHz (if pll, else 50MHz)
reboot
nop ' seems to need delay after clkset (otherwise next coginit ids incorrectly)
coginit #7,#@RESET
coginit #6,#@RESET
coginit #5,#@RESET
coginit #4,#@RESET
coginit #3,#@RESET
coginit #2,#@RESET
coginit #1,#@rxcog
coginit #0,#@RESET ' RESET does a COGID so that #0 can run the console instead of an IDLE loop
Well done Peter. That looks a horrible bug to have almost got past.
All of my test code has a WAITX (for the purpose of giving the PC debug terminal time to take over the comport after download) immediately following the CLKSET and is being executed correctly, afaik.
Would it be possible for you to distill the erring code to the bare essentials, so that we could determine what the trouble is? Just changing speed should have no effect on the logic. It does the same thing at any speed.
V27z has been crashing after some time so I've been using V26. However in light of the fact that I have uncovered this startup bug I could try it out on V27z again. Now this bug is subtle, and certainly in the case of TAQOZ on V26 it was not always readily apparent but once I went to download a file I would get all kinds of download errors which had nothing to do with the received data. The RESET routine that is used with coginit #7,#@RESET etc immediately calls hub exec code which does a cogid as part of the init to check for cog 0 and have it run the console else run an idle loop. So it's not just a simple coginit.
Once V28 is available I will try out TAQOZ on there and look for subtle problems even by removing the nop for instance. Lets see how it goes and I will try V28 on the CVA9 and DE2.
This matter of v27 being flakey is really concerning me. I think this is what's been eating me up. We need to discover whatever is wrong with it. ASAP.
Comments
And, somewhat related (i.e. RESIx/RETIx), we have CALLD, which can be used for cooperative multitasking.
This has two advantages:
1) It actually saves logic, since replica logic doesn't exist at two different pipeline stages.
2) It makes things go faster.
Prop2-Hot worked this way, but I had abandoned this in the new design, since it requires a flipflop per decode. I guess the new design drifted to the point where there were few enough decodes that flops became more efficient than extra logic.
This change caused Fmax, for the 8-cog/64-smartpin Cyclone V A9 boards to go from 84.0 MHz to 89.6 MHz. That's a 6.7% speed increase that should translate straight into the silicon Fmax.
And look at the slack histogram on the FPGA. There are just a few dangling paths that are keeping the FPGA from reaching 100 MHz. The ASIC tools will be able to tuck these in a lot tighter.
I will be getting a v28 out soon and update the documentation accordingly.
One other thing... I changed the memory mapping slightly so that the last 16KB of hub RAM always appears at both it's natural location and at $FC000..$FFFFF. The write-protect mechanism works at both of the last 16KB address ranges. The debug interrupt jumps are always only accessible at the end of the 1MB map, though, and they are subject to the write-protect mechanism. This will let people use the memory more naturally if they are not caring about fixed code at the top of the 1MB hub memory map.
This Hub mapping seems a much better way to me. Thanks for this.
Have some code that was working fine with this
But then, I removed some debugging code and it stopped working...
Replaced with this and it works again...
The label, OV965X_REGS_QVGA, is around $400 in HUB
I still do not get the importance of treating the debug vectors different from the rest of the ROM.
Why are the debug vectors not accessible at the end of 512KB and why they are excluded from being loaded at boot time?
just curious,
Mike
Well, the jump table data will appear at both. Just the execution is always fixed at the high end.
This is for compatibility as much as anything. It provides for the simplest hard coded absolute addressing to be used in software. It's not much good having different pieces of code assuming different fixed locations for the table just because they were developed for different editions of the Prop2.
No doubt there are other tables that would be useful?
EDIT: Dang, Cluso beat me to it.
For Hub addresses below $400 use the absolute version of LOC.
Is there a reason that loc can't work with #@ below $400?
I vote not to put tables in the ROM
This might shed some light on it.
Found this in the in instructions_v27.txt file.
Addresses below $400 would be assumed to be cog/lut addresses by Pnut me thinks.
I saw this part in that txt file:
But missed that second part...
Anyway, shouldn't the boundary be $800? Or, did I do my math wrong...
I'll bump this with news the new EFM8UB3 USB-MCU is now showing stock and prices
88.5c/1000 gives 8-bit MCU, Full Speed USB, 40kB Flash, 3kB RAM, 5 Volt, 12-bit ADC, UART, SMBus, SPI, 13 GPIOs
The 40k Flash is quite an increase from the EFM8UB1, and would allow multiple bridge devices to be coded.
eg a Mass Storage device could program EEPROM on a FLiP like P1 module, or program SPI Flash on a P2 design.
Such a MCU can also manage the Prop1/2 reset, reducing the BOM, and offer more than one bridge link.
It may even be able to power a P1, and the 48MHz SysClk could output 6MHz to P1, to further reduce the BOM.
I had a funny bug a while ago so let me relate how it affected my system and how I worked around it.
Normally I load up most cogs with Tachyon but get them to run an IDLE after reset and when finally cog 0 does a coginit the reset routine checks the cogid and if it is zero then it will get it to run the terminal startup instead. The trouble was that my serial receive seemed to get corrupted when I was downloading a source file into Tachyon yet a dump of the large receive buffer showed nothing wrong. Through the time honored method of trial and error and the process of elimination I knew that the problem was in my startups and by disabling the IDLE coginits everything seemed to work. However it didn't matter which one I enabled back again, there was a bug there. Disable them, no bug.
(It was as if another cog was identifying as cog 0 and running as the console, stealing a character from the receive stream now and then when the timing was right)
Was it that the coginits needed a delay between them? That seemed to work but didn't seem right, so I applied the time honored methods again and again. Finally I decided to insert a NOP after the CLKSET #$FF and prior to the coginits. That fixed the problem and since then I have left it at that.
However that got me to thinking that maybe this was one of the reasons why the DE2-115 had weird stepped levels on the output pins. Anyway, food for thought.
All of my test code has a WAITX (for the purpose of giving the PC debug terminal time to take over the comport after download) immediately following the CLKSET and is being executed correctly, afaik.
So far, I have not been able to duplicate Peter's symptom above.
Past my bedtime ...
I can't make sense of what the trouble could be.
Would it be possible for you to distill the erring code to the bare essentials, so that we could determine what the trouble is? Just changing speed should have no effect on the logic. It does the same thing at any speed.
All 8 LEDs blink, anyway.
Does adding a 'WC' to COGINIT make a difference?
Are you still using v26, and might that be different to v27a/z/zz ?
I have my SD card booter ready for v28. Just need to know where the SD pins will be.
Once V28 is available I will try out TAQOZ on there and look for subtle problems even by removing the nop for instance. Lets see how it goes and I will try V28 on the CVA9 and DE2.