TAQOZ - Tachyon Forth for the P2 BOOT ROM

jmg · 2018-04-19 03:52

Peter Jakacki wrote: »

.... Really need an instance of at least 20k of ROM though, it's a pity.

The next increment is likely 32kB, but it may be too late to increase the ROM size ? Silicon wise, is it not much, but it's likely been P&R'd around already..

cgracey · 2018-04-19 05:00

Peter, $30ddaaaa is just for over-writing the boot ROM for ROM development purposes. It's just in the FPGA.

Cluso99 · 2018-04-25 15:54

Peter, just been rethinking TAQOZ.

I think you said it needs to run in lower 64KB hub space because of 16 bit addressing. Is this correct? If so, then it will most likely need to be compiled for the addresses where it will run. So if it will be copied to lower hub space from rom, then it will need to be put into rom as a set of longs taken from the pnut output.

cgracey · 2018-04-26 05:29

Cluso99 wrote: »

Peter, just been rethinking TAQOZ.

I think you said it needs to run in lower 64KB hub space because of 16 bit addressing. Is this correct? If so, then it will most likely need to be compiled for the addresses where it will run. So if it will be copied to lower hub space from rom, then it will need to be put into rom as a set of longs taken from the pnut output.

If he were to use all relative addressing, which is default between hub code addresses, he might get away with not having to relocate it as data in the ROM source file. Some address masking at the source level may be necessary.

Cluso99 · 2018-04-26 07:12

cgracey wrote: »

Cluso99 wrote: »

Peter, just been rethinking TAQOZ.

I think you said it needs to run in lower 64KB hub space because of 16 bit addressing. Is this correct? If so, then it will most likely need to be compiled for the addresses where it will run. So if it will be copied to lower hub space from rom, then it will need to be put into rom as a set of longs taken from the pnut output.

If he were to use all relative addressing, which is default between hub code addresses, he might get away with not having to relocate it as data in the ROM source file. Some address masking at the source level may be necessary.

From what I have seen, there are lots of code snippets which have a list of 16 bit addresses to run a few instructions. So it's not JMP instructions causing the limitation, but the compiler re-targeting the code (word definitions) to an address other than where its compiling to.

Cluso99 · 2018-04-26 07:15

BTW Chip,
Thanks for a great discussion!
Hope you get that other problem sorted.

kwinn · 2018-04-26 12:59

Cluso99 wrote: »

cgracey wrote: »

Cluso99 wrote: »

Peter, just been rethinking TAQOZ.

I think you said it needs to run in lower 64KB hub space because of 16 bit addressing. Is this correct? If so, then it will most likely need to be compiled for the addresses where it will run. So if it will be copied to lower hub space from rom, then it will need to be put into rom as a set of longs taken from the pnut output.

If he were to use all relative addressing, which is default between hub code addresses, he might get away with not having to relocate it as data in the ROM source file. Some address masking at the source level may be necessary.

From what I have seen, there are lots of code snippets which have a list of 16 bit addresses to run a few instructions. So it's not JMP instructions causing the limitation, but the compiler re-targeting the code (word definitions) to an address other than where its compiling to.

Is this the TAQOZ compiler? If so that may be a problem for using Tachyon on the P2. Any way to deal with this?

Cluso99 · 2018-04-26 15:03

kwinn wrote: »

Cluso99 wrote: »

cgracey wrote: »

Cluso99 wrote: »

Peter, just been rethinking TAQOZ.

I think you said it needs to run in lower 64KB hub space because of 16 bit addressing. Is this correct? If so, then it will most likely need to be compiled for the addresses where it will run. So if it will be copied to lower hub space from rom, then it will need to be put into rom as a set of longs taken from the pnut output.

If he were to use all relative addressing, which is default between hub code addresses, he might get away with not having to relocate it as data in the ROM source file. Some address masking at the source level may be necessary.

From what I have seen, there are lots of code snippets which have a list of 16 bit addresses to run a few instructions. So it's not JMP instructions causing the limitation, but the compiler re-targeting the code (word definitions) to an address other than where its compiling to.

Is this the TAQOZ compiler? If so that may be a problem for using Tachyon on the P2. Any way to deal with this?

It's all fine. Just needs to be in lower 64KB of Hub.

jmg · 2018-04-26 19:38

Cluso99 wrote: »

From what I have seen, there are lots of code snippets which have a list of 16 bit addresses to run a few instructions. So it's not JMP instructions causing the limitation, but the compiler re-targeting the code (word definitions) to an address other than where its compiling to.

It's all fine. Just needs to be in lower 64KB of Hub.

I'm not following the details here ? - is this due to a P2 silicon limitation, whereby code does not relocate as well, or a tools limitation ?

potatohead · 2018-04-26 19:43

He wrote it in a 16 bit address space.

A Forth is it's own tool, in the toolchain sense. This is very cool, when it comes to things like bootstrapping onto a CPU. One only needs to author a few basic things. Then load a dictionary and go. Won't be optimized, but will work very reasonably.

Further, it was highly optimized for that space, due to how the P1 is. And that's where the big limitation is, from what I understsnd.

jmg · 2018-04-26 19:58

potatohead wrote: »

He wrote it in a 16 bit address space.
...
Further, it was highly optimized for that space, due to how the P1 is. And that's where the big limitation is, from what I understsnd.

Well, yes, but the P2 has relative opcodes.
Are there not enough / wrong mix of relative opcodes in P2 ? Or do the tools not yet fully reach all of them ?

potatohead · 2018-04-26 20:15

Again, Forth is its own tool. The very core, or basis was PASM, and isn't written in that fashion. It's also making address space assumptions that are 16bit.

The rest depends on that.

Peter would need to rewrite the kernel, then apply a dictionary, which may also likely require changes.

kwinn · 2018-04-26 20:49

jmg wrote: »

Cluso99 wrote: »

From what I have seen, there are lots of code snippets which have a list of 16 bit addresses to run a few instructions. So it's not JMP instructions causing the limitation, but the compiler re-targeting the code (word definitions) to an address other than where its compiling to.

It's all fine. Just needs to be in lower 64KB of Hub.

I'm not following the details here ? - is this due to a P2 silicon limitation, whereby code does not relocate as well, or a tools limitation ?

I do not think it is a limitation of the P2 silicon. If I understand Peter's explanation of Tachyon the "functions/instructions" of a program are the starting address of the P2 code that performs the function. I am picturing them as the 16 bit equivalent of a byte-code that points to the actual P2 code. That absolute address is what limits it to running in the lower 64K of Hub.

Please feel free to correct me if I am wrong here.

jmg · 2018-04-26 21:00

kwinn wrote: »

I do not think it is a limitation of the P2 silicon. If I understand Peter's explanation of Tachyon the "functions/instructions" of a program are the starting address of the P2 code that performs the function. I am picturing them as the 16 bit equivalent of a byte-code that points to the actual P2 code. That absolute address is what limits it to running in the lower 64K of Hub.

Please feel free to correct me if I am wrong here.

That's how I understand it too, but the P2 supports PC += D[19:0] relative opcodes, which should relocate. (hence my question of are there not enough of them)
With a 16k ROM limit, those index-codes cannot exceed 14 bits, so should relocate on 16k blocks (ie into the ROM upper placement)

Cluso99 · 2018-04-26 21:49

Anyone taking the time to look at Peters source will see that Tachyon is full of little code snippets made up of a bunch of addresses of other code snippets which are then in turn a bunch of snippets. So the whole dictionary is a bunch of addresses. They are 16 bit addresses for a reason.
Changing to 32 bit addresses would double the code, and slow it down too.
These addresses cannot be relative because there's no base to be relative from.
It's possible to make the whole enchilada relativ to a base so the blob could be relocatable, but not for now. It's fine being in the bottom 64KB hub IMHO.

potatohead · 2018-04-26 22:14

Peters source will see that Tachyon is full of little code snippets made up of a bunch of addresses of other code snippets which are then in turn a bunch of snippets. So the whole dictionary is a bunch of addresses. They are 16 bit addresses for a reason.

Yes, you said it much better. It's not the P2 and opcodes at issue.

Peter Jakacki · 2018-04-26 22:17

While P2 instructions do support relative addressing they aren't really of much at the higher level of the Tachyon VM itself. For instance, the dictionary is separate from the code and is comprised of an array of names that also include a "wordcode" which most of the time is simply a 16-bit address pointing to the code area (Some upper addresses are decoded for compact literals and branches etc).

Now, one of the things I can and need to do for the boot ROM version is relocate the dictionary which is an easy matter of moving the whole array and pointing to the start address (the dictionary grows down). There are no link fields since this is a contiguous array of headers in the format [attribute+count,<name>,address(16)]. So I can move that dictionary anywhere in the 512k memory and simply repoint to it.

The code itself consists of an array of 16-bit wordcodes/addresses plus in-line data, all of which are "interpreted" in a similar manner to how LMM is "interpreted" except that wordcodes can be cog addresses, hubexec, hub addresses, or decoded further as conditional IF//UNTIL relative branches or as 9-bit literals etc as this code snippet shows:

_IF		=       $FC00		' IF relative forward branch 0 to 127 words
_UNTIL		=       $FC80		' UNTIL relative reverse branch 0 to 127 words
rg		=	$FD00		' task/cog register 8-bit offset
w		=	$FE00		' wordcode offset for 9-bit literals

But I had an idea that I'm going to try out. If code is compiled for the $FC000 area and address pointers are always limited to 16-bit words I should be able to copy my code to the first 64kB on boot and simply run it from the $0Cxxx area. Of course the dictionary can be moved to anywhere in memory so that it can grow, but perhaps to just before $0Cxxx since it grows down (actually even over the top of Chip and Cluso's code image). The variable HERE which points to where new code will be compiled can be set to point to low memory and the area vacated by the dictionary at around $0Fxxx can be used for buffers.
$00xxx new user wordcode
$0Bxxx dictionary (grows down)
$0Cxxx TAQOZ code
$0Fxxx buffers and data

There is one little thing that needs extra processing is decoding the hubexec addresses which were an easy matter before since if they weren't in the cog address range they could just be decoded as being below hub wordcode since I clump all the high level hubexec code in one area.

In the unlikely event that someone wants to use as much code space as possible it is a very easy matter to relocate the dictionary elsewhere but 60kB of wordcode is like 10 times more functionality than 30kB.

I need to go out now but I will try this out later as I think it can work with the minimum of fuss.

Cluso99 · 2018-04-27 08:35

A little over my head without concentrating. I'll leave it to sort Peter

Peter Jakacki · 2018-04-30 12:52

I've already mentioned this in another thread but I did manage to integrate TAQOZ into Chip's boot code and load that into ROM but because this is loaded via PNut it needed a coginit #0,#$FC000 instruction at address 0 to make it run first time. A hard or hub reset though causes the fixed ROM to be loaded and wipe out the new test ROM so there is no way to stop that and does not allow power-up testing either. TAQOZ copies all of ROM to $0C000 and then runs from there while moving the dictionary to just before $0C000 and setting up the code pointer in low memory. For the moment I assume I have an 80MHz clock available and run at 115200 baud but final touches may allow the autobaud info to be used in RCFAST.

Chip, I mentioned before about having a fixed boot ROM that did absolutely nothing else but load 16kB from EEPROM on independent pins from other boot devices. Now it could just as well be 3-pin SPI Flash if that helps and you could use the same SPI Flash code you have now but if that were compiled into the ROM test version FPGA then all any of us would have to do is connect an SPI Flash to say P0..P2 which would hold the 16k image that gets loaded into $FC000 (over the top of the test loader) and run. Of course we would need some way of loading code into that in the first place but it's easy enough to program a P1 to emulate the P2 loader and program the SPI Flash.

It would be easy enough to do and would allow a full test of the boot sequence from power-up. The preloader would always assume 3-pin SPI Flash and load 16kB then jump to $FC000. I can send Cluso an SPI Flash module to plug straight into the CVA9 along with a P1 as the ROM loader so he can fully test it too.

jmg · 2018-04-30 21:20

Peter Jakacki wrote: »

...
Chip, I mentioned before about having a fixed boot ROM that did absolutely nothing else but load 16kB from EEPROM on independent pins from other boot devices. Now it could just as well be 3-pin SPI Flash if that helps and you could use the same SPI Flash code you have now but if that were compiled into the ROM test version FPGA then all any of us would have to do is connect an SPI Flash to say P0..P2 which would hold the 16k image that gets loaded into $FC000 (over the top of the test loader) and run. Of course we would need some way of loading code into that in the first place but it's easy enough to program a P1 to emulate the P2 loader and program the SPI Flash.

It would be easy enough to do and would allow a full test of the boot sequence from power-up. The preloader would always assume 3-pin SPI Flash and load 16kB then jump to $FC000. I can send Cluso an SPI Flash module to plug straight into the CVA9 along with a P1 as the ROM loader so he can fully test it too.

That sounds the closest to 'actual operation' too.

Another (less special hardware) solution could be to expand the BOOT serial command set slightly, to allow a call into the SPI loader.
In this use, Serial would load the boot-image, and then can load any part of FLASH memory, or can re-run the boot decision flow, after change of a signal pin.

Peter Jakacki · 2018-04-30 22:49

jmg wrote: »

That sounds the closest to 'actual operation' too.

Another (less special hardware) solution could be to expand the BOOT serial command set slightly, to allow a call into the SPI loader.
In this use, Serial would load the boot-image, and then can load any part of FLASH memory, or can re-run the boot decision flow, after change of a signal pin.

Since we need to test all aspects of the final boot ROM it seems prudent to "boot the boot ROM" from an independent device so that when the test boot ROM boots we can test it loading from serial, or SPI Flash, or SD, or escaping into TAQOZ, just like the real thing.

Chip hasn't replied yet but he is probably a bit busy however it would not only be nice to have an FPGA configured this way, it would also do away with having to do a time-consuming FPGA compile every time we think we have a final boot ROM candidate but find there is still a little tweaking. The independent SPI Flash chip may not be as easy but it is the only sure way we can test boot properly and with the boot ROM programmer setup it becomes quick and easy to load up new boot ROMs.

Then when we are happy with the boot ROM then Chip can do a final FPGA compile that all of us can test just to be on the safe side.

It might be possible to massage the current bootloader to do both but I'm looking at keeping it simple for this special case.

BTW, I posted the hardwired cog 0 boot code in another thread but let's have a look at what the P2 actually goes through internally to "boot".
1. Reset
2. Run the Stage 1 code hardwired as 5 instructions in cog 0 that loads the internal sequential 16kB ROM into $FC000
3. Run Stage 2 boot code at $FC000
4. Special development boot ROM reads special SPI Flash on P0..P2 back into $FC000
5. Run Stage 3 boot code as if it were the final boot ROM
6A. If Stage 3 loads 256 longs from the default boot SPI then this needs another stage as well to load the final code
6B If Stage 3 loads serial then it runs it from address 0
6C. If Stage 3 loads SD then it may be loading another boot loader stage

The other thing I discussed with Cluso was making the SPI Flash boot the same way as SD in that we have a signature and header that describes the source and destination and size of the image to load. That way the SPI Flash won't need another boot loader stage.

jmg · 2018-04-30 23:42

Peter Jakacki wrote: »

The other thing I discussed with Cluso was making the SPI Flash boot the same way as SD in that we have a signature and header that describes the source and destination and size of the image to load. That way the SPI Flash won't need another boot loader stage.

Yes, I think having a header for Source/Dest/Size keeps things flexible, and simpler for final users.
I would also like to see Serial boot being able to call into the SPI loader - that allows a small MCU as a stage 1 loader, and to select a choice of flash images.
Common for systems to have test and diagnostics code, and final product code.

Peter Jakacki wrote: »

BTW, I posted the hardwired cog 0 boot code in another thread but let's have a look at what the P2 actually goes through internally to "boot".
1. Reset
2. Run the Stage 1 code hardwired as 5 instructions in cog 0 that loads the internal sequential 16kB ROM into $FC000
3. Run Stage 2 boot code at $FC000
4. Special development boot ROM reads special SPI Flash on P0..P2 back into $FC000

An alternative to that, could be to have the verilog code in 2., talk directly to the external ROM-mimic SPI, and load $FC000 ?
If doing that, you could likely load from $1CF000 in Flash, to RAM $FC000 (ie now use one flash, with test booter image well above user code )

Cluso99 · 2018-04-30 23:42

Thinking about the above comments. Will respond a little later, perhaps after a call to Peter.
I am home today so I can work on it.

Peter Jakacki · 2018-05-15 06:18

All the testing seems to be good, just minor tweaks (I hope).

I pasted my SD code into TAQOZ and did a BACKUP into Flash just like we do with EEPROM except I write to the last 64KB of 1MB of Flash. A ^R or RESTORE is all I need to do after a hard reset to get it all back again. Once it is loaded you can FLOAD Forth source code directly too. Anyway, I used my SD software to open the SD boot file and enter a new 3 instruction LED Blinky in as machine code, here is the source:

main		drvnot	#5
		waitx	##3000000
		jmp	#main

and enter the machine code like this:

TAQOZ# MOUNT 
Mounted NO NAME     0A0D.A260-6130.6430 [SDSL08G]          FAT32    7,944MB (32kB/cluster)  ok
TAQOZ# ls
/NO NAME     
_BOOT_P2.BIX   _BOOT_P2.BIY   SDCARD  .FTH   SPLAT   .FTH     ok
TAQOZ# FOPEN _BOOT_P2.BIX...opened at 0000.4040   ok
TAQOZ# 0 $20 FS DUMPL 
00.0000: FD65.FE00 0000.0000 FD64.1258 FD64.0E58    ..e.....X.d.X.d.
00.0010: FD64.0A59 FD60.2C1A FA60.2C15 FD60.2224    Y.d..,`..,`.$"`.
TAQOZ# $FD640A5F 0 FS!  ok
TAQOZ# $FF8016E3 4 FS!  ok
TAQOZ# $FD65801F 8 FS!  ok
TAQOZ# $FD8FFFF0 12 FS!  ok
TAQOZ# 0 $10 FS DUMPL 
00.0000: FD64.0A5F FF80.16E3 FD65.801F FD8F.FFF0    _.d.......e.....
TAQOZ#

FS! will store a long at the address specified (0,4,8) in the currently open file.

Flush or close the file and press the reset and it loads the SD code at boot which flashes P5 ~10 times a second.
I can also edit Flash in a similar way but I could also copy from SD to Flash and automatically calculate the Prop checksum.
Playing like this helps me stumble across little things that need to be fixed up in ROM before it is ROM!

This shows reset going high and 91ms later the program is running toggling P5 (fast).
BTW. You can make an LED blink with smartpins like this:

TAQOZ# 4 PIN 10 HZ  ok

cgracey · 2018-05-15 22:27

Nice, Peter!

We had the weekly status meeting with OnSemi a few hours ago. They are wait for us to deliver the ROM. I hope we get it right.

jmg · 2018-05-15 22:43

cgracey wrote: »

We had the weekly status meeting with OnSemi a few hours ago. They are wait for us to deliver the ROM. I hope we get it right.

What is the cost of a ROM revision, if done between FAB runs ? IIRC you said one metal layer ?

cgracey · 2018-05-15 23:13

jmg wrote: »

cgracey wrote: »

We had the weekly status meeting with OnSemi a few hours ago. They are wait for us to deliver the ROM. I hope we get it right.

What is the cost of a ROM revision, if done between FAB runs ? IIRC you said one metal layer ?

Not sure. Probably between $5k and $10k.

Peter Jakacki · 2018-05-15 23:19

Chip, sorry for the late ROM code that I've emailed you and Cluso. I've added in some basic virtual memory support for SD as well as Flash where we can fetch and store and dump just like hub memory etc. If we want full FAT32 support for handling files we can boot a full version of Tachyon from SD in that case.

I noticed however that if I leave an SD card in on boot and it doesn't have any boot files, perhaps only data for logging, that the P2 stops. I need the P2 to either check Flash first for boot files or check after SD since I will most definitely have hardware where the SD card is only used for data.

jmg · 2018-05-15 23:27

Peter Jakacki wrote: »

... since I will most definitely have hardware where the SD card is only used for data.

Wouldn't most cases of data-only SD, use different pins to keep the data completely separate for better security ?
Or do you want to try and save pins, by allowing some sort of auto-sense of data/boot ?
What if someone grabs any old SD, not knowing it has a P2 boot still on it ?

cgracey · 2018-05-16 10:29

Peter and Cluso got their TAQOZ and SD Boot and Monitor code in today. We got it all wrapped up and tested. Looks good.

The ROM code is off to OnSemi now.

Thanks, Peter and Cluso!

TAQOZ - Tachyon Forth for the P2 BOOT ROM

Comments