Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

cgracey · 2018-03-13 07:26

jmg wrote: »

cgracey wrote: »

There was one issue I had to rediscover and I'm thinking about how to handle it: Hub-exec takes control of the hub RAM FIFO and clobbers whatever RDFAST/WRFAST activity was underway. There's no way around this, except to wholly locate the debug ISR activity within the cog RAM..

.. any RDFAST/WRFAST activity has a short time limit, choices might be to
delay the exec part, on BREAK ?
or
to have just the WAIT:JMP code in COG ?

Bytecode engines run RDFAST continuously and can't get blown up on debug interrupts. Also, streamer RDFAST/WRFAST activity might be running continuously and needs to act sanely.

I've got it worked out now...

Instead of debug jump vectors at the end of hub RAM, there are debug buffer areas. For each cog, there is a 32-long save/restore area for registers $000..$01F and a 32-long debug routine area which gets loaded into registers $000..$01F.

Then, there is an 8-long ROM in each cog that always supplies instructions (instead of the cog RAM) when code is executed from $001F8..$001FF. It contains a debug-entry program at $1F8 and a debug-exit program at $1FD. PTRA and PTRB read as special values when the ROM is executing:

Debug ISR entry code - saves $000..$01F to hub RAM, loads debugger code into $000..$01F and jumps to $000:

$1F8 -	setq	#$1F	'ready to save registers $000..$01F
$1F9 -	wrlong	0,ptra	'ptra reads $000FFx00, x = !cog_id
$1FA -	setq	#$1F	'ready to load registers $000..$01F
$1FB -	rdlong	0,ptrb	'ptrb reads $000FFx80, x = !cog_id
$1FC -	jmp	#0	'jump to loaded registers

Debug ISR exit code - restores $000..$01F from hub RAM and returns from debug interrupt:

$1FD -	setq	#$1F	'ready to load registers $000..$01F
$1FE -	rdlong	0,ptra	'ptra reads $000FFx00, x = !cog_id
$1FF -	reti0		'return from debug interrupt

This gets us the hub-cog data interchange that we need to save and restore context, and load in debugger code for breakpoint activity - all without affecting RDFAST/WRFAST or anything that is using them. And this is totally hidden from the cog program without any need for agreement on register usage, etc.

jmg · 2018-03-13 07:58

cgracey wrote: »

This gets us the hub-cog data interchange that we need to save and restore context, and load in debugger code for breakpoint activity - all without affecting RDFAST/WRFAST or anything that is using them. And this is totally hidden from the cog program without any need for agreement on register usage, etc.

Sounds good.
How do you show the user-values in ptra, ptrb ?

Is there room for an early breakpoint pass counter ?

I think the entry code must be in COG, but the exit code I think could be anywhere, as the RDFAST/WRFAST is done ?

cgracey · 2018-03-13 08:18

It all works!

Those special PTRA/PTRB values only appear during $1F8..$1FF ROM execution. At all other times, PTRA and PTRB reflect their true values. I just needed some efficient way to get pointers to the cog's load and save buffers at the end of hub memory. I only had 8 instructions in the available $1F8..$1FF space, so there was no room for a COGID instruction plus math instructions to compute those pointer values. They needed to just be there, already, so I did it in hardware.

You can always point INA/IJMP0 to any address, either in cog RAM, lut RAM or hub RAM, if you want. That 8-instruction ROM mapped into $1F8..$1FF is just there for initialization and in case you want to keep using it. INA/IJMP0 is pointed to $1F8 on cog initialization, so that you can place any custom debugger program into the end-area of hub RAM.

ozpropdev · 2018-03-13 08:45

Neat!
I'm working on my debugger at the moment.
This will smooth the connection between the two nicely.

cgracey · 2018-03-13 08:59

ozpropdev wrote: »

Neat!
I'm working on my debugger at the moment.
This will smooth the connection between the two nicely.

I've got to go to Parallax in the morning, so I must go to bed, but I'll start compiling you a BeMicro-A9 image with this stuff in it. It works really well.

All that's left to do is figure out exactly what kind of write protection (if any) is needed towards the end of hub RAM.

potatohead · 2018-03-13 15:08

Does that put the 16k block into question, or does this mean a secondary layer for debug of some kind?

cgracey · 2018-03-13 15:30

potatohead wrote: »

Does that put the 16k block into question, or does this mean a secondary layer for debug of some kind?

It means that the last 2KB of RAM, in the case of 8 cogs, will be used by the debug mechanism.

potatohead · 2018-03-13 18:01

So we get the 16k block write protect, and the upper 2k of that is used by debug?

Whether one wants to write protect that 16k, say for some resident tool doesn't matter for debug, which will always write protect the upper 2k when in use?

jmg · 2018-03-13 20:21

cgracey wrote: »

...
Those special PTRA/PTRB values only appear during $1F8..$1FF ROM execution.

Hehe, undocumented features.... perhaps those special PTRS need a special alias name too ?

cgracey wrote: »

You can always point INA/IJMP0 to any address, either in cog RAM, lut RAM or hub RAM, if you want. That 8-instruction ROM mapped into $1F8..$1FF is just there for initialization and in case you want to keep using it. INA/IJMP0 is pointed to $1F8 on cog initialization, so that you can place any custom debugger program into the end-area of hub RAM.

I'm not quite following - is there both code ROM and code RAM overlaying the registers ?

I guess that ROM could be slightly larger, if needed - the entry needs to be $1F8..$1FF, to enable it, but that can then enable a larger area.

evanh · 2018-03-13 21:56

potatohead wrote: »

So we get the 16k block write protect, and the upper 2k of that is used by debug?

I think Chip is wanting to separately de-write-protect that 2 KB to give some known debug/kernel workspace without turning off the write protect completely.

potatohead · 2018-03-13 23:01

I think so too. Just not clear on it all yet.

And, if so? Great! Seems like a solid approach.

cgracey · 2018-03-14 04:50

I'm thinking we need a global debug-enable via HUBSET.

Also, 32-long save/restore areas for each cog are necessary, but unique 32-long debugger programs for each cog are not necessary. There can be just one 32-long program that all cogs initially load.

I'm thinking that the hub RAM write protect needs to be set-only. Otherwise, cog programs will be setting and clearing the mechanism asynchronously and causing problems. Or, we have no write-protect. For that matter, maybe no global debug-enable, either.

As Evanh said, it's a slippery slope to 'protected mode'. Maybe we shouldn't go in that direction, at all. I mean, if some rogue cog program is wiping out hub RAM, there will be other likely problems wreaking havoc of their own.

Maybe simple is best: No write-protect and debug always runs, even if it only shuts itself off in each case. Fewer cases means simpler management.

jmg · 2018-03-14 05:08

cgracey wrote: »

I'm thinking we need a global debug-enable via HUBSET.

Also, 32-long save/restore areas for each cog are necessary, but unique 32-long debugger programs for each cog are not necessary. There can be just one 32-long program that all cogs initially load.

I'm thinking that the hub RAM write protect needs to be set-only. Otherwise, cog programs will be setting and clearing the mechanism asynchronously and causing problems. Or, we have no write-protect. For that matter, maybe no global debug-enable, either.

As Evanh said, it's a slippery slope to 'protected mode'. Maybe we shouldn't go in that direction, at all. I mean, if some rogue cog program is wiping out hub RAM, there will be other likely problems wreaking havoc of their own.

Maybe simple is best: No write-protect and debug always runs, even if it only shuts itself off in each case. Fewer cases means simpler management.

I'm not following the final choice here - there is still boot area protection, right ?

Are the boot ROM and Debug ROMs top-metal type ROMS that can be changed, or are they compiled into the fabric ?
If you can do metal ROMS, I could see some customer interest in those that would include protection controls.
ie Protection can always be left off, but it may bring more customers too...

potatohead · 2018-03-14 05:44

Set only, meaning debug gets access, nothing else does?

Great. Let's do that.

Seems to me, that all can be used for fast, robust debug, and that is attractive to many people. An easily stomped thing will seem like a kludge. It kind of is.

Should someone want to do more, they can, and in those use cases, raw speed isn't all that important.

The reason for doing this at all is robustness. Forcing that makes a ton of sense. Keeps people out of trouble.

And the use cases are super clear. Debug, or maybe dev, interactive. Or, it's just running raw, P2 full access as P1 is today.

cgracey · 2018-03-15 15:23

potatohead wrote: »

Set only, meaning debug gets access, nothing else does?

Great. Let's do that.

Seems to me, that all can be used for fast, robust debug, and that is attractive to many people. An easily stomped thing will seem like a kludge. It kind of is.

Should someone want to do more, they can, and in those use cases, raw speed isn't all that important.

The reason for doing this at all is robustness. Forcing that makes a ton of sense. Keeps people out of trouble.

And the use cases are super clear. Debug, or maybe dev, interactive. Or, it's just running raw, P2 full access as P1 is today.

After a day and a half of reasoning this whole thing out and writing test code and Verilog, I think I've got it all solved.

The top 16KB of hub RAM is always going to be pushed up to address range $FC000..$FFFFF and it will not appear where it would normally end, which in the case of 512KB would mean that hub memory looks like this:

$00000..$7BFFF = RAM
$7C000..$FBFFF = empty, reads $00
$FC000..$FFFFF = RAM, write-protectable, contains debug buffers

When the last 16KB is write-protected, only cogs in debug ISRs using (SETQ+)WRxxxx instructions can write that range of memory.

There is a global control bit to enable debug interrupts.

Both write-protect and debug-enable bits are lockable:

HUBSET ##%0010_0000_0000_0000_0000_0000_0000_0LWB

L = 1 to lock W and B being written, reset required to unlock again
W = 1 to write-protect last 16KB of hub RAM located at top of hub map
B = 1 to enable debug interrupts

Being able to globally disable debug interrupts is nice, because it lets cogs launch much faster.

Each cog has 64 longs worth of buffer space at the end of hub memory for saving and restoring registers $000..$03F:

Hub range %1111_1111_CCCC_xxxx_xxxx (where CCCC = !cogid) holds buffers for each of the potential 16 cogs.

Hub range %1111_1110_1111_xxxx_xxxx holds the debug code that is loaded into cog registers $000.$03F between saving and restoring.

The debugger will use ONE dedicated cog to coordinate the other cogs' debug interrupt activities. Each cog that (re)starts will alert the main debugger cog of its presence and communicate via the protected hub memory. The main debugger cog will handle the communication with the host system and report hub memory and smart pin status, along with cog debugging status. That main debugger cog will operate in debug ISR mode all the time, being able to r/w the protected hub memory, along with the other cogs, while in their own debug ISRs.

Having a whole 64 longs of program/data space may get around needing code overlays in the debugger program that runs in the cogs being debugged. I think 16 longs might be sufficient, but there's not much elbow room in there and several overlays would be needed. My next goal (after getting an FPGA update out) is to prove the whole debugger concept. It may be that 16 longs, or even 8, are sufficient. That would save memory, but not time, since many more cycles will be spent communicating with the main debugger cog and host system than loading and saving registers. A small set would be good, though, for things like breakpoint counters that Jmg was talking about.

Time to get some rest.

cgracey · 2018-03-15 15:45

I thought a lot about giving each cog an associated debug code area, aside from the register buffer space. That would allow custom little breakpoint patches. There are two problems, though: First, custom patches are not really that useful, it seems, and their functionality would pale in comparison to the full-blown debugger's. Second, starting cogs is most often done on a next-come/next-serve basis (COGNEW). It is impractical to try to sequester a cog and set it up for a custom breakpoint routine. It's much more solid to have EVERY cog load up the same 64-long debugger code and negotiate its way through the process.

I was thinking, "How will we know which source code associates with each cog?" Well, there's a bit flag that tells us when we are in an initial breakpoint after launching, before anything gets modified. We can use the PTRA and PTRB values to determine data and code addresses. That's usually sufficient for instance number and program recognition. Once matched, we show the corresponding source code in the high-level debugger window.

You can fire up your whole app, without any breakpoints, and just click on cogs' windows to generate async breakpoints, where they'll update their local debugger window and return to execution, unless you tell one to stop, from which point you could set an address breakpoint, single-step, wait for another async breakpoint, or a BRK {#}D instrucion (software break).

potatohead · 2018-03-15 17:22

Chip, that all seems quite good. The lock, reset to unlock makes all the use cases possible.

These features are going to matter. Thanks, nice work.

>It's much more solid to have EVERY cog load up the same 64-long debugger code and negotiate its way through the process.

Yes, that is the Prop way. That negotiation is Almost free compared to other activity too.

jmg · 2018-03-15 19:34

cgracey wrote: »

Each cog has 64 longs worth of buffer space at the end of hub memory for saving and restoring registers $000..$03F:
Hub range %1111_1111_CCCC_xxxx_xxxx (where CCCC = !cogid) holds buffers for each of the potential 16 cogs.
Hub range %1111_1110_1111_xxxx_xxxx holds the debug code that is loaded into cog registers $000.$03F between saving and restoring.

I think this means the pointers to that HUB are hard-wired, but the memory is just simply HUB, that the bootloader/user must avoid this 1024 long footprint ?
This is also used for mailbox buffers (as below) ?
This means a typical memory download would be user code, and debug-stub info ? How does protection interact here ?

cgracey wrote: »

The debugger will use ONE dedicated cog to coordinate the other cogs' debug interrupt activities. Each cog that (re)starts will alert the main debugger cog of its presence and communicate via the protected hub memory. The main debugger cog will handle the communication with the host system and report hub memory and smart pin status, along with cog debugging status. That main debugger cog will operate in debug ISR mode all the time, being able to r/w the protected hub memory, along with the other cogs, while in their own debug ISRs.

hmm, does that mean there is no way to debug an 8 COG design ? That's quite a limit.
Could one COG (least loaded?) somehow build/compile in Debug-mode, and launch the debug stubs itself, before running user code ?

cgracey wrote: »

A small set would be good, though, for things like breakpoint counters that Jmg was talking about.

It would be nice to put the break pass counter before any save, to get closest to 'almost real time', but that entry stub area is tight, and it also limits you to one counter only.
The alternative is to have users add 2 lines of counter code, during debug, and break on the not-skipped line ?

Seairth · 2018-03-16 00:32

@cgracey, did you mean

$7C000-$FBFFF = empty, reads $00

above?

cgracey · 2018-03-16 01:33

Seairth wrote: »

@cgracey, did you mean

$7C000-$FBFFF = empty, reads $00

above?

You are right. I just fixed it. Thanks.

msrobots · 2018-03-16 03:10

I think this is going in a wrong direction.

OK a global debug enable flag makes sense, but requiring a dedicated master debug cog does not make sense at all.

The previous simple debug solution to just provide a interrupt vector and a programmer can put some stuff there without requiring another COG is way more useful.

Sadly we do not have 16 COGs and one to spare, just 8 and a master debug COG is simply not there when you need your COGs. 16 time the RAM 2 times the Pins but still 8 COGS, those guys will get a rare resource.

Sure a dedicate controller/debug COG comes in handy but should not be the only way to use debug.

One should be able to add code to a 8-COG design and debug it without the need of a free COG.

Mike

msrobots · 2018-03-16 03:25

Moving the 16K to the top completely is a very good decision, having it mirrored at the end of the 512k HUB was a potential issue with later memory expansion.

Any way to have a interrupt source hitting on access of HUB memory outside the existing one?

just asking...

Mike

cgracey · 2018-03-16 03:30

Mike,

I know what you are saying. If we could debug just one cog at a time, one cog could do it. Maybe instead of a global debug enable, I need to make one enable for each cog. That way, you could, say, set cog7 up for debugging, and be sure to use that cog for your testing. Or, we could augment COGINIT to be able to stipulate debugging, but I kind of don't like that because it starts to require modification of source code.

Here's the thing about having a cog dedicated to handling debugging: It can continuously serve hub data and smart pin data to the host, as well as sensibly coordinate other cogs' debug events. It could even drive a local VGA monitor for really fast updates of internal state info. All kinds of things are possible.

The problem with having multiple cogs in debug mode without a coordinator cog is that there will be confusion about resource sharing. For example, we could use a LOCK to coordinate which cog in a debug ISR gets to talk over the serial port to the host machine, giving each a turn. That works great until some cog gets stopped or restarted by another cog, potentially leaving the LOCK state in limbo. I've been thinking about adding timeouts to LOCK bits or clearing them if their current owner gets stopped or restarted. The latter would be maybe better. Without some additional mechanism, though, I don't know if we could reliably share the serial port.

cgracey · 2018-03-16 03:38

msrobots wrote: »

Moving the 16K to the top completely is a very good decision, having it mirrored at the end of the 512k HUB was a potential issue with later memory expansion.

Any way to have a interrupt source hitting on access of HUB memory outside the existing one?

just asking...

Mike

We could easily have the 20-bit breakpoint address be used for read/write sensitivity on hub accesses, but only for RDxxxx/WRxxxx instructions (maybe without SETQ/SETQ2 before), but definitely not for hub FIFO accesses, as the activity is too decoupled to be easily traced.

cgracey · 2018-03-16 03:39

During a breakpoint, a cog could spill its guts and report any hub RAM data.

msrobots · 2018-03-16 03:47

@Chip, yes for a perfect debug of multiple COGs your attempt is completely right, I am just asking to keep the simplest solution still available.

Say I have a 8 COG running design and want to call some debug routine of my own on COG 4 for example.

No need for syncing multiple COGS no need for a HOST at all, I just want to say break at some address, write some value to HUB and continue, so no need to interact with a HOST.

You seem to concentrate to much on a debugger as program instead of debugging features of the COG itself.

Mike

cgracey · 2018-03-16 03:55

Mike, I hear you.

I could go back to having separate debug programs for each cog, instead of a common one. It is really easy to make use of the debug interrupt and get whatever you want going on. It's fast, too.

I'm going to make the LOCK bits work such that if the current cog who has the '1' state goes dead, the LOCK goes back to '0', unless some other cog is simultaneously asking for it, in which case it stays '1'. This would be a nice addition to LOCK functionality, anyway. And it would allow central debugging without a coordinator cog.

cgracey · 2018-03-16 04:01

Ah, but I keep forgetting... having separate debug programs (64 longs, or whatever size) means that you need to know the number of the cog BEFORE you start it. This practically means doing a COGNEW to get a cog, but have it launch into an infinite loop program, then you ready its debug program from its newly-discovered ID, then COGINIT that cog with the final program. It's a pain in the butt when you are in the context of the greater tool system, which needs to be able to engage cogs on a random basis.

jmg · 2018-03-16 04:04

cgracey wrote: »

I'm going to make the LOCK bits work such that if the current cog who has the '1' state goes dead, the LOCK goes back to '0', unless some other cog is simultaneously asking for it, in which case it stays '1'. This would be a nice addition to LOCK functionality, anyway. And it would allow central debugging without a coordinator cog.

That last bit is important, as I mentioned above. Requiring a COG be consumed for debug, is quite an impact.
Having the choice of advanced debug of 7 COGs, using the 8th one, is more tolerable.

cgracey wrote: »

... If we could debug just one cog at a time, one cog could do it. Maybe instead of a global debug enable, I need to make one enable for each cog. That way, you could, say, set cog7 up for debugging, and be sure to use that cog for your testing. Or, we could augment COGINIT to be able to stipulate debugging, but I kind of don't like that because it starts to require modification of source code.

It's fairly common to have DEBUG and RELEASE build settings these days, so long as the impact of that Debug build can be kept minimal, that's tolerable,

cgracey · 2018-03-16 04:05

The way things are headed now, you'll just run your app and maybe have some initial breakpoints pre-defined in your source code. The breakpoints automatically occur and data is presented as your program runs. You can stop and poll any cog at any time. The debugger determines what program is running in each cog based on what it can match initial PTRA/PTRB values to in your compiled source. It takes a ton of setup off of your hands and just presents everything on a big platter.

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Comments