High-level languages and the LUT

RossH · 2019-08-26 23:32

I am doing some more tidying up on Catalina, and I'm just about to rewrite some of the library functions that currently reside in the LUT. Just before I do, I thought I'd canvas opinions on whether a high-level language implementation should or should not make use of the LUT to hold code.

Pro: it improves the speed of often-used functions (or language primitives) by making them cog-resident.

Con: it pretty much precludes the use of the LUT directly by high-level language programs for any other purpose (the LUT in another cog can still be used indirectly, by plugins etc).

While I think the "Pro" outweighs the "Con", the case is not really clear cut, since the speed increase gained by using the LUT as extra code space is not as dramatic on the P2 as it would have been on the P1 (which would have had to use LMM to execute code stored in Hub RAM).

One possibility I am considering is making it configurable whether the library code resides in the LUT or in Hub RAM. But I am not sure it is worth the effort. Even if I free up the LUT for high-level language applications to use, would there be any instances of anyone actually using it?

Does anyone have an opinion? Or an example where the LUT might actually be more useful if it was made available to high-level language programs?

twm47099 · 2019-08-26 23:51

I think that 2 adjacent cogs can share data in one of their LUTs (I forget the details). Would the ability for one cog to read the data from and write data to another cog's LUT have significant time savings compared with having to transfer data through the hub?

Tom

jmg · 2019-08-27 00:08

Other MCUs allow HLLs to specify the memory location of Arrays, & VARs based on memory type.

The 'close in' memory is faster and more precious, and the 'further away' memory is larger, but slower to access, and can be less atomic.

So I could certainly see uses for LUT specified as Array/var storage.

RossH wrote: »

One possibility I am considering is making it configurable whether the library code resides in the LUT or in Hub RAM. But I am not sure it is worth the effort. Even if I free up the LUT for high-level language applications to use, would there be any instances of anyone actually using it?

Does anyone have an opinion? Or an example where the LUT might actually be more useful if it was made available to high-level language programs?

Anything that gives users a gain, will certainly be used.

I think code can run from LUT at the same speed as COG (does that include jumps ?) and thus doing that frees up COG memory for more useful data storage.
Self modifying code I think needs to reside in COG, but that is rare in P2.

One approach could be to jump immediately from COG to LUT for fast-code, and default to COG as data and with code-growth into HUB for slower code (eg non interrupt), and someone using a lot of HUB code, will likely want a lot of data, so reducing 'fast' code impact into COG helps there.
Once LUT is full, a means to `use a bit more` from COG would be useful, but the COG SFR's located 0x1f0~0x1ff exclude a simple 'grow down' design.

I'm rusty on the details, but I think Chip worked to allow code to cross-over boundaries ? (LUT -> HUB ?)

RossH · 2019-08-27 00:22

jmg wrote: »

So I could certainly see uses for LUT specified as Array/var storage.

Yes, this is true, and I might implement that. But that isn't specifically to do with the LUT - it is just another way of treating the LUT as an extension to cog RAM.

I am looking more for any instances where the LUT may used - from a high-level language - as it was originally designed to be used.

RossH · 2019-08-27 00:27

twm47099 wrote: »

I think that 2 adjacent cogs can share data in one of their LUTs (I forget the details). Would the ability for one cog to read the data from and write data to another cog's LUT have significant time savings compared with having to transfer data through the hub?

Tom

Yes, you are right. But can you think of any instances where this would be useful from a high-level language? I always expected this feature to mostly be used for PASM drivers - i.e. drivers that required two cogs to co-operate (such as a video driver) plus some high-bandwidth communications between them. You don't typically write those in a high-level language.

Perhaps where the executive code was written in a high-level language in one cog, and the low-level code was written in PASM to run in the other cog so that they could share the LUT? But I can't think of a good example where this would actually be the preferred solution.

rogloh · 2019-08-27 00:39

LUTRAM can sometimes become useful for holding PASM ISRs if there is not enough space left in COG RAM to fit everything in there. Timing ISRs, smartpin based ISRs, Serial Rx transfers etc. Ideally such ISRs can be kept pretty small and may still fit in your COGRAM, but some might spill over and desire LUTRAM storage.

DavidZemon · 2019-08-27 01:59

RossH wrote: »

twm47099 wrote: »

I think that 2 adjacent cogs can share data in one of their LUTs (I forget the details). Would the ability for one cog to read the data from and write data to another cog's LUT have significant time savings compared with having to transfer data through the hub?

Tom

Yes, you are right. But can you think of any instances where this would be useful from a high-level language? I always expected this feature to mostly be used for PASM drivers - i.e. drivers that required two cogs to co-operate (such as a video driver) plus some high-bandwidth communications between them. You don't typically write those in a high-level language.

Perhaps where the executive code was written in a high-level language in one cog, and the low-level code was written in PASM to run in the other cog so that they could share the LUT? But I can't think of a good example where this would actually be the preferred solution.

In PropWare, I don't have any pure-assembly objects. Everything is written in C++ with inline-assembly where speed is critical. I'm not saying that is always the right choice - my objects are far from the fastest or smallest - but it is one way to do it. So for the way I write code I'm sure I'd eventually find a need for LUT accessible from a high-level language.

evanh · 2019-08-27 02:17

Aside from the various streamer functions, I've used LUT for doing block copies of hubRAM using SETQ2+RDLONG followed by SETQ2+WRLONG. Very fast.

RossH · 2019-08-27 02:33

All reasonable suggestions - but no "killer application" so far!

evanh · 2019-08-27 02:47

I used the streamer with lut addresses 0 and 1 to form a simple clock pattern to internally clock SPI smartpins. This improves the transmit clock to data lag timing by 3 sysclock which is a nice improvement.

twm47099 · 2019-08-27 02:48

RossH wrote: »

twm47099 wrote: »

I think that 2 adjacent cogs can share data in one of their LUTs (I forget the details). Would the ability for one cog to read the data from and write data to another cog's LUT have significant time savings compared with having to transfer data through the hub?

Tom

Yes, you are right. But can you think of any instances where this would be useful from a high-level language? I always expected this feature to mostly be used for PASM drivers - i.e. drivers that required two cogs to co-operate (such as a video driver) plus some high-bandwidth communications between them. You don't typically write those in a high-level language.

Perhaps where the executive code was written in a high-level language in one cog, and the low-level code was written in PASM to run in the other cog so that they could share the LUT? But I can't think of a good example where this would actually be the preferred solution.

With the P1 I write most programs in C, sometimes with a PASM driver in another cog, but often with a C function running in another cog monitoring sensors. When the sensor cog(s) get a value they may do some intermediate calculations and then pass the data through a global to the main (or other) cog for it to do something. With the P1 the data has to go through the hub, and sometimes if it is multiple related data (e.g. x & y values) it has to go through the hub 2 times.

I have one sensor (CMUCAM5 Pixy color tracking camera) that sends 7 pieces of data for each object that it detects. It sends the data for each object 50 times a second. The cycle of data is called a data frame. According to the Pixy spec it can send data on 135 objects each data frame. In reality the speed of the interface determines how many objects (each one with 7 pieces of data) can be received each frame. For example, using SPI interface in C using the SimpleTools library shiftin/shiftout I was able to capture 2 objects per frame (not very useful when trying to track multiple colors); using Spin with the SPIAsm objects from the PropTool Library, I got up to 6 objects. Using C with the SPIAsm PASM driver, I was able to get up to 14 objects, but when trying to use that data in an other cog, the number of objects dropped. Possibly transferring between cogs using LUT might have given more. (I realize that going to the P2 itself will increase the rate of collecting data, but cog-LUT-cog transfer should help.

Tom

RossH · 2019-08-27 04:03

Ok - clearly I need to learn a bit more not only about the P2's smartpins, but also about the streamer, goertzel processing and how all these use the LUT before I can make a sensible decision. I had hoped it would be easier than that, because I'm way out my depth on these things - I've read the documentation, but so far it is all just gobbledegook to me

But here's one fairly simple question that may help me shortcut the process: Is it perhaps worthwhile trying to reserve just part of the LUT for high-level applications to use, or is it basically an "all or nothing" deal?

Or, to put it another way - would an application usually need all of the LUT? I could currently reserve perhaps half of it without too much effort.

evanh · 2019-08-27 04:25

Yep. Goertzel is beyond me for the moment as well but I do know, like the video functions, Goertzel can use various sizes of lutRAM - application dependant. The remainder is available for other uses ... and with lutRAM being dual ported there isn't even any access contention between the streamer and cog.

ersmith · 2019-08-29 19:35

Coming to this a bit late as I've been away, but fastspin optionally will use the LUT for FCACHE (either give an explict --fcache= size for the amount LUT to use, or else use -O2). As you noticed Ross, it doesn't make nearly as much difference on P2 as FCACHE does on P1, but it does provide a small speedup.

Cluso99 · 2019-08-29 21:00

FWIW unless LUT is needed for some of P2s hardware lookups, LUT is probably best for code and COG for variables and possibly for stack.
This is because, as you know, the assembler instruction set can only perform on 9-bit addresses (eg AND/OR/ADC etc) which addresses COG.

RossH · 2019-08-30 00:27

Yes, I will probably continue to use LUT for some code. Otherwise it would be completely wasted in the vast majority of applications. I had not thought of using it for FCACHE. I didn't even bother implementing FCACHE in native mode - I didn't think that the overhead of performing the load would be worth the small speed increment except in a few unusual cases. However, I will give that some thought - thanks.

Cluso99 · 2019-08-30 06:44

Loading code blocks into LUT from HUB happen at 1 long per clock after ~7 clocks overhead, so FCACHE should give some amazing result too.

ersmith · 2019-08-30 09:18

Cluso99 wrote: »

Loading code blocks into LUT from HUB happen at 1 long per clock after ~7 clocks overhead, so FCACHE should give some amazing result too.

The loop overhead in hubexec is pretty nasty, so for small loops FCACHE is nice. As the loop gets larger the overhead matters less, since the code in the loop runs at pretty much the same speed for hubexec and lutexec.

The other nice thing on P2 is that loading into LUT is so simple, just a:

   setq2 #COUNT-1
   rdlong lutbase, hubbase

to read COUNT longs from HUB to LUT.

Tubular · 2019-08-30 09:42

Its worth keeping in mind we might have 2 cog variants in the future too, where the hub penalty isn't so dominant. We can run these on the fpga now

RossH · 2019-08-30 11:41

Thanks, all. For the moment, I have decided to just reserve half the LUT for application use. That may include a general FCACHE capability, but I am currently thinking of instead just allowing applications to specify that some specific functions and/or variables are LUT resident.

This is a lot simpler than FCACHE, gives you much the same benefit, and has no load overhead.

ersmith · 2019-08-30 12:15

RossH wrote: »

Thanks, all. For the moment, I have decided to just reserve half the LUT for application use. That may include a general FCACHE capability, but I am currently thinking of instead just allowing applications to specify that some specific functions and/or variables are LUT resident.

This is a lot simpler than FCACHE, gives you much the same benefit, and has no load overhead.

I have to disagree -- requiring the user to specify which functions are LUT resident is error prone (users aren't always very good at estimating which parts of the program are most frequently called) and more complicated for the user. If you could figure out some way of automatically putting loops or frequently called functions into LUT it would have a lot more benefit, I think. For example, it'd be nice if floating point functions went into LUT when used, but left it while not being used.

The load overhead is a red herring -- no matter what you do the processor has to fetch the instructions from HUB (hubexec has overhead too!) and the setq2/rdlong load procedure is as fast as possible, with only 1 cycle per long.

RossH · 2019-08-30 12:20

ersmith wrote: »

RossH wrote: »

Thanks, all. For the moment, I have decided to just reserve half the LUT for application use. That may include a general FCACHE capability, but I am currently thinking of instead just allowing applications to specify that some specific functions and/or variables are LUT resident.

This is a lot simpler than FCACHE, gives you much the same benefit, and has no load overhead.

I have to disagree -- requiring the user to specify which functions are LUT resident is error prone (users aren't always very good at estimating which parts of the program are most frequently called) and more complicated for the user. If you could figure out some way of automatically putting loops or frequently called functions into LUT it would have a lot more benefit, I think. For example, it'd be nice if floating point functions went into LUT when used, but left it while not being used.

The load overhead is a red herring -- no matter what you do the processor has to fetch the instructions from HUB (hubexec has overhead too!) and the setq2/rdlong load procedure is as fast as possible, with only 1 cycle per long.

Unless your compiler automatically profiles your application, I think users would be much better at figuring out where the LUT would be most useful than a compiler could ever be.

evanh · 2019-08-30 13:40

On many microcontrollers, particularly the prop, counting instructions for I/O timing can make all the difference. Admittedly this was generally done as pasm on the prop1, but maybe C could handle it more if allowed.

dMajo · 2019-08-30 14:22

RossH wrote: »

ersmith wrote: »

RossH wrote: »

Thanks, all. For the moment, I have decided to just reserve half the LUT for application use. That may include a general FCACHE capability, but I am currently thinking of instead just allowing applications to specify that some specific functions and/or variables are LUT resident.

This is a lot simpler than FCACHE, gives you much the same benefit, and has no load overhead.

I have to disagree -- requiring the user to specify which functions are LUT resident is error prone (users aren't always very good at estimating which parts of the program are most frequently called) and more complicated for the user. If you could figure out some way of automatically putting loops or frequently called functions into LUT it would have a lot more benefit, I think. For example, it'd be nice if floating point functions went into LUT when used, but left it while not being used.

The load overhead is a red herring -- no matter what you do the processor has to fetch the instructions from HUB (hubexec has overhead too!) and the setq2/rdlong load procedure is as fast as possible, with only 1 cycle per long.

Unless your compiler automatically profiles your application, I think users would be much better at figuring out where the LUT would be most useful than a compiler could ever be.

I agree.

There is also the case where the user want a better performance from a specific function, that can be time critical, even if it is not used the most.

TonyB_ · 2019-08-30 14:26

RossH wrote: »

Thanks, all. For the moment, I have decided to just reserve half the LUT for application use. That may include a general FCACHE capability, but I am currently thinking of instead just allowing applications to specify that some specific functions and/or variables are LUT resident.

This is a lot simpler than FCACHE, gives you much the same benefit, and has no load overhead.

In v2 only half of the LUT can be addressed immediately by RDLUT/WRLUT.

ersmith · 2019-08-30 14:59

RossH wrote: »

Unless your compiler automatically profiles your application, I think users would be much better at figuring out where the LUT would be most useful than a compiler could ever be.

There's more than one way to skin a cat. One way to implement FCACHE is to make it dynamic, so if a function that should be FCACHE'd gets called, it checks to see if its main body has already been loaded into LUT; if not, it does the load, and in any case it jumps to the LUT version. So there is a little hubexec stub, but it's only a few instructions.

Alternatively, the check and load could be done at the call site instead of inside the function.

Now, how many functions can fit in the cache at the same time is a bit trickier. The simple solution is to just to put one function / loop in at a time, and when a new one is loaded the old one is evicted. A fancier compiler could generate position independent code to allow multiple functions to be in LUT at the same time, and/or keep track of slots based on the HUB address.

The big advantage of this kind of dynamic cache is that it's automatic, and it adapts to the changing behavior of the program. If it's doing a lot of serial I/O, the serial handler lives in the LUT. If it then switches to floating point calculation, the floating point code lives in the LUT, and so on.

The downside is that you can end up thrashing the cache. On P1 this was almost never an issue -- the cost of the cache load was barely greater than the overhead of LMM execution (and indeed the cache loading could be folded into the LMM loop) so even one trip through a loop would result in a saving. On P2 this isn't as clear; branches are much faster in LUT, but straight line code is only a little bit faster, so the heuristic for which functions / loops benefit from FCACHE is trickier. OTOH the cache load is very cheap on P2, even cheaper than on P1, so it may still not be much of an issue.

jmg · 2019-08-30 19:45

dMajo wrote: »

RossH wrote: »

Unless your compiler automatically profiles your application, I think users would be much better at figuring out where the LUT would be most useful than a compiler could ever be.

I agree.
There is also the case where the user want a better performance from a specific function, that can be time critical, even if it is not used the most.

.. and interrupts complicate things further. A compiler has no way to know how often an interrupt may be called.
Some means to manually direct/place code is good, but it would be used quite sparingly, given how little LUT memory there is.
For 'most' user code the compiler decision would be good enough.

RossH · 2019-08-31 01:45

TonyB_ wrote: »

In v2 only half of the LUT can be addressed immediately by RDLUT/WRLUT.

Really? I hadn't spotted that! Do you have a reference?

cgracey · 2019-08-31 02:00

RossH wrote: »

TonyB_ wrote: »

In v2 only half of the LUT can be addressed immediately by RDLUT/WRLUT.

Really? I hadn't spotted that! Do you have a reference?

On the newer silicon, PTRA/PTRB expressions can be used with RDLUT and WRLUT, just like with RDxxxx and WRxxxx.

msrobots · 2019-08-31 02:39

and for that chip stole a bit in the instruction.

So just the lower half of it immediate.

Mike

RossH · 2019-08-31 03:00

msrobots wrote: »

and for that chip stole a bit in the instruction.

So just the lower half of it immediate.

Mike

I'm clearly missing something here

Are you saying that RDLUT/WRLUT can only immediately access the lower half of the LUT? I don't get that from reading the instruction definition, but if it is true I will change my use of the LUT so that instead of using the lower half for code, I will use the upper half.

High-level languages and the LUT

Comments