High-level languages and the LUT
RossH
Posts: 5,502
in Propeller 2
I am doing some more tidying up on Catalina, and I'm just about to rewrite some of the library functions that currently reside in the LUT. Just before I do, I thought I'd canvas opinions on whether a high-level language implementation should or should not make use of the LUT to hold code.
Pro: it improves the speed of often-used functions (or language primitives) by making them cog-resident.
Con: it pretty much precludes the use of the LUT directly by high-level language programs for any other purpose (the LUT in another cog can still be used indirectly, by plugins etc).
While I think the "Pro" outweighs the "Con", the case is not really clear cut, since the speed increase gained by using the LUT as extra code space is not as dramatic on the P2 as it would have been on the P1 (which would have had to use LMM to execute code stored in Hub RAM).
One possibility I am considering is making it configurable whether the library code resides in the LUT or in Hub RAM. But I am not sure it is worth the effort. Even if I free up the LUT for high-level language applications to use, would there be any instances of anyone actually using it?
Does anyone have an opinion? Or an example where the LUT might actually be more useful if it was made available to high-level language programs?
Pro: it improves the speed of often-used functions (or language primitives) by making them cog-resident.
Con: it pretty much precludes the use of the LUT directly by high-level language programs for any other purpose (the LUT in another cog can still be used indirectly, by plugins etc).
While I think the "Pro" outweighs the "Con", the case is not really clear cut, since the speed increase gained by using the LUT as extra code space is not as dramatic on the P2 as it would have been on the P1 (which would have had to use LMM to execute code stored in Hub RAM).
One possibility I am considering is making it configurable whether the library code resides in the LUT or in Hub RAM. But I am not sure it is worth the effort. Even if I free up the LUT for high-level language applications to use, would there be any instances of anyone actually using it?
Does anyone have an opinion? Or an example where the LUT might actually be more useful if it was made available to high-level language programs?
Comments
Tom
The 'close in' memory is faster and more precious, and the 'further away' memory is larger, but slower to access, and can be less atomic.
So I could certainly see uses for LUT specified as Array/var storage.
Anything that gives users a gain, will certainly be used.
I think code can run from LUT at the same speed as COG (does that include jumps ?) and thus doing that frees up COG memory for more useful data storage.
Self modifying code I think needs to reside in COG, but that is rare in P2.
One approach could be to jump immediately from COG to LUT for fast-code, and default to COG as data and with code-growth into HUB for slower code (eg non interrupt), and someone using a lot of HUB code, will likely want a lot of data, so reducing 'fast' code impact into COG helps there.
Once LUT is full, a means to `use a bit more` from COG would be useful, but the COG SFR's located 0x1f0~0x1ff exclude a simple 'grow down' design.
I'm rusty on the details, but I think Chip worked to allow code to cross-over boundaries ? (LUT -> HUB ?)
Yes, this is true, and I might implement that. But that isn't specifically to do with the LUT - it is just another way of treating the LUT as an extension to cog RAM.
I am looking more for any instances where the LUT may used - from a high-level language - as it was originally designed to be used.
Yes, you are right. But can you think of any instances where this would be useful from a high-level language? I always expected this feature to mostly be used for PASM drivers - i.e. drivers that required two cogs to co-operate (such as a video driver) plus some high-bandwidth communications between them. You don't typically write those in a high-level language.
Perhaps where the executive code was written in a high-level language in one cog, and the low-level code was written in PASM to run in the other cog so that they could share the LUT? But I can't think of a good example where this would actually be the preferred solution.
In PropWare, I don't have any pure-assembly objects. Everything is written in C++ with inline-assembly where speed is critical. I'm not saying that is always the right choice - my objects are far from the fastest or smallest - but it is one way to do it. So for the way I write code I'm sure I'd eventually find a need for LUT accessible from a high-level language.
With the P1 I write most programs in C, sometimes with a PASM driver in another cog, but often with a C function running in another cog monitoring sensors. When the sensor cog(s) get a value they may do some intermediate calculations and then pass the data through a global to the main (or other) cog for it to do something. With the P1 the data has to go through the hub, and sometimes if it is multiple related data (e.g. x & y values) it has to go through the hub 2 times.
I have one sensor (CMUCAM5 Pixy color tracking camera) that sends 7 pieces of data for each object that it detects. It sends the data for each object 50 times a second. The cycle of data is called a data frame. According to the Pixy spec it can send data on 135 objects each data frame. In reality the speed of the interface determines how many objects (each one with 7 pieces of data) can be received each frame. For example, using SPI interface in C using the SimpleTools library shiftin/shiftout I was able to capture 2 objects per frame (not very useful when trying to track multiple colors); using Spin with the SPIAsm objects from the PropTool Library, I got up to 6 objects. Using C with the SPIAsm PASM driver, I was able to get up to 14 objects, but when trying to use that data in an other cog, the number of objects dropped. Possibly transferring between cogs using LUT might have given more. (I realize that going to the P2 itself will increase the rate of collecting data, but cog-LUT-cog transfer should help.
Tom
But here's one fairly simple question that may help me shortcut the process: Is it perhaps worthwhile trying to reserve just part of the LUT for high-level applications to use, or is it basically an "all or nothing" deal?
Or, to put it another way - would an application usually need all of the LUT? I could currently reserve perhaps half of it without too much effort.
This is because, as you know, the assembler instruction set can only perform on 9-bit addresses (eg AND/OR/ADC etc) which addresses COG.
The loop overhead in hubexec is pretty nasty, so for small loops FCACHE is nice. As the loop gets larger the overhead matters less, since the code in the loop runs at pretty much the same speed for hubexec and lutexec.
The other nice thing on P2 is that loading into LUT is so simple, just a: to read COUNT longs from HUB to LUT.
This is a lot simpler than FCACHE, gives you much the same benefit, and has no load overhead.
I have to disagree -- requiring the user to specify which functions are LUT resident is error prone (users aren't always very good at estimating which parts of the program are most frequently called) and more complicated for the user. If you could figure out some way of automatically putting loops or frequently called functions into LUT it would have a lot more benefit, I think. For example, it'd be nice if floating point functions went into LUT when used, but left it while not being used.
The load overhead is a red herring -- no matter what you do the processor has to fetch the instructions from HUB (hubexec has overhead too!) and the setq2/rdlong load procedure is as fast as possible, with only 1 cycle per long.
Unless your compiler automatically profiles your application, I think users would be much better at figuring out where the LUT would be most useful than a compiler could ever be.
I agree.
There is also the case where the user want a better performance from a specific function, that can be time critical, even if it is not used the most.
In v2 only half of the LUT can be addressed immediately by RDLUT/WRLUT.
There's more than one way to skin a cat. One way to implement FCACHE is to make it dynamic, so if a function that should be FCACHE'd gets called, it checks to see if its main body has already been loaded into LUT; if not, it does the load, and in any case it jumps to the LUT version. So there is a little hubexec stub, but it's only a few instructions.
Alternatively, the check and load could be done at the call site instead of inside the function.
Now, how many functions can fit in the cache at the same time is a bit trickier. The simple solution is to just to put one function / loop in at a time, and when a new one is loaded the old one is evicted. A fancier compiler could generate position independent code to allow multiple functions to be in LUT at the same time, and/or keep track of slots based on the HUB address.
The big advantage of this kind of dynamic cache is that it's automatic, and it adapts to the changing behavior of the program. If it's doing a lot of serial I/O, the serial handler lives in the LUT. If it then switches to floating point calculation, the floating point code lives in the LUT, and so on.
The downside is that you can end up thrashing the cache. On P1 this was almost never an issue -- the cost of the cache load was barely greater than the overhead of LMM execution (and indeed the cache loading could be folded into the LMM loop) so even one trip through a loop would result in a saving. On P2 this isn't as clear; branches are much faster in LUT, but straight line code is only a little bit faster, so the heuristic for which functions / loops benefit from FCACHE is trickier. OTOH the cache load is very cheap on P2, even cheaper than on P1, so it may still not be much of an issue.
.. and interrupts complicate things further. A compiler has no way to know how often an interrupt may be called.
Some means to manually direct/place code is good, but it would be used quite sparingly, given how little LUT memory there is.
For 'most' user code the compiler decision would be good enough.
Really? I hadn't spotted that! Do you have a reference?
On the newer silicon, PTRA/PTRB expressions can be used with RDLUT and WRLUT, just like with RDxxxx and WRxxxx.
So just the lower half of it immediate.
Mike
I'm clearly missing something here
Are you saying that RDLUT/WRLUT can only immediately access the lower half of the LUT? I don't get that from reading the instruction definition, but if it is true I will change my use of the LUT so that instead of using the lower half for code, I will use the upper half.