Catalina - a self-hosted PASM assembler and C compiler for the Propeller 2

pik33 · 2023-05-20 11:54

And how fast is your SD driver itself? Flexprop's one, that is now fast, was much slower in the past.

rogloh · 2023-05-20 21:04

@RossH do you know how many total MB are being read from your card for this compilation of Hello World? Are we talking just a few MBs or is it hundreds of MBs? Maybe it's not so much raw IO throughput but lots of time spent file switching with FAT lookups etc, or other slow card access latencies. Not sure there.

Pretty sure there should be something faster to read SD these days - I think evanh might have done some improvements for the SPI transfers at some point which may have made it's way into flexspin too.

Rayman · 2023-05-20 23:15

Bet a pasm2 compiler would be much faster right? Or, would allowing preprocessor things slow it down just as much?

RossH · 2023-05-21 01:10

Another update: It is not the SD card, it is the preprocessor that is causing the problem!

The preprocessor MUST be used to process C files - it's an important part of the language. But I have simplified things so that preprocessing the intermediate assembly file (which are typically about 100kb) is essentially now just a simple file copy (i.e. preprocessing a file with nothing more than a single #include statement) and it still takes SIX MINUTES just to do that step! Whereas doing essentially the same copy sector by sector takes less than TEN SECONDS!

I have never really looked at LCC's preprocessor - it is written in standard C and has never needed any changes - it just works. But clearly, it is a bit of a dog. If I eliminate the need for using it in the intermediate compilation steps - essentially replacing it with a simple line by line copy - it will reduce the compilation time by almost 6 minutes.

So I can reduce the current 15 minutes compilation time down to about 10 minutes ... with 8 of those minutes being the final p2asm assembly.

I know the issue with p2asm - it is to do with how it manages a symbol table that may contain up to 40,000 symbols (this is the size of the symbol table required to compile Catalina itself). Dave Hein (who sadly is no longer with us) did a great job writing it - it is written in standard C and is very easy to maintain and modify. But apart from maintaining it as a tribute to Dave, I have no special attachment to it, so if anyone knows of another pasm assembler that can cope with assembling files of this size, I'd happily use it - at least on the P2.

Ross.

evanh · 2023-05-21 01:47

Good to hear about the speed of file read/write. Cluso's bit-bashed code isn't bad for speed (from memory, something like sysclock/20) so shouldn't have caused a bottleneck.

RossH · 2023-05-28 06:10

Aaarrrggghhh!

I just spent a couple of days trying to chase down what I thought was a serious bug with the self-hosted version of Catalina, which seemed to corrupt the SD card randomly. It finally occurred to me to try a different SD card and have not had a failure since. It seems some SD Cards may not be very reliable when used this intensively!

This stalled any progress on improving the compile time, but in the process I have been doing larger compiles, which was a more realistic test. They now all seem to work ok (albeit slowly). Here are some actual compile times:

hello_world.c (5 lines) - 10 minutes
othello.c (470 lines) - 13 minutes
startrek.c (2200 lines) - 56 minutes

It is clear that the compile time is not linear with the C program size. This is partly because a large chunk of the time (almost all of it in the case of hello_world.c) is not used to compile the C code at all - it is used to build the run-time environment, which is written in PASM and is pretty much the same for each of the above programs. Another factor is how much of the C library the program uses - a line of C that writes to a disk file will take longer to compile than a line that writes to the terminal, because it pulls in much more of the C library code.

My idea to pre-compile the run-time environment to avoid having to do this every time sounded good, but it turns out to be more complex than I expected, and it is also now clear that while this would improve the compile times of very small programs significantly (e.g. by 10 minutes or so), it will still take many minutes or even hours to compile programs of any significant size. So I will put that idea on hold for the moment, and just concentrate on tidying up what I have working already - after all, this is primarily a demo of what Catalina can do rather than a real-world practical application.

It is also clear that there is no point in trying to support some of the more advanced (but really time consuming) features of Catalina (such as XMM support) in the self-hosted version. Compiling programs large or complex enough to need such features would be impractical on the Propeller 2, and will have to be compiled on a PC.

So, included will be:
1. Support for Propeller 2 LMM (TINY), CMM (COMPACT) and NMM (NATIVE) programs.
3. Support for compiling programs for all Propeller 2 platforms supported by Catalina.

Excluded will be:
1. XMM support (since it would be too slow to compile programs large enough to require XMM RAM)
2. Debugger support (since you need to use a PC to host the debugger anyway)
3. Parallelizer support (since this requires awk, which I do not plan to port to the Propeller)
4. Optimizer support (since this also requires awk)
5. Propeller 1 support (since this requires a Spin compiler which I do not plan to port to the Propeller)

There will be a simple catalina program added to Catalyst to execute all the necessary compilation steps using a single command, but it will support only a limited subset of the command-line options of the PC version. If you need any of the unsupported options, you will have to use the PC version. Also note that self-hosted development will require a Propeller 2 fitted with at least 16MB of PSRAM or HyperRAM.

Ross.

EDIT: Update timing.
EDIT: Revised to remove Propeller 1 support.

pik33 · 2023-05-28 06:54

I killed several cards SD while compiling programs on Raspbery Pi 1. Massive writing of temporary files and swapping the memory to SD. They don't like that.

rogloh · 2023-05-28 07:52

Still crazy that hello world takes 15mins to compile. Gut feeling tells me this thing should be able to perform far better on a ~300MHz P2. I guess much data access for the symbol table is going on during final assembly which just isn't scaling well leading to inefficiencies and major slowdowns.

In comparison, how long does it take on a real PC to compile for the same file examples? Are well talking just a few seconds or does it still take a while to compile there too?

RossH · 2023-05-28 09:17

@rogloh said:
Still crazy that hello world takes 15mins to compile. Gut feeling tells me this thing should be able to perform far better on a ~300MHz P2. I guess much data access for the symbol table is going on during final assembly which just isn't scaling well leading to inefficiencies and major slowdowns.

In comparison, how long does it take on a real PC to compile for the same file examples? Are well talking just a few seconds or does it still take a while to compile there too?

All these compilations take just a few seconds on a PC. But this is easy to explain - 15 minutes to compile 5 lines of C, but only 20 minutes to compile ~500 lines of C means the bottleneck is not in the compilation of the C code, it is in the compilation of the PASM run-time - i.e. the assembler. So Catalina on the Propeller 2 effectively has a fixed overhead of about 15 minutes for any program compilation.

This run-time compilation on a PC takes about 10Mb of RAM. But while this process takes only seconds on a PC, this is largely because on a PC you have fast random access to the whole 10Mb. But on the P2 access to XMM RAM is comparatively slow when compared to accessing Hub RAM. Your drivers may be fast for sequential access, but they are not fast for random access when compared to Hub RAM (note: this is not a criticism of your drivers - it is just the nature of the beast!). To speed this up, Catalina uses a Hub RAM cache.

However, the assembler currently uses a hash table, which relies for speed on fast random access to the whole of RAM. If you use a cache (as Catalina does) then this is not a good design, and can result in cache thrashing.

Some redesign of the assembler may help.

Ross.

RossH · 2023-05-28 09:19

@pik33 said:
I killed several cards SD while compiling programs on Raspbery Pi 1. Massive writing of temporary files and swapping the memory to SD. They don't like that.

No, they do not. I wonder if anyone has yet designed a disk interface for the Propeller 2?

Wuerfel_21 · 2023-05-28 15:44

@RossH said:
This run-time compilation on a PC takes about 10Mb of RAM. But while this process takes only seconds on a PC, this is largely because on a PC you have fast random access to the whole 10Mb. But on the P2 access to XMM RAM is comparatively slow when compared to accessing Hub RAM. Your drivers may be fast for sequential access, but they are not fast for random access when compared to Hub RAM (note: this is not a criticism of your drivers - it is just the nature of the beast!). To speed this up, Catalina uses a Hub RAM cache.

Actually, you should criticize roger's drivers here - they do a lot of stuff you don't need when you're running a block cache on top (and another lot of stuff that's not needed on any extant board - pins/latency/etc as a per-bank setting).

You can get very low latency if you hardcode the size of your blocks and always access them aligned to that size. That bypasses all branches that are otherwise needed for checking row crossings etc.

I haven't really benchmarked the limit, but my custom access code in NeoYume can certainly do more than 1.5 million operations per second (~ 96 ops/line * 15.7k hsync/s) with 8 byte blocks (16px * 4bpp). With the 16 bit RAM on the Edge it's probably(tm) closer to 3 million per second? Actually, I now want to calculate how fast it really is, lol.

Wuerfel_21 · 2023-05-28 16:16

Okay, quickly made a dummied out version of the NeoYume sprite slot loop, which is about as close to optimal as a real-world usecase gets (actually, one could probably interleave some setup for the next iteration into the wait times....) It seems that for 16 bit bus and 8 byte blocks it comes out with ~92 cycles per loop, which would indeed imply more than 3.5 million ops per second are possible. I think roger said once that his driver needs roughly a microsecond to do anything, which would be ~350 cycles? Anyways, if you step up to 64 byte blocks the slot loop still only needs ~148 cycles. So the there's an overhead of 84 cycles and then of course there's 1 byte/cycle of transfer time.

CON _CLKFREQ = 10_000_000


BLOCK_SIZE = 8
BUS_WIDTH = 16
LOOPS = 1000

' stuff don't touch
PSRAM_BASE = 0
MA_CLKDIV = 2
PSRAM_WAIT  = 5
PSRAM_DELAY = 14
MA_CHAR_CYCLES = (BLOCK_SIZE * 8) / BUS_WIDTH



DAT
              org
              setxfrq ma_nco_slow
              mov ptrb,#0
              mov ma_slotleft,##LOOPS
              getct start_time
.slotlp
              rdlong ma_mtmp1,ptrb[2]
              nop ' sentinel check

              nop  ' NY-specific address math
              nop  ' ^^

              setbyte ma_mtmp1,#$EB,#3
              splitb  ma_mtmp1
              rev     ma_mtmp1
              movbyts ma_mtmp1, #%%0123
              mergeb  ma_mtmp1

              rep @.irqshield,#1
              nop ' drvl
              nop ' drvh
              xinit ma_psram_addr_cmd,ma_mtmp1
              nop ' wypin
              setq ma_nco_fast
              xcont #PSRAM_WAIT*MA_CLKDIV+PSRAM_DELAY,#0
              wrfast ma_bit31,ptrb
              waitxmt
              nop ' fltl
              setq ma_nco_slow
              xcont ma_psram_readspr_cmd,#0
              waitxfi
.irqshield
              add ptrb,#4*4
              djnz ma_slotleft,#.slotlp

              getct end_time
              sub end_time,start_time
              debug(udec(end_time))
              jmp #$


ma_psram_addr_cmd long (PSRAM_BASE<<17)|X_PINS_ON | X_IMM_8X4_LUT + 8
ma_psram_readspr_cmd long (PSRAM_BASE<<17)|X_WRITE_ON| X_16P_2DAC8_WFWORD + MA_CHAR_CYCLES


ma_bit31      'alias
ma_nco_fast   long $8000_0000
ma_nco_slow   long $4000_0000

start_time    res 1
end_time      res 1

ma_mtmp1      res 1
ma_slotleft   res 1

EDIT:

actually, one could probably interleave some setup for the next iteration into the wait times....

That actually seems possible. For the 8 byte case, there's at least 24 free cycles between XCONT and WAITXFI. I was able to make a version that takes 74 cyles per loop (instead of 92), but that requires you know what data you want next ahead of time. Also, haven't actually tested that with the real P2EDGE setup.

RossH · 2023-05-28 23:23

@Wuerfel_21 said:
Actually, you should criticize roger's drivers here - they do a lot of stuff you don't need when you're running a block cache on top (and another lot of stuff that's not needed on any extant board - pins/latency/etc as a per-bank setting).

I'm sure there will eventually be faster drivers, and I will use them when the appear - but Roger's drivers are not the main issue here. Part of the problem is that I hadn't thought how adding a hash table to the assembler (Dave Hein's original didn't use one) to speed up performance on the PC would perform so badly in conjunction with cached XMM access on the P2 (but to be fair to myself, I had no idea at the time I would be supporting either XMM RAM on the P2 or self-hosted development!).

And even if we achieve a few million operations per second accessing XMM RAM, this is only a small percentage of the speed of accessing Hub RAM on the P2, which is itself only small percentage of the speed of accessing RAM on a PC, so a slowdown of the magnitude seen here is not all that unexpected given such a naive and memory-hungry design. We have all been spoiled a bit by how we can get away with such profligacy on a modern PC

Another part of the problem is the way C itself must be compiled. For instance, I talk blithely about a "5 line C program", but really there is no such thing. Just applying the C pre-processor (which is the necessary first step in the C compilation process) turns those 5 lines into about 100 lines of C code. The C preprocessor hides a multitude of sins!

Ross.

jmg · 2023-05-28 23:32

I did find this for ISSI, picking only their 3v3 parts. These are steadily getting larger, so could be an alternative to flash ?

ISSI Serial Memory Road maps                                                         
Serial RAM 
32M     IS66WVS4M8BLL,IS67WVS4M8BLL             x1, x4  2.7~3.6V        104 MHz -40 to 105°C    SOIC(8) Prod            Serial RAM
        IS66WVQ8M4DBLL,IS67WVQ8M4DBLL           x4      2.7~3.6V        166 MHz -40 to 125°C    BGA(24) Prod            Serial QUADRAM
64M     IS66WVS8M8FBLL,IS67WVS8M8FBLL           x1, x4  2.7~3.6V        104 MHz -40 to 105°C    SOIC(8) S=H2/2023       Serial RAM
        IS66WVQ16M4FBLL,IS67WVQ16M4FBLL,        x4      2.7~3.6V        200 MHz -40 to 125°C    BGA(24) S=H2/2023       Serial QUADRAM
128M    IS66WVS16M8FBLL,IS67WVS16M8FBLL,        x1, x4  2.7~3.6V        104 MHz -40 to 105°C    SOIC(8) S=H2/2023       Serial RAM

OctalRAM
256M    32Mx8   IS66WVO32M8DBLL,IS67WVO32M8DBLL         2.7-3.6V        166     BGA(24) Prod    IBIS    
512M    64Mx8   IS66WVO64M8DBLL,IS67WVO64M8DBLL         2.7-3.6V        200     BGA(24) S=Q4/2022       IBIS    

HyperRAM
256M    32Mx8   IS66WVH32M8DBLL,IS67WVH32M8DBLL         2.7-3.6V        166     BGA(24) Prod    IBIS    
512M    64Mx8   IS66WVH64M8DBLL,IS67WVH64M8DBLL         2.7-3.6V        200     BGA(24) S=Q4/2022       IBIS

evanh · 2023-05-28 23:54

Huh, the octal parts must use a larger die now. There is a lot more room in the BGA package assuming the pin bonds are direct die contact so therefore no peripheral legs nor bond wires.

On that note, I've not seen any articles on how BGA, or PGA for that matter, packaging is achieved. It started with "flip-chip" I presume but that hasn't been the norm for a while, afaik. I'm just assuming there is a magic way to get pins routed through the silicon die to the top side these days.

rogloh · 2023-05-29 00:00

@Wuerfel_21 said:
Actually, you should criticize roger's drivers here - they do a lot of stuff you don't need when you're running a block cache on top (and another lot of stuff that's not needed on any extant board - pins/latency/etc as a per-bank setting).

You can get very low latency if you hardcode the size of your blocks and always access them aligned to that size. That bypasses all branches that are otherwise needed for checking row crossings etc.

I haven't really benchmarked the limit, but my custom access code in NeoYume can certainly do more than 1.5 million operations per second (~ 96 ops/line * 15.7k hsync/s) with 8 byte blocks (16px * 4bpp). With the 16 bit RAM on the Edge it's probably(tm) closer to 3 million per second? Actually, I now want to calculate how fast it really is, lol.

I think roger said once that his driver needs roughly a microsecond to do anything, which would be ~350 cycles?

Yeah the one microsecond was an estimate from back in the HyperRAM days. It used to take around 250 cycles to do the mailbox polling for all the COGs, the memory request decoding and bank setup and all other overheads in the loop. It's probably a little out of date now for the PSRAM which has other complications including its read-modify-write stuff and row crossings.

I'd estimate if you can optimize for fixed block sizes only writing long sized elements and bury some raw transfer code into your COG you'll probably gain somewhere between a 2x and 2.6x boost. In the real world these gains will be less for an actual application using external memory if you use a cache because of all the hits not requiring a PSRAM access. If you need multi-cog sharing of memory or video access you will likely want something much closer to what I already developed and have to poll/decode a number of mailboxes which slows things down again. But if you have an emulator coupled directly to the RAM (like NeoYume) then as you learned yourself you can achieve some gains vs using my drivers.

RossH · 2023-05-29 05:54

I've decided not to include Propeller 1 support in the self-hosted version of Catalina. I already have too much to do, and I doubt such a feature would find any use at all. So the version of Catalina on the Propeller 2 will only support compiling programs for the Propeller 2.

I've updated my earlier post accordingly.

Ross.

Christof Eb. · 2023-05-30 13:54

@RossH said:

@rogloh said:
Still crazy that hello world takes 15mins to compile. Gut feeling tells me this thing should be able to perform far better on a ~300MHz P2. I guess much data access for the symbol table is going on during final assembly which just isn't scaling well leading to inefficiencies and major slowdowns.

In comparison, how long does it take on a real PC to compile for the same file examples? Are well talking just a few seconds or does it still take a while to compile there too?

All these compilations take just a few seconds on a PC. But this is easy to explain - 15 minutes to compile 5 lines of C, but only 20 minutes to compile ~500 lines of C means the bottleneck is not in the compilation of the C code, it is in the compilation of the PASM run-time - i.e. the assembler. So Catalina on the Propeller 2 effectively has a fixed overhead of about 15 minutes for any program compilation.

This run-time compilation on a PC takes about 10Mb of RAM. But while this process takes only seconds on a PC, this is largely because on a PC you have fast random access to the whole 10Mb. But on the P2 access to XMM RAM is comparatively slow when compared to accessing Hub RAM. Your drivers may be fast for sequential access, but they are not fast for random access when compared to Hub RAM (note: this is not a criticism of your drivers - it is just the nature of the beast!). To speed this up, Catalina uses a Hub RAM cache.

However, the assembler currently uses a hash table, which relies for speed on fast random access to the whole of RAM. If you use a cache (as Catalina does) then this is not a good design, and can result in cache thrashing.

Some redesign of the assembler may help.

Ross.

Just being curious, I tried to follow this discussion. - Do I understand this right: Instead of linking preassembled object code, Catalina does the linking on source level and needs to recompile and/or reassemble the libraries every time?
Christof

evanh · 2023-05-30 14:54

Libraries will be pre-compiled. I read it as:
- Step 1: Compile your C source code into assembly source code.
- Step 2: The Pasm assembler needs to be compiled, from C I presume, before it can do the assembling of your compiled C code. This takes 15 minutes.
- Step 3: Then the assembling process itself also has a bottleneck with its hash table causing a lot of cache thrashing.

EDIT: Hmm, that doesn't make sense either. Assembler can't be assembled that way. Back to Ross ...

Rayman · 2023-05-30 23:28

I'm thinking compiling PASM2 on chip is more interesting and practical. Although being able to compile C on chip is really cool, even if it takes a long time....

I guess in principle the Spin2Cpp could also be used to compile Spin2 code in this way? Or, not?

RossH · 2023-05-31 04:32

@"Christof Eb." said:

Just being curious, I tried to follow this discussion. - Do I understand this right: Instead of linking preassembled object code, Catalina does the linking on source level and needs to recompile and/or reassemble the libraries every time?
Christof

Catalina compiles C to SPIN/PASM. The final PASM assembly is done after linking in all the library and run-time code.

You have to remember that there were no tools available for the Propeller when Catalina was first developed other than a Spin compiler. There wasn't a language independent binary object format, and there wasn't (still isn't AFAIK) a stand-alone linker - there there wasn't even a stand-alone assembler - originally everything had to be done in Spin, because that was the only tool we had. This is still how Catalina works the Propeller 1, but on the Propeller 2 we at least had a stand-alone assembler developed fairly early by Dave Hein.

Ross.

RossH · 2023-05-31 04:44

@evanh said:
Libraries will be pre-compiled. I read it as:
- Step 1: Compile your C source code into assembly source code.
- Step 2: The Pasm assembler needs to be compiled, from C I presume, before it can do the assembling of your compiled C code. This takes 15 minutes.
- Step 3: Then the assembling process itself also has a bottleneck with its hash table causing a lot of cache thrashing.

EDIT: Hmm, that doesn't make sense either. Assembler can't be assembled that way. Back to Ross ...

See my previous post. Everything is compiled to PASM, then combined, then that PASM is assembled. So the assembler is the main bottleneck. It was written assuming it had fast access to s many megabytes of RAM as it needed, so it is quite a simple implementation - and on a PC that's fine. But not on the P2. I use a slightly cut-down version of assembler on the P2, but even so the program size ends up as around 4Mb code and data, and so it must execute entirely out of PSRAM.

Ross.

RossH · 2023-05-31 04:52

@Rayman said:
I'm thinking compiling PASM2 on chip is more interesting and practical. Although being able to compile C on chip is really cool, even if it takes a long time....

I guess in principle the Spin2Cpp could also be used to compile Spin2 code in this way? Or, not?

If it is written in standard C, it can be run on a P2 using Catalina. So if there is a Spin compiler written in standard C, you can use it to compile Spin a P2.

My language of choice these days is Lua, and the Lua interpreter and compiler have been running on the P2 for years now. In fact, on the P2 those parts of Catalina outside the basic compiler and assembler (which are still written in C) are now written in Lua. I may eventually back-port this Lua version to the PC, since it is much cleaner and simpler than the C version.

Ross.

rogloh · 2023-05-31 06:28

@RossH said:
See my previous post. Everything is compiled to PASM, then combined, then that PASM is assembled. So the assembler is the main bottleneck. It was written assuming it had fast access to s many megabytes of RAM as it needed, so it is quite a simple implementation - and on a PC that's fine. But not on the P2. I use a slightly cut-down version of assembler on the P2, but even so the program size ends up as around 4Mb code and data, and so it must execute entirely out of PSRAM.

Ross.

Executing I-cached code from PSRAM should not be a major bottleneck, at least based on what I saw with running Micropython from PSRAM, so long as the I-cache is made sufficiently large. But it'd be ideal if all the data memory needed for Dave Hein's assembler tool could be fit into whatever's left of the 512kB of HUB RAM. That could make a huge performance difference. I'm not sure what sized symbol table that would allow if it can be migrated away from its current potentially sparse hash table implementation or whether it would seriously limit the target application size. If it really needs a large amount of data segment memory, a write back D-cache may be of use although it will only really achieve decent benefits if the data has sufficient locality and doesn't thrash the cache.

RossH · 2023-05-31 10:09

@rogloh said:

Executing I-cached code from PSRAM should not be a major bottleneck, at least based on what I saw with running Micropython from PSRAM, so long as the I-cache is made sufficiently large. But it'd be ideal if all the data memory needed for Dave Hein's assembler tool could be fit into whatever's left of the 512kB of HUB RAM. That could make a huge performance difference. I'm not sure what sized symbol table that would allow if it can be migrated away from its current potentially sparse hash table implementation or whether it would seriously limit the target application size. If it really needs a large amount of data segment memory, a write back D-cache may be of use although it will only really achieve decent benefits if the data has sufficient locality and doesn't thrash the cache.

Yes, the problem is not the code - executing Lua from PSRAM demonstrates that this can work ok. The problem is that the implementation of the assembler assumes that it can allocate the entire symbol table before starting the assembly, which means the symbol table - several megabytes of it - will not fit in Hub RAM but must be allocated in PSRAM. And accessing a hash table in PSRAM leads to cache thrashing.

Some redesign of the assembler may be the only answer. Not Dave's fault, I should hasten to add - I am doing something with his code that he probably never envisaged.

Ross.

Wuerfel_21 · 2023-05-31 10:58

I don't think the large hash table is actually your problem, O(1) random access and all. Pessimistically assuming your hello world + libc is 64K large (how large is it actually?) and each byte roughly corresponds to one token of pasm source, that would mean for it to take multiple minutes, the symbol table lookups would have to be screeching along at far less than 1000 tokens per second, so 2 to 3 orders of magnitude slower than it has any right to realistically be, even without any optimizations. There's clearly something wrong with the implementation. (Is the compiler sploding the hash function and everything ends up in the same bucket?)

RossH · 2023-05-31 11:16

@Wuerfel_21 said:
I don't think the large hash table is actually your problem

A cache works by assuming that the next access is likely to be on the same page as recent accesses. A hash table works by ensuring this is not the case.

The two paradigms don't work well together.

Wuerfel_21 · 2023-05-31 11:25

@RossH said:

@Wuerfel_21 said:
I don't think the large hash table is actually your problem

A cache works by assuming that the next access is likely to be on the same page as recent accesses. A hash table works by ensuring this is not the case.

Whether or not the cache is hitting is entirely irrelevant. If it takes a full one of god's very own milliseconds to look up a single symbol, you could cache/evict a hundred cache blocks in that time. That's why I think the slowness is either somewhere else or the hash table is mega broken.

(Also, hash tables do not have inherently bad cache behavior? Esp. given that normal pasm source will use the same register labels a lot so their buckets would be kept cached.)

RossH · 2023-05-31 11:45

@Wuerfel_21 said:

Whether or not the cache is hitting is entirely irrelevant.

https://stackoverflow.com/questions/49709873/cache-performance-in-hash-tables-with-chaining-vs-open-addressing

Wuerfel_21 · 2023-05-31 11:54

I meant that even if every lookup misses the cache multiple times (they shouldn't) it still should be orders of magnitude faster than it apparently(?) is. Please read my post first.

@Wuerfel_21 said:
you could cache/evict a hundred cache blocks in that time.

Catalina - a self-hosted PASM assembler and C compiler for the Propeller 2

Comments