Shop OBEX P1 Docs P2 Docs Learn Events
Catalina - a self-hosted PASM assembler and C compiler for the Propeller 2 - Page 2 — Parallax Forums

Catalina - a self-hosted PASM assembler and C compiler for the Propeller 2

24

Comments

  • pik33pik33 Posts: 2,347
    edited 2023-05-20 14:26

    And how fast is your SD driver itself? Flexprop's one, that is now fast, was much slower in the past.

  • @RossH do you know how many total MB are being read from your card for this compilation of Hello World? Are we talking just a few MBs or is it hundreds of MBs? Maybe it's not so much raw IO throughput but lots of time spent file switching with FAT lookups etc, or other slow card access latencies. Not sure there.

    Pretty sure there should be something faster to read SD these days - I think evanh might have done some improvements for the SPI transfers at some point which may have made it's way into flexspin too.

  • RaymanRayman Posts: 13,797

    Bet a pasm2 compiler would be much faster right? Or, would allowing preprocessor things slow it down just as much?

  • RossHRossH Posts: 5,336

    Another update: It is not the SD card, it is the preprocessor that is causing the problem!

    The preprocessor MUST be used to process C files - it's an important part of the language. But I have simplified things so that preprocessing the intermediate assembly file (which are typically about 100kb) is essentially now just a simple file copy (i.e. preprocessing a file with nothing more than a single #include statement) and it still takes SIX MINUTES just to do that step! Whereas doing essentially the same copy sector by sector takes less than TEN SECONDS!

    I have never really looked at LCC's preprocessor - it is written in standard C and has never needed any changes - it just works. But clearly, it is a bit of a dog. If I eliminate the need for using it in the intermediate compilation steps - essentially replacing it with a simple line by line copy - it will reduce the compilation time by almost 6 minutes.

    So I can reduce the current 15 minutes compilation time down to about 10 minutes ... with 8 of those minutes being the final p2asm assembly.

    I know the issue with p2asm - it is to do with how it manages a symbol table that may contain up to 40,000 symbols (this is the size of the symbol table required to compile Catalina itself). Dave Hein (who sadly is no longer with us) did a great job writing it - it is written in standard C and is very easy to maintain and modify. But apart from maintaining it as a tribute to Dave, I have no special attachment to it, so if anyone knows of another pasm assembler that can cope with assembling files of this size, I'd happily use it - at least on the P2.

    Ross.

  • evanhevanh Posts: 15,126

    Good to hear about the speed of file read/write. Cluso's bit-bashed code isn't bad for speed (from memory, something like sysclock/20) so shouldn't have caused a bottleneck.

  • RossHRossH Posts: 5,336
    edited 2023-06-03 08:31

    Aaarrrggghhh!

    I just spent a couple of days trying to chase down what I thought was a serious bug with the self-hosted version of Catalina, which seemed to corrupt the SD card randomly. It finally occurred to me to try a different SD card and have not had a failure since. It seems some SD Cards may not be very reliable when used this intensively! :(

    This stalled any progress on improving the compile time, but in the process I have been doing larger compiles, which was a more realistic test. They now all seem to work ok (albeit slowly). Here are some actual compile times:

    hello_world.c (5 lines) - 10 minutes
    othello.c (470 lines) - 13 minutes
    startrek.c (2200 lines) - 56 minutes

    It is clear that the compile time is not linear with the C program size. This is partly because a large chunk of the time (almost all of it in the case of hello_world.c) is not used to compile the C code at all - it is used to build the run-time environment, which is written in PASM and is pretty much the same for each of the above programs. Another factor is how much of the C library the program uses - a line of C that writes to a disk file will take longer to compile than a line that writes to the terminal, because it pulls in much more of the C library code.

    My idea to pre-compile the run-time environment to avoid having to do this every time sounded good, but it turns out to be more complex than I expected, and it is also now clear that while this would improve the compile times of very small programs significantly (e.g. by 10 minutes or so), it will still take many minutes or even hours to compile programs of any significant size. So I will put that idea on hold for the moment, and just concentrate on tidying up what I have working already - after all, this is primarily a demo of what Catalina can do rather than a real-world practical application.

    It is also clear that there is no point in trying to support some of the more advanced (but really time consuming) features of Catalina (such as XMM support) in the self-hosted version. Compiling programs large or complex enough to need such features would be impractical on the Propeller 2, and will have to be compiled on a PC.

    So, included will be:
    1. Support for Propeller 2 LMM (TINY), CMM (COMPACT) and NMM (NATIVE) programs.
    3. Support for compiling programs for all Propeller 2 platforms supported by Catalina.

    Excluded will be:
    1. XMM support (since it would be too slow to compile programs large enough to require XMM RAM)
    2. Debugger support (since you need to use a PC to host the debugger anyway)
    3. Parallelizer support (since this requires awk, which I do not plan to port to the Propeller)
    4. Optimizer support (since this also requires awk)
    5. Propeller 1 support (since this requires a Spin compiler which I do not plan to port to the Propeller)

    There will be a simple catalina program added to Catalyst to execute all the necessary compilation steps using a single command, but it will support only a limited subset of the command-line options of the PC version. If you need any of the unsupported options, you will have to use the PC version. Also note that self-hosted development will require a Propeller 2 fitted with at least 16MB of PSRAM or HyperRAM.

    Ross.

    EDIT: Update timing.
    EDIT: Revised to remove Propeller 1 support.

  • pik33pik33 Posts: 2,347

    I killed several cards SD while compiling programs on Raspbery Pi 1. Massive writing of temporary files and swapping the memory to SD. They don't like that.

  • Still crazy that hello world takes 15mins to compile. Gut feeling tells me this thing should be able to perform far better on a ~300MHz P2. I guess much data access for the symbol table is going on during final assembly which just isn't scaling well leading to inefficiencies and major slowdowns.

    In comparison, how long does it take on a real PC to compile for the same file examples? Are well talking just a few seconds or does it still take a while to compile there too?

  • RossHRossH Posts: 5,336
    edited 2023-05-28 12:49

    @rogloh said:
    Still crazy that hello world takes 15mins to compile. Gut feeling tells me this thing should be able to perform far better on a ~300MHz P2. I guess much data access for the symbol table is going on during final assembly which just isn't scaling well leading to inefficiencies and major slowdowns.

    In comparison, how long does it take on a real PC to compile for the same file examples? Are well talking just a few seconds or does it still take a while to compile there too?

    All these compilations take just a few seconds on a PC. But this is easy to explain - 15 minutes to compile 5 lines of C, but only 20 minutes to compile ~500 lines of C means the bottleneck is not in the compilation of the C code, it is in the compilation of the PASM run-time - i.e. the assembler. So Catalina on the Propeller 2 effectively has a fixed overhead of about 15 minutes for any program compilation.

    This run-time compilation on a PC takes about 10Mb of RAM. But while this process takes only seconds on a PC, this is largely because on a PC you have fast random access to the whole 10Mb. But on the P2 access to XMM RAM is comparatively slow when compared to accessing Hub RAM. Your drivers may be fast for sequential access, but they are not fast for random access when compared to Hub RAM (note: this is not a criticism of your drivers - it is just the nature of the beast!). To speed this up, Catalina uses a Hub RAM cache.

    However, the assembler currently uses a hash table, which relies for speed on fast random access to the whole of RAM. If you use a cache (as Catalina does) then this is not a good design, and can result in cache thrashing.

    Some redesign of the assembler may help.

    Ross.

  • RossHRossH Posts: 5,336

    @pik33 said:
    I killed several cards SD while compiling programs on Raspbery Pi 1. Massive writing of temporary files and swapping the memory to SD. They don't like that.

    No, they do not. I wonder if anyone has yet designed a disk interface for the Propeller 2? :)

  • @RossH said:
    This run-time compilation on a PC takes about 10Mb of RAM. But while this process takes only seconds on a PC, this is largely because on a PC you have fast random access to the whole 10Mb. But on the P2 access to XMM RAM is comparatively slow when compared to accessing Hub RAM. Your drivers may be fast for sequential access, but they are not fast for random access when compared to Hub RAM (note: this is not a criticism of your drivers - it is just the nature of the beast!). To speed this up, Catalina uses a Hub RAM cache.

    Actually, you should criticize roger's drivers here - they do a lot of stuff you don't need when you're running a block cache on top (and another lot of stuff that's not needed on any extant board - pins/latency/etc as a per-bank setting).

    You can get very low latency if you hardcode the size of your blocks and always access them aligned to that size. That bypasses all branches that are otherwise needed for checking row crossings etc.

    I haven't really benchmarked the limit, but my custom access code in NeoYume can certainly do more than 1.5 million operations per second (~ 96 ops/line * 15.7k hsync/s) with 8 byte blocks (16px * 4bpp). With the 16 bit RAM on the Edge it's probably(tm) closer to 3 million per second? Actually, I now want to calculate how fast it really is, lol.

  • Wuerfel_21Wuerfel_21 Posts: 4,369
    edited 2023-05-28 16:59

    Okay, quickly made a dummied out version of the NeoYume sprite slot loop, which is about as close to optimal as a real-world usecase gets (actually, one could probably interleave some setup for the next iteration into the wait times....) It seems that for 16 bit bus and 8 byte blocks it comes out with ~92 cycles per loop, which would indeed imply more than 3.5 million ops per second are possible. I think roger said once that his driver needs roughly a microsecond to do anything, which would be ~350 cycles? Anyways, if you step up to 64 byte blocks the slot loop still only needs ~148 cycles. So the there's an overhead of 84 cycles and then of course there's 1 byte/cycle of transfer time.

    CON _CLKFREQ = 10_000_000
    
    
    BLOCK_SIZE = 8
    BUS_WIDTH = 16
    LOOPS = 1000
    
    ' stuff don't touch
    PSRAM_BASE = 0
    MA_CLKDIV = 2
    PSRAM_WAIT  = 5
    PSRAM_DELAY = 14
    MA_CHAR_CYCLES = (BLOCK_SIZE * 8) / BUS_WIDTH
    
    
    
    DAT
                  org
                  setxfrq ma_nco_slow
                  mov ptrb,#0
                  mov ma_slotleft,##LOOPS
                  getct start_time
    .slotlp
                  rdlong ma_mtmp1,ptrb[2]
                  nop ' sentinel check
    
                  nop  ' NY-specific address math
                  nop  ' ^^
    
                  setbyte ma_mtmp1,#$EB,#3
                  splitb  ma_mtmp1
                  rev     ma_mtmp1
                  movbyts ma_mtmp1, #%%0123
                  mergeb  ma_mtmp1
    
                  rep @.irqshield,#1
                  nop ' drvl
                  nop ' drvh
                  xinit ma_psram_addr_cmd,ma_mtmp1
                  nop ' wypin
                  setq ma_nco_fast
                  xcont #PSRAM_WAIT*MA_CLKDIV+PSRAM_DELAY,#0
                  wrfast ma_bit31,ptrb
                  waitxmt
                  nop ' fltl
                  setq ma_nco_slow
                  xcont ma_psram_readspr_cmd,#0
                  waitxfi
    .irqshield
                  add ptrb,#4*4
                  djnz ma_slotleft,#.slotlp
    
                  getct end_time
                  sub end_time,start_time
                  debug(udec(end_time))
                  jmp #$
    
    
    ma_psram_addr_cmd long (PSRAM_BASE<<17)|X_PINS_ON | X_IMM_8X4_LUT + 8
    ma_psram_readspr_cmd long (PSRAM_BASE<<17)|X_WRITE_ON| X_16P_2DAC8_WFWORD + MA_CHAR_CYCLES
    
    
    ma_bit31      'alias
    ma_nco_fast   long $8000_0000
    ma_nco_slow   long $4000_0000
    
    start_time    res 1
    end_time      res 1
    
    ma_mtmp1      res 1
    ma_slotleft   res 1
    
    

    EDIT:

    actually, one could probably interleave some setup for the next iteration into the wait times....

    That actually seems possible. For the 8 byte case, there's at least 24 free cycles between XCONT and WAITXFI. I was able to make a version that takes 74 cyles per loop (instead of 92), but that requires you know what data you want next ahead of time. Also, haven't actually tested that with the real P2EDGE setup.

  • RossHRossH Posts: 5,336

    @Wuerfel_21 said:
    Actually, you should criticize roger's drivers here - they do a lot of stuff you don't need when you're running a block cache on top (and another lot of stuff that's not needed on any extant board - pins/latency/etc as a per-bank setting).

    I'm sure there will eventually be faster drivers, and I will use them when the appear - but Roger's drivers are not the main issue here. Part of the problem is that I hadn't thought how adding a hash table to the assembler (Dave Hein's original didn't use one) to speed up performance on the PC would perform so badly in conjunction with cached XMM access on the P2 (but to be fair to myself, I had no idea at the time I would be supporting either XMM RAM on the P2 or self-hosted development!).

    And even if we achieve a few million operations per second accessing XMM RAM, this is only a small percentage of the speed of accessing Hub RAM on the P2, which is itself only small percentage of the speed of accessing RAM on a PC, so a slowdown of the magnitude seen here is not all that unexpected given such a naive and memory-hungry design. We have all been spoiled a bit by how we can get away with such profligacy on a modern PC :)

    Another part of the problem is the way C itself must be compiled. For instance, I talk blithely about a "5 line C program", but really there is no such thing. Just applying the C pre-processor (which is the necessary first step in the C compilation process) turns those 5 lines into about 100 lines of C code. The C preprocessor hides a multitude of sins! :)

    Ross.

  • jmgjmg Posts: 15,140

    I did find this for ISSI, picking only their 3v3 parts. These are steadily getting larger, so could be an alternative to flash ?

    ISSI Serial Memory Road maps                                                         
    Serial RAM 
    32M     IS66WVS4M8BLL,IS67WVS4M8BLL             x1, x4  2.7~3.6V        104 MHz -40 to 105°C    SOIC(8) Prod            Serial RAM
            IS66WVQ8M4DBLL,IS67WVQ8M4DBLL           x4      2.7~3.6V        166 MHz -40 to 125°C    BGA(24) Prod            Serial QUADRAM
    64M     IS66WVS8M8FBLL,IS67WVS8M8FBLL           x1, x4  2.7~3.6V        104 MHz -40 to 105°C    SOIC(8) S=H2/2023       Serial RAM
            IS66WVQ16M4FBLL,IS67WVQ16M4FBLL,        x4      2.7~3.6V        200 MHz -40 to 125°C    BGA(24) S=H2/2023       Serial QUADRAM
    128M    IS66WVS16M8FBLL,IS67WVS16M8FBLL,        x1, x4  2.7~3.6V        104 MHz -40 to 105°C    SOIC(8) S=H2/2023       Serial RAM
    
    OctalRAM
    256M    32Mx8   IS66WVO32M8DBLL,IS67WVO32M8DBLL         2.7-3.6V        166     BGA(24) Prod    IBIS    
    512M    64Mx8   IS66WVO64M8DBLL,IS67WVO64M8DBLL         2.7-3.6V        200     BGA(24) S=Q4/2022       IBIS    
    
    HyperRAM
    256M    32Mx8   IS66WVH32M8DBLL,IS67WVH32M8DBLL         2.7-3.6V        166     BGA(24) Prod    IBIS    
    512M    64Mx8   IS66WVH64M8DBLL,IS67WVH64M8DBLL         2.7-3.6V        200     BGA(24) S=Q4/2022       IBIS    
    
    
  • evanhevanh Posts: 15,126

    Huh, the octal parts must use a larger die now. There is a lot more room in the BGA package assuming the pin bonds are direct die contact so therefore no peripheral legs nor bond wires.

    On that note, I've not seen any articles on how BGA, or PGA for that matter, packaging is achieved. It started with "flip-chip" I presume but that hasn't been the norm for a while, afaik. I'm just assuming there is a magic way to get pins routed through the silicon die to the top side these days.

  • @Wuerfel_21 said:
    Actually, you should criticize roger's drivers here - they do a lot of stuff you don't need when you're running a block cache on top (and another lot of stuff that's not needed on any extant board - pins/latency/etc as a per-bank setting).

    You can get very low latency if you hardcode the size of your blocks and always access them aligned to that size. That bypasses all branches that are otherwise needed for checking row crossings etc.

    I haven't really benchmarked the limit, but my custom access code in NeoYume can certainly do more than 1.5 million operations per second (~ 96 ops/line * 15.7k hsync/s) with 8 byte blocks (16px * 4bpp). With the 16 bit RAM on the Edge it's probably(tm) closer to 3 million per second? Actually, I now want to calculate how fast it really is, lol.

    I think roger said once that his driver needs roughly a microsecond to do anything, which would be ~350 cycles?

    Yeah the one microsecond was an estimate from back in the HyperRAM days. It used to take around 250 cycles to do the mailbox polling for all the COGs, the memory request decoding and bank setup and all other overheads in the loop. It's probably a little out of date now for the PSRAM which has other complications including its read-modify-write stuff and row crossings.

    I'd estimate if you can optimize for fixed block sizes only writing long sized elements and bury some raw transfer code into your COG you'll probably gain somewhere between a 2x and 2.6x boost. In the real world these gains will be less for an actual application using external memory if you use a cache because of all the hits not requiring a PSRAM access. If you need multi-cog sharing of memory or video access you will likely want something much closer to what I already developed and have to poll/decode a number of mailboxes which slows things down again. But if you have an emulator coupled directly to the RAM (like NeoYume) then as you learned yourself you can achieve some gains vs using my drivers.

  • RossHRossH Posts: 5,336

    I've decided not to include Propeller 1 support in the self-hosted version of Catalina. I already have too much to do, and I doubt such a feature would find any use at all. So the version of Catalina on the Propeller 2 will only support compiling programs for the Propeller 2.

    I've updated my earlier post accordingly.

    Ross.

  • @RossH said:

    @rogloh said:
    Still crazy that hello world takes 15mins to compile. Gut feeling tells me this thing should be able to perform far better on a ~300MHz P2. I guess much data access for the symbol table is going on during final assembly which just isn't scaling well leading to inefficiencies and major slowdowns.

    In comparison, how long does it take on a real PC to compile for the same file examples? Are well talking just a few seconds or does it still take a while to compile there too?

    All these compilations take just a few seconds on a PC. But this is easy to explain - 15 minutes to compile 5 lines of C, but only 20 minutes to compile ~500 lines of C means the bottleneck is not in the compilation of the C code, it is in the compilation of the PASM run-time - i.e. the assembler. So Catalina on the Propeller 2 effectively has a fixed overhead of about 15 minutes for any program compilation.

    This run-time compilation on a PC takes about 10Mb of RAM. But while this process takes only seconds on a PC, this is largely because on a PC you have fast random access to the whole 10Mb. But on the P2 access to XMM RAM is comparatively slow when compared to accessing Hub RAM. Your drivers may be fast for sequential access, but they are not fast for random access when compared to Hub RAM (note: this is not a criticism of your drivers - it is just the nature of the beast!). To speed this up, Catalina uses a Hub RAM cache.

    However, the assembler currently uses a hash table, which relies for speed on fast random access to the whole of RAM. If you use a cache (as Catalina does) then this is not a good design, and can result in cache thrashing.

    Some redesign of the assembler may help.

    Ross.

    Just being curious, I tried to follow this discussion. - Do I understand this right: Instead of linking preassembled object code, Catalina does the linking on source level and needs to recompile and/or reassemble the libraries every time?
    Christof

  • evanhevanh Posts: 15,126
    edited 2023-05-30 15:00

    Libraries will be pre-compiled. I read it as:
    - Step 1: Compile your C source code into assembly source code.
    - Step 2: The Pasm assembler needs to be compiled, from C I presume, before it can do the assembling of your compiled C code. This takes 15 minutes.
    - Step 3: Then the assembling process itself also has a bottleneck with its hash table causing a lot of cache thrashing.

    EDIT: Hmm, that doesn't make sense either. Assembler can't be assembled that way. Back to Ross ...

  • RaymanRayman Posts: 13,797

    I'm thinking compiling PASM2 on chip is more interesting and practical. Although being able to compile C on chip is really cool, even if it takes a long time....

    I guess in principle the Spin2Cpp could also be used to compile Spin2 code in this way? Or, not?

  • RossHRossH Posts: 5,336

    @"Christof Eb." said:

    Just being curious, I tried to follow this discussion. - Do I understand this right: Instead of linking preassembled object code, Catalina does the linking on source level and needs to recompile and/or reassemble the libraries every time?
    Christof

    Catalina compiles C to SPIN/PASM. The final PASM assembly is done after linking in all the library and run-time code.

    You have to remember that there were no tools available for the Propeller when Catalina was first developed other than a Spin compiler. There wasn't a language independent binary object format, and there wasn't (still isn't AFAIK) a stand-alone linker - there there wasn't even a stand-alone assembler - originally everything had to be done in Spin, because that was the only tool we had. This is still how Catalina works the Propeller 1, but on the Propeller 2 we at least had a stand-alone assembler developed fairly early by Dave Hein.

    Ross.

  • RossHRossH Posts: 5,336

    @evanh said:
    Libraries will be pre-compiled. I read it as:
    - Step 1: Compile your C source code into assembly source code.
    - Step 2: The Pasm assembler needs to be compiled, from C I presume, before it can do the assembling of your compiled C code. This takes 15 minutes.
    - Step 3: Then the assembling process itself also has a bottleneck with its hash table causing a lot of cache thrashing.

    EDIT: Hmm, that doesn't make sense either. Assembler can't be assembled that way. Back to Ross ...

    See my previous post. Everything is compiled to PASM, then combined, then that PASM is assembled. So the assembler is the main bottleneck. It was written assuming it had fast access to s many megabytes of RAM as it needed, so it is quite a simple implementation - and on a PC that's fine. But not on the P2. I use a slightly cut-down version of assembler on the P2, but even so the program size ends up as around 4Mb code and data, and so it must execute entirely out of PSRAM.

    Ross.

  • RossHRossH Posts: 5,336

    @Rayman said:
    I'm thinking compiling PASM2 on chip is more interesting and practical. Although being able to compile C on chip is really cool, even if it takes a long time....

    I guess in principle the Spin2Cpp could also be used to compile Spin2 code in this way? Or, not?

    If it is written in standard C, it can be run on a P2 using Catalina. So if there is a Spin compiler written in standard C, you can use it to compile Spin a P2.

    My language of choice these days is Lua, and the Lua interpreter and compiler have been running on the P2 for years now. In fact, on the P2 those parts of Catalina outside the basic compiler and assembler (which are still written in C) are now written in Lua. I may eventually back-port this Lua version to the PC, since it is much cleaner and simpler than the C version.

    Ross.

  • @RossH said:
    See my previous post. Everything is compiled to PASM, then combined, then that PASM is assembled. So the assembler is the main bottleneck. It was written assuming it had fast access to s many megabytes of RAM as it needed, so it is quite a simple implementation - and on a PC that's fine. But not on the P2. I use a slightly cut-down version of assembler on the P2, but even so the program size ends up as around 4Mb code and data, and so it must execute entirely out of PSRAM.

    Ross.

    Executing I-cached code from PSRAM should not be a major bottleneck, at least based on what I saw with running Micropython from PSRAM, so long as the I-cache is made sufficiently large. But it'd be ideal if all the data memory needed for Dave Hein's assembler tool could be fit into whatever's left of the 512kB of HUB RAM. That could make a huge performance difference. I'm not sure what sized symbol table that would allow if it can be migrated away from its current potentially sparse hash table implementation or whether it would seriously limit the target application size. If it really needs a large amount of data segment memory, a write back D-cache may be of use although it will only really achieve decent benefits if the data has sufficient locality and doesn't thrash the cache.

  • RossHRossH Posts: 5,336

    @rogloh said:

    Executing I-cached code from PSRAM should not be a major bottleneck, at least based on what I saw with running Micropython from PSRAM, so long as the I-cache is made sufficiently large. But it'd be ideal if all the data memory needed for Dave Hein's assembler tool could be fit into whatever's left of the 512kB of HUB RAM. That could make a huge performance difference. I'm not sure what sized symbol table that would allow if it can be migrated away from its current potentially sparse hash table implementation or whether it would seriously limit the target application size. If it really needs a large amount of data segment memory, a write back D-cache may be of use although it will only really achieve decent benefits if the data has sufficient locality and doesn't thrash the cache.

    Yes, the problem is not the code - executing Lua from PSRAM demonstrates that this can work ok. The problem is that the implementation of the assembler assumes that it can allocate the entire symbol table before starting the assembly, which means the symbol table - several megabytes of it - will not fit in Hub RAM but must be allocated in PSRAM. And accessing a hash table in PSRAM leads to cache thrashing.

    Some redesign of the assembler may be the only answer. Not Dave's fault, I should hasten to add - I am doing something with his code that he probably never envisaged.

    Ross.

  • Wuerfel_21Wuerfel_21 Posts: 4,369
    edited 2023-05-31 11:02

    I don't think the large hash table is actually your problem, O(1) random access and all. Pessimistically assuming your hello world + libc is 64K large (how large is it actually?) and each byte roughly corresponds to one token of pasm source, that would mean for it to take multiple minutes, the symbol table lookups would have to be screeching along at far less than 1000 tokens per second, so 2 to 3 orders of magnitude slower than it has any right to realistically be, even without any optimizations. There's clearly something wrong with the implementation. (Is the compiler sploding the hash function and everything ends up in the same bucket?)

  • RossHRossH Posts: 5,336

    @Wuerfel_21 said:
    I don't think the large hash table is actually your problem

    A cache works by assuming that the next access is likely to be on the same page as recent accesses. A hash table works by ensuring this is not the case.

    The two paradigms don't work well together.

  • Wuerfel_21Wuerfel_21 Posts: 4,369
    edited 2023-05-31 11:25

    @RossH said:

    @Wuerfel_21 said:
    I don't think the large hash table is actually your problem

    A cache works by assuming that the next access is likely to be on the same page as recent accesses. A hash table works by ensuring this is not the case.

    Whether or not the cache is hitting is entirely irrelevant. If it takes a full one of god's very own milliseconds to look up a single symbol, you could cache/evict a hundred cache blocks in that time. That's why I think the slowness is either somewhere else or the hash table is mega broken.

    (Also, hash tables do not have inherently bad cache behavior? Esp. given that normal pasm source will use the same register labels a lot so their buckets would be kept cached.)

  • RossHRossH Posts: 5,336

    @Wuerfel_21 said:

    Whether or not the cache is hitting is entirely irrelevant.

    https://stackoverflow.com/questions/49709873/cache-performance-in-hash-tables-with-chaining-vs-open-addressing

  • Wuerfel_21Wuerfel_21 Posts: 4,369
    edited 2023-05-31 11:58

    I meant that even if every lookup misses the cache multiple times (they shouldn't) it still should be orders of magnitude faster than it apparently(?) is. Please read my post first.

    @Wuerfel_21 said:
    you could cache/evict a hundred cache blocks in that time.

Sign In or Register to comment.