Strange behaviour, my fault or compiler bug?

RossH · 2024-05-15 12:04

@Wuerfel_21 said:

@RossH said:

The provided Linux binaries will only probably work on recent Ubuntu releases, But building Catalina on Linux is fairly easy. See the BUILD.TXT document for details.

If you're building Linux binaries for distribution, you should reallly make them fully static where possible. It doesn't end up much bigger and doesn't ever have this dumbass glibc version issue. You install musl-tools from APT and then use musl-gcc -static -fno-pie as your compiler/linker.

Will investigate. But on Linux (unlike Windows) Catalina is fairly easy to build from source.

Ross.

ke4pjw · 2024-05-15 20:50

@ManAtWork said:
Yes, the code had timing issues. The original LAN9252 driver code from Microchip didn't check for the number of available data in the FIFO but instead assumed that the MCU was always slower than the FIFO being filled or written. I think I've fixed this but it might be still wrong. But in that case it should result in wrong values but no crashes or hangs. Hangs or infinite loops could be caused by timing issues but it hangs in the first call to a driver function that doesn't use the FIFO. Adding a printf() to the end cannot affect timing of an earlier call.

My wiznet driver had a similar problem. What I did to correct it was check twice to see of the buffer had changed. If it changed, I knew that I hadn't received all of the data. That was also a suggestion from the wiznet fourms.

RossH · 2024-05-16 05:40

@evanh said:
Just had a go at using Catalina. The supplied binaries complain that it requires GLIBC_2.34 ... I seem to have v2.31. Attempting to build Catalina using ./build_all in catalina/source nets the same error ...
gcc: error: ../catalina/awka-0.7.5/lib/libawka.a: No such file or directory
../catalina/awka: /lib/i386-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ../catalina/awka)
PS: I'm currently using Kubuntu 20.04 with HWE but am planning on moving to 24.04 soon.

One more question - did you follow all the instructions in BUILD.TXT? In particular, before doing ./build_all, did you build awka? The version of awka in the distribution may depend on GLIBC 2.34 but you should be able to rebuild it to use whatever version you have installed:

   cd $LCCDIR/source/catalina/awka-0.7.5
   ./configure CFLAGS="-m32"
   make

Also, this is off-topic for this thread - please post follow ups to the main Catalina thread here.

ManAtWork · 2024-05-16 07:21

Thanks, Ross, for suggesting using Catalina. At the moment, I prefer Flexspin because it would mean a lot of work to port all the drivers from Spin2 to C. But in the case I finally give up I'll consider to give Catalina another chance.

I had very little time the last days to further investigate the problem. I'll continue tomorrow.

ersmith · 2024-05-16 17:20

@ManAtWork said:
Any ideas how can I debug this??? I think I can rule ot the following possible causes:

timing, code that isn't yet executed cannot affect the timing of previous actions

This is wrong, unfortunately. Hub alignment has a small but significant effect on timing, both for jumps (the hub cache miss timing) and for data accesses using WRLONG/RDLONG. So adding code can change timings even if that code is never executed, if you have either code or data in hub.

Wuerfel_21 · 2024-05-16 18:16

@ersmith said:

@ManAtWork said:
Any ideas how can I debug this??? I think I can rule ot the following possible causes:

timing, code that isn't yet executed cannot affect the timing of previous actions

This is wrong, unfortunately. Hub alignment has a small but significant effect on timing, both for jumps (the hub cache miss timing) and for data accesses using WRLONG/RDLONG. So adding code can change timings even if that code is never executed, if you have either code or data in hub.

Yep. Absolutely obnoxious when micro-optimizing compiled code (either on the user code side or by dumpster diving into optimize_ir.c). Saving a few instructions in one part of the program can be outweighed by some alignments becoming worse because the function is now smaller. So you really need to look at the diff of assembly output between versions to see which one is actually better.

Speaking of, nothing stops you from editing the p2asm file flexspin emits and building that again. So if you just want to black-box test if the issue has to do with code size/alignment:

build good code and bad code with listing enabled
diff the listing to see where the code starts shifting to different addresses.
insert NOPs either inside or between functions so the addresses in the bad code match the addresses in the good code
build modified bad p2asm and verify through the listing that the addresses of functions are now the same
see if it still fails

ManAtWork · 2024-05-16 18:29

We are not talking about a video or memory driver that has to hit an exact cycle-by-cycle-timing to work. It a quite simple process. The P2 writes a command to a register and checks a busy-flag until the command has been processed. I don't think that small timing changes of 1--8 cycles has an effect on it. But you're right, I cant proove it and arguing about speculations is not useful.

Modifying the pasm listing is a good idea. This way I can change the semantics of the code without changing the memory layout. Or I can change the layout (inserting NOPs) without changing the semantics.

The most important thing to find out is where any why the code actually hangs.

ManAtWork · 2024-05-16 18:39

@evanh said:
Catalina does have support for Spin/Pasm along the lines of driver objects I think. It's maybe not drop-in though, but rather one has to adapt it. I'm guessing.

I just checked the feature list of Catalina

Includes a debugger for source level debugging.

Wow! That would be a great advantage. I can't tell how many hours I wasted by inserting and removing printf()s to my code. A true source level debugger with the ability to add breakpoints and single step through the code would save so much time that it would justify accepting some other disadvantages.

Rayman · 2024-05-16 21:27

I will say that I've found Catalina to be rock solid and have never found an issue that wasn't my fault.
FlexProp has a lot of advantages though, so one has to make a choice...

Rayman · 2024-05-16 21:35

Or, use both... I suppose if you make the code ANSI compliant, then you have a few options...
I think they even tried to use the same Propeller2.h Propeller.h files at one point for the P2 specific stuff...

ManAtWork · 2024-05-17 07:45

@ersmith said:

@ManAtWork said:
Any ideas how can I debug this??? I think I can rule ot the following possible causes:

timing, code that isn't yet executed cannot affect the timing of previous actions

This is wrong, unfortunately. Hub alignment has a small but significant effect on timing, both for jumps (the hub cache miss timing) and for data accesses using WRLONG/RDLONG. So adding code can change timings even if that code is never executed, if you have either code or data in hub.

A quite simple explanation could be that if the write of the command fails for some reason the read of the busy/ready flag will loop forever (*). If that is the case and I manage to trigger the scope on the last write I should be able to see what's going wrong. The quad SPI driver uses the streamer to transfer data at 50MHz so it is actually very timing sensitive. I thought I have checked that and the assembler part of the driver should be copied to cog RAM for execution so it shouldn't be affected by hub latency variations. But I'll double check that.

And I remember that there is actually a timing constraint mentioned in the data sheet. Two back-to-back write/read commands to the same register or cross-coupled registers (read depends on the write) are required to have a minimum gap of ~50..100ns between them. Otherwise the read data can be incorrect. I think that requirement will never be violated because the turnaround time for loading the cog RAM and starting the streamer is always longer. But I'll also check that.

Wait... In that case the read should reflect the (ready) state before the command was issued. This should not cause a hang but an early termination with a false result. Same for (*) above. Theoretically it should not hang but just return with an invalid value. But hey, it's Microchip. Things do not always behave as expected.

evanh · 2024-05-18 02:29

[accidental reply post]

RossH · 2024-05-18 11:49

@ManAtWork said:

@evanh said:
Catalina does have support for Spin/Pasm along the lines of driver objects I think. It's maybe not drop-in though, but rather one has to adapt it. I'm guessing.

I just checked the feature list of Catalina

Includes a debugger for source level debugging.

Wow! That would be a great advantage. I can't tell how many hours I wasted by inserting and removing printf()s to my code. A true source level debugger with the ability to add breakpoints and single step through the code would save so much time that it would justify accepting some other disadvantages.

You can see Catalina's debugger (called BlackBox) in action in the videos on this page.

evanh · 2024-06-01 02:53

I might be on Kubuntu 20.04 a little longer after all. I've just tried to move to 24.04 but I'm getting blank screen on boot to desktop. It's not an ordinary boot, sans the picture, either because the usual CTRL-ALT-DEL doesn't initiate a shutdown.

The newest Mesa 24.1 update, for the 20.04 desktop, arrived today as well. I had kind of figured those sorts of updates would be depreciated by now. But not so it seems.

rogloh · 2024-06-01 04:47

@evanh said:
I might be on Kubuntu 20.04 a little longer after all. I've just tried to move to 24.04 but I'm getting blank screen on boot to desktop. It's not an ordinary boot, sans the picture, either because the usual CTRL-ALT-DEL doesn't initiate a shutdown.

I had problems the first time I tried to install 24.04 on a MacBook Pro (Intel based). I tried a second time and it was okay. Not sure if the USB installer dd command failed the first time and something got corrupted and I had a similar blank screen with apparent lockup. Second time I waited a little longer and it came back up and let me install completely. After full installation I've seen no problems so far. Maybe try a second time...?

evanh · 2024-06-01 10:01

The installing steps went smooth. It's the reboot into the fully installed SSD that fell over.

I tried multiple ways. It worked on a totally fresh SSD but that's not how I wanted it. I've always upgraded by keeping the existing home partition intact. It'll be some small config file is incompatible I guess. Anyway that was enough to get me trying other options ... Currently experimenting with the Endeavour Gemini entry point to Arch Linux. It's the newbie way in.

Not everything quite as smooth as Kubuntu but so far so good. I thought there wasn't any sound for a while until I realised it was only the first menu that was incorrectly reporting no hardware present. Once I opened the full sound settings everything was there and all I had to do was select DisplayPort device for sound output. It's using the newly out of beta KDE 6.x so I guess that's a KDE bug. The price of bleeding edge packages. Not a show stopper. There is supposedly an easy rollback for going back to what works - something for later.

evanh · 2024-06-01 10:17

Nice, Firefox was of course fully up-to-date on the old system so I could happily copy to entire Firefox profile over and that includes all the plugins exactly as I had them configured. All of a sudden it's a completely identical browsing experience on both.

evanh · 2024-06-02 06:39

That audio menu that wasn't working is working now. Must have come to life after a reboot.

So only an issue while no device was selected. And I wonder if possibly dependent what's detected via my particular driver set. ie: the developers possibly aren't aware of the bug.

evanh · 2024-06-02 13:03

There is an annoying issue: I can't produce screen tearing when desired! You'd think that's a good thing but, for testing, if tearing can't be produced on demand then you know there is something preventing it. And that something is bound to be interfering with the presented real frame rate - As opposed to the software measured rate.

I think I noticed this kick in when I switched to the newer Radeon RX6800, replacing the older Geforce GTX960 I had previously.

It's something I've attempted to resolve a few times but never found a solution. The fresh Arch Linux install, and subsequent testing, has reminded me about it ... time to dig out the Geforce I guess. That sucks because nVidia's binary blob driver installing, and deinstalling, has always been a pain to deal with.

evanh · 2024-06-02 23:05

Damn, it's dead. Must have been a kernel update or something. Grub can't find the Arch boot image any longer. I'm back on Kubuntu again. Didn't get to swapping GPUs, although I guess I can try that with Kubuntu instead.

ManAtWork · 2024-06-08 14:44

@evanh and @rogloh Errr, sorry, about what are you talking? Does it have anything to do with the original problem?
??? ???

evanh · 2024-06-08 22:40

It was me trying to get going with Catalina and learning that Catalina is fussy about glibc versions. Aside from me rebuilding all of Catalina for my older glibc, which worked in the end, it also got me looking at moving to a new distro release, which didn't work out so well. Never got back to Catalina.

evanh · 2024-06-08 22:46

To me, your difficulty does feel like you're tripping on a Flexspin compiler bug, rather than one of your own making.

EDIT: That said, did you try making the heap bigger anyway?
EDIT2: And guess you've tried to replicate with smaller code size already? I've often found I can't simply replicate such problems with a tiny test program. Because the bug wasn't with my source code.

Wuerfel_21 · 2024-06-08 22:47

tripping on a Flexspin compiler bug

Also known as a flexsplorp

evanh · 2024-06-08 22:58

Roger's idea of selectively controlling the optimiser is a good one too. That can often narrow down where to find the compiler's problem. They can be used globally on the command line, and on functions individually. Read up Eric's "general.md" document.

ersmith · 2024-06-08 23:20

I first thought it was an optimizer bug too, but as I mentioned in message #14 I looked at the generated code with/without printf and it was identical (except for the missing printf). That means the only real change in the two versions is in the memory layout.

If indeed adding/removing the printf causes something to hang before the printf is even called, this could mean that there's some extremely timing sensitive code somewhere that's breaking when moved around in hub. Or a pointer dereference that's going wrong and picking up incorrect data, which depends on what's in the memory. Changing the optimizer settings will also move code around, so this is consistent.

I'm not saying it's definitely not a compiler bug (maybe the compiler is the one getting the pointer wrong) but I wouldn't rule out the other possibilities.

EDIT: one way to test this theory is to add a block of nops to the code where the printf should be, to see if that changes the behavior in the same way.

Strange behaviour, my fault or compiler bug?

Comments