Fastest way to reverse nibbles?

in Propeller 2
For a PSRAM memory driver I am working on I need to be able to reverse the 8 nibbles in the command+address phase so it can be sent least significant nibble first (streamer assumes this):
What would be the fastest way to do this on the P2..?
E.g.
input is $89ABCDEF
output is $FEDCBA98
REV doesn't help much as it reverses bits.
MOVBYTS doesn't help much unless there is a good way to swap nibbles in each byte first.
Something do to with SPLIT/MERGE perhaps?
I'd prefer not to do 8 ROLNIBs if it can be avoided but I might have to.
Comments
I think I can brute force it in 7 instructions like this:
movbyts data, #%%0123 mov temp, data and data, mask1 and temp, mask2 shl temp, #4 shr data, #4 or data, temp mask1 long $f0f0f0f0 mask2 long $0f0f0f0f
Can it be done faster?
how about
splitb data rev data movbyts data,#%%0123 mergeb data
not sure if that actually works though.
Also, isn't there a bit in the streamer config that swaps the nibbles in a byte? Then you could just do one MOVBYTS
You'd think so, but unfortunately it's not the case for the P2 when sending immediates with the LUT for translating 4 nibbles into 16 bits (4 memory devices ganged in parallel, ie. 4 copies of 4 bits). There is no "a" bit in those streamer commands.
I'll take a look at that, that stuff typically does my head in.
deleted
Wow, it seems to work! Wuerfel_21's solution. Thank you!!!
Here's it working on the bus (address is ABCDFC, QPI read command is EB). It is in the correct bus order, streaming nicely.

Will Wuerfel_21's piece of brilliance be forgotten then lost in a couple of weeks, like the pixel doubling tricks?
It'll be in my code and is captured in the forum so it won't be lost. The trick will be finding it later. But this forum title is at least relevant for searching perhaps.
Rogloh: what tool are you using to capture that data?
Just my logic analyzer. An original Saleae Logic 8 bit one.
Very limited capture rate over USB (~12M samples/s or so) but does the trick for many projects when you underclock the P2.
It looked sorta like my Saleae display does, but the buttons were on the wrong side and I couldnt figure out how you did that. That explains it. Thanks!
BTW, fastest way to swap nibbles of each byte (streamer "alternate" bit emulation):
mergeb data rol data,#16 splitb data
input is $89ABCDEF
output is 98BADCFE
Here's an easy way to test it.
CON _clkfreq = 10_000_000 PUB go() : i i := $89ABCDEF org mergeb i 'reverse nibble order within bytes rol i,#16 splitb i movbyts i,#%%0123 'reverse byte order within long end debug(uhex_long(i))
Huh, intriguing, MERGB and SPLITB are exchanged from Ada's code. So two different ways to swap nibbles in a byte.