Thanks for that info Chip, I ended up removing the nop after I figured that it was safe to perform a testp after the clock falling edge. So this works at 80MHZ clocking data at 10MHz.
SPIRD rep @.end,#8' 8 bitsoutnot sck ' clock (low high)outnot sck
testp miso wc' read data from cardrcl tos,#1' shift in msb first
.endret
Peter, that wound up being the same as my ROM code:
''' SPI byte in'
spi_in rep @.in,#8'ready to input a byteouth #spi_ck 'clock pin highoutl #spi_ck 'clock pin lowtestp #spi_dq wc'sample data pin ('testp' is from before 'outh')rcl x,#1'save data bit
.in
ret
Modifying that a little, and I'm not sure how much this would be valued, but here's a slight unrolling to even out the clock duty:
''' SPI byte in'
spi_in outh #spi_ck 'clock pin highrep @.in,#7'ready to input a byteoutl #spi_ck 'clock pin lowtestp #spi_dq wc'sample data pin ('testin' is from before 'outl')outh #spi_ck 'clock pin highrcl x,#1'save data bit
.in
outl #spi_ck 'clock pin lowtestp #spi_dq wc'sample data pin ('testin' is from before 'outl')rcl x,#1'save data bitret
Looks good, Evanh. The SPI memory outputs new data on the falling edge of the clock, so you'd want to sample just before the clock drops, which all our code does in these snippets.
My SPI code just toggles the clock so that I can enter with it high or low to cater for various chips. The data that I'm reading though is already available before I do any clocking and so the clock is simply outputting the next data bit.
Chip, do you think it's a good idea to use smartpins for SPI for the boot code? I will probably try this out next just so I have the options.
Looks good, Evanh. The SPI memory outputs new data on the falling edge of the clock, so you'd want to sample just before the clock drops, which all our code does in these snippets.
The SPI slave has to be clocking out on the rising clock edge! Otherwise the first bit is not present for the first shift.
The extra clock cycle needed at 80Mhz is due to the unconstrained timing paths not allowing same-clock feedback from a register, through logic, out to a pin, back in from a pin, through more logic, then back into a register, all in 12.5ns.
Peter also reported issues at 40MHz, so those delays look to be significant.
Your notes for IN say "registered from physical pin", so I'm unclear where this slack is sneaking in ...
There's no need to suppose that some variable number of NOPs might be needed, based on frequency. It is safest to write code for the 80MHz timing situation, as it will always work.
Problem is, the real world does not quite work that way.
As long as the silicon can add an arbitrary extra clock, based on SysCLK and PVT, it will bite users.
In reality, the user codes until it works, and then later someone else may adjust the clock speed.
Even if they strive to carefully count cycles, there is no proof they have that right.
As your numbers show, even counting opcodes is too coarse, as the sampling can be mid-opcode.
So long as the silicon has the risk, proving it is field safe is very hard indeed.
ie You may just have a fluked 'good' part.
I went through the Verilog code to determine the IN/OUT paths to/from the cogs:
OUT registered in cog on 'go' (on last clock of instruction)
registered in hub after OR'ing all cogs' OUT signals *
goes through smart pin logic to physical pin
(total delay = 1 clock after last clock of instruction)
DRVC #30'2 (+1!) updates after 1 clock after 2-clock instruction
IN for D/S registered from physical pin
goes through smart pin mux and logic/filtering
registered in hub for fan-out to all cogs *
registered in cog on 'go' (last clock of prior instruction)
(total delay = 3 clocks)
TESTBINB,#31'(?3+) 2 samples 3 clocks before 2-clock instruction
IN for TESTP{N} registered from physical pin
goes through smart pin mux and logic/filtering
registered in hub for fan-out to all cogs *
registered in cog on 'get' (first clock of instruction)
(total delay = 2 clocks, since it arrives in first clock of instruction)
TESTP #31'(?2+) 2 samples 2 clocks before 2-clock instruction
* Can *maybe* be eliminated in ASIC
As you can see, TESTP/TESTPN get IN data that is one clock fresher than instructions which read INA/INB via D or S.
I'm not easily seeing a long path anywhere there, especially with the "registered from physical pin" on Pin in ?
In terms is *possible* ASIC speed ups, the IN path seems safer than OUT.
The OUT includes a wide OR, (slow), and you want the OUT pin to be close to the clock edge, to allow other external parts setup times. Smart pin bypass sounds a fast path.
For IN, what slack is 'registered in hub for fan-out to all cogs' swallowing, there is no wide-OR on IN ?
Could that be easier to remove than the OUT case ?
What exactly is "goes through smart pin mux and logic/filtering" ?
The mux here is a simple bypass for no-smart-pin, and the filtering is default-off right ?
JMG,
I read it as the issue is on the output. Input is good, albeit with a fixed lag.
It looks like the big OR is hidden behind a register in the Hub. So that leaves delays from long route from the Hub and also any mux'ing with associated Smartpin. The Smartpin propagation should be very short.
I'm suspicious, without any changes, that the final silicon will be a lot improved simply because the long route won't have all the FPGA's selectable interconnect.
I can almost bet the problem is just the FPGA ... but I won't.
JMG,
I read it as the issue is on the output. Input is good, albeit with a fixed lag.
That's a good point, a scope should show which of the OUT, or IN, paths gets the bonus added clock.
Triggering the scope could be tricky, maybe connected to Smart Pin toggling pin, and SW-Pin, would presume the Smart-pin never had this effect, and so could not move ?
An alternative could be to add a SysClk/N scope pin to the FPGA build, that is eg /128 using a fast sync counter and then the SW can pace-itself to the same total clocks. Such a fixed ref avoids all Smart Pin complexities, and hopefully 'fails last'.
That fast-test-pin could also check IN effects, by wiring to a Smart Pin.
From my understanding, one of the gotchas with FPGA is that the "OR"ing of the various "OUT"s come from a number of places...
1. each cog (ultimately the OUTx registers by way of various instructions)
2. smart pins
3. streamer
This results in a big OR gate. In the FPGA, only 4 or 6 lines can be "or"ed at a time. So cascading OR gates needs to be done. On top of that, we have the routing lines adding delays.
In the real silicon, the "OR" gate can have as many inputs as required, resulting in a single OR gate with less routing delays.
Looks good, Evanh. The SPI memory outputs new data on the falling edge of the clock, so you'd want to sample just before the clock drops, which all our code does in these snippets.
The SPI slave has to be clocking out on the rising clock edge! Otherwise the first bit is not present for the first shift.
In SPI Mode0 (the preferred SD mode) & Mode3 (Flash supports modes 0 & 3):
Data is output on CLK falling edge (must be valid within 6-8ns after the falling edge)
Data is sampled on CLK rising edge (must be valid tds>2ns before, and thd>2ns after the rising edge)
In SPI Mode0 (the preferred SD mode) & Mode3 (Flash supports modes 0 & 3):
Data is output on CLK falling edge (must be valid within 6-8ns after the falling edge)
Data is sampled on CLK rising edge (must be valid tds>2ns before, and thd>2ns after the rising edge)
Ah, thanks for the nudge, I've just had a nosy at a MicroChip SPI flash datasheet. The detail I hadn't understood is the last clock (rising) of the read command and address is also the first clock (falling) of the data reply.
The first data bit is always present for reading back before the SPI-In-routine clock even gets raised ... so only 7 data clocks will be needed in all those examples above.
Chip,
I just bumped into a small group of TESTP/TESTPN instructions that are explicitly encoded for the pin I/O bits. I note there is also a general case of these instructions that looks to be 100% compatible. The similarity is such that the {#}S bit selection field from the general cases is also listed in the pin I/O cases but these ones don't actually use S field at all.
I'm thinking the pin I/O versions should just be aliases of the general versions.
Chip,
I just bumped into a small groyup of TESTP/TESTPN instructions that are explicitly encoded for the pin I/O bits. I note there is also a general case of these instructions that looks to be 100% compatible. The similarity is such that the {#}S bit selection field from the general cases is also listed in the pin I/O cases but these ones don't actually use S field at all.
I'm thinking the pin I/O versions should just be aliases of the general versions.
For immediate pin numbers, you could use 'TESTB INx,#pin', but for variable pin numbers, you'd need extra code to resolve INx. That's where TESTP/TESTPN are needed.
Ha, I've probably asked the same thing in the past and forgotten the answer... it's a touch of deja vu just looking down the instruction list and seeing this virtually identical group of instructions.
3. SPLITB to REGEXP descriptions mention S, but D must be S and it would be less confusing to use D only.
Thanks! That "D,{#}S" was a mistake. I changed it to "{#}D", like it should it have been.
There is a bunch of overlap between the TESTPx and the DIRx/OUTx/FLTx/DRVx encodings. The CZ bits differentiate the two sets. If the CZ bits are %01 or %10, it's TESTPx. If the CZ bits are %00 or %11, it's DIRx/OUTx/FLTx/DRVx. In other words, if one flag is affected, it's TESTPx. If neither or both flags are affected, it's the others.
So I retried programming V27a (BeMicro_A9_Prop2_v27.jic 11/11/2017 2:39AM) and worked.
But when I repower (replug USB programming port) there is no Cog Leds lit (shouldn't Cog 0 be lit).
I have an RGB LED connected to P5, P7 & P9 via resistors to GND. These are ON indicationg these pins are High. On v26 the RGB led are OFF.
PNut_v27a.exe 19/11/2017 8:31AM (dd/mm/yyyy format) cannot find the USB port for downloading - ie it cannot find the prop.
Do I have the correct versions? Am I missing something?
So I retried programming V27a (BeMicro_A9_Prop2_v27.jic 11/11/2017 2:39AM) and worked.
But when I repower (replug USB programming port) there is no Cog Leds lit (shouldn't Cog 0 be lit).
I have an RGB LED connected to P5, P7 & P9 via resistors to GND. These are ON indicationg these pins are High. On v26 the RGB led are OFF.
PNut_v27a.exe 19/11/2017 8:31AM (dd/mm/yyyy format) cannot find the USB port for downloading - ie it cannot find the prop.
Do I have the correct versions? Am I missing something?
3. SPLITB to REGEXP descriptions mention S, but D must be S and it would be less confusing to use D only.
Thanks! That "D,{#}S" was a mistake. I changed it to "{#}D", like it should it have been.
There is a bunch of overlap between the TESTPx and the DIRx/OUTx/FLTx/DRVx encodings. The CZ bits differentiate the two sets. If the CZ bits are %01 or %10, it's TESTPx. If the CZ bits are %00 or %11, it's DIRx/OUTx/FLTx/DRVx. In other words, if one flag is affected, it's TESTPx. If neither or both flags are affected, it's the others.
Here's the overlapping opcodes expanded for clarity
So I retried programming V27a (BeMicro_A9_Prop2_v27.jic 11/11/2017 2:39AM) and worked.
But when I repower (replug USB programming port) there is no Cog Leds lit (shouldn't Cog 0 be lit).
I have an RGB LED connected to P5, P7 & P9 via resistors to GND. These are ON indicationg these pins are High. On v26 the RGB led are OFF.
PNut_v27a.exe 19/11/2017 8:31AM (dd/mm/yyyy format) cannot find the USB port for downloading - ie it cannot find the prop.
Do I have the correct versions? Am I missing something?
3. SPLITB to REGEXP descriptions mention S, but D must be S and it would be less confusing to use D only.
Thanks! That "D,{#}S" was a mistake. I changed it to "{#}D", like it should it have been.
There is a bunch of overlap between the TESTPx and the DIRx/OUTx/FLTx/DRVx encodings. The CZ bits differentiate the two sets. If the CZ bits are %01 or %10, it's TESTPx. If the CZ bits are %00 or %11, it's DIRx/OUTx/FLTx/DRVx. In other words, if one flag is affected, it's TESTPx. If neither or both flags are affected, it's the others.
Here's the overlapping opcodes expanded for clarity
When HubExec was devised, coherent integration with the time sliced threads was troublesome. HubExec was clearly the favourite of two and, as you've noted, we suddenly had 16 true cores to play with so no-one felt losing the threads was a significant loss.
Comments
SPIRD rep @.end,#8 ' 8 bits outnot sck ' clock (low high) outnot sck testp miso wc ' read data from card rcl tos,#1 ' shift in msb first .end ret
' ' ' SPI byte in ' spi_in rep @.in,#8 'ready to input a byte outh #spi_ck 'clock pin high outl #spi_ck 'clock pin low testp #spi_dq wc 'sample data pin ('testp' is from before 'outh') rcl x,#1 'save data bit .in ret
' ' ' SPI byte in ' spi_in outh #spi_ck 'clock pin high rep @.in,#7 'ready to input a byte outl #spi_ck 'clock pin low testp #spi_dq wc 'sample data pin ('testin' is from before 'outl') outh #spi_ck 'clock pin high rcl x,#1 'save data bit .in outl #spi_ck 'clock pin low testp #spi_dq wc 'sample data pin ('testin' is from before 'outl') rcl x,#1 'save data bit ret
Chip, do you think it's a good idea to use smartpins for SPI for the boot code? I will probably try this out next just so I have the options.
Your notes for IN say "registered from physical pin", so I'm unclear where this slack is sneaking in ...
Problem is, the real world does not quite work that way.
As long as the silicon can add an arbitrary extra clock, based on SysCLK and PVT, it will bite users.
In reality, the user codes until it works, and then later someone else may adjust the clock speed.
Even if they strive to carefully count cycles, there is no proof they have that right.
As your numbers show, even counting opcodes is too coarse, as the sampling can be mid-opcode.
So long as the silicon has the risk, proving it is field safe is very hard indeed.
ie You may just have a fluked 'good' part.
I'm not easily seeing a long path anywhere there, especially with the "registered from physical pin" on Pin in ?
In terms is *possible* ASIC speed ups, the IN path seems safer than OUT.
The OUT includes a wide OR, (slow), and you want the OUT pin to be close to the clock edge, to allow other external parts setup times. Smart pin bypass sounds a fast path.
For IN, what slack is 'registered in hub for fan-out to all cogs' swallowing, there is no wide-OR on IN ?
Could that be easier to remove than the OUT case ?
What exactly is "goes through smart pin mux and logic/filtering" ?
The mux here is a simple bypass for no-smart-pin, and the filtering is default-off right ?
I read it as the issue is on the output. Input is good, albeit with a fixed lag.
It looks like the big OR is hidden behind a register in the Hub. So that leaves delays from long route from the Hub and also any mux'ing with associated Smartpin. The Smartpin propagation should be very short.
I'm suspicious, without any changes, that the final silicon will be a lot improved simply because the long route won't have all the FPGA's selectable interconnect.
I can almost bet the problem is just the FPGA ... but I won't.
Triggering the scope could be tricky, maybe connected to Smart Pin toggling pin, and SW-Pin, would presume the Smart-pin never had this effect, and so could not move ?
An alternative could be to add a SysClk/N scope pin to the FPGA build, that is eg /128 using a fast sync counter and then the SW can pace-itself to the same total clocks. Such a fixed ref avoids all Smart Pin complexities, and hopefully 'fails last'.
That fast-test-pin could also check IN effects, by wiring to a Smart Pin.
1. each cog (ultimately the OUTx registers by way of various instructions)
2. smart pins
3. streamer
This results in a big OR gate. In the FPGA, only 4 or 6 lines can be "or"ed at a time. So cascading OR gates needs to be done. On top of that, we have the routing lines adding delays.
In the real silicon, the "OR" gate can have as many inputs as required, resulting in a single OR gate with less routing delays.
In SPI Mode0 (the preferred SD mode) & Mode3 (Flash supports modes 0 & 3):
Data is output on CLK falling edge (must be valid within 6-8ns after the falling edge)
Data is sampled on CLK rising edge (must be valid tds>2ns before, and thd>2ns after the rising edge)
Ah, thanks for the nudge, I've just had a nosy at a MicroChip SPI flash datasheet. The detail I hadn't understood is the last clock (rising) of the read command and address is also the first clock (falling) of the data reply.
The first data bit is always present for reading back before the SPI-In-routine clock even gets raised ... so only 7 data clocks will be needed in all those examples above.
I just bumped into a small group of TESTP/TESTPN instructions that are explicitly encoded for the pin I/O bits. I note there is also a general case of these instructions that looks to be 100% compatible. The similarity is such that the {#}S bit selection field from the general cases is also listed in the pin I/O cases but these ones don't actually use S field at all.
I'm thinking the pin I/O versions should just be aliases of the general versions.
For immediate pin numbers, you could use 'TESTB INx,#pin', but for variable pin numbers, you'd need extra code to resolve INx. That's where TESTP/TESTPN are needed.
I do now, however, see one difference. That is the bit range is 0-63 for the TESTPx instructions verses 0-31 for the TESTBx instrctions.
EDIT: Oh, lol, I've kind of noted the same thing in a different way. You can't just runtime alias the two INx registers. Understood now.
1. TESTPx are shown with ,{#}S but there is no S. Is this a copy-and-paste error from TESTB?
2. TESTPx and the following DIRx instructions have identical opcodes.
EEEE 1101011 CZL DDDDDDDDD 001000110 TESTP D,{#}S XORC/XORZ EEEE 1101011 CZL DDDDDDDDD 001000110 DIRRND {#}D {WCZ}
3. SPLITB to REGEXP descriptions mention S, but D must be S and it would be less confusing to use D only.
Thanks! That "D,{#}S" was a mistake. I changed it to "{#}D", like it should it have been.
There is a bunch of overlap between the TESTPx and the DIRx/OUTx/FLTx/DRVx encodings. The CZ bits differentiate the two sets. If the CZ bits are %01 or %10, it's TESTPx. If the CZ bits are %00 or %11, it's DIRx/OUTx/FLTx/DRVx. In other words, if one flag is affected, it's TESTPx. If neither or both flags are affected, it's the others.
I have V26 working and my program works.
So I retried programming V27a (BeMicro_A9_Prop2_v27.jic 11/11/2017 2:39AM) and worked.
But when I repower (replug USB programming port) there is no Cog Leds lit (shouldn't Cog 0 be lit).
I have an RGB LED connected to P5, P7 & P9 via resistors to GND. These are ON indicationg these pins are High. On v26 the RGB led are OFF.
PNut_v27a.exe 19/11/2017 8:31AM (dd/mm/yyyy format) cannot find the USB port for downloading - ie it cannot find the prop.
Do I have the correct versions? Am I missing something?
https://drive.google.com/file/d/1omGhklqFgAEEoR0jrSNupT_UUxO7Cye8/view?usp=sharing
Here's the overlapping opcodes expanded for clarity
EEEE 0100000 00I DDDDDDDDD SSSSSSSSS BITL D,S/# EEEE 0100000 01I DDDDDDDDD SSSSSSSSS TESTB D,S/# WZ EEEE 0100000 10I DDDDDDDDD SSSSSSSSS TESTB D,S/# WC EEEE 0100000 11I DDDDDDDDD SSSSSSSSS BITL D,S/# WCZ EEEE 0100001 00I DDDDDDDDD SSSSSSSSS BITH D,S/# EEEE 0100001 01I DDDDDDDDD SSSSSSSSS TESTBN D,S/# WZ EEEE 0100001 10I DDDDDDDDD SSSSSSSSS TESTBN D,S/# WC EEEE 0100001 11I DDDDDDDDD SSSSSSSSS BITHw D,S/# WCZ EEEE 0100010 00I DDDDDDDDD SSSSSSSSS BITC D,S/# EEEE 0100010 01I DDDDDDDDD SSSSSSSSS TESTB D,S/# ANDZ EEEE 0100010 10I DDDDDDDDD SSSSSSSSS TESTB D,S/# ANDC EEEE 0100010 11I DDDDDDDDD SSSSSSSSS BITC D,S/# WCZ EEEE 0100011 00I DDDDDDDDD SSSSSSSSS BITNC D,S/# EEEE 0100011 01I DDDDDDDDD SSSSSSSSS TESTBN D,S/# ANDZ EEEE 0100011 10I DDDDDDDDD SSSSSSSSS TESTBN D,S/# ANDC EEEE 0100011 11I DDDDDDDDD SSSSSSSSS BITNC D,S/# WCZ EEEE 0100100 00I DDDDDDDDD SSSSSSSSS BITZ D,S/# EEEE 0100100 01I DDDDDDDDD SSSSSSSSS TESTB D,S/# ORZ EEEE 0100100 10I DDDDDDDDD SSSSSSSSS TESTB D,S/# ORC EEEE 0100100 11I DDDDDDDDD SSSSSSSSS BITZ D,S/# WCZ EEEE 0100101 00I DDDDDDDDD SSSSSSSSS BITNZ D,S/# EEEE 0100101 01I DDDDDDDDD SSSSSSSSS TESTBN D,S/# ORZ EEEE 0100101 10I DDDDDDDDD SSSSSSSSS TESTBN D,S/# ORC EEEE 0100101 11I DDDDDDDDD SSSSSSSSS BITNZ D,S/# WCZ EEEE 0100110 00I DDDDDDDDD SSSSSSSSS BITRND D,S/# EEEE 0100110 01I DDDDDDDDD SSSSSSSSS TESTB D,S/# XORZ EEEE 0100110 10I DDDDDDDDD SSSSSSSSS TESTB D,S/# XORC EEEE 0100110 11I DDDDDDDDD SSSSSSSSS BITRND D,S/# WCZ EEEE 0100111 00I DDDDDDDDD SSSSSSSSS BITNOT D,S/# EEEE 0100111 01I DDDDDDDDD SSSSSSSSS TESTBN D,S/# XORZ EEEE 0100111 10I DDDDDDDDD SSSSSSSSS TESTBN D,S/# XORC EEEE 0100111 11I DDDDDDDDD SSSSSSSSS BITNOT D,S/# WCZ EEEE 1101011 00L DDDDDDDDD 001000000 DIRL D/# EEEE 1101011 01L DDDDDDDDD 001000000 TESTP D/# WZ EEEE 1101011 10L DDDDDDDDD 001000000 TESTP D/# WC EEEE 1101011 11L DDDDDDDDD 001000000 DIRL D/# WCZ EEEE 1101011 00L DDDDDDDDD 001000001 DIRH D/# EEEE 1101011 01L DDDDDDDDD 001000001 TESTPN D/# WZ EEEE 1101011 10L DDDDDDDDD 001000001 TESTPN D/# WC EEEE 1101011 11L DDDDDDDDD 001000001 DIRH D/# WCZ EEEE 1101011 00L DDDDDDDDD 001000010 DIRC D/# EEEE 1101011 01L DDDDDDDDD 001000010 TESTP D/# ANDZ EEEE 1101011 10L DDDDDDDDD 001000010 TESTP D/# ANDC EEEE 1101011 11L DDDDDDDDD 001000010 DIRC D/# WCZ EEEE 1101011 00L DDDDDDDDD 001000011 DIRNC D/# {WCZ} EEEE 1101011 01L DDDDDDDDD 001000011 TESTPN D/# ANDZ EEEE 1101011 10L DDDDDDDDD 001000011 TESTPN D/# ANDC EEEE 1101011 11L DDDDDDDDD 001000011 DIRNC D/# {WCZ} EEEE 1101011 00L DDDDDDDDD 001000100 DIRZ D/# EEEE 1101011 01L DDDDDDDDD 001000100 TESTP D/# ORZ EEEE 1101011 10L DDDDDDDDD 001000100 TESTP D/# ORC EEEE 1101011 11L DDDDDDDDD 001000100 DIRZ D/# WCZ EEEE 1101011 00L DDDDDDDDD 001000101 DIRNZ D/# EEEE 1101011 01L DDDDDDDDD 001000101 TESTPN D/# ORZ EEEE 1101011 10L DDDDDDDDD 001000101 TESTPN D/# ORC EEEE 1101011 11L DDDDDDDDD 001000101 DIRNZ D/# WCZ EEEE 1101011 00L DDDDDDDDD 001000110 DIRRND D/# EEEE 1101011 01L DDDDDDDDD 001000110 TESTP D/# XORZ EEEE 1101011 10L DDDDDDDDD 001000110 TESTP D/# XORC EEEE 1101011 11L DDDDDDDDD 001000110 DIRRND D/# WCZ EEEE 1101011 00L DDDDDDDDD 001000111 DIRNOT D/# EEEE 1101011 01L DDDDDDDDD 001000111 TESTPN D/# XORZ EEEE 1101011 10L DDDDDDDDD 001000111 TESTPN D/# XORC EEEE 1101011 11L DDDDDDDDD 001000111 DIRNOT D/# WCZ
Thanks Brian. v27z jic is working. Are the SD pins mapped on this version, and if so, are they the P3x or P6x sets and is SW1 used?
I also found v27zz which has fixed SD mapping but I missed where the SD is mapped.
Thanks Chip & ozpropdev.
I am not finding any lockups on v27z. Run some outputting to serial for >1hr so far.
Haven't yet figured out if SD is on ~P38 or ~P60.
Update: now been running for more than 6 hours, no lockups.
Last mention of "settask" instruction seems to be circa 2014...
I guess I didn't pay a whole lot of attention back then because there were 16 cores...
When HubExec was devised, coherent integration with the time sliced threads was troublesome. HubExec was clearly the favourite of two and, as you've noted, we suddenly had 16 true cores to play with so no-one felt losing the threads was a significant loss.