Thanks, that works. I recall that in a previous version we had to us $7F for 60 MHz and $FF for 120 MHz. I'm testing my loader so that I can send it out to you, Peter and Cluso to run some of my test programs. I was able to get it to work at 40 MHz and 20 MHz, but I had to adjust my formula slightly for computing the number of bit cycles for 2 Mega-baud. I'll post it later today.
Also, does a 64-bit free running counter in the hub make any sense?
Might be useful for measuring longer event periods now that the possible clock frequency is so high. At 80MHz on the P1 rollover comes at ~53 seconds, at 240MHz the P2 rollover would be less than 18 seconds with a 32 bit counter.
The Prop2 has that much shorter attention span. You'd be torturing it!
Hehe, being serious, it would be a secondary 32-bit counter. A single 64-bit counter is not easy to grapple with when all instructions only have 32-bit data access. Once that is true then this new counter may as well be independent.
Amusingly, RDPIN is a 2-clock instruction. A smartpin can be setup as a tick counter. There's bound to be an unused smartpin amongst the SPI boot pins. We could commonly make use of that in all software, say, as a de facto microsecond counter. A steady 71.58 minutes roll-over, irrespective of sysclock rate.
Don't cordic operations return 64 bit results? Maybe can tie into that?
Use getqx, getqy...
Funny you bring that up. The CORDIC doesn't take 64-bit inputs. This is important. This detail hit home for me when I was doing some multiplies and figured I could use the fast 16x16 multiply instruction instead of cordic's multiply. Turns out that multiplying the same variable by ten a few times quickly makes it leap out of the 16-bit input limit.
The CORDIC doesn't take 64-bit inputs. This is important.
The Cordic summary says this 32-bit, pipelined CORDIC solver with scale-factor correction
32 x 32 unsigned multiply (not stated is 64b result)
64 / 32 unsigned divide
64 → 32 square root
or did you mean other operations ?
The Prop2 has that much shorter attention span. You'd be torturing it!
Hehe, being serious, it would be a secondary 32-bit counter. A single 64-bit counter is not easy to grapple with when all instructions only have 32-bit data access. Once that is true then this new counter may as well be independent.
XORO32 D has a 64-bit output: 32-bit state to D and 32-bit PRN injected into S-field of next instruction. We don't want GETCT to always take four cycles, however.
I think GETCT could write to the flags just the same as GETRND. The two opcodes differ only in bit 0 so there should be no logic cost.
Huh, a smartpin won't do it. The Z register doesn't update until a rollover in the NCO modes, there's not really anything else suitable. And you lose control of IN and OUT for bit bashing anyway, so never was suitable to share with a say a SPI chip select pin.
So only way to have a microsecond or millisecond counter would be do it in software.
P2-Hot had a 64 bit counter. I forget how the upper 32-bits were read, but there was a way to do it.
P2Hot had GETCNT and GETCNTX.
Here's a snippet from the old docs.
The hub contains a 64-bit counter called CNT that increments on each clock cycle. Each cog can use CNT
to mark time in various ways. On chip reset, the ROM Booter initializes CNT to $00000000_00000000, from
which point it begins incrementing.
Here are the instructions which relate to CNT:
GETCNT D Get CNT[31..0] into D.
GETCNTX D Get CNT[63..32], delayed by 1 clock, into D. A single-task program executing a
GETCNT, immediately followed by a GETCNTX, would get a 64-bit snapshot of CNT.
Good read Oz. Hehe, it was really two 32-bit counters, one chained from the other. Hence the reason why the paired single-cycle instructions provided a coherent read.
I'm guessing WAITCNT didn't exist or didn't apply to the whole 64-bits.
Also, does a 64-bit free running counter in the hub make any sense?
Might be useful for measuring longer event periods now that the possible clock frequency is so high. At 80MHz on the P1 rollover comes at ~53 seconds, at 240MHz the P2 rollover would be less than 18 seconds with a 32 bit counter.
BTW
On the P2 @ 80MHz the WAITCTx range is ~26 secs.
The CT overflow was changed in V32b IIRC.
Chip,
I'm of the opinion the LOC instruction has no useful purpose for PC-relative addressing mode. Neither for the instruction encoding itself nor the generated data it writes.
Even if the Prop2 hardware is left with the ability, I think Pnut should stop generating any relative encodings for LOC in the machine code. Having to use #\@label every time, to be sure of correct encoding, is just messy.
Assemblers are meant to be low level, and they should generate what you tell them. If you want LOC to generate an absolute address you should use "\@ ". If for some bizarre reason you want a relative address you should use "@ ". Who knows, there might be a reason to use a relative address, such as for position-independent code. It's reasonable for the assembler to have a mode where it will generate warnings if you do something unusual, but that could be controlled by a setup flag.
There's no sense in LOC ever wanting a relative address. And as it stands, there is a bug in the way Pnut tries to generate relative addresses inappropriately.
So if LOC should never use relative addressing, then this is more than an assembler issue. It's also an issue with the hardware since it allows for relative addressing. I guess it was easier to implement LOC that way since it uses the same instruction format as CALLD. One problem with LOC is that it only allows writing into PA, PB, PTRA or PTRB. Maybe the R bit could be used to increase the register range from $126-$129 to $122-$129. The alternative to LOC is to use MOV with a 32-bit source value. Of course, that requires an extra long and 2 extra cycles.
I'll probably add a warning message to p2asm if LOC is used with relative addressing. However, I'll still allow it just in case someone comes up with an imaginative way to use relative addresses with LOC.
I'll probably add a warning message to p2asm if LOC is used with relative addressing. However, I'll still allow it just in case someone comes up with an imaginative way to use relative addresses with LOC.
^^^this
Assemblers don't 'correct' what the software author meant. That way lies chaos.
Throw an error, throw a warning, don't silently correct.
LOC with relative addresses seems like it could be pretty useful for position independent code. Think of it as LEA rather than MOV and it may make more sense.
Comments
Also, does a 64-bit free running counter in the hub make any sense?
Might be useful for measuring longer event periods now that the possible clock frequency is so high. At 80MHz on the P1 rollover comes at ~53 seconds, at 240MHz the P2 rollover would be less than 18 seconds with a 32 bit counter.
Hehe, being serious, it would be a secondary 32-bit counter. A single 64-bit counter is not easy to grapple with when all instructions only have 32-bit data access. Once that is true then this new counter may as well be independent.
Amusingly, RDPIN is a 2-clock instruction. A smartpin can be setup as a tick counter. There's bound to be an unused smartpin amongst the SPI boot pins. We could commonly make use of that in all software, say, as a de facto microsecond counter. A steady 71.58 minutes roll-over, irrespective of sysclock rate.
Use getqx, getqy...
WAITX'ing longer than a few seconds isn't something of value. This one seems okay for the moment.
Events, maybe. But then 53 seconds wasn't a huge long time either. Long interval events will have to be done another way.
Funny you bring that up. The CORDIC doesn't take 64-bit inputs. This is important. This detail hit home for me when I was doing some multiplies and figured I could use the fast 16x16 multiply instruction instead of cordic's multiply. Turns out that multiplying the same variable by ten a few times quickly makes it leap out of the 16-bit input limit.
The Cordic summary says this
32-bit, pipelined CORDIC solver with scale-factor correction
32 x 32 unsigned multiply (not stated is 64b result)
64 / 32 unsigned divide
64 → 32 square root
or did you mean other operations ?
XORO32 D has a 64-bit output: 32-bit state to D and 32-bit PRN injected into S-field of next instruction. We don't want GETCT to always take four cycles, however.
I think GETCT could write to the flags just the same as GETRND. The two opcodes differ only in bit 0 so there should be no logic cost.
So only way to have a microsecond or millisecond counter would be do it in software.
Here's a snippet from the old docs.
I'm guessing WAITCNT didn't exist or didn't apply to the whole 64-bits.
BTW
On the P2 @ 80MHz the WAITCTx range is ~26 secs.
The CT overflow was changed in V32b IIRC.
WAITX is less clear as to whether it is the same or not. The description says 2+D so I'm thinking its duration is as per Prop1.
I'm of the opinion the LOC instruction has no useful purpose for PC-relative addressing mode. Neither for the instruction encoding itself nor the generated data it writes.
Even if the Prop2 hardware is left with the ability, I think Pnut should stop generating any relative encodings for LOC in the machine code. Having to use #\@label every time, to be sure of correct encoding, is just messy.
Further reading - https://forums.parallax.com/discussion/comment/1456687/#Comment_1456687
I'll probably add a warning message to p2asm if LOC is used with relative addressing. However, I'll still allow it just in case someone comes up with an imaginative way to use relative addresses with LOC.
^^^this
Assemblers don't 'correct' what the software author meant. That way lies chaos.
Throw an error, throw a warning, don't silently correct.
Yup - my post was intended to provide violent agreement :-)