The way I read it and taking into account the fact that SETQ would be needed to specify more than 32 pins, I took it that it did span, but alas I tested it and it did not.
Check pin states where h and l are input states and H and L are output states, then from pin 30 set 8 pins low and check.
So SETQ cannot be used to specify more than 32 pins anyway.
EDIT: This is also a perfect way to sync smartpins. I tried setting P24 for 8 pins as a PIN specifier and then set them to NCO mode at 1 MHZ, They were all perfectly synch'd.
It was not possible to involve both A and B ports in the pin span, since the data-forwarding circuitry only handles one 32-bit register. The bit span wraps, as well.
So, It's not a bug, but it's not really a feature, either.
This is a case where I'd wonder about violating the RISC philosophy and having this instruction take 4 (or more) clock cycles, so as to give the expected result.
> @Rayman said:
> This is a case where I'd wonder about violating the RISC philosophy and having this instruction take 4 (or more) clock cycles, so as to give the expected result.
We could have done it all in 2 clocks, but it would have grown the data-forwarding circuit immensely. Didn't seem like it was worth doing.
This is a case where I'd wonder about violating the RISC philosophy and having this instruction take 4 (or more) clock cycles, so as to give the expected result.
IMHO this is not a bug, but the standard (and quite logical) way things have been done on any computer, microprocessor, or microcontroller I have worked with. Having such a feature would not only make the logic circuitry more complicated, it would also open up a whole new set of potential bugs.
Think I just found a bug in a code related to SHR...
I assumed that SHR with a source value >31 would result in destination:=0.
But, seems this type of instruction only looks at the lower 5 bits.
Wouldn't it be better if SHR with source>31 made result zero?
There are many instructions that have unused source bits but I would not call that a bug. To have it handle the unused bits requires extra logic and do you really want it to work that way anyway?
I just realized that I had marked the source fields of these instructions in my formatted copy of the instruction sheet as xxxxSSSSS, but that is only correct for immediate mode, or more correctly only 5 bits of the source data.
Think I just found a bug in a code related to SHR...
I assumed that SHR with a source value >31 would result in destination:=0.
But, seems this type of instruction only looks at the lower 5 bits.
Wouldn't it be better if SHR with source>31 made result zero?
If it's a bug, it's a bug shared by x86, ARM, and RISC-V -- it seems to be pretty standard in microprocessors to use only the lower 5 bits of the shift amount. I agree that it's unfortunate (it would make more sense to output 0 for values > 31) but the P2 is in good company there.
If it's a bug, it's a bug shared by x86, ARM, and RISC-V -- it seems to be pretty standard in microprocessors to use only the lower 5 bits of the shift amount. I agree that it's unfortunate (it would make more sense to output 0 for values > 31) but the P2 is in good company there.
Hitachi/Renesas/whoever-owns-it-this-week SuperH processors (If you know nothing about them, just know that narrow loads/stores are sign-extended by default to be scared of them for life) do something like that with their SHLD instructions - if you shift left by -32, you get zero... (yet it still only looks at the bottom five bits and the sign bit, so -33 is the same as -1). This whacky nonsense, in addition to the fact that that is the only dynamic shift instruction (aside from SHAD, which does arithmetic shifts, you just get fixed 1/2/8/16 bit shifts and a 1-bit rotate, oof) makes bitwise ops real slow on those, lol.
Spin2: >| changed to ENCOD
...and with an operational difference. In the P1, >| returns the highest bit set plus one; 0 if tested value is 0. In the P2, ENCOD returns the highest bit set; 0 even if the tested value is 0
Spin2 : Return value(s) must be declared.
In the P1 we could do this in a method:
return 5
or
result := 5
The P2 doesn't have a default return value, hence it must be declared. To emulate the single return value of the P1, update the method declaration with an explicit return value.
Here are the changes I had to make to Graphics.spin to make it work in Fastspin on P2:
'MINS-->FGES
'MAXS-->FLES
'MOVS-->SETS
':-->. (many places)
'SETD-->SETD2 'SETD is an instruction in P2
'cmps wc,wr --> subs wc
'command := cmd << 16 + argptr --> command := cmd << 24 + argptr
'Using QROTATE to fix missing sin table
'Used MUL to speed multiply
The command part is due to the larger address space.
A lot of P1 code assumed address was only in the lower word.
But now, 20-bits are used for address (the upper 12 bits are ignored by the hardware).
Also, found that PTRA now takes the place of PAR in coginit passing of a pointer from hub to cog...
'Trying new approach to jump table (suggested by Wuerfel_21)
'ALTGB D, S will select byte #D from table starting at S
'Getbyte D brings that byte into D
'JMP to that D (and not #D) takes you where you need to go.
altgb t1,#jumps
getbyte t1
jmp t1
long
jumps byte 0 '0
byte setup_ '1
byte color_ '2
byte width_ '3
byte plot_ '4
byte line_ '5
byte arc_ '6
byte vec_ '7
byte vecarc_ '8
byte pix_ '9
byte pixarc_ 'A
byte text_ 'B
byte textarc_ 'C
byte textmode_ 'D
byte fill_ 'E
byte loop 'F
Also, when moving graphics to hubexec I noticed that:
1. Had to leave self-generating PASM in the cog. These two lines where operated on by SETS:
DAT 'NeededInCog -This part doesn't work in hubexec because target of sets needs to be in cog
NeededInCog
ToShifts
shift0 shl mask0,#0 'position slice
shift1 shr mask1,#0
jmp #DoneShifts
2. This like TJZ jumps back into cog don't work, had to replace with CMP and JMP instructions.
Here are the changes I had to make to Graphics.spin to make it work in Fastspin on P2:
'MINS-->FGES
'MAXS-->FLES
'MOVS-->SETS
':-->. (many places)
'SETD-->SETD2 'SETD is an instruction in P2
'cmps wc,wr --> subs wc
'command := cmd << 16 + argptr --> command := cmd << 24 + argptr
'Using QROTATE to fix missing sin table
'Used MUL to speed multiply
The command part is due to the larger address space.
A lot of P1 code assumed address was only in the lower word.
But now, 20-bits are used for address (the upper 12 bits are ignored by the hardware).
Also, found that PTRA now takes the place of PAR in coginit passing of a pointer from hub to cog...
And PTRA can be modified whereas PTR was a fixed value that remains while the cog is active.
Here is the simplified list of P2 operators with P1 comparisons/changes.
Spin Operators
* New operator in P2
** Behavioral change in P2
P2 P1 Description
-----------------------------------------------------------------
++ (pre) ++ Pre-increment
-- (pre) -- Pre-decrement
?? (pre) ? ** XORO32, iterate and return pseudo-random
++ (post) ++ Post-increment
-- (post) -- Post-decrement
!! (post) Post-logical NOT
! (post) Post-bitwise NOT
\ (post) Post-set
~ (post) ~ Post-set to 0
~~ (post) ~~ Post-set to -1
! ! Bitwise NOT, 1's complement
- - Negation, 2's complement
ABS || * Absolute value
ENCOD >| ** Encode MSB, 31..0
DECOD |< * Decode, 1 << (x & $1F)
BMASK Bitmask, (2 << (x & $1F)) - 1
ONES Count ones
SQRT ^^ * Square root of unsigned x
QLOG Unsigned to logarithm
QEXP Logarithm to unsigned
>> >> Shift right, insert 0's
<< << Shift left, insert 0's
SAR ~> * Shift right, insert MSB's
ROR -> * Rotate right
ROL <- * Rotate left
REV >< * Reverse y LSBs of x and zero-extend
ZEROX Zero-extend above bit y
SIGNX ~, ~~ ** Sign-extend from bit y
& & Bitwise AND
^ ^ Bitwise XOR
| | Bitwise OR
* * Signed multiply
/ / Signed divide, return quotient
+/ Unsigned divide, return quotient
// // Signed divide, return remainder
+// Unsigned divide, return remainder
SCA Unsigned scale (x * y) >> 32
SCAS Signed scale (x * y) >> 30
FRAC Unsigned fraction {x, 32'b0} / y
+ + Add
- - Subtract
#> Ensure x => y, signed
<# Ensure x <= y, signed
ADDBITS Make bitfield, (x & $1F) | (y & $1F) << 5
ADDPINS Make pinfield, (x & $3F) | (y & $1F) << 6
< < Signed less than (returns 0 or -1)
+< Unsigned less than (returns 0 or -1)
<= =< * Signed less than or equal (returns 0 or -1)
+<= Unsigned less than or equal (returns 0 or -1)
== == Equal (returns 0 or -1)
<> <> Not equal (returns 0 or -1)
>= => * Signed greater than or equal (returns 0 or -1)
+>= Unsigned greater than or equal (returns 0 or -1)
> > Signed greater than (returns 0 or -1)
+> Unsigned greater than (returns 0 or -1)
<=> Signed comparison (<,=,> returns -1,0,1)
!!, NOT not * Logical NOT (x == 0, returns 0 or -1)
&&, AND and * Logical AND (x <> 0 AND y <> 0, returns 0 or -1)
^^, XOR xor ** Logical XOR (x <> 0 XOR y <> 0, returns 0 or -1)
||, OR or ** Logical OR (x <> 0 OR y <> 0, returns 0 or -1)
? : * If x <> 0 then choose y, else choose z
:= := ** Set var(s) to x
P2: v1,v2,... := x,y,... ' set v1 to x, v2 to y, etc. '_' = ignore
Complex math functions
---------------------------------------------------------------------------------------------------
var_x,var_y := ROTXY(x,y,t) Rotate cartesian (x,y) by t and assign resultant (x,y)
var_r,var_t := XYPOL(x,y) Convert cartesian (x,y) to polar and assign resultant (r,t)
var_x,var_y := POLXY(r,t) Convert polar (r,t) to cartesian and assign resultant (x,y)
Floating Point Constants
I think this applies to both P1 and P2, but it bit me a few days ago. It's convenient to use floating point math in the CON section, but it's very unlikely that your program wants IEEE754 floating point values. When processed as an integer, the value will be much different than you expect.
This is a snippet from a VGA driver bundled with PNut: This is fine.
fpix = 40_000_000
...
qfrac ##fpix,pa
fpix = 40_000_000.0 ' PROBLEM
...
qfrac ##fpix,pa ' PROBLEM -IEE754 value inserted where P2 most likely expects an integer
fpix = 40_000_000.0
...
qfrac ##round(fpix),pa ' FIXED use round() or trunc() to convert float constant to integer
Should there be a warning about this? On fastspin, round() or trunc() on an integer constant causes an error.
There's another difference with these type constant definitions. Pnut auto-promotes to floats, Fastspin does not and will complain when mixed without explicit casts.
Spin2: >< (reverse operator) replaced with REV
REV behaves differently, as well. For example, this snippet of code preps a value for a bit-banged SPI output in the P1.
if (mode == LSBFIRST)
outbits ><= 32 ' flip bits, align lsb to bit31
else
outbits <<= (32-bits) ' align msb to bit31
The modification for running on the P2 is:
if (mode == LSBFIRST)
outbits rev= 31 ' flip bits, align lsb to bit31
else
outbits <<= (32-bits) ' align msb to bit31
Note that the value used in the P2 is the last bit to be persevered; REV will reverse the bits between 0 and the target, then clear everything above the target to 0.
1) MOVS and MOVD are renamed to SETS and SETD respectively. This is noted in the 2nd post in the thread.
2) The P2's pipeline requires an additional instruction between the SETS/SETD and the instruction that uses it.
There is an instruction timing diagram in the "Assembly Language" section of the Documentation.
1. Ib read ' SETD
2. Db,Sb read
3. Ic read, Ra write ' First delay instruction
4. Dc,Sc read
5. Id read, Rb write ' SETD is writing result here, reading the same location does not result in new data
6. Dd,Sd read
7. Ie read, Rc write ' Can read Rb now
setd .loop,#Temp_Data
nop ' Seems like P2 needs an additional delay
add t3,#1 ' Address the next data register
.loop wrbyte 0-0,t3 ' Write the data bytes into hub memory
add .loop,bit_9
add t3,#1 ' Address the next data register
djnz t2,#.loop
This code snippet is from the CAN bus object. My P2 port is working pretty well now.
Trap: waitms() and waitus() are limited to a delay of 2^31 / clkfreq seconds. In my 200MHz test that works out to 10_737 milliseconds for waitms(), and 10_737_419 microseconds for waitus().
For long delays I put these methods into my jm_timer.spin2 object.
pub pause(ms) | t0, tixms
'' Delay in milliseconds
org
getct t0 ' snapshot counter
sub t0, ##592 ' fix call overhead
rdlong tixms, #$44 ' get clkfreq
qdiv tixms, ##1_000 ' get ticks/ms
getqx tixms
rep #2, ms ' delay
addct1 t0, tixms
waitct1
end
pub pause_us(us) | t0, tixus
'' Delay in microseconds
'' -- for low speed system frequency, use waitus()
org
getct t0 ' snapshot counter
sub t0, ##560 ' fix call overhead
rdlong tixus, #$44 ' get clkfreq
qdiv tixus, ##1_000_000 ' get ticks/us
getqx tixus
rep #2, us ' delay
addct1 t0, tixus
waitct1
end
Edits:
-- fixed opening statement for clarity
-- updated delay routines to be frequency independent
Comments
The way I read it and taking into account the fact that SETQ would be needed to specify more than 32 pins, I took it that it did span, but alas I tested it and it did not.
Check pin states where h and l are input states and H and L are output states, then from pin 30 set 8 pins low and check.
So SETQ cannot be used to specify more than 32 pins anyway.
EDIT: This is also a perfect way to sync smartpins. I tried setting P24 for 8 pins as a PIN specifier and then set them to NCO mode at 1 MHZ, They were all perfectly synch'd.
This is unfortunate. This must be a bug in the Verilog.
Wonder if Chip knows about this...
So, It's not a bug, but it's not really a feature, either.
> This is a case where I'd wonder about violating the RISC philosophy and having this instruction take 4 (or more) clock cycles, so as to give the expected result.
We could have done it all in 2 clocks, but it would have grown the data-forwarding circuit immensely. Didn't seem like it was worth doing.
But, not for BITH, BITL, etc.
Are they the same way?
scratch that... Obviously not the same type of instruction...
I guess this is another thing that should be added to the docs at some point...
But since all pins are equal in opposite to other MCs, one can design the pinout on the board to avoid the overlap.
IMHO this is not a bug, but the standard (and quite logical) way things have been done on any computer, microprocessor, or microcontroller I have worked with. Having such a feature would not only make the logic circuitry more complicated, it would also open up a whole new set of potential bugs.
I assumed that SHR with a source value >31 would result in destination:=0.
But, seems this type of instruction only looks at the lower 5 bits.
Wouldn't it be better if SHR with source>31 made result zero?
There are many instructions that have unused source bits but I would not call that a bug. To have it handle the unused bits requires extra logic and do you really want it to work that way anyway?
I just realized that I had marked the source fields of these instructions in my formatted copy of the instruction sheet as xxxxSSSSS, but that is only correct for immediate mode, or more correctly only 5 bits of the source data.
If it's a bug, it's a bug shared by x86, ARM, and RISC-V -- it seems to be pretty standard in microprocessors to use only the lower 5 bits of the shift amount. I agree that it's unfortunate (it would make more sense to output 0 for values > 31) but the P2 is in good company there.
I am pretty confident that this is the same behaviour on P1.
Hitachi/Renesas/whoever-owns-it-this-week SuperH processors (If you know nothing about them, just know that narrow loads/stores are sign-extended by default to be scared of them for life) do something like that with their SHLD instructions - if you shift left by -32, you get zero... (yet it still only looks at the bottom five bits and the sign bit, so -33 is the same as -1). This whacky nonsense, in addition to the fact that that is the only dynamic shift instruction (aside from SHAD, which does arithmetic shifts, you just get fixed 1/2/8/16 bit shifts and a 1-bit rotate, oof) makes bitwise ops real slow on those, lol.
...and with an operational difference. In the P1, >| returns the highest bit set plus one; 0 if tested value is 0. In the P2, ENCOD returns the highest bit set; 0 even if the tested value is 0
In the P1 we could do this in a method: or The P2 doesn't have a default return value, hence it must be declared. To emulate the single return value of the P1, update the method declaration with an explicit return value. Now return will work as in the P1.
The command part is due to the larger address space.
A lot of P1 code assumed address was only in the lower word.
But now, 20-bits are used for address (the upper 12 bits are ignored by the hardware).
Also, found that PTRA now takes the place of PAR in coginit passing of a pointer from hub to cog...
1. Had to leave self-generating PASM in the cog. These two lines where operated on by SETS:
2. This like TJZ jumps back into cog don't work, had to replace with CMP and JMP instructions.
To get the absolute value:
The P2 || operator is now logical or.
I wonder if Eric will change Fastspin over to this at some point...
Big one for me is that >= is finally in the correct order (the way you say it).
That "<-" turns out to be bitwise rotate left in Spin1 and not less than a negative number.
Guess it was good to change that...
I think this applies to both P1 and P2, but it bit me a few days ago. It's convenient to use floating point math in the CON section, but it's very unlikely that your program wants IEEE754 floating point values. When processed as an integer, the value will be much different than you expect.
This is a snippet from a VGA driver bundled with PNut: This is fine.
Should there be a warning about this? On fastspin, round() or trunc() on an integer constant causes an error.
REV behaves differently, as well. For example, this snippet of code preps a value for a bit-banged SPI output in the P1. The modification for running on the P2 is: Note that the value used in the P2 is the last bit to be persevered; REV will reverse the bits between 0 and the target, then clear everything above the target to 0.
1) MOVS and MOVD are renamed to SETS and SETD respectively. This is noted in the 2nd post in the thread.
2) The P2's pipeline requires an additional instruction between the SETS/SETD and the instruction that uses it.
There is an instruction timing diagram in the "Assembly Language" section of the Documentation.
1. Ib read ' SETD
2. Db,Sb read
3. Ic read, Ra write ' First delay instruction
4. Dc,Sc read
5. Id read, Rb write ' SETD is writing result here, reading the same location does not result in new data
6. Dd,Sd read
7. Ie read, Rc write ' Can read Rb now
This code snippet is from the CAN bus object. My P2 port is working pretty well now.
For long delays I put these methods into my jm_timer.spin2 object.
Edits:
-- fixed opening statement for clarity
-- updated delay routines to be frequency independent