GCC should use a fair number of instructions to start. It can always be improved.
Maxing out hardware features on this chip is basically intended to be done with PASM. The C people will in line and encapsulate just like the SPIN people will.
I suppose the downside to this (sorry) is that GCC is unlikely to use any of these new instructions, right?
So, we may be delaying rollout for a small fraction of potential users....
I suppose there could be a version of CMM for P2 that could use these instructions if its VM can't fit in COG/LUT otherwise or to provide more "fcache" space.
I've got the Prop123_A9 and BeMicro_A9 compiles done. Compiling for the Prop123_A7 now.
I'm considering using one of my CVA9s for a specialized test instrument since it's a one off so I will try out the latest enhancements and see if I can improve my interpreter as well.
I just got the unary math operators together for Spin2:
op_unary popa x wz 'pop a b c d e f g h i j k a: NOT
if_nz not x,#0 'NOT a | | | | | | | | | | b: !
not x '! a b | | | | | | | | | c: -
neg x '- | | c | | | | | | | | d: ||
abs x '|| | | | d | | | | | | | e: >|
topone x '>| | | | | e | | | | | | f: |<
decod x '|< | | | | | f | | | | | g: ~
shl x,#24 '~ | | | | | | g | | | | h: ~~
sar x,#24 ' | | | | | | g | | | | i: SQRT()
shl x,#16 '~~ | | | | | | | h | | | j: LOG()
sar x,#16 ' | | | | | | | h | | | k: EXP()
qsqrt x 'SQRT() | | | | | | | | i | |
qlog x 'LOG() | | | | | | | | | j |
qexp x 'EXP() | | | | | | | | | | k
getqx x ' | | | | | | | | i j k
_ret_ pusha x 'push a b c d e f g h i j k
That block gets called from the 9-clock interpreter loop via EXECF, the instructions of interest execute, and then it returns to the loop.
All but the "~" operator suffer a single NOP delay, since they must skip 8 instructions in a row, at one point. The code savings is huge and this approach is only two clocks (one NOP) slower than dedicated routines containing only the instructions of interest would have been.
I did find some numbers for PLC code, running via a eCLR byte-code engine :
eCLR Performance data 1000 logical or arithmetic-STL statements
a 1 Cortex®-M3 72 MHz
b ARM9 ™ 240 MHz
c Cortex®-A9 800 MHz
d PowerPC ™ e500 1.2 GHz 1
e Intel® 2.8 GHz
a b c d e
72M3 240A9 800A9 1.2G i2.8GHz
BOOL 33 μs 12 μs 1 μs 1 μs 0.35 μs
BYTE 33 μs 12 μs 1 μs 1 μs 0.35 μs
INT 33 μs 18 μs 4 μs 1 μs 0.35 μs
DINT 33 μs 18 μs 4 μs 8 μs 0.35 μs
REAL 449 μs 100 μs 40 μs 13 μs 1.8 μs
1 The indicated system limits and performance values may differ depending on the configuration of the hardware.
Supported CPUs Altera® Cyclone® V Intel® X86, Intel® Atom ™ Altera® Nios® II PowerPC ™ e300, PowerPC ™
e500 ARM7 ™, ARM9 ™, ARM11 ™ SH02, SH03, SH04 Cortex®-M3, Cortex®-M4, Cortex®-R4F Xilinx Zynq®
I think this means the most basic add/and/not operations. like A = B+C or A= D AND E.
The derived sysCLKs here seem suss ?
A forum posting at RaspBerryPi suggests this
10,000 rungs that look like this:
----------][---------------][----------------][----------------][-------------------------( )---------
Time to execute all 10,000 is about 3 mSec.
This one is 300us for 1000, so quite a lot slower than the other claims.
Assuming (say) 800MHz SysCLK for Pi, that's ~ 240 SysCLK for R=A AND B AND C AND D type operation.
That sounds more realistic ?
You should be able to derive a time for R=A AND B AND C AND D on P2, in SysCLKs, via your byte code ?
I did find some numbers for PLC code, running via a eCLR byte-code engine :
eCLR Performance data 1000 logical or arithmetic-STL statements
a 1 Cortex®-M3 72 MHz
b ARM9 ™ 240 MHz
c Cortex®-A9 800 MHz
d PowerPC ™ e500 1.2 GHz 1
e Intel® 2.8 GHz
a b c d e
72M3 240A9 800A9 1.2G i2.8GHz
BOOL 33 μs 12 μs 1 μs 1 μs 0.35 μs
BYTE 33 μs 12 μs 1 μs 1 μs 0.35 μs
INT 33 μs 18 μs 4 μs 1 μs 0.35 μs
DINT 33 μs 18 μs 4 μs 8 μs 0.35 μs
REAL 449 μs 100 μs 40 μs 13 μs 1.8 μs
1 The indicated system limits and performance values may differ depending on the configuration of the hardware.
Supported CPUs Altera® Cyclone® V Intel® X86, Intel® Atom ™ Altera® Nios® II PowerPC ™ e300, PowerPC ™
e500 ARM7 ™, ARM9 ™, ARM11 ™ SH02, SH03, SH04 Cortex®-M3, Cortex®-M4, Cortex®-R4F Xilinx Zynq®
I think this means the most basic add/and/not operations. like A = B+C or A= D AND E.
The derived sysCLKs here seem suss ?
A forum posting at RaspBerryPi suggests this
10,000 rungs that look like this:
----------][---------------][----------------][----------------][-------------------------( )---------
Time to execute all 10,000 is about 3 mSec.
This one is 300us for 1000, so quite a lot slower than the other claims.
Assuming (say) 800MHz SysCLK for Pi, that's ~ 240 SysCLK for R=A AND B AND C AND D type operation.
That sounds more realistic ?
You should be able to derive a time for R=A AND B AND C AND D on P2, in SysCLKs, via your byte code ?
I found that .pdf and they are talking about native ASM instructions actually executing. It seems that this eCLR bytecode translates easily into instructions. It also has a 256k runtime minimum. If the Prop2 were to run that same test in the way those other benchmarks were computed it would be about (Fclk/2 * 16 processors) / 1000 operations.
These 13 instructions constitute the binary operators for multiply, divide, remainder, scale, and fraction:
op_binary2 setq #2-1 'pop a b c d e a: *
rdlong x,ptra[--2] wc 'C=Ys a b c d e b: /
testb x,#31 wz '!Z=Xs a b c | | c: //
abs x ' a b c | | d: **
abs y ' a b c | | e: */
qmul x,y ' a | | d |
qdiv x,y ' | b c | |
qfrac x,y ' | | | | e
getqx x ' a b | | e
if_c_eq_z neg x ' a b | | |
getqy x ' | | c d |
if_nz neg x ' | | c | |
_ret_ pusha x 'push a b c d e
If you add up all the letters, you can see that these routines coded discretely would have taken 37 instructions. Here, EXECF results in 65% smaller code that runs just as fast!
These 13 instructions constitute the binary operators for multiply, divide, remainder, scale, and fraction:
op_binary2 setq #2-1 'pop a b c d e a: *
rdlong x,ptra[--2] wc 'C=Ys a b c d e b: /
testb x,#31 wz '!Z=Xs a b c | | c: //
abs x ' a b c | | d: **
abs y ' a b c | | e: */
qmul x,y ' a | | d |
qdiv x,y ' | b c | |
qfrac x,y ' | | | | e
getqx x ' a b | | e
if_c_eq_z neg x ' a b | | |
getqy x ' | | c d |
if_nz neg x ' | | c | |
_ret_ pusha x 'push a b c d e
If you add up all the letters, you can see that these routines coded discretely would have taken 37 instructions. Here, EXECF results in 65% smaller code that runs just as fast!
All but the "~" operator suffer a single NOP delay, since they must skip 8 instructions in a row, at one point. The code savings is huge and this approach is only two clocks (one NOP) slower than dedicated routines containing only the instructions of interest would have been.
You could also do
op_unary popa x wz 'pop a b c d e f | | | | |
if_nz not x,#0 'NOT a | | | | | | | | | | a: NOT
not x '! a b | | | | | | | | | b: !
neg x '- | | c | | | | | | | | c: -
abs x '|| | | | d | | | | | | | d: ||
topone x '>| | | | | e | | | | | | e: >|
decod x '|< | | | | | f | | | | | f: |<
popa x wz 'pop | | | | | | g h i j k
_ret_ pusha x 'push a b c d e f | | | | |
shl x,#24 '~ | | | | | | g | | | | g: ~
sar x,#24 ' | | | | | | g | | | |
shl x,#16 '~~ | | | | | | | h | | | h: ~~
sar x,#16 ' | | | | | | | h | | |
qsqrt x 'SQRT() | | | | | | | | i | | i: SQRT()
qlog x 'LOG() | | | | | | | | | j | j: LOG()
qexp x 'EXP() | | | | | | | | | | k k: EXP()
getqx x ' | | | | | | | | i j k
_ret_ pusha x 'push | | | | | | g h i j k
An additional popa/pusha have been added at slots 8 and 9. This does add two instructions, but it gets rid of the NOP penalty.
All but the "~" operator suffer a single NOP delay, since they must skip 8 instructions in a row, at one point. The code savings is huge and this approach is only two clocks (one NOP) slower than dedicated routines containing only the instructions of interest would have been.
You could also do
op_unary popa x wz 'pop a b c d e f | | | | |
if_nz not x,#0 'NOT a | | | | | | | | | | a: NOT
not x '! a b | | | | | | | | | b: !
neg x '- | | c | | | | | | | | c: -
abs x '|| | | | d | | | | | | | d: ||
topone x '>| | | | | e | | | | | | e: >|
decod x '|< | | | | | f | | | | | f: |<
popa x wz 'pop | | | | | | g h i j k
_ret_ pusha x 'push a b c d e f | | | | |
shl x,#24 '~ | | | | | | g | | | | g: ~
sar x,#24 ' | | | | | | g | | | |
shl x,#16 '~~ | | | | | | | h | | | h: ~~
sar x,#16 ' | | | | | | | h | | |
qsqrt x 'SQRT() | | | | | | | | i | | i: SQRT()
qlog x 'LOG() | | | | | | | | | j | j: LOG()
qexp x 'EXP() | | | | | | | | | | k k: EXP()
getqx x ' | | | | | | | | i j k
_ret_ pusha x 'push | | | | | | g h i j k
An additional popa/pusha have been added at slots 8 and 9. This does add two instructions, but it gets rid of the NOP penalty.
Right. Or, just break them into two different sets with no overlap. That wouldn't take any more instructions if you've already added the PUSHA and POPA.
Check out these equality operators. 70% code size reduction:
op_binary0 setq #2-1 'pop a b c d e f a: <
rdlong x,ptra[--2] ' a b c d e f b: =<,<=
cmps x,y wc,wz ' a | c d e | c: ==
cmps y,x wc,wz ' | b | | | f d: <>
muxc x,_FFFFFFFF ' a | | | | f e: =>,>=
muxnc x,_FFFFFFFF ' | b | | e | f: >
muxz x,_FFFFFFFF ' | | c | | |
muxnz x,_FFFFFFFF ' | | | d | |
_ret_ pusha x 'push a b c d e f
Right. Or, just break them into two different sets with no overlap. That wouldn't take any more instructions if you've already added the PUSHA and POPA.
Originally, I was thinking you could insert just the pusha, then re-arrange the instructions to have to most frequently used ones in the first half. But at that point, one additional instruction for removing the penalty from the second half seemed like a better deal. But, you are right. At that point, it may as well be two blocks. I wasn't thinking about the fact that the jump address was part of the the lookup too.
Right. Or, just break them into two different sets with no overlap. That wouldn't take any more instructions if you've already added the PUSHA and POPA.
Originally, I was thinking you could insert just the pusha, then re-arrange the instructions to have to most frequently used ones in the first half. But at that point, one additional instruction for removing the penalty from the second half seemed like a better deal. But, you are right. At that point, it may as well be two blocks. I wasn't thinking about the fact that the jump address was part of the the lookup too.
I'm thinking that when there's just a few overhead instructions, it might make sense to break things into smaller blocks if it means saving cycles. The only slow thing here is the POPA and PUSHA, but they can't easily be gotten around. Saving a few cycles could get you into an earlier egg-beater slot, though, since the POPA and PUSHA addresses have a static timing relationship.
Right. Or, just break them into two different sets with no overlap. That wouldn't take any more instructions if you've already added the PUSHA and POPA.
Originally, I was thinking you could insert just the pusha, then re-arrange the instructions to have to most frequently used ones in the first half. But at that point, one additional instruction for removing the penalty from the second half seemed like a better deal. But, you are right. At that point, it may as well be two blocks. I wasn't thinking about the fact that the jump address was part of the the lookup too.
I'm thinking that when there's just a few overhead instructions, it might make sense to break things into smaller blocks if it means saving cycles. The only slow thing here is the POPA and PUSHA, but they can't easily be gotten around. Saving a few cycles could get you into an earlier egg-beater slot, though, since the POPA and PUSHA addresses have a static timing relationship.
Oh, right! So actually we shouldn't be getting hung up on the NOP penalty (in this case), since all of those instructions will be processed within a single hub cycle. Even with the NOP, you're still waiting on the pusha.
Right. Or, just break them into two different sets with no overlap. That wouldn't take any more instructions if you've already added the PUSHA and POPA.
Originally, I was thinking you could insert just the pusha, then re-arrange the instructions to have to most frequently used ones in the first half. But at that point, one additional instruction for removing the penalty from the second half seemed like a better deal. But, you are right. At that point, it may as well be two blocks. I wasn't thinking about the fact that the jump address was part of the the lookup too.
I'm thinking that when there's just a few overhead instructions, it might make sense to break things into smaller blocks if it means saving cycles. The only slow thing here is the POPA and PUSHA, but they can't easily be gotten around. Saving a few cycles could get you into an earlier egg-beater slot, though, since the POPA and PUSHA addresses have a static timing relationship.
Oh, right! So actually we shouldn't be getting hung up on the NOP penalty (in this case), since all of those instructions will be processed within a single hub cycle. Even with the NOP, you're still waiting on the pusha.
I'll have to do some tests to find out what the sweet spot is, but getting within a near egg-beater slot is worth 16 cycles.
Look at top center. While the 6507 ROM listing I found varies a tiny bit from the bits in the die simulation rendering, the basic idea is you can see the bit patterns in the Visual 6502 are wired to atomic operations, much like the bit patterns are associated with instructions.
The chip does this in parallel. All at the same time.
I just got the unary math operators together for Spin2:
op_unary popa x wz 'pop a b c d e f g h i j k a: NOT
if_nz not x,#0 'NOT a | | | | | | | | | | b: !
not x '! a b | | | | | | | | | c: -
neg x '- | | c | | | | | | | | d: ||
abs x '|| | | | d | | | | | | | e: >|
topone x '>| | | | | e | | | | | | f: |<
decod x '|< | | | | | f | | | | | g: ~
shl x,#24 '~ | | | | | | g | | | | h: ~~
sar x,#24 ' | | | | | | g | | | | i: SQRT()
shl x,#16 '~~ | | | | | | | h | | | j: LOG()
sar x,#16 ' | | | | | | | h | | | k: EXP()
qsqrt x 'SQRT() | | | | | | | | i | |
qlog x 'LOG() | | | | | | | | | j |
qexp x 'EXP() | | | | | | | | | | k
getqx x ' | | | | | | | | i j k
_ret_ pusha x 'push a b c d e f g h i j k
That block gets called from the 9-clock interpreter loop via EXECF, the instructions of interest execute, and then it returns to the loop.
All but the "~" operator suffer a single NOP delay, since they must skip 8 instructions in a row, at one point. The code savings is huge and this approach is only two clocks (one NOP) slower than dedicated routines containing only the instructions of interest would have been.
The three unary CORDIC operations take significantly longer than the other unary operations. Would eliminating the NOP delay for qsqrt, qlog, and qexp make a difference?
If so, this could be done by moving qsqrt, qlog, and qexp to the top, but leaving the getqx at the bottom. That way, the NOP delay for those operators wouldn't matter, since it'd be waiting for the CORDIC anyway.
op_unary popa x wz 'pop a b c d e f g h i j k a: NOT
qsqrt x 'SQRT() | | | | | | | | i | | b: !
qlog x 'LOG() | | | | | | | | | j | c: -
qexp x 'EXP() | | | | | | | | | | k d: ||
if_nz not x,#0 'NOT a | | | | | | | | | | e: >|
not x '! a b | | | | | | | | | f: |<
neg x '- | | c | | | | | | | | g: ~
abs x '|| | | | d | | | | | | | h: ~~
topone x '>| | | | | e | | | | | | i: SQRT()
decod x '|< | | | | | f | | | | | j: LOG()
shl x,#24 '~ | | | | | | g | | | | k: EXP()
sar x,#24 ' | | | | | | g | | | |
shl x,#16 '~~ | | | | | | | h | | |
sar x,#16 ' | | | | | | | h | | |
getqx x ' | | | | | | | | i j k
_ret_ pusha x 'push a b c d e f g h i j k
Also, have you considered leaving the top stack element in a register? Tachyon Forth does this. It would eliminate two hub accesses from every stack operation. Unary operations would require no hub access, and binary operations would require only one, to pop the second value on the stack.
Also, have you considered leaving the top stack element in a register? Tachyon Forth does this. It would eliminate two hub accesses from every stack operation. Unary operations would require no hub access, and binary operations would require only one, to pop the second value on the stack.
That is a great idea! I will do that.
Also, about those unary CORDIC instructions, they actually get skipped over by the PC when they are not needed, so they take no time, at all. That's the cool thing about SKIPF/EXECF. The only time a NOP occurs is when eight contiguous instructions are getting skipped.
10,000 rungs that look like this:
----------][---------------][----------------][----------------][-------------------------( )---------
Time to execute all 10,000 is about 3 mSec.
This one is 300us for 1000, so quite a lot slower than the other claims.
Assuming (say) 800MHz SysCLK for Pi, that's ~ 240 SysCLK for R=A AND B AND C AND D type operation.
That sounds more realistic ?
You should be able to derive a time for R=A AND B AND C AND D on P2, in SysCLKs, via your byte code ?
I found that .pdf and they are talking about native ASM instructions actually executing. It seems that this eCLR bytecode translates easily into instructions. It also has a 256k runtime minimum. If the Prop2 were to run that same test in the way those other benchmarks were computed it would be about (Fclk/2 * 16 processors) / 1000 operations.
The first numbers are more marketing, and they rather fail a simple litmus test of checking the SysCLK ratio,
in some cases, I get under 1 sysclk ?! Something is awry. Not even a JIT compiler is that good...
The second, forum value of 3ms/10k lines seems to be a real measured system, with an equation given, so it should be possible to compare that with PASM and Spin.ByteCodes, for each of BOOL (this can be a pin, or memory bit), BYTE, INT, DINT, REAL ?
Fastest I think is COG u32 vars, which can be 4 lines of PASM for 8 SysCLKS or 100ns @ 80MHz
(Pi is ~ 300ns/line)
u8,u16,u64 will all be somewhat slower, more so if they use HUB vars.
BOOL code on P2 I've not looked at too closely yet.
SpinByte codes will be in the ballpark of 1us/line ?
Here are the unary operators, using electrodude's idea about keeping the top level of the stack in a register. This is crazy simple now. Things like !, -, and || will take a total of 13 clocks to completely execute, from bytecode fetch to bytecode fetch:
op_notb test x wz 'NOT
_ret_ muxz x,_FFFFFFFF
op_not _ret_ not x '!
op_neg _ret_ neg x '-
op_abs _ret_ abs x '||
op_ecd _ret_ topone x '>|
op_dcd _ret_ decod x '|<
op_sxb shl x,#24 '~
_ret_ sar x,#24
op_sxw shl x,#16 '~~
_ret_ sar x,#16
op_sqrt qsqrt x 'SQRT() a a: SQRT()
op_log qlog x 'LOG() | b b: LOG()
op_exp qexp x 'EXP() | | c c: EXP()
_ret_ getqx x ' a b c
What would you guys think about making the TRUE result = 1, instead of $FFFFFFFF, for NOT/AND/OR and equality tests?
I think my original thinking was that all ones could be used as an & (bitwise AND) argument. Maybe just 1 would be better, though. Verilog behaves this way and I've come to like it.
EDIT: ...On second thought, this is not a good idea because it would confuse the behavior between ! and NOT. The way it is now is probably best.
I got all the unary and binary operators, plus equality operators, rewritten to work by keeping the top of the stack in a register. The code size is 60 longs, same as before. The difference now is that it runs way faster because it does fewer pushes and pops and all the SKIPF-induced NOPs are gone, since I tightened things up a bit.
'
'
' Unary operators
'
op_notb test x wz 'NOT
_ret_ muxz x,_FFFFFFFF
op_not _ret_ not x '!
op_neg _ret_ neg x '-
op_abs _ret_ abs x '||
op_ecd _ret_ topone x '>|
op_dcd _ret_ decod x '|<
op_sxb shl x,#24 '~
_ret_ sar x,#24
op_sxw shl x,#16 '~~
_ret_ sar x,#16
op_sqrt qsqrt x ' a a: SQRT()
op_log qlog x ' | b b: LOG()
op_exp qexp x ' | | c c: EXP()
_ret_ getqx x ' a b c
'
'
' Binary operators
'
op_logic popa y 'pop a b c d e a: AND
test x wz ' a b | | | b: OR
muxnz x,_FFFFFFFF ' a b | | | c: &
test y wz ' a b | | | d: |
muxnz y,_FFFFFFFF ' a b | | | e: ^
_ret_ and x,y ' a | c | |
_ret_ or x,y ' b d |
_ret_ xor x,y ' e
op_shift mov y,x 'swap a b c d e f g a: >>
popa x 'pop a b c d e f g b: <<
_ret_ shr x,y ' a | | | | | | c: ~>
_ret_ shl x,y ' b | | | | | d: <~
_ret_ sar x,y ' c | | | | e: ->
_ret_ sal x,y ' d | | | f: <-
_ret_ ror x,y ' e | | g: ><
_ret_ rol x,y ' f |
rev x ' g
rol x,y ' g
rol x,#1 ' g
_ret_ triml x,y ' g
op_addsub mov y,x 'swap a b c d a: #>
popa x 'pop a b c d b: <#
_ret_ mins x,y ' a | | | c: +
_ret_ maxs x,y ' b | | d: -
_ret_ add x,y ' c |
_ret_ sub x,y ' d
op_muldiv popa y wc 'pop a b c d e a: *
testb x,#31 wz 'c=ys a b c | | b: /
abs y '!z=xs a b c | | c: //
abs x ' a b c | | d: **
qmul y,x ' a | | d | e: */
qdiv y,x ' | b c | |
qfrac y,x ' | | | | e
getqx x ' a b | | e
if_c_eq_z neg x ' a b | | |
getqy x ' | | c d |
if_c neg x ' | | c | |
ret ' a b c d e
op_equal popa y 'pop a b c d e f a: <
cmps y,x wc,wz ' a | c d e | b: =<,<=
cmps x,y wc,wz ' | b | | | f c: ==
_ret_ muxc x,_FFFFFFFF ' a | | | | f d: <>
_ret_ muxnc x,_FFFFFFFF ' b | | e e: =>,>=
_ret_ muxz x,_FFFFFFFF ' c | f: >
_ret_ muxnz x,_FFFFFFFF ' d
'
'
' Constants
'
con_low pusha x 'push a b c d e f g a: 0
_ret_ mov x,#0 ' a | | | | | | b: 1
_ret_ mov x,#1 ' b | | | | | c: 2
_ret_ mov x,#2 ' c | | | | d: 3
_ret_ mov x,#3 ' d | | | e: 4
_ret_ mov x,#4 ' e | | f: 5
_ret_ mov x,#5 ' f | g: 6
_ret_ mov x,#6 ' g
con_high pusha x 'push a b c d e f g a: 7
_ret_ mov x,#7 ' a | | | | | | b: 8
_ret_ mov x,#8 ' b | | | | | c: 15
_ret_ mov x,#15 ' c | | | | d: 16
_ret_ mov x,#16 ' d | | | e: 31
_ret_ mov x,#31 ' e | | f: 32
_ret_ mov x,#32 ' f | g: -1
_ret_ mov x,_FFFFFFFF ' g
con_data pusha x 'push a b c d e f g h i a: byte
_ret_ rfbyte x ' a | | | | | | | | b: !byte
rfbyte x ' b | | | f g h i c: word
_ret_ rfword x ' | c | | | | | | d: !word
rfword x ' | d | | | | | e: long
_ret_ rflong x ' | | e | | | | f: byte + decode
_ret_ decod x ' | | f | | | g: byte + decode + not
decod x ' | | g h i h: byte + decode + decrement
_ret_ not x ' b d g | | i: byte + decode + negate
_ret_ sub x,#1 ' h |
_ret_ neg x ' i
Not sure if you all looked at the 6502 links I dropped, but this general approach looks one heck of a lot like that chips instruction decode and execute ROM.
Seems doing so much Verilog has done Chip some good.
We have a sequential version of what is a common hardware approach.
I see it now as a truth table version of logic. If one works at encoding the table, it ends up being a set of branches and operators, "the function."
Like XOR is a function, with a little table. This is some function, we don't care what function, just that the truth table for it matches requirements. Without it, one would basically make this same table as a starting point, then work out efficient combinations of branches, jumps, and, nor, XOR, ops to get it all done.
Doing it this way avoids all of that logic, and it's possible to just evaluate the table for all cases, set the bits and go.
Frankly, once people see what happens, this will be easier for a lot of parse and decision type things.
I got all the unary and binary operators, plus equality operators, rewritten to work by keeping the top of the stack in a register. The code size is 60 longs, same as before. The difference now is that it runs way faster because it does fewer pushes and pops and all the SKIPF-induced NOPs are gone, since I tightened things up a bit.
You still need to manage all those masks tho, and store them somewhere...
Your comment syntax could, with small changes, lend itself to parsing to auto-extract the masks by name.
The main issue I think is the right-most tag is not valid name chars, but a function name could be ?
Which brings up the point: is now a good time to deprecate the syntactic-soup that is Spin operators, and introduce a cleaner functionname() approach, that would allow other front-ends on the Byte-code generator ?
Final code size is no different, but source is (far) easier to scan and read, and has far less culture shock coming from other languages.
When the ASM is easier to read than the HLL, it is time to pause and re-think.
Spin's operators should be brought into line with those of C, C++, C#, Java, Javascript.
I see no particular reason they are different. It confuses and catches out those coming from those other languages. I.e everybody.
By all means have some extra exotic operators as well.
Yup.
The way other tools manage this, is with a simple mode switch, if you find some that are mutually exclusive.
ie you can compile both forms, with a modicum of effort.
I was looking at Chip's tables, and thinking how to automate that, and there, the cryptic-names bite.
Without those, it is not too hard to Automate.
Some clearer name is needed, for automation, so why not push that up to the source, since it is then in the docs.
Python 3 moves more from the 'language' into function syntax, which has many benefits.
Comments
Maxing out hardware features on this chip is basically intended to be done with PASM. The C people will in line and encapsulate just like the SPIN people will.
No reason for delays.
I'm considering using one of my CVA9s for a specialized test instrument since it's a one off so I will try out the latest enhancements and see if I can improve my interpreter as well.
That block gets called from the 9-clock interpreter loop via EXECF, the instructions of interest execute, and then it returns to the loop.
All but the "~" operator suffer a single NOP delay, since they must skip 8 instructions in a row, at one point. The code savings is huge and this approach is only two clocks (one NOP) slower than dedicated routines containing only the instructions of interest would have been.
I did find some numbers for PLC code, running via a eCLR byte-code engine :
I think this means the most basic add/and/not operations. like A = B+C or A= D AND E.
The derived sysCLKs here seem suss ?
A forum posting at RaspBerryPi suggests this This one is 300us for 1000, so quite a lot slower than the other claims.
Assuming (say) 800MHz SysCLK for Pi, that's ~ 240 SysCLK for R=A AND B AND C AND D type operation.
That sounds more realistic ?
You should be able to derive a time for R=A AND B AND C AND D on P2, in SysCLKs, via your byte code ?
I found that .pdf and they are talking about native ASM instructions actually executing. It seems that this eCLR bytecode translates easily into instructions. It also has a 256k runtime minimum. If the Prop2 were to run that same test in the way those other benchmarks were computed it would be about (Fclk/2 * 16 processors) / 1000 operations.
If you add up all the letters, you can see that these routines coded discretely would have taken 37 instructions. Here, EXECF results in 65% smaller code that runs just as fast!
You could also do
An additional popa/pusha have been added at slots 8 and 9. This does add two instructions, but it gets rid of the NOP penalty.
Right. Or, just break them into two different sets with no overlap. That wouldn't take any more instructions if you've already added the PUSHA and POPA.
Check out these equality operators. 70% code size reduction:
Originally, I was thinking you could insert just the pusha, then re-arrange the instructions to have to most frequently used ones in the first half. But at that point, one additional instruction for removing the penalty from the second half seemed like a better deal. But, you are right. At that point, it may as well be two blocks. I wasn't thinking about the fact that the jump address was part of the the lookup too.
I'm thinking that when there's just a few overhead instructions, it might make sense to break things into smaller blocks if it means saving cycles. The only slow thing here is the POPA and PUSHA, but they can't easily be gotten around. Saving a few cycles could get you into an earlier egg-beater slot, though, since the POPA and PUSHA addresses have a static timing relationship.
Oh, right! So actually we shouldn't be getting hung up on the NOP penalty (in this case), since all of those instructions will be processed within a single hub cycle. Even with the NOP, you're still waiting on the pusha.
I'll have to do some tests to find out what the sweet spot is, but getting within a near egg-beater slot is worth 16 cycles.
http://visual6502.org/wiki/index.php?title=6507_Decode_ROM
http://visual6502.org/JSSim/
Look at top center. While the 6507 ROM listing I found varies a tiny bit from the bits in the die simulation rendering, the basic idea is you can see the bit patterns in the Visual 6502 are wired to atomic operations, much like the bit patterns are associated with instructions.
The chip does this in parallel. All at the same time.
A P2 does it sequentially in software.
Very microcode like, if you ask me.
The three unary CORDIC operations take significantly longer than the other unary operations. Would eliminating the NOP delay for qsqrt, qlog, and qexp make a difference?
If so, this could be done by moving qsqrt, qlog, and qexp to the top, but leaving the getqx at the bottom. That way, the NOP delay for those operators wouldn't matter, since it'd be waiting for the CORDIC anyway.
That is a great idea! I will do that.
Also, about those unary CORDIC instructions, they actually get skipped over by the PC when they are not needed, so they take no time, at all. That's the cool thing about SKIPF/EXECF. The only time a NOP occurs is when eight contiguous instructions are getting skipped.
The first numbers are more marketing, and they rather fail a simple litmus test of checking the SysCLK ratio,
in some cases, I get under 1 sysclk ?! Something is awry. Not even a JIT compiler is that good...
The second, forum value of 3ms/10k lines seems to be a real measured system, with an equation given, so it should be possible to compare that with PASM and Spin.ByteCodes, for each of BOOL (this can be a pin, or memory bit), BYTE, INT, DINT, REAL ?
Fastest I think is COG u32 vars, which can be 4 lines of PASM for 8 SysCLKS or 100ns @ 80MHz
(Pi is ~ 300ns/line)
u8,u16,u64 will all be somewhat slower, more so if they use HUB vars.
BOOL code on P2 I've not looked at too closely yet.
SpinByte codes will be in the ballpark of 1us/line ?
I think my original thinking was that all ones could be used as an & (bitwise AND) argument. Maybe just 1 would be better, though. Verilog behaves this way and I've come to like it.
EDIT: ...On second thought, this is not a good idea because it would confuse the behavior between ! and NOT. The way it is now is probably best.
A big chunk of interpreter in 60 longs (with blazing speed) is very neat.
Not sure if you all looked at the 6502 links I dropped, but this general approach looks one heck of a lot like that chips instruction decode and execute ROM.
Seems doing so much Verilog has done Chip some good.
We have a sequential version of what is a common hardware approach.
I see it now as a truth table version of logic. If one works at encoding the table, it ends up being a set of branches and operators, "the function."
Like XOR is a function, with a little table. This is some function, we don't care what function, just that the truth table for it matches requirements. Without it, one would basically make this same table as a starting point, then work out efficient combinations of branches, jumps, and, nor, XOR, ops to get it all done.
Doing it this way avoids all of that logic, and it's possible to just evaluate the table for all cases, set the bits and go.
Frankly, once people see what happens, this will be easier for a lot of parse and decision type things.
SPIN should have it. No joke.
You still need to manage all those masks tho, and store them somewhere...
Your comment syntax could, with small changes, lend itself to parsing to auto-extract the masks by name.
The main issue I think is the right-most tag is not valid name chars, but a function name could be ?
Which brings up the point: is now a good time to deprecate the syntactic-soup that is Spin operators, and introduce a cleaner functionname() approach, that would allow other front-ends on the Byte-code generator ?
Final code size is no different, but source is (far) easier to scan and read, and has far less culture shock coming from other languages.
When the ASM is easier to read than the HLL, it is time to pause and re-think.
I see no particular reason they are different. It confuses and catches out those coming from those other languages. I.e everybody.
By all means have some extra exotic operators as well.
Yup.
The way other tools manage this, is with a simple mode switch, if you find some that are mutually exclusive.
ie you can compile both forms, with a modicum of effort.
I was looking at Chip's tables, and thinking how to automate that, and there, the cryptic-names bite.
Without those, it is not too hard to Automate.
Some clearer name is needed, for automation, so why not push that up to the source, since it is then in the docs.
Python 3 moves more from the 'language' into function syntax, which has many benefits.