Dave,
Intriguing, so the FIFO stops being used when the code path enters CogExec, the FIFO is then being used for a buffered block write, then automatically refills with HubExec code when the RET is hit.
There's probably a FIFO flush delay at this point but "That's so easy!" Now for the block copy version ... it'll need to chop it up into chunks like Chip has.
One nice thing about SETQ+RDLONG/WRLONG is that the FIFO is not used. Data is streamed directly between the cog and the hub at one clock per long. FIFO status and mode are unchanged.
I've been using Dave's comport code for my testing. It's simple bit-bashing code (presumably been like that since Prop2Hot days) but it saved me going through the trial and error effort myself. Had to hack it around a little to remove the GCC parts. I wanted all the printf() code too but that was just too hard to extract. So I settled for making my own itoa() that uses the putch() primitive.
Chip, that looks great. However, I wonder if you could get a speed improvement by separating xxxxFILL from xxxxMOVE. Also, because the P2 does unaligned reads and writes the BYTE and WORD operations could be almost as fast as the LONG operations. It just requires performing 0 to 3 BYTE accesses at the beginning, and then doing LONG accesses after that. This is the code I used for memset() in p2gcc, which is basically the same as BYTEFILL() in Spin.
' This code resides in cog RAM
__LONGFILL
wrfast #0, r0
rep #1, r2
wflong r1
ret
// This code resides in hub RAM
// ptr, val and num are in r0, r1 and r2
void memset(void *ptr, int val, int num)
{
__asm__(" mov r3, #3");
__asm__(" and r3, r2 wz");
__asm__(" if_z jmp #label1");
__asm__("label2 wrbyte r1, r0");
__asm__(" add r0, #1");
__asm__(" djnz r3, #label2");
__asm__("label1 setnib r1, r1, #1");
__asm__(" setbyte r1, r1, #1");
__asm__(" setword r1, r1, #1");
__asm__(" shr r2, #2 wz");
__asm__(" if_nz call #\\__LONGFILL");
}
I could speed it up a bit by moving all the code to cog RAM, but p2gcc C code currently calls functions that only reside in hub RAM.
Dave, there was a lot of overlap between the move and fill functions. So, I melded them together. I didn't want one full-length SETQ+WRLONG fill, since interrupt latency would have been compromised.
If you don't care about interrupt latency, you can do one SETQ+WRLONG combo, using a constant for the WRLONG's D. It will fill at the rate of 1 long per clock.
I'll have to change it to use SETQ+WRLONG. Since it doesn't use the FIFO I should be able to do this from hub RAM and I won't need to use cog RAM. I guess if I need to worry about interrupt latency in the future I could break the SETQ+WRLONG into to smaller chunks.
Work on the interpreter is going slowly, but almost done. It is really challenging for me because there are so many subtle things that can be done to create optimal code, that it's enough to drive one nuts.
We almost need some kind of compiler that can optimize variable placement and coordinate it with SETQ+RDLONG activity to get the big speed improvements possible.
I've been thinking for years now that it would be really neat to have a tool where you can define some kind of outcome that you want, based on inputs, and the tool, using random instruction selection and simulation, comes up with really efficient solutions that would have been too difficult for a human to produce in as timely a way.
Work on the interpreter is going slowly, but almost done. It is really challenging for me because there are so many subtle things that can be done to create optimal code, that it's enough to drive one nuts.
We almost need some kind of compiler that can optimize variable placement and coordinate it with SETQ+RDLONG activity to get the big speed improvements possible.
I've been thinking for years now that it would be really neat to have a tool where you can define some kind of outcome that you want, based on inputs, and the tool, using random instruction selection and simulation, comes up with really efficient solutions that would have been too difficult for a human to produce in as timely a way.
So you mean sort of like a Place & Route in FPGA, only for Opcodes... ?
Could be interesting to try, but likely to quickly become exponentially slower, as the task increases.
Easier could be more-automated tools for your masked skip opcodes.
ie fairly simple ones that re-create masks as lines are added and removed.
Yes, it seems almost like an FPGA compilation problem, but with a larger-grain and a more-random set of building blocks that cover a different set of functions. It's like chess, whereas FPGA's are more like Legos.
Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.
That didn't occur to me, at first, but that would be the best way to do it. You could eliminate bad combos right away with that approach. A full parametric sweep would be impossible.
Maybe the simplest way to make such a thing is to give the tool some code sequence that works, and have it try to find a shorter/faster sequence. That way, there's less abstraction needed on defining the goal. It just simulates the model while it tries to find a better one.
Maybe the simplest way to make such a thing is to give the tool some code sequence that works, and have it try to find a shorter/faster sequence. That way, there's less abstraction needed on defining the goal. It just simulates the model while it tries to find a better one.
Sounds like your moving toward genetic algorithms :-)
The fastest way to build this and the fastest way to run it, is on the prop2, itself. No need for a simulator, then:
1) generate new random input values
2) run 'model' routine with generated random input values, capture outputs
4) run 'test' routine with generated random input values, capture outputs
5) compare outputs and iterate 'test' routine if mismatch
6) repeat until n cycles with no mismatch
isn't that exactly what neuronal networks do?
Mike
No, thats exactly what Genetic Algorithms do.
"The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution. The genetic algorithm repeatedly modifies a population of individual solutions."
Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.
Hi Chip,
I would like to show you a solution that I came up with to your above problem. I have tested it pretty thoroughly
using Cluso's debugger, so I hope that it helps.
Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.
Hi Chip,
I would like to show you a solution that I came up with to your above problem. I have tested it pretty thoroughly
using Cluso's debugger, so I hope that it helps.
Neat! I was wondering if something like this was possible - checking the range in two directions. It hurt my head so bad that I couldn't get on top of it. Looks like you got there, though.
Meanwhile, here is what I came up with (comments are first, to state the goal of the code):
'
'
' LOOKDOWN(target : ,,,value1..value2,,,)
'
' if value1 <= value2
' delta = value2 - value1
' if value1 <= target <= value2
' result = - value1 + index + target
' branch to address
' else index += delta + 1
'
' if value1 > value2
' delta = value1 - value2
' if value2 <= target <= value1
' result = value1 + index - target
' branch to address
' else index += delta + 1
'
'
' x value2 x
' ptra[-1] value1 t
' ptra[-2] index u
' ptra[-3] target v
' ptra[-4] address w
'
'
lookdnr setq #3 'pop value1+index+target+address into t+u+v+w
rdlong w,--ptra[4]
cmps x,t wc 'z = (value1 > value2)
modcz 0,_c wz
if_nz cmps v,t wc 'if nz, nc = (value1 <= target <= value2)
if_nz_and_nc cmps x,v wc
if_z cmps v,x wc 'if z, nc = (value2 <= target <= value1)
if_z_and_nc cmps t,v wc
if_nc negnz x,t 'if target within range,
if_nc add x,u '..get result on stack top
if_nc sumz x,v
if_nc jmp #casex '..and branch
if_nz sub x,t 'else, get new index on stack top
if_z subr x,t
addx x,u '(c=1, results in +1)
_ret_ add ptra,#2*4 '..and unpop address+target
This code gets the direction of the compare into Z, so that it can still be used after C indicates if we are in range, or not.
Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.
Hi Chip,
I would like to show you a solution that I came up with to your above problem. I have tested it pretty thoroughly
using Cluso's debugger, so I hope that it helps.
Neat! I was wondering if something like this was possible - checking the range in two directions. It hurt my head so bad that I couldn't get on top of it. Looks like you got there, though.
Meanwhile, here is what I came up with (comments are first, to state the goal of the code):
'
'
' LOOKDOWN(target : ,,,value1..value2,,,)
'
' if value1 <= value2
' delta = value2 - value1
' if value1 <= target <= value2
' result = - value1 + index + target
' branch to address
' else index += delta + 1
'
' if value1 > value2
' delta = value1 - value2
' if value2 <= target <= value1
' result = value1 + index - target
' branch to address
' else index += delta + 1
'
'
' x value2 x
' ptra[-1] value1 t
' ptra[-2] index u
' ptra[-3] target v
' ptra[-4] address w
'
'
lookdnr setq #3 'pop value1+index+target+address into t+u+v+w
rdlong w,--ptra[4]
cmps x,t wc 'z = (value1 > value2)
modcz 0,_c wz
if_nz cmps v,t wc 'if nz, nc = (value1 <= target <= value2)
if_nz_and_nc cmps x,v wc
if_z cmps v,x wc 'if z, nc = (value2 <= target <= value1)
if_z_and_nc cmps t,v wc
if_nc negnz x,t 'if target within range,
if_nc add x,u '..get result on stack top
if_nc sumz x,v
if_nc jmp #casex '..and branch
if_nz sub x,t 'else, get new index on stack top
if_z subr x,t
addx x,u '(c=1, results in +1)
_ret_ add ptra,#2*4 '..and unpop address+target
This code gets the direction of the compare into Z, so that it can still be used after C indicates if we are in range, or not.
Thanks chip for looking at my code, I really busted my brain! Trying to make a range check that works in two directions. I’ll add some comments soon to my code, I just have to figure out a good way to explain it. Chip your code really looks good and because you write such efficient code sometimes I have trouble understanding what you wrote :-)
I have documented my code that performs a range check in two directions.
dat
{
if n>=r1 r1<r2 direction
if n<=r2
return true n is in range of r1 and r2
elseif n<=r1 r1>r2 direction
if n>=r2
return true n is in range of r1 and r2
else
return false n is not in range of r1 and r2
}
InRange cmps n,r1 wz,wc ' checks for n>=r1
IF_NZ_AND_NC cmps r2,n wz,wc ' if n>=r1 then check if n<=r2
IF_NC jmp #InRange_ret ' if n>=r1 and n<=r2 then exit
cmps r1,n wz,wc ' checks for n<=r1
IF_NZ_AND_NC cmps n,r2 wz,wc ' if n<=r1 then check if n>=r2
InRange_ret ret
' defines what flags mean upon subroutine exit
' z=1&c=0 means that n is equal to r1 or n is equal to r2
' z=0&c=0 means that n is between r1 and r2, but does not include r1 or r2
' z=0&c=1 means that n is not in range!
Thanks for documenting that. Looks great! Stuff like this is hard to think about.
If that code is used as a subroutine, it could be quickened a little:
InRange cmps n,r1 wz,wc ' checks for n>=r1
IF_NZ_AND_NC cmps r2,n wz,wc ' if n>=r1 then check if n<=r2
IF_NC ret ' if n>=r1 and n<=r2 then exit
cmps r1,n wz,wc ' checks for n<=r1
IF_NZ_AND_NC cmps n,r2 wz,wc ' if n<=r1 then check if n>=r2
ret
I ran into a limitation while working on Spin2 with XBYTE...
XBYTE uses SETQ to establish the bytecode base and LSB/MSB indexing, usually at the start of bytecode execution:
_RET_ SETQ #$000 'set bytecode base to $000 in LUT, stack = $1F8, do XBYTE
After that, _RET_+Instruction's are done at the end of each code snippet to execute the next bytecode.
What if you have an alternate set of bytecodes, though, that you want to execute through a portal bytecode? You could do a new _RET_+SETQ, but then you permanently change the base and MSB/LSB indexing for subsequent _RET_+instruction's.
There needs to be a way to do a temporary base-and-MSB/LSB-indexing change that only works on the next bytecode, so that alternate sets of bytecodes could be used to run lower-usage code, without any need for a restorative _RET_+SETQ instruction later.
SETQ2 will now serve this purpose:
_RET_ SETQ2 #$100 'set bytecode base to $100 in LUT for next bytecode only, stack = $1F8, do XBYTE
This will add 10 flops to each cog, but make alternate bytecode sets easy to engage without the need for post clean-up.
Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.
In P1 code:
sub lim2, val wz
if_nz sub lim1, val wz
xor lim1, lim2
shl lim1, #1 wc
The value is in range "if_c_or_z". Unfortunately it changes the values of your limt variables, and I'm guessing that's not allowed.
Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.
In P1 code:
sub lim2, val wz
if_nz sub lim1, val wz
xor lim1, lim2
shl lim1, #1 wc
The value is in range "if_c_or_z". Unfortunately it changes the values of your limt variables, and I'm guessing that's not allowed.
Jonathan
Jonathan, this is hard to think about it. I will make the effort to ingest this soon.
Further, refined thoughts on two options for a possible new instruction as introduced above. The existing instructions that would be replaced by the new one are as follows:
1. Minimal option
MOV temp,bytecode 'preserve bytecode for later
AND temp,#andmask 'andmask = %11 {1 of 4} or %111 {1 of 8}
DECOD temp 'temp = %0001...%1000 or %00000001...%10000000
XOR temp,#xormask 'xormask = %1111 or %11111111
SHL temp,#1 'temp = %1110_0...%0111_0 or %11111110_0...%01111111_0
SKIPF temp 'temp = skip pattern, unused high bits all zero
2. Ideal option
MOV temp,bytecode
SHR temp,#shift 'right shift desired group to low bits
AND temp,#andmask
DECOD temp
XOR temp,#xormask
SHL temp,#1
SKIPF temp
The ideal option assumes the existing shift logic can be used for the right shift, the decode logic afterwards is trivial. The left shift inserts a '0' as the first skip bit, so that one instruction is always executed before the four or eight skippable ones.
As the instruction after SKIPF can only be cancelled if unused, not skipped, the leading '0' avoids 3 of 4 or 7 of 8 options incurring a two-cycle penalty, it keeps the timing consistent and creates space for a preliminary jump instruction. The new skip instruction, SKIPX for want of a better name, could be placed one instruction early if a branch is not needed.
The minimal option would require a right shift beforehand if the lowest bytecode bits are not the desired ones and a D-only instruction could suffice, with an opcode bit selecting 1 of 4 / 1 of 8. The ideal option would need a D,S slot, where S[4:0] could select the right shift and S[8] 1 of 4 / 1 of 8.
As can be seen, SKIPX could replace up to six or seven instructions, a better ratio than some other instructions. It would be a direct binary-to-skip-fast instruction, which is currently missing.
TonyB_,
I'm not readily understanding your idea. I'm going to need to look carefully when I have more time.
Right now, I'm getting to leave in the morning to go to Treehouse in Colorado, so that we can get the layout changes done once and for all, hopefully.
I'm not readily understanding your idea. I'm going to need to look carefully when I have more time.
Right now, I'm getting to leave in the morning to go to Treehouse in Colorado, so that we can get the layout changes done once and for all, hopefully.
After more consideration, what I'm trying to do could be achieved with two or three existing instructions, not six or seven, therefore no real need for a new one.
There is a real issue here, though, related to your recent SETQ2 addition to XBYTE. I hope I can explain it better when you get back from your trip.
Comments
Intriguing, so the FIFO stops being used when the code path enters CogExec, the FIFO is then being used for a buffered block write, then automatically refills with HubExec code when the RET is hit.
There's probably a FIFO flush delay at this point but "That's so easy!" Now for the block copy version ... it'll need to chop it up into chunks like Chip has.
I've been using Dave's comport code for my testing. It's simple bit-bashing code (presumably been like that since Prop2Hot days) but it saved me going through the trial and error effort myself. Had to hack it around a little to remove the GCC parts. I wanted all the printf() code too but that was just too hard to extract. So I settled for making my own itoa() that uses the putch() primitive.
Dave, there was a lot of overlap between the move and fill functions. So, I melded them together. I didn't want one full-length SETQ+WRLONG fill, since interrupt latency would have been compromised.
If you don't care about interrupt latency, you can do one SETQ+WRLONG combo, using a constant for the WRLONG's D. It will fill at the rate of 1 long per clock.
We almost need some kind of compiler that can optimize variable placement and coordinate it with SETQ+RDLONG activity to get the big speed improvements possible.
I've been thinking for years now that it would be really neat to have a tool where you can define some kind of outcome that you want, based on inputs, and the tool, using random instruction selection and simulation, comes up with really efficient solutions that would have been too difficult for a human to produce in as timely a way.
So you mean sort of like a Place & Route in FPGA, only for Opcodes... ?
Could be interesting to try, but likely to quickly become exponentially slower, as the task increases.
Easier could be more-automated tools for your masked skip opcodes.
ie fairly simple ones that re-create masks as lines are added and removed.
https://en.m.wikipedia.org/wiki/Monte_Carlo_method
Yesterday, I had the problem of needing to find if a register held a value that was between two other random register values. I can do it in six instructions, but there must be a simpler solution. I think it would be reasonable to have a compiler come up with a solution for that simple of a problem.
That didn't occur to me, at first, but that would be the best way to do it. You could eliminate bad combos right away with that approach. A full parametric sweep would be impossible.
Sounds like your moving toward genetic algorithms :-)
1) generate new random input values
2) run 'model' routine with generated random input values, capture outputs
4) run 'test' routine with generated random input values, capture outputs
5) compare outputs and iterate 'test' routine if mismatch
6) repeat until n cycles with no mismatch
provide them with input-data and a goal output to train them?
nothing new here
Mike
No, thats exactly what Genetic Algorithms do.
"The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution. The genetic algorithm repeatedly modifies a population of individual solutions."
Hi Chip,
I would like to show you a solution that I came up with to your above problem. I have tested it pretty thoroughly
using Cluso's debugger, so I hope that it helps.
HydraHacker
Neat! I was wondering if something like this was possible - checking the range in two directions. It hurt my head so bad that I couldn't get on top of it. Looks like you got there, though.
Meanwhile, here is what I came up with (comments are first, to state the goal of the code):
This code gets the direction of the compare into Z, so that it can still be used after C indicates if we are in range, or not.
Thanks chip for looking at my code, I really busted my brain! Trying to make a range check that works in two directions. I’ll add some comments soon to my code, I just have to figure out a good way to explain it. Chip your code really looks good and because you write such efficient code sometimes I have trouble understanding what you wrote :-)
HydraHacker
I have documented my code that performs a range check in two directions.
Thanks for documenting that. Looks great! Stuff like this is hard to think about.
If that code is used as a subroutine, it could be quickened a little:
XBYTE uses SETQ to establish the bytecode base and LSB/MSB indexing, usually at the start of bytecode execution:
After that, _RET_+Instruction's are done at the end of each code snippet to execute the next bytecode.
What if you have an alternate set of bytecodes, though, that you want to execute through a portal bytecode? You could do a new _RET_+SETQ, but then you permanently change the base and MSB/LSB indexing for subsequent _RET_+instruction's.
There needs to be a way to do a temporary base-and-MSB/LSB-indexing change that only works on the next bytecode, so that alternate sets of bytecodes could be used to run lower-usage code, without any need for a restorative _RET_+SETQ instruction later.
SETQ2 will now serve this purpose:
This will add 10 flops to each cog, but make alternate bytecode sets easy to engage without the need for post clean-up.
Very cool, that will make emulation of some CPUs and VMs nicer too!
Man! I was just thinking that I wanted you to see this, because you would probably like it.
Jonathan
Please see this post and later
Jonathan, this is hard to think about it. I will make the effort to ingest this soon.
TonyB_,
I'm not readily understanding your idea. I'm going to need to look carefully when I have more time.
Right now, I'm getting to leave in the morning to go to Treehouse in Colorado, so that we can get the layout changes done once and for all, hopefully.
After more consideration, what I'm trying to do could be achieved with two or three existing instructions, not six or seven, therefore no real need for a new one.
There is a real issue here, though, related to your recent SETQ2 addition to XBYTE. I hope I can explain it better when you get back from your trip.