We do have 16 COGS. As Chip says, in COG timing isn't hard. It's a lot like a P1, with some spiffy new hardware. Doing things like we did on P1 will continue to be a reasonable option.
The difference is one can load up a COG, interrupts, etc... that will take a bit more thought. I don't see it being overly difficult.
Streamer will help a lot. It's the general purpose device we all thought WAITVID would become in the P2.
In HUB land, it going to vary a lot more. Personally, I see people putting critical stuff into a COG, P1 style.
Inline will be hardware access and some speed boost. People can code for worst case and where that becomes an issue, into a COG it goes!
And we all understood the HUB tradeoffs. Moving to a more conservative, design gets us HUBEXEC, and a lot of speed for big Programs.
Chip, my apologies for derailing the main topic. Go back to tweaking the interpreter so that we can get the instruction set finished. Arguments over syntax can wait.
Chip, my apologies for derailing the main topic. Go back to tweaking the interpreter so that we can get the instruction set finished. Arguments over syntax can wait.
Hey, no problem. I really didn't even notice.
I was working out the variable-related bytecodes last night and they are probably going to be 10x faster than Spin on Prop1. A lot is getting done now with very few instructions. Plus, the snippets are more specialized, as opposed to general-case, so efficiency is way higher.
I really like RET_AFTER. Is there any way to get that under 8 characters, though, so that it can sit at the tab stop just prior to the instruction? Then, it aligns with most IF_x prefixes.
I really like ozpropdev's suggestions (double alias), smart and clean. If 7 chars are the limit then:
RETNEXT
RETELSE
Says that something like Python would have to be stripped of much of the library to fit in small systems...
Both would be good for P2, and even a smaller Python is OK.
As P2 continues to gestate, other offerings advance, recent news was the WiFi Pi Zero W for $10, and
this news from NXP:
That part is a M2 core, 150MHz with 1MBytes of SRAM and 2MB of Flash, and QuadSPI XIP support ( & HS USB).
The 1MB of SRAM moves it ahead of P2's target memory, and the 150MHz speed is similar to P2 target (which I suspect will drop before release)
Price wise, the K28 is likely to be more than P2, but is a possible companion device.
Both PiZW and K28 can run 'larger' language implementations, should users need that.
It will be important to allow easy movement of code across such platform boundaries.
Spin is rather too niche, but can spawn other efforts.
In other news, I see the highest end Multi-Core Infineon MCU parts, now included an 8-bit MCU in the corner,
I'm looking forward to see numbers that P2 Byte-code fetch can reach, when fed from Streamer in XIP applications.
That should be part of this byte-code core testing.
Seems to me a ton of library code could be in external storage too. A few COGS dedicated to paging and a compiler able to break all that out in a useful way could perform well.
Too diverse of library use would grind away, but a lot if more focused uses may do just fine.
I think most of that library code is not needed for the kinds of things people do with micro-controllers. Rather like we can write C++ code for the Propeller but we don't expect the luxury of the gigantic C++ standard library to be available.
This is the approach taken with MicroPython and the tiny Javascript engines.
That's quite cool. I just finished reading through your P2 document and there are some amazing capabilities there. It will be a challenge to find a way to use them from the GCC code generator.
I made another change to the Verilog, but it's very minor.
The SETCZ instruction has always gotten D[1:0] into {C,Z}. Now, SETCZ rotates D right by two bits if D is a register (not #D).
This automatic rotation allows for new C and Z values with each SETCZ. It comes in very handy for the memory-read/write/modify code in the bytecode interpreter. Note that 'p' is the bytecode:
'
' Read memory - no index
'
rdmem setcz p wc,wz 'get offset
if_00 rfbyte m
if_01 rfword m
if_10 rflong m
setcz p wc,wz 'add base
if_00 add m,pbase
if_01 add m,vbase
if_10 add m,dbase
setcz p wc,wz 'read memory
if_00 rdbyte x,m
if_01 rdword x,m
if_10 rdlong x,m
_ret_ pusha x
Here is the read with indexing:
'
' Read memory - with index
'
rdmemi setcz p wc,wz 'get offset
if_00 rfbyte m
if_01 rfword m
if_10 rflong m
setcz p wc,wz 'add base
if_00 add m,pbase
if_01 add m,vbase
if_10 add m,dbase
setcz p wc,wz 'handle index
popa x
if_01 shl x,#1
if_10 shl x,#2
add m,x
if_00 rdbyte x,m 'read memory
if_01 rdword x,m
if_10 rdlong x,m
_ret_ pusha x
Last night I was thinking about a challenge I have in implementing the interpreter. I've been tempted to write out separate routines for all the different memory variable accesses, since they could be fast and concise that way.
For example, reading a long variable with an 8-bit offset from dbase would look like this:
rfbyte m
add m,dbase
rdlong x,m
_ret_ pusha x
That's pretty simple, but all the different routines needed would be huge.
A general-case routine is several times slower, as it must handle every possibility. That would really slow the interpreter down, but save memory.
I was thinking about how it may be possible to selectively SKIP certain instructions in a long pattern of instructions, in order to get what you want out of the sequence.
Look at this code. It contains every instruction needed to execute all the hub read/write operations. To do a specific operation, you'd want to skip most of these instructions and execute only a few select ones:
'
' Read/write hub memory
'
rw_mem rfbyte m 'one of these (offset) 3 x
rfword m
rflong m
add m,pbase 'one of these (base) 3 x
add m,vbase
add m,dbase
popa x 'maybe this (index) 2 x (on/off)
shl x,#1 '...and maybe this
shl x,#2 '...or maybe this
add m,x '...and this
rdbyte x,m 'one of these (read) 6 x
rdword x,m
rdlong x,m
_ret_ pusha x '...and this
popa x 'or this (write)
_ret_ wrbyte x,m '...and one of these
_ret_ wrword x,m
_ret_ wrlong x,m
There are 108 permutations possible. That would be a lot of separate small, fast routines, or one big slow one that does it all.
Well, we can now get ultimate compactness AND highest speed:
skip ##%001111110111100 'execute jmp/rfbyte/add/rdlong/pusha
jmp #rw_mem
...then this executes, pieced together from the rw_mem code...
rfbyte m
add m,dbase
rdlong x,m
_ret_ pusha x 'only the four instructions we wanted, zero time overhead!
SKIP uses bits in D, LSB first, to selectively cancel instructions in the order they occur in memory, starting with the next instruction. It works through branches, too. And it doesn't just cancel instructions. In cog-exec mode, it actually increments the PC by 1..8, not just by 1, in order to get to the next instruction, saving time. It works in hub-exec mode by just cancelling instructions, always stepping by 4, to not reset the FIFO.
This is going to allow for really compact and fast code. This will shrink the interpreter down to almost nothing and it's going to be as fast as can be. For each bytecode, I'll look up a branch address and a SKIP pattern, and do a SKIP+JMP to execute the code and pattern needed by the bytecode. It will be very efficient.
I'm thinking with _RET_ and this new SKIP, it's probably time for an FPGA update.
When I started working on this, I first had a 32-bit encoder to look ahead at the SKIP pattern, so that it could increment the PC by 1..33. That encoder was big and slow, and then a big shifter was needed. Not practical. By making it only look eight SKIP bits ahead, the encoder and shifter were 1/4 the size and fast enough to work in time. It was very little logic to implement. If you do wind up skipping 32 instructions, it takes a total of 8 clocks: 1st instruction is cancelled, then the 9th, 17th, and 25th are cancelled, with PC incrementing by 8, 8, 8, then 8, dropping you off at the end of the sequence. Interrupts are disabled during SKIP. The SKIP pattern can be as short as you want, as leading 0's are implied. Once the 1's are exhausted, it's over.
It looks clever but also a source of virtually unreadable code. Is it really faster than the one-routine-per-instruction implementation if you put those routines in hub memory? How about paring down the instruction set so you don't need so many cases?
It looks clever but also a source of virtually unreadable code. Is it really faster than the one-routine-per-instruction implementation if you put those routines in hub memory? How about paring down the instruction set so you don't need so many cases?
You don't want those routines in the hub because the FIFO will always be slowing down the RDxxxx/WRxxxx instructions. This allows for smallest code, actually just the bare set of instructions needed to build any variant routine, and then a mask for the SKIP instruction.
It looks clever but also a source of virtually unreadable code. Is it really faster than the one-routine-per-instruction implementation if you put those routines in hub memory? How about paring down the instruction set so you don't need so many cases?
You don't want those routines in the hub because the FIFO will always be slowing down the RDxxxx/WRxxxx instructions. This allows for smallest code, actually just the bare set of instructions needed to build any variant routine, and then a mask for the SKIP instruction.
I still think it might be better to just rearchitect the byte code VM instruction set to fit into COG/LUT RAM. Are you attempting to keep the byte codes the same as the ones used by the P1 VM?
David Betz,
I think that ship already sailed with the self modifying and all the ALT* instructions. This is a neat code density trick, and with a little smarts an editor could show the executed snippets in a tooltip or some form of expanded view mode. Heck, we might even be able to let you edit the "expanded" form and have it auto-generate the condensed form.
Maybe I am biased, since I have to deal with templates, lambdas, and what not in C++ all the time, and those are equally hard (if not far worse in some cases) to parse and debug. I agree that it will make PASM2 a bit harder to read and debug when it has them, but I think the benefits far outweigh that cost.
Comments
We do have 16 COGS. As Chip says, in COG timing isn't hard. It's a lot like a P1, with some spiffy new hardware. Doing things like we did on P1 will continue to be a reasonable option.
The difference is one can load up a COG, interrupts, etc... that will take a bit more thought. I don't see it being overly difficult.
Streamer will help a lot. It's the general purpose device we all thought WAITVID would become in the P2.
In HUB land, it going to vary a lot more. Personally, I see people putting critical stuff into a COG, P1 style.
Inline will be hardware access and some speed boost. People can code for worst case and where that becomes an issue, into a COG it goes!
And we all understood the HUB tradeoffs. Moving to a more conservative, design gets us HUBEXEC, and a lot of speed for big Programs.
We should not be worrying about an extra tab stop. By the time I have labels and conditionals in the the instructions are already far in to the line.
Of course actual tab characters should never exist in a source file
Hey, no problem. I really didn't even notice.
I was working out the variable-related bytecodes last night and they are probably going to be 10x faster than Spin on Prop1. A lot is getting done now with very few instructions. Plus, the snippets are more specialized, as opposed to general-case, so efficiency is way higher.
RETNEXT
RETELSE
Looks like cross between C and Spin. The description uses the words bytecode and interpreter...
Compiles to 100kB.
Is this something that could work for P2?
Also, I am looking forward to trying to get OpenSpin compiled/running on P2 using propgcc when it's up and running.
Says that something like Python would have to be stripped of much of the library to fit in small systems...
Both would be good for P2, and even a smaller Python is OK.
As P2 continues to gestate, other offerings advance, recent news was the WiFi Pi Zero W for $10, and
this news from NXP:
http://www.nxp.com/products/microcontrollers-and-processors/arm-processors/kinetis-cortex-m-mcus/k-series-performance-m4/k2x-usb/kinetis-k28-150-mhz-2x-usb-core-voltage-bypass-2mb-flash-1mb-sram-mcus-based-on-arm-cortex-m4:K28_150#overviewExpand
That part is a M2 core, 150MHz with 1MBytes of SRAM and 2MB of Flash, and QuadSPI XIP support ( & HS USB).
The 1MB of SRAM moves it ahead of P2's target memory, and the 150MHz speed is similar to P2 target (which I suspect will drop before release)
Price wise, the K28 is likely to be more than P2, but is a possible companion device.
Both PiZW and K28 can run 'larger' language implementations, should users need that.
It will be important to allow easy movement of code across such platform boundaries.
Spin is rather too niche, but can spawn other efforts.
In other news, I see the highest end Multi-Core Infineon MCU parts, now included an 8-bit MCU in the corner,
I'm looking forward to see numbers that P2 Byte-code fetch can reach, when fed from Streamer in XIP applications.
That should be part of this byte-code core testing.
https://micropython.org/
Too diverse of library use would grind away, but a lot if more focused uses may do just fine.
This is the approach taken with MicroPython and the tiny Javascript engines.
The bytecode dispatcher jumps right to these and they return to the bytecode loop. This is going to be 10x faster than Spin on the current Prop1:
The SETCZ instruction has always gotten D[1:0] into {C,Z}. Now, SETCZ rotates D right by two bits if D is a register (not #D).
This automatic rotation allows for new C and Z values with each SETCZ. It comes in very handy for the memory-read/write/modify code in the bytecode interpreter. Note that 'p' is the bytecode:
Here is the read with indexing:
Reminds me of the old POPZC instruction in P2-Hot.
Clever! So, to be clear, it also means that the D[1:0] is rotated to D[31:30]?
It might be worthwhile to make this two separate instructions (same encoding, though):
SETCZ #
RORCZ D
This makes it a bit clearer when reading code. Also, that aligns it with "ROR D, #1 WC" (which itself might benefit from an alias RORC).
Yes, D[1:0] are rotated into the top two bits.
Last night I was thinking about a challenge I have in implementing the interpreter. I've been tempted to write out separate routines for all the different memory variable accesses, since they could be fast and concise that way.
For example, reading a long variable with an 8-bit offset from dbase would look like this:
That's pretty simple, but all the different routines needed would be huge.
A general-case routine is several times slower, as it must handle every possibility. That would really slow the interpreter down, but save memory.
I was thinking about how it may be possible to selectively SKIP certain instructions in a long pattern of instructions, in order to get what you want out of the sequence.
Look at this code. It contains every instruction needed to execute all the hub read/write operations. To do a specific operation, you'd want to skip most of these instructions and execute only a few select ones:
There are 108 permutations possible. That would be a lot of separate small, fast routines, or one big slow one that does it all.
Well, we can now get ultimate compactness AND highest speed:
SKIP uses bits in D, LSB first, to selectively cancel instructions in the order they occur in memory, starting with the next instruction. It works through branches, too. And it doesn't just cancel instructions. In cog-exec mode, it actually increments the PC by 1..8, not just by 1, in order to get to the next instruction, saving time. It works in hub-exec mode by just cancelling instructions, always stepping by 4, to not reset the FIFO.
This is going to allow for really compact and fast code. This will shrink the interpreter down to almost nothing and it's going to be as fast as can be. For each bytecode, I'll look up a branch address and a SKIP pattern, and do a SKIP+JMP to execute the code and pattern needed by the bytecode. It will be very efficient.
I'm thinking with _RET_ and this new SKIP, it's probably time for an FPGA update.
When I started working on this, I first had a 32-bit encoder to look ahead at the SKIP pattern, so that it could increment the PC by 1..33. That encoder was big and slow, and then a big shifter was needed. Not practical. By making it only look eight SKIP bits ahead, the encoder and shifter were 1/4 the size and fast enough to work in time. It was very little logic to implement. If you do wind up skipping 32 instructions, it takes a total of 8 clocks: 1st instruction is cancelled, then the 9th, 17th, and 25th are cancelled, with PC incrementing by 8, 8, 8, then 8, dropping you off at the end of the sequence. Interrupts are disabled during SKIP. The SKIP pattern can be as short as you want, as leading 0's are implied. Once the 1's are exhausted, it's over.
Small and fast, fabulous!
You don't want those routines in the hub because the FIFO will always be slowing down the RDxxxx/WRxxxx instructions. This allows for smallest code, actually just the bare set of instructions needed to build any variant routine, and then a mask for the SKIP instruction.
What happens if there's a second skip when one is already in effect?
That depends on whether the first SKIP skipped the second SKIP.
I think that ship already sailed with the self modifying and all the ALT* instructions. This is a neat code density trick, and with a little smarts an editor could show the executed snippets in a tooltip or some form of expanded view mode. Heck, we might even be able to let you edit the "expanded" form and have it auto-generate the condensed form.
Maybe I am biased, since I have to deal with templates, lambdas, and what not in C++ all the time, and those are equally hard (if not far worse in some cases) to parse and debug. I agree that it will make PASM2 a bit harder to read and debug when it has them, but I think the benefits far outweigh that cost.