I REALLY like your SETTRACE instruction! It will make debugging much easier, as it will be possible to watch the PC and find out if a cog is stuck.
Here is an idea for the next shuttle run:
if a specific pin (P85? whichever is the highest "unused" pin) is found to be pulled low on startup, SETTRACE the boot loader / monitor to say P64-P79... only adds 2 instructions to the "ROM"
The above, and a logic analyzer, would help verify the chip until its fully tested; and since the bootloader/monitor source is published, it won't hurt to leave it in for a production run (to avoid a mask charge)
For development work, it will be fantastic - capturing the execution profile of 1..4 tasks, for post-capture analysis!
Neat idea, Bill. Maybe we should make it initialize to ON, in case the cog's brain stem is all that works.
Totally. It's a great feature and I think it will go a long way toward addressing that "but does it have JTAG?" question. Answer: "No, but this is really cool.... " and given some spiffy tools, a perfectly reasonable and quite useful answer.
IMHO, one of the strong points of the Propeller architecture is bit manipulation. I often use instructions such as rcr, rev, muxc, andn, movi, movd for intended and unintended things. To use "movd", in som situations, saves me 3 instructions instead of doing "and dest" -> "and source" -> "shl source" -> "or dest, source". Designing your code (and formats) with Propeller instruction bit fieds in mind can increase execution speed for inner loops. I often think of how nice it would be to have arbitrary bit field read/write instructions. That would be very useful for a lot of general cases.
3 instructions would be needed.
SMM - Set Multiplex Mask (maybe the accumulator could be used as a mask?)
MUXV - Multiplex Value
DMIV - Demultiplex Into Value
If you want to fill a destination address with data in some arbitrary bits, you could just.
SMM bitMask
MUX dest, source
The bits in source would fill upp all the ones of the mask in destination. Starting from the LSB and until there are no more "holes" to fill.
SMM -> DMIV would do the opposite of course.
Maybe these kind of "dynamic" instructions eats more silicon and are harder to implement?
/Johannes
Those are really nice ideas. It would take a lot of logic to make a bit-filling mux like that. It might be slow, too, because of serial computations needed to get to the next hole. I'll look into it today. I know it would save a lot of code in the Spin interpreter.
To address these kinds of issues, there are now instructions which can get and set nibbles, bytes, and words, with the field# being coded into the instruction, making them non-modal, so that all tasks can use them without any configuration contingencies:
Defaulting to it is great for the shuttle runs, but before production, the pull-low trick would be helpful (so external logic can drive P64-P79 in production boards)
I just realized that one of the SPI pins (other than CS) could be used for the pull low detection, sparing P85... say 20k pull-down, easily over-driven by prop's MOSI
Are there enough bits left to be able to specify both source and destination NIB/BYTE/WORD?
Consider
REPS #6,#512
MOV dira, qspi_read_mask
ANDN outa,qspi_clock_mask
SETNIB ptra.0,ina.0
OR outa,qspi_clock_mask
ANDN outa,qspi_clock_mask
SETNIB ptra++.1, ina.0
OR outa,qspi_clock_mask
Presto! CLKFREQ/3 QSPI read without needing special QSPI support in SERDES! (add nopx's for other data rates, or use counters for CLKFREQ/1)
The code for QSPI write is left as an exercise to the reader
Octal writes/reads would be almost identical to the above.
IDEA:
You could cut the number of opcodes in half by:
MOVNIB dest.x, src.y ' x,y encoded like you would have before in #nibble
MOVBYTE dest.x, src.y ' x,y encoded like you would have before in #byte
MOVWORD dest.x, src.y ' x,y encoded like you would have before in #word
Hmmm.... MOVN / MOVB / MOVW as mnemonics would be shorter...
If there are not enough bits, or the logic is too complex, we can still get 3 cycles as long as the lowest 4 bits of porta/b/c are used for D0-D3, otherwise two extra instructions are needed (ie read source nibble to temp, write to destination nibble) leading to 4 cycles per nibble without a counter providing a clock, and 2 cycles with a counter used to provide the clock.
Those are really nice ideas. It would take a lot of logic to make a bit-filling mux like that. It might be slow, too, because of serial computations needed to get to the next hole. I'll look into it today. I know it would save a lot of code in the Spin interpreter.
To address these kinds of issues, there are now instructions which can get and set nibbles, bytes, and words, with the field# being coded into the instruction, making them non-modal, so that all tasks can use them without any configuration contingencies:
if a specific pin (P85? whichever is the highest "unused" pin) is found to be pulled low on startup, SETTRACE the boot loader / monitor to say P64-P79... only adds 2 instructions to the "ROM"
WAIT A MINUTE !
Doesn't that immediatly make it possible to watch all the encrypted code being decrypted that giving away the users crown jewels!
Sounds like it's just blown all the carefully crafted code protection in the Prop II.
Unless trace only gets turned on at the end of boot and before starting the the user code. Which is kind of pointless isn't it.
Are there enough bits left to be able to specify both source and destination NIB/BYTE/WORD?
Consider
REPS #6,#512
MOV dira, qspi_read_mask
ANDN outa,qspi_clock_mask
SETNIB ptra.0,ina.0
OR outa,qspi_clock_mask
ANDN outa,qspi_clock_mask
SETNIB ptra++.1, ina.0
OR outa,qspi_clock_mask
Presto! CLKFREQ/3 QSPI read without needing special QSPI support in SERDES! (add nopx's for other data rates, or use counters for CLKFREQ/1)
The code for QSPI write is left as an exercise to the reader
Octal writes/reads would be almost identical to the above.
IDEA:
You could cut the number of opcodes in half by:
MOVNIB dest.x, src.y ' x,y encoded like you would have before in #nibble
MOVBYTE dest.x, src.y ' x,y encoded like you would have before in #byte
MOVWORD dest.x, src.y ' x,y encoded like you would have before in #word
Hmmm.... MOVN / MOVB / MOVW as mnemonics would be shorter...
If there are not enough bits, or the logic is too complex, we can still get 3 cycles as long as the lowest 4 bits of porta/b/c are used for D0-D3, otherwise two extra instructions are needed (ie read source nibble to temp, write to destination nibble) leading to 4 cycles per nibble without a counter providing a clock, and 2 cycles with a counter used to provide the clock.
There aren't enough bit possibilities available to do random nibble-to-nibble moves. However, we could have GETNIB do a rotate-left-by-nibble to get some kind of chain going.
REPS #8,#512
MOV dira, qspi_read_mask
ANDN outa,qspi_clock_mask
GETNIB temp, ina, #0
OR outa,qspi_clock_mask
SETNIB ptra, temp, #0
ANDN outa,qspi_clock_mask
GETNIB temp, ina, #0
OR outa,qspi_clock_mask
SETNIB ptra++, temp, #1
clkfreq/4 per / nibble without using counters to provide the clock (20MB/sec @ 160Mhz)
clkfreq/2 per / nibble using counters to provide the clock (40MB/sec @ 160Mhz)
There aren't enough bit possibilities available to do random nibble-to-nibble moves. However, we could have GETNIB do a rotate-left-by-nibble to get some kind of chain going.
Well, things are moving along well, despite a few delays due to hard-to-find bugs after the Big Change.
Right now, I'm working on adding the new pixel blending modes to the texture mapper, and then I must revisit the logic controlling the auxiliary RAM from the cog side. After that, I'll deal with the synchronous shifter issue. Then, updated test suites need to be made.
Last night, in trying to discover the source of a bug, I was hard-coding some internal cog signals out to I/O pins so that I could observe things on the logic analyzer. This helped me immensely. It occurred that this type of thing could be standardized almost for free, as it takes just a few mux's. So, I added a SETRACE D/#n (set trace) instruction which outputs that cog's internal signals onto a selectable word (as in 16 bits) of I/O pins. The signals output are, from top down: Z, C, GO, COND, VALID, TASK[1:0], PC[8:0]. This way, you can see, in real-time (or capture through internal port D to AUX RAM or external SDRAM) the sequence of a cog's activity. VALID indicates whether the instruction hasn't been cancelled as branch-trailing code, COND shows condition, and GO is high whenever execution is proceeding and low when the pipeline is being stalled. The rest are what you'd expect: the flag states (Z and C), the task number (T), and the program counter (PC). When you need to see what a cog is doing, this really spills its guts. You could make a trace from another cog by having it wait for an internal (port D) edge event, then log so many clock cycles of activity, which can then be mapped back to the code that is known to be in the cog of interest. Anyway, it doesn't take any special code to operate; you just make 16 pins outputs, then do a SETRACE #word to start the outputting. This should be helpful to people who want to get an understanding of what's actually going on with their code at the clock-cycle level.
I REALLY like your SETTRACE instruction! It will make debugging much easier, as it will be possible to watch the PC and find out if a cog is stuck.
Here is an idea for the next shuttle run:
if a specific pin (P85? whichever is the highest "unused" pin) is found to be pulled low on startup, SETTRACE the boot loader / monitor to say P64-P79... only adds 2 instructions to the "ROM"
The above, and a logic analyzer, would help verify the chip until its fully tested; and since the bootloader/monitor source is published, it won't hurt to leave it in for a production run (to avoid a mask charge)
For development work, it will be fantastic - capturing the execution profile of 1..4 tasks, for post-capture analysis!
Set only if encryption is disabled of course. Aha - see its already been mentioned - agreed, instruction is disabled by encryption would be simplest.
Chip,
Another possibility - could there be an instruction that takes an input pin to force a "stall" - would give us single stepping even though there would be many caveats.
Those are really nice ideas. It would take a lot of logic to make a bit-filling mux like that. It might be slow, too, because of serial computations needed to get to the next hole. I'll look into it today. I know it would save a lot of code in the Spin interpreter.
To address these kinds of issues, there are now instructions which can get and set nibbles, bytes, and words, with the field# being coded into the instruction, making them non-modal, so that all tasks can use them without any configuration contingencies:
There are a few other instructions which do other kinds of fixed moves.
Ahle2 said..
SMM - Set Multiplex Mask (maybe the accumulator could be used as a mask?)
MUXV - Multiplex Value
DMIV - Demultiplex Into Value
Perhaps a simpler set might be possible?
SETMASK value - sets a MASK register (maybe 2 per cog or use ACCx) - this could then be used for other purposes too like the poly in CRC
MUX dest, srce
DMUX dest, srce
The MASK would contain 2 9bit values: srce = number of bits to skip left from 0, dest = number of bits to replace
As an example, the MOVI instruction would be the equivalent of
SETMASK value
MUX instr, bitfield
VALUE LONG 9<<9 | 23
Just a thought as I realise we don't have a MOV icccc instruction.
BTW I noticed that MOVS/MOVD/MOVI have changed to SETxxx instructions. My preference is to retain MOVx rather than SETxxx.
The above MUX could be MOVX and DMUX could be EXTRACT??? I like being able to extract a source/destination/opcode even if I have to set a mask.
Effectively here, the MOVS/MOVD/MOVI instructions are a combo of the above with predefined masks. There are also the GET nibble/byte/word instructions as well. Maybe there could be some synergy between all these instructions???
Bill & Ahle2 - would you like to start a new thread to discuss possible simplification of these while Chip proceeds???
Another possibility - could there be an instruction that takes an input pin to force a "stall" - would give us single stepping even though there would be many caveats.
You could use WAITPEQ, masked to a "resume" pin. Externally, you would pulse the appropriate pin to resume execution. This too would have some caveats.
You could use WAITPEQ, masked to a "resume" pin. Externally, you would pulse the appropriate pin to resume execution. This too would have some caveats.
That's not the same as the "stall".
WAITPEQ only waits until condition is met. If you want it to stop again, you need to add another WAITPEQ.
What Cluso99 wants is an instruction that forces the Prop2 into single-step mode. No more WAITPEQ instructions filling up the COG RAM, just ONE instruction, then a lot of clicking with a button to slowly advance the code. Add a heap of LEDs connecte to the IO Pins used in the 'Trace mode', and a whole lot more debugging is possible for anyone not fortunate enough to own a Logic Analyzer.
Yeah, I realize that. On the up side, it already exists!
But true single-step and breakpoint feature would be nice. Maybe SETRACE could also enable the other 16 pins in a bank as inputs, where you could have SE, STEP, BE, BRK[8..0]. SE would enable single-step mode, with STEP being pulsed to perform a single step. BE would enable a single breakpoint, whose address is given by BRK.
How can SETTRACE operate, at the clock cycle level, @ 160 MHz?
There is no concern at all, about some bandwidth limitations, imposed by the pin circuits?
Also, it will be extremely usefull, to have SETTRACE selectively enabled/disabled, under program control, even in cases where security bits are programmed.
I can devise its usefulness, as some way to do 8 or even nine bit banging or to enable some redundance checking among three and even more Propellers, intended to run some synced tasks in parallel, being audited extra-chip, by another Propeller or even a FPGA.
With SETTRACE (or SETNSAMODE? :-) ) will it be possible to use another cog to read the output pins, and if there's a single-step mode, to let another cog do the stepping? I'm thinking debug cog, for those unfortunate enough not to have a Logic Analyzer...
Why is there all this redesigning? It seems there will be some work to adapt to a different process and change a few errors that were found, but don't we want something to buy soon?
Why is there all this redesigning? It seems there will be some work to adapt to a different process and change a few errors that were found, but don't we want something to buy soon?
Am I missing something?
I'm wondering this too. PropGCC for P2 is now broken and will require an unknown amount of work to fix. I'm sure all of these changes will make P2 better but I wonder how much later it will be because of them?
I can relate to the feeling of being "almost there" and then feeling that I am miles away. But you have to remember that the various vendors have their own schedules. This corresponds to "windows," which roughly correlates to available schedules. I'm guessing Chip is more than aware of his own schedule and that of the vendors. It looks to me that Chip had ideas that he did not have time to implement before the last round of development and therefore came up with the idea of the P3 program. Since that shuttle failed, there was a window of opportunity and he jumped at it. This accounts for his hunkering down. I have a thousand questions that I would like to ask, but don't because the answer is likely to change in the very near future.
What is clear is that the P2 is going to be far more powerful and versatile than anyone had reasonably thought. I can't wait, either. But waitn we b.
All these changes sound wonderful for those of you writing compilers and all the gee-wiz-bang code for the someday to be P2, but to those of us out here who have been eager to have the P2 as first defined... it scares the hello out of us to see all these changes that each feel like they could increase the chances that the NEXT iteration of the chip will have another problem unrelated to the previous difficulties. I have products that I have delayed for years on the expectation P2 being available in "the next few months." Please...
Chip is an incredible engineer and no one stands in more awe of his genius than yours truly. However... as a lesser engineer, I know that my greatest successes have been in designs that have been forced to a close and a final period put at the sentence: "Here's what it does". My greatest failures have come from ever-expanding delivery dates caused by my own enthusiasm for adding "Just one more great thing".
It may be time to put that period at the end of "Here's what the P2 does:" and start the new sentence: " What would you rather have more of in the P3?"
In the meantime, I have been planning what I would like to do first with the P2 coding it on the P1, when it is possible.
I have been working on a camera driver that is just not possible on the P1 but will absolutely fly on the P2.
Comments
Neat idea, Bill. Maybe we should make it initialize to ON, in case the cog's brain stem is all that works.
Totally. It's a great feature and I think it will go a long way toward addressing that "but does it have JTAG?" question. Answer: "No, but this is really cool.... " and given some spiffy tools, a perfectly reasonable and quite useful answer.
Cool!
Are you giving the COGs 4th and 5th amendment flags? (oh, never mind, you said NSA)
Those are really nice ideas. It would take a lot of logic to make a bit-filling mux like that. It might be slow, too, because of serial computations needed to get to the next hole. I'll look into it today. I know it would save a lot of code in the Spin interpreter.
To address these kinds of issues, there are now instructions which can get and set nibbles, bytes, and words, with the field# being coded into the instruction, making them non-modal, so that all tasks can use them without any configuration contingencies:
SETNIB D,S,#0..7
GETNIB D,S,#0..7
SETBYTE D,S,#0..3
GETBYTE D,S,#0..3
SETWORD D,S,#0..1
GETWORD D,S,#0..1
There are a few other instructions which do other kinds of fixed moves.
Defaulting to it is great for the shuttle runs, but before production, the pull-low trick would be helpful (so external logic can drive P64-P79 in production boards)
I just realized that one of the SPI pins (other than CS) could be used for the pull low detection, sparing P85... say 20k pull-down, easily over-driven by prop's MOSI
Are there enough bits left to be able to specify both source and destination NIB/BYTE/WORD?
Consider
Presto! CLKFREQ/3 QSPI read without needing special QSPI support in SERDES! (add nopx's for other data rates, or use counters for CLKFREQ/1)
The code for QSPI write is left as an exercise to the reader
Octal writes/reads would be almost identical to the above.
IDEA:
You could cut the number of opcodes in half by:
MOVNIB dest.x, src.y ' x,y encoded like you would have before in #nibble
MOVBYTE dest.x, src.y ' x,y encoded like you would have before in #byte
MOVWORD dest.x, src.y ' x,y encoded like you would have before in #word
Hmmm.... MOVN / MOVB / MOVW as mnemonics would be shorter...
If there are not enough bits, or the logic is too complex, we can still get 3 cycles as long as the lowest 4 bits of porta/b/c are used for D0-D3, otherwise two extra instructions are needed (ie read source nibble to temp, write to destination nibble) leading to 4 cycles per nibble without a counter providing a clock, and 2 cycles with a counter used to provide the clock.
WAIT A MINUTE !
Doesn't that immediatly make it possible to watch all the encrypted code being decrypted that giving away the users crown jewels!
Sounds like it's just blown all the carefully crafted code protection in the Prop II.
Unless trace only gets turned on at the end of boot and before starting the the user code. Which is kind of pointless isn't it.
Or am I missing a point again?
I think you could watch the end results (process), but not see the code or the crown jewels (key) that ultimately makes it happen.
It could be easily avoided if the decryption was launched in COG#1, which would not lift its skirt... simply have the decrypt code NOT settrace.
There aren't enough bit possibilities available to do random nibble-to-nibble moves. However, we could have GETNIB do a rotate-left-by-nibble to get some kind of chain going.
That can be permanently OFF if security bit's programmed
clkfreq/4 per / nibble without using counters to provide the clock (20MB/sec @ 160Mhz)
clkfreq/2 per / nibble using counters to provide the clock (40MB/sec @ 160Mhz)
Still pretty damned good for only six pins
Yeah, good idea! For safety (in things I'm not sure are actually compromised yet), that's probably the best plan.
Aha - see its already been mentioned - agreed, instruction is disabled by encryption would be simplest.
Chip,
Another possibility - could there be an instruction that takes an input pin to force a "stall" - would give us single stepping even though there would be many caveats.
SETMASK value - sets a MASK register (maybe 2 per cog or use ACCx) - this could then be used for other purposes too like the poly in CRC
MUX dest, srce
DMUX dest, srce
The MASK would contain 2 9bit values: srce = number of bits to skip left from 0, dest = number of bits to replace
As an example, the MOVI instruction would be the equivalent of
SETMASK value
MUX instr, bitfield
VALUE LONG 9<<9 | 23
Just a thought as I realise we don't have a MOV icccc instruction.
BTW I noticed that MOVS/MOVD/MOVI have changed to SETxxx instructions. My preference is to retain MOVx rather than SETxxx.
The above MUX could be MOVX and DMUX could be EXTRACT??? I like being able to extract a source/destination/opcode even if I have to set a mask.
Effectively here, the MOVS/MOVD/MOVI instructions are a combo of the above with predefined masks. There are also the GET nibble/byte/word instructions as well. Maybe there could be some synergy between all these instructions???
Bill & Ahle2 - would you like to start a new thread to discuss possible simplification of these while Chip proceeds???
You could use WAITPEQ, masked to a "resume" pin. Externally, you would pulse the appropriate pin to resume execution. This too would have some caveats.
That's not the same as the "stall".
WAITPEQ only waits until condition is met. If you want it to stop again, you need to add another WAITPEQ.
What Cluso99 wants is an instruction that forces the Prop2 into single-step mode. No more WAITPEQ instructions filling up the COG RAM, just ONE instruction, then a lot of clicking with a button to slowly advance the code. Add a heap of LEDs connecte to the IO Pins used in the 'Trace mode', and a whole lot more debugging is possible for anyone not fortunate enough to own a Logic Analyzer.
Yeah, I realize that. On the up side, it already exists!
But true single-step and breakpoint feature would be nice. Maybe SETRACE could also enable the other 16 pins in a bank as inputs, where you could have SE, STEP, BE, BRK[8..0]. SE would enable single-step mode, with STEP being pulsed to perform a single step. BE would enable a single breakpoint, whose address is given by BRK.
How can SETTRACE operate, at the clock cycle level, @ 160 MHz?
There is no concern at all, about some bandwidth limitations, imposed by the pin circuits?
Also, it will be extremely usefull, to have SETTRACE selectively enabled/disabled, under program control, even in cases where security bits are programmed.
I can devise its usefulness, as some way to do 8 or even nine bit banging or to enable some redundance checking among three and even more Propellers, intended to run some synced tasks in parallel, being audited extra-chip, by another Propeller or even a FPGA.
Yanomani
===Jac
Am I missing something?
I can relate to the feeling of being "almost there" and then feeling that I am miles away. But you have to remember that the various vendors have their own schedules. This corresponds to "windows," which roughly correlates to available schedules. I'm guessing Chip is more than aware of his own schedule and that of the vendors. It looks to me that Chip had ideas that he did not have time to implement before the last round of development and therefore came up with the idea of the P3 program. Since that shuttle failed, there was a window of opportunity and he jumped at it. This accounts for his hunkering down. I have a thousand questions that I would like to ask, but don't because the answer is likely to change in the very near future.
What is clear is that the P2 is going to be far more powerful and versatile than anyone had reasonably thought. I can't wait, either. But waitn we b.
Rich
All these changes sound wonderful for those of you writing compilers and all the gee-wiz-bang code for the someday to be P2, but to those of us out here who have been eager to have the P2 as first defined... it scares the hello out of us to see all these changes that each feel like they could increase the chances that the NEXT iteration of the chip will have another problem unrelated to the previous difficulties. I have products that I have delayed for years on the expectation P2 being available in "the next few months." Please...
Chip is an incredible engineer and no one stands in more awe of his genius than yours truly. However... as a lesser engineer, I know that my greatest successes have been in designs that have been forced to a close and a final period put at the sentence: "Here's what it does". My greatest failures have come from ever-expanding delivery dates caused by my own enthusiasm for adding "Just one more great thing".
It may be time to put that period at the end of "Here's what the P2 does:" and start the new sentence: " What would you rather have more of in the P3?"
In the meantime, I have been planning what I would like to do first with the P2 coding it on the P1, when it is possible.
I have been working on a camera driver that is just not possible on the P1 but will absolutely fly on the P2.
On the P1, here is what it looks like:)