Propeller II

cgracey · 2012-08-10 01:18

pedward wrote: »

Ok, those are the details I was after since we talked broad strokes.

As for 128 bits, that doesn't worry me, HMAC was designed for that and SHA-256 hasn't been proven to vulnerable to a 1 bit attack, much less 128 bits of deterministic entropy

I am concerned about how WP is implemented, last we talked you and I floated the idea of redundant WP bits to prevent unintentional bricking. Is only part of the e-fuses WP? That's the difference I make between "system" and "user". The "user" is still allowed to blow "user" fuses after the WP is set. Maybe WP should be two-fold, 5 WP bits (interleaved) for the "system" portion and several interleaved WP bits for the user portion?

I was kinda thinking WP bits could gate write access to the next 31 bits of fuses, so you have 1 WP bit for every 31 bits. Though they should be ANDed for the 2 different groups. The system group fuse bits are ANDed together and the user group bits are ANDed together.

Also, to think of it, fuse bits need to be blown in ROM (Ring 0) code because you can read back the bits in Ring 1. The ROM could simply read 5 longs and check the sign bit (high order bit) to see if it's WP and disable write if all bits are true.

Yes, the loader cares about 4K...

The fuses are vulnerable to write from the authenticated loader. This way, a special loader is sent to program fuses when desired. Otherwise, one of the first things a normal loader would need to do would be to hide the fuses, effectively write-protecting them, as well as hiding them. Also, the COGID mechanism is used not just for fuses, but for testing the 64-bit system counter (CNT). This testing would be done by a special loader, as well, during device testing. I wanted to free the ROM from having to do too many things. Do you see a problem with this?

Cluso99 · 2012-08-10 01:26

Thanks for the info on the SPI Flash chip. Pitty I cannot find a SOT23-6 device. Smallest is SOIC8 0.153".

I would have thought we could have read a boot block from SD given there is SPI support in ROM. So I wait with baited breath for Peter to check (pure lack of time here for 1 month).

My debugger the (basic section) is small because it sits in the shaddow ram and basically runs an LMM debugger from hub. I will try and find the time over the w/e to take a look at how this could work with the P2. IIRC I only used 4 longs in the cog (in shaddow ram). The output method (to hub) is a single character buffer. If its non-zero, the debugger just waits. So the ROM needs to have the hub code in it. The rest of the code to do the actual debugging can be soft.

How I implemented the debugger was, that prior to loading the hub code into the cog, I saved the first few longs in hub that would be loaded into cog $0+ into a new hub location, and then replaced them with my debugger initialisation code, then let cognew run. This caused the cog to start executing my debug code at cog $0. This then loaded up the shaddow ram and jumped to it. It implemented an LMM execution machine (totally within the shaddow cog ram) running my hub LMM code. This LMM code first replaced those instructions/longs that I had previously modified (before the cognew) at cog $0+. Then my LMM code would single-step/cycle the user code (via the LMM execution machine), beginning at cog $0.

While I had not implemented any breakpoints, I had devised a way to do this.

Now, on the P2 there is no shaddow ram (but we have a little more cog space because we have fewer registers).

So firstly, where will we place this LMM execution machine? Cog $0+ or at the end of cog (somewhere about $1F8+)?
The nicest way to implement the debugger is to not require any changes to the users code. However, we must have a space in hub to save the few user longs for this way to work successfully.

Chip: If I asked for a hole in the hub rom (about 5 or so longs) to be ram instead of rom, is this possible? It certainly sounds like it would be. Maybe best if this could be 8 longs because there could be a few other things that could be done with this.

BTW The LMM execution machine cannot use the same LMM execution machine as C because we cannot use of the registers (INDx etc).

While I am thinking about this, another routine that may be nice to be in ROM could be a simple cog loader to load a block of hub to cog. What I mean is this... If we want to dynamically load cogs from hub, then the ideal way is to only load the length required (just like an overlay) and not load the full $50x longs - this is faster. Might this be possible and useful? I presume there is no way to reset a cog's registers without actually resetting the cog by cognew?

pedward · 2012-08-10 01:33

cgracey wrote: »

The fuses are vulnerable to write from the authenticated loader. This way, a special loader is sent to program fuses when desired. Otherwise, one of the first things a normal loader would need to do would be to hide the fuses, effectively write-protecting them, as well as hiding them. Also, the COGID mechanism is used not just for fuses, but for testing the 64-bit system counter (CNT). This testing would be done by a special loader, as well, during device testing. I wanted to free the ROM from having to do too many things. Do you see a problem with this?

If authentication is ALWAYS ON and an authenticated bootloader can read and write the fuses, that's much more flexible than the previously proposed solution. It seems that in the 11th hour simplicity won out!

Nothing strikes me as wrong immediately, I will continue to think on the subject. It simply means that the second stage boot loader even more the lynch pin, which I've always said it was anyway.

I've been a proponent in soft loading as much functionality from the get-go, I think that leaves less room for error and opens up the system to more people.

What this does is put the onus on the write code to "do the right thing". The only downside is that it will be slightly harder to protect people from themselves. This allows for truly secure and custom boot loaders like the application I described with my cell phone. The boot loader can enforce user firmware revision escalation, to prevent downgrade attacks to the system. Downgrade attacks are the simplest and most viable attack on signed architectures. With user-fuses you can protect against downgrade attacks by incrementing the fuse pattern and checking that against a field in the firmware. In practical use you would only issue an escalation during a critical security fix, so as not to waste the escalation capabilities. There may be ways of using the bits non-linearly to achieve more than n bits of escalation. A pattern could represent a code where the number of usable symbols is greater than the number of OTP bits.

That brings to mind, is there any chance that the mask ROM on the chip could have a user OTP area for those 128 longs you've got there?

On the other hand, 512 bytes is the size of a disk sector. I wonder if you could think up any tricky uses for 128 longs of RAM? I'm just kinda daydreaming here, but maybe some special mirror addresses, or debug bridge, or crossbar memory (all COGs, no timeslicing, like the D port, but bigger).

I know some of these ideas are crazy, considering the late hour, but it's worth asking you silly questions, you've been known to crack loose a great idea from time to time.

LoopyByteloose · 2012-08-10 01:53

Happy to hear from Chip that things are moving ahead. It is all rather exciting and 160Mhz is going to help out in video generaltion. The world has been moving away from VGA, NTSC, and PAL.

Cluso99 · 2012-08-10 01:58

Debugger

Here is the section of my debugger code... It's a little convoluted because I also had to get around compiler restrictions as well.

CON           'THESE LINES ARE REQUIRED IN THE TOP OBJECT AND AT THE TOP OF THE CODE !!!
'------------------------------------------------------------------------------------------------------------
' The following is the object offset needed to be added to [EMAIL="#@xxxx"]#@xxxx[/EMAIL] in Debug_Block pasm instructions (compiler restriction)
  DB            = $10                                          '<=====
 ' The following is the debug code which is held in shadow ram (compiler restriction)
  X_ENTER       = $1F0 ' X_ENTER   RDLONG  X_OPCODE, #D_LMM    'read instruction to be executed 
  X_ADD4        = $1F1 '           ADD     X_ENTER,  #4        'inc to next hub instruction to be executed
  X_OPCODE      = $1F2 ' X_OPCODE  NOP                         'execute the instruction ("SUB X_ENTER,#4 if waiting)
  X_1F3         = $1F3 '           JMP     #X_ENTER            'loop again
DAT           'THESE LINES ARE REQUIRED IN THE TOP OBJECT AND AT THE TOP OF THE CODE !!! 
'------------------------------------------------------------------------------------------------------------
' THIS IS THE DEBUG BLOCK WHICH MUST BE LOCATED IN LOWER HUB RAM (entirely below $200)
' In order to achieve this the DAT block must be at the top of the top object.
'------------------------------------------------------------------------------------------------------------
' Cog execution begins as follows:
'   The bootcode placed into cog $000-003 executes which loads the debug code into cog $1F0-1F3.
'   The debug code executes next, which reloads the original (saved) user code back into cog $000-003.
'   The debug code sets the first instruction for the cog to be $000 and waits for the dubugger code
'     to tell it what to do.
'   Execution is controlled by a simple LMM kernel.
'------------------------------------------------------------------------------------------------------------
Debug_Block             org     0                       '  v0.275   
SAVE_000                nop                             '\ Cog user code $000-003 is saved here
SAVE_001                nop                             '|
SAVE_002                nop                             '|
SAVE_003                nop                             '/
                        org     0
' Debug LMM code executes from Hub (works cooperatively with the Debug LMM code in Cog)
D_LMM_BOOT              rdlong  X_ENTER, [EMAIL="#DB+@z_ENTER"]#DB+@z_ENTER[/EMAIL]   '\ copies Debug code
                        rdlong  X_ADD4,  [EMAIL="#DB+@z_ADD4"]#DB+@z_ADD4[/EMAIL]    '|   from hub to cog $1F0-1F3
                        rdlong  X_OPCODE,#DB+@z_OPCODE  '| (workaround compiler restriction)
                        rdlong  X_1F3,   [EMAIL="#DB+@z_1F3"]#DB+@z_1F3[/EMAIL]     '/
                        jmp     #X_ENTER
                        org     0
' This is the LMM debug code for the Debug kernel running in cog $1F0-1F3
D_LMM_SAVE              rdlong  SAVE_000,#DB+@SAVE_000  '\ restores the user code
                        rdlong  SAVE_001,#DB+@SAVE_001  '|   from hub to cog $000-003
                        rdlong  SAVE_002,#DB+@SAVE_002  '|
                        rdlong  SAVE_003,#DB+@SAVE_003  '/
' This is the LMM debug execution loop which the Debug kernel will then execute & communicate with the spin debugger
D_LMM_EXEC              sub     X_ENTER,#4-0            'normally #4 (set to #0 by "execute")
D_OPC_EXEC              nop                             'placeholder for the instruction to execute
                        wrbyte  X_ADD4,#DB+@D_LMM_EXEC  'set back to #4 when done
                        movs    X_ENTER,#DB+@D_LMM_EXEC 'simulates jump to D_LMM by changing the hub pointer
' The following are used to store variable data for hub store/load
D_VAL1_EXEC             long    0                       'data value  for cog to hub store/load
D_VAL2_EXEC             long    0                       'data value2 for cog to hub store/load (not used yet)
D_VAL3_EXEC             long    0                       'data value3 for cog to hub store/load (not used yet)
' Debug kernel code runs in cog shadow ram $1F0-1F3
                        org     $0                      'compiler will not allow $1F0 so workaround
z_ENTER                 rdlong  X_OPCODE,#DB+@D_LMM_SAVE
z_ADD4                  add     X_ENTER,#4
z_OPCODE                nop
z_1F3                   jmp     #X_ENTER
'------------------------------------------------------------------------------------------------------------
' Debug kernel code runs in cog $000-003 (bootloader code executes when cog starts)
'   The user code in cog $000-003 is replaced with the code below by spin while still in hub memory.
'   The user code is restored once the Debug kernel is loaded into cog $1F0-1F3.
                        org     0
B_ENTER                 rdlong  B_OPCODE,#DB+@D_LMM_BOOT
B_ADD4                  add     B_ENTER,#4
B_OPCODE                nop
B_003                   jmp     #B_ENTER
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;

OBJ
  dbg              : "ClusoDebugger_275"                'Cluso debugger
CON
'  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
'  &#9474;    ClusoDebugger Launch code                                             &#9474;
'  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
VAR
  long  uservars                                        'user variables here
  
PUB Main            
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
  ' START THE DEBUGGER, THEN START THE USER TEST CODE (in pasm)
  dbg.StartDebugger( -1, @Debug_Block, @PasmCode, 0)          'start debugger in a new cog
  PauseMs(200)
  CogNew( @PasmCode, 0 )                                      '<---- The pasm code to be debugged
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
  repeat                                                      '<=== loop here (or stop this cog)

One of the problems I found was that some of my code needed to live below hub $200 because I used immediate addressing in the rdlong instruction.

Cluso99 · 2012-08-10 02:05

pedward: From what I understood chip to say, the ROM code is just using the hub ram with the transistors (mosfets) effectively tied on/off as required. This is a simple modification to the actual mask (but in the software), if you like. However, OTP is actually implemented just like simple fuses, but unfortuately Parallax doesn't have access to this technology. So Chip is using a different method to make the fuses and they take a lot more room than a single bit of hub ram.

cgracey · 2012-08-10 02:21

Cluso99 wrote: »

Thanks for the info on the SPI Flash chip. Pitty I cannot find a SOT23-6 device. Smallest is SOIC8 0.153".

I would have thought we could have read a boot block from SD given there is SPI support in ROM. So I wait with baited breath for Peter to check (pure lack of time here for 1 month).

My debugger the (basic section) is small because it sits in the shaddow ram and basically runs an LMM debugger from hub. I will try and find the time over the w/e to take a look at how this could work with the P2. IIRC I only used 4 longs in the cog (in shaddow ram). The output method (to hub) is a single character buffer. If its non-zero, the debugger just waits. So the ROM needs to have the hub code in it. The rest of the code to do the actual debugging can be soft.

How I implemented the debugger was, that prior to loading the hub code into the cog, I saved the first few longs in hub that would be loaded into cog $0+ into a new hub location, and then replaced them with my debugger initialisation code, then let cognew run. This caused the cog to start executing my debug code at cog $0. This then loaded up the shaddow ram and jumped to it. It implemented an LMM execution machine (totally within the shaddow cog ram) running my hub LMM code. This LMM code first replaced those instructions/longs that I had previously modified (before the cognew) at cog $0+. Then my LMM code would single-step/cycle the user code (via the LMM execution machine), beginning at cog $0.

While I had not implemented any breakpoints, I had devised a way to do this.

Now, on the P2 there is no shaddow ram (but we have a little more cog space because we have fewer registers).

So firstly, where will we place this LMM execution machine? Cog $0+ or at the end of cog (somewhere about $1F8+)?
The nicest way to implement the debugger is to not require any changes to the users code. However, we must have a space in hub to save the few user longs for this way to work successfully.

Chip: If I asked for a hole in the hub rom (about 5 or so longs) to be ram instead of rom, is this possible? It certainly sounds like it would be. Maybe best if this could be 8 longs because there could be a few other things that could be done with this.

BTW The LMM execution machine cannot use the same LMM execution machine as C because we cannot use of the registers (INDx etc).

While I am thinking about this, another routine that may be nice to be in ROM could be a simple cog loader to load a block of hub to cog. What I mean is this... If we want to dynamically load cogs from hub, then the ideal way is to only load the length required (just like an overlay) and not load the full $50x longs - this is faster. Might this be possible and useful? I presume there is no way to reset a cog's registers without actually resetting the cog by cognew?

Cluso99, your debugger sounds really sneaky. We can put regular RAM cells back in place of what is slated for ROM bits, no problem. By getting dual use out of our RAM layout, we didn't need tristated outputs and separate ROMs on a shared bus, so it saved a lot of space and sped things up. So, we could put some sneaky RAM locations within the ROM.

I'm trying to think if I could make a way to write the register RAM behind $1F8..$1FF (which code is always fetched from, but never D or S, as they only read the I/O pin flops). I would have to inhibit incidental shadow RAM writes from D, but enable them on some other path - like, maybe "MOV $1F8,inst NR". That way, we could innocuously write the register RAM which instructions get fetched from, without affecting the I/O's, and do it in some way that is totally harmless. Basically, only a "MOV $1F8,data NR" instruction could write the actual RAM. Would that work okay for you? There would be a need to execute FOUR filler instructions between writing and executing within $1F8..$1FF, since the data-forwarding logic would be blind to this activity. Also, you could only EXECUTE from $1F8..$1FF, not use it as variable space. Would that be okay?

cgracey · 2012-08-10 02:30

pedward wrote: »

If authentication is ALWAYS ON and an authenticated bootloader can read and write the fuses, that's much more flexible than the previously proposed solution. It seems that in the 11th hour simplicity won out!

Nothing strikes me as wrong immediately, I will continue to think on the subject. It simply means that the second stage boot loader even more the lynch pin, which I've always said it was anyway.

I've been a proponent in soft loading as much functionality from the get-go, I think that leaves less room for error and opens up the system to more people.

What this does is put the onus on the write code to "do the right thing". The only downside is that it will be slightly harder to protect people from themselves. This allows for truly secure and custom boot loaders like the application I described with my cell phone. The boot loader can enforce user firmware revision escalation, to prevent downgrade attacks to the system. Downgrade attacks are the simplest and most viable attack on signed architectures. With user-fuses you can protect against downgrade attacks by incrementing the fuse pattern and checking that against a field in the firmware. In practical use you would only issue an escalation during a critical security fix, so as not to waste the escalation capabilities. There may be ways of using the bits non-linearly to achieve more than n bits of escalation. A pattern could represent a code where the number of usable symbols is greater than the number of OTP bits.

That brings to mind, is there any chance that the mask ROM on the chip could have a user OTP area for those 128 longs you've got there?

On the other hand, 512 bytes is the size of a disk sector. I wonder if you could think up any tricky uses for 128 longs of RAM? I'm just kinda daydreaming here, but maybe some special mirror addresses, or debug bridge, or crossbar memory (all COGs, no timeslicing, like the D port, but bigger).

I know some of these ideas are crazy, considering the late hour, but it's worth asking you silly questions, you've been known to crack loose a great idea from time to time.

We can't do any OTP in the main memory.

I don't think we need to worry about the users shooting themselves in the foot, because almost all of them are going to be using well-written loaders that are part of their particular tool chain. Only tool developers will likely bother writing any loaders. Don't you think?

cgracey · 2012-08-10 02:42

Cluso99 wrote: »

...While I am thinking about this, another routine that may be nice to be in ROM could be a simple cog loader to load a block of hub to cog. What I mean is this... If we want to dynamically load cogs from hub, then the ideal way is to only load the length required (just like an overlay) and not load the full $50x longs - this is faster. Might this be possible and useful? I presume there is no way to reset a cog's registers without actually resetting the cog by cognew?

COGINIT is much faster on Prop II. It uses the RDLONGC to fetch its longs, which is a 4-long cached version of RDLONG. So, every hub cycle (8 clocks) it picks up 4 longs, making the whole load take 1024 clocks, or 6.4us at 160MHz. The current Prop I takes 102.4us at 80MHz. And there is no way to completely reset a cog without doing a COGINIT.

Cluso99 · 2012-08-10 02:56

Chip: That would be great to have the ram back in for the rom for a few longs. I remembered how you were doing the ROM (when I asked if we could get access to what I thought had been made inaccessible). I thought that was really thinking outside the square - easy and no extra ROM block and buffering required.

This is the code that I currently place in the shaddow registers $1F0- $1F3...

             org       $1F0
 X_ENTER     RDLONG    X_OPCODE, #D_LMM     'read instruction to be executed (from lower hub ram using imediate mode)
             ADD       X_ENTER,  #4         'inc to next hub instruction to be executed
 X_OPCODE    NOP                            'execute the instruction ("SUB X_ENTER,#4 if waiting)
             JMP       #X_ENTER             'loop again

The debugger program running in another cog (currently it is spin code) uses the instruction "SUB X_ENTER, #4" in the hub LMM code while it forces the cog to pause. So, the debugger is really slow but since it is tracing all the instructions that is fine.

Now, if we have an extra long in shaddow, then I believe that we can avoid the problem of requiring the LMM code to live in lower hub (<$200).
So the shaddow code would become...

             org       $1F8                 'the new shaddow registers
 X_ENTER     RDLONG    X_OPCODE, X_HUBPTR   'read instruction to be executed (from lower hub ram using imediate mode)
             ADD       X_[COLOR=#FF0000]HUBPTR[/COLOR],  #4         'inc to next hub instruction to be executed
 X_OPCODE    NOP                            'execute the instruction ("SUB X_ENTER,#4 if waiting) - it gets read into here by the rdlong at X_ENTER
             JMP       #X_ENTER             'loop again
 X_HUBPTR    LONG      D_LMM+(0-0)          'hub pointer to the next LMM instruction to be executed

As I see it, there are two problems with cog shaddow ram on the P2...
1. We have to get this code into the cog shaddow ram (could always make it ROM like you did in hub, but nicer if ram because there could be other uses)
2. We need to ensure that the rdlong instruction can write to shaddow $1FA as held in the source. This might be the biggest problem.

Do you have a current register map for $1F8-$1FF ? (I will check the ParallaxSemi site)
Perhaps it might be possible to reorganise these instructions so that the rdlong can work correctly for its shaddow address. As I see it, this instruction could live in any shaddow location $1F8-$1FE (because it needs to be followed by at least one instruction (being a jmp).

I just had an idea to use the new REP instructions and the incrementing RDLONG instructions. Then I realised this is not possible because it would interfere with the user code running!

Ray

Sapieha · 2012-08-10 02:57

Hi Chip.

As Yours method is good for loading entire COG ---- IT is not good to load OVERLAYS that need be loaded from given COM address position.

LOAD COG org 150 by 25 longs

As an example !!

cgracey wrote: »

COGINIT is much faster on Prop II. It uses the RDLONGC to fetch its longs, which is a 4-long cached version of RDLONG. So, every hub cycle (8 clocks) it picks up 4 longs, making the whole load take 1024 clocks, or 6.4us at 160MHz. The current Prop I takes 102.4us at 80MHz. And there is no way to completely reset a cog without doing a COGINIT.

Cluso99 · 2012-08-10 02:58

Wow Chip! Loading is really going to fly. Here comes dynamic loading and overlaying too!

cgracey · 2012-08-10 03:08

Sapieha wrote: »

Hi Chip.

As Yours method is good for loading entire COG ---- IT is not good to load OVERLAYS that need be loaded from given COM address position.

LOAD COG org 150 by 25 longs
As an example !!

Here's how you overlay within a cog:

		SETPTRA	hub_code_address
		REPS	#instructions_to_load,#1
		SETINDA	cog_start_address
		RDLONGC	INDA++,PTRA++

This loads at the same rate as COGINIT, but doesn't reset any registers.

Cluso99 · 2012-08-10 03:14

Chip,
I am wondering if it may be worth putting a dummy cog instruction as the first ROM long (cogstop or the like) in case an LMM program runs errant since I expect there will be a lot of P2s running LMM code?

cgracey · 2012-08-10 03:16

Cluso99 wrote: »
Chip: That would be great to have the ram back in for the rom for a few longs. I remembered how you were doing the ROM (when I asked if we could get access to what I thought had been made inaccessible). I thought that was really thinking outside the square - easy and no extra ROM block and buffering required.

This is the code that I currently place in the shaddow registers $1F0- $1F3...
             org       $1F0
 X_ENTER     RDLONG    X_OPCODE, #D_LMM     'read instruction to be executed (from lower hub ram using imediate mode)
             ADD       X_ENTER,  #4         'inc to next hub instruction to be executed
 X_OPCODE    NOP                            'execute the instruction ("SUB X_ENTER,#4 if waiting)
             JMP       #X_ENTER             'loop again
The debugger program running in another cog (currently it is spin code) uses the instruction "SUB X_ENTER, #4" in the hub LMM code while it forces the cog to pause. So, the debugger is really slow but since it is tracing all the instructions that is fine.

Now, if we have an extra long in shaddow, then I believe that we can avoid the problem of requiring the LMM code to live in lower hub (<$200).
So the shaddow code would become...
             org       $1F8                 'the new shaddow registers
 X_ENTER     RDLONG    X_OPCODE, X_HUBPTR   'read instruction to be executed (from lower hub ram using imediate mode)
             ADD       X_ENTER,  #4         'inc to next hub instruction to be executed
 X_OPCODE    NOP                            'execute the instruction ("SUB X_ENTER,#4 if waiting) - it gets read into here by the rdlong at X_ENTER
             JMP       #X_ENTER             'loop again
 X_HUBPTR    LONG      D_LMM+(0-0)          'hub pointer to the next LMM instruction to be executed
As I see it, there are two problems with cog shaddow ram on the P2...
1. We have to get this code into the cog shaddow ram (could always make it ROM like you did in hub, but nicer if ram because there could be other uses)
2. We need to ensure that the rdlong instruction can write to shaddow $1FA as held in the source. This might be the biggest problem.

Do you have a current register map for $1F8-$1FF ? (I will check the ParallaxSemi site)
Perhaps it might be possible to reorganise these instructions so that the rdlong can work correctly for its shaddow address. As I see it, this instruction could live in any shaddow location $1F8-$1FE (because it needs to be followed by at least one instruction (being a jmp).

I just had an idea to use the new REP instructions and the incrementing RDLONG instructions. Then I realised this is not possible because it would interfere with the user code running!

Ray

The one problem I see there is that X_HUBPTR is needing to be read through D or S, which will not work, as I/O registers will be read. You could have another cog feeding a static location from which the debugger pulls it RDLONG data. The other cog could be putting longs in there as fast as you are reading and executing them. That way, you only need one long.

1F8 = PINA
1F9 = PINB
1FA = PINC
1FB = PIND
1FC = DIRA
1FD = DIRB
1FE = DIRC
1FF = DIRD

msrobots · 2012-08-10 03:18

oh CHIP;

this is very nice. I do really like PASM as is but you are making it more beautiful.
You are the man. Since you have time to be here everything looks good for P2.

Thank you for doing all of this.

Enjoy!

Mike

Sapieha · 2012-08-10 03:22

Hi Chip.

That nice Thanks.

Only one question Why REPS instruction are before SETINDA cog_start_address --- If that REPS shall only execute RDLONGC x-tiimes ?

cgracey · 2012-08-10 03:29

Cluso99 wrote: »

2. We need to ensure that the rdlong instruction can write to shaddow $1FA as held in the source. This might be the biggest problem.

Yes, that would be a problem, all right. I could make it so that RDxxxx instructions couldn't affect I/O, but only shadow ram.

I could see creating a lot of caveats to make this work, which might not even be adequate for what winds up being needed. I think the way to make this debugger work is to keep it above-board, where you have the user agree to give you $1F0..$1F5 of cog RAM, and you use some ordinary hub ram that is owned by the debugger. I'll keep thinking about this, though. Where there's a will...

msrobots · 2012-08-10 03:31

@Saphiea,

In my understanding this is a pipline issue, repxx can be followed by up to 2? instructions until it works. Magic again for me. How to hande this right?

Enjoy!

Mike

Sapieha · 2012-08-10 03:37

Hi msrobots.

It is not problem for me to handle that (not so magic to) ---- If I know what is goings on

msrobots wrote: »

@Saphiea,

In my understanding this is a pipline issue, repxx can be followed by up to 2? instructions until it works. Magic again for me. How to hande this right?

Enjoy!

Mike

Heater. · 2012-08-10 03:42

@Chip,

Re: "..the old shackles that C has subtly placed on computing..."

You did remind me of one famous case where these shackles hurt. The INMOS Transputer chip. The Transputer was designed with multi-threaded and parallel processing in mind. To that end it could schedule threads on a single CPU in hardware, no operating system, it could communicate easily and quickly with other Transputer chips again supported in hardware instructions. It really was
designed with an eye for building hugely parallel systems.

To that end it was supplied with a language, Occam, that directly supported parallel processing and communication in it syntax and semantics. No messy function calls to start bunch of parallel tasks just write:

par
    doThis()
    doThat()
    doTheOther

If you actually wanted normal sequential code you had to write:

seq
    doThis()
    doThat()
    doTheOther

Note the use of white space block delimiting, no brackets!

Communication between taks was as simple as:

channelA ? Avariable
channelB ! Avariable

To read and write data via channels to other tasks.

One could compile a whole program to run on one processor or many with almost no change many processors, distributing tasks around.

The Transputer and Occam failed to get mass acceptance, despite being streets ahead of the state of the art.

David May, the chief architect of all this later said that he felt one reason for this failure was that Occam was not C!!! It was just to different for the masses to get their heads around.

As others have pointed out those shackles are not really C's fault. C is just another member of a family of single processor, single threaded, block structured languages that dates back to FORTRAN and ALGOL and includes Pascal, ADA, PL/M, etc etc. In essence these languages are all the same, they implement the tree fundamentals of programming and algorithm design: sequence, selection and iteration.

Sequence = a list of statements to be executed in the order given.
Selection = conditional execution, "if".
Iteration = Loops, "for", "while", "repeat" and so on.

That's pretty much all they do. Things like threads or parallel processing are bolted on via function calls that implement them. (Except ADA which has some idea of such things built into its syntax).

In this light Spin is just another member of that family all be it with a simple object model included.

Aside: Those who think Spin and its objects are great for parallel processing, note that the Spin language itself has no support for those things. Syntactically the concepts of threads, parallel execution, communication between threads, mutual exclusion are not represented in the language at all. They are bolted on via plain old function calls, if at all.

How did these shackels of "sequence", "selection" and "iteration" come about? I believe it is because of the single processor Von Neumann architecture adopted by most computers. And, well, it's all you need to write an algorithm.

Why do we have the Von Neumann architecture? I believe that is basically because computer builders have always been limited by the number of tubes, transistors or chip real estate they had to play with, Von Neuman style is about the only practical way to get the job done, it works, it's very flexible.

We have always wanted faster machines, wider arithmetic, floating point etc, etc, That has always eaten all the resources available as we move from 8 to 16 to 32 to 64 bits with just the Von Neumann architecture.

I hate to say it here, but when the Prop II is cooked,'and you have a little time have a look at the XMOS architecture, there you will find some sideways thinking about how to build a multi-core MCU and create a non-traditional language. I'm not suggesting future Props should copy that, but there may be some stimulating ideas in there. The architect their is the same David May, whom I think feels the same way as you about those "shackles".

jmg · 2012-08-10 03:46

Cluso99 wrote: »

... So, the debugger is really slow but since it is tracing all the instructions that is fine.

That is ok for some debug tasks, but a faster viewport could be nice for closer to real time

Could a Debug use the SNDSER, RCVSER opcodes ? (even if they talked to another Prop 2, on a Debug-paddle).

Heater. · 2012-08-10 04:02

Just to wrap up my language ramblings I urge anyone who thinks, BASIC, C, Pascal, Spin etc etc are different from each other to check out this video by Bob Martin on the subject of programming languages.
http://www.youtube.com/watch?v=mslMLp5bQD0&feature=player_embedded

The differences are pretty superficial when looked that way.

cgracey · 2012-08-10 04:16

Sapieha wrote: »

Hi Chip.

That nice Thanks.

Only one question Why REPS instruction are before SETINDA cog_start_address --- If that REPS shall only execute RDLONGC x-tiimes ?

REPS executes early in the pipeline, in order to minimize the number of instructions that must execute before the address looping begins to take effect. Even so, the looping can't begin with the next instruction, but on the second. Therefore, you have an instruction slot you must fill between the REPS and the repeating code.

Cluso99 · 2012-08-10 04:20

cgracey wrote: »

Yes, that would be a problem, all right. I could make it so that RDxxxx instructions couldn't affect I/O, but only shadow ram.

I could see creating a lot of caveats to make this work, which might not even be adequate for what winds up being needed. I think the way to make this debugger work is to keep it above-board, where you have the user agree to give you $1F0..$1F5 of cog RAM, and you use some ordinary hub ram that is owned by the debugger. I'll keep thinking about this, though. Where there's a will...

Yes, with those registers PINx & DIRx there is no way shaddow will work the way I did it in P1. I forgot that the INDx, PAR and others are now burried. So I agree that the user is going to need to reserve some space for the debugger stub code.

There is a bigger problem in that the REPD/ NOPX/SETSKIP instructions cannot be used in a typical LMM style. Though I suppose I could emulate them anyway within the main debuger code. It may be better in the P2 to leave the cog code in hub and execute the simpler part of the debugger within the cog, loading the cog user instructions as required from hub. It will require this method to run any REPS routines.

ATM as you said, if we require the user to do any changes to his cog code to use the debugger, then we may as well load it all soft and use normal hub ram. Less room for error.
Likewise, I will sleep on it.

Sapieha · 2012-08-10 04:40

Hi Chip.

Thanks for clarification.

cgracey wrote: »

REPS executes early in the pipeline, in order to minimize the number of instructions that must execute before the address looping begins to take effect. Even so, the looping can't begin with the next instruction, but on the second. Therefore, you have an instruction slot you must fill between the REPS and the repeating code.

dMajo · 2012-08-10 04:52

cgracey wrote: »

I am currently working on getting the ROM booter code done, which loads (from half-duplex serial or SPI flash) an authenticated (SHA-256/HMAC) secondary booter which can handle full-chip loading with faster clocking and AES-128 decryption using the keys which get passed to it, which reside in special one-time-programmable fuse bits on the chip. I think these last two things (final die artwork and ROM bit pattern) will come together at about the same time and within three weeks we'll have final GDS2 data to send off to the fab. Then, three weeks later we'll see if it works.

I hope that the new P2 will be programmable also through ethernet by using eg. one of this modules with the "remote mapped com port function" (ethernet to serial). Currently I buffer the program in the tibbo device but it eats up memory that can be used for other purposes.

http://www.lantronix.com/device-networking/embedded-device-servers/xport.html
http://www.lantronix.com/device-networking/embedded-device-servers/xport-ar.html
http://www.lantronix.com/device-networking/embedded-device-servers/xport-pro.html

http://tibbo.com/products/?form_filter=yes&type[]=module&type[]=accessory&type[]=add-on&class[]=programmable&wclass_set=out

Cluso99 · 2012-08-10 05:08

jmg wrote: »

Could a Debug use the SNDSER, RCVSER opcodes ? (even if they talked to another Prop 2, on a Debug-paddle).

Probably. However this may not help. The way I do the debugging is that in essence I single-step every instruction using an LMM mode execution unit. So, not only do I have the LMM mode stepping through the code, but the actual LMM code is being placed there by another cog running spin. Only when it is ready (it disassembles the instruction back to a form of source for display) does it force the LMM code to emit instructions to fetch the next real cog instruction to be executed by the LMM execution unit in the user cog. In other words, I take full control of the users cog.

Now that doesn't mean it has to be done this way at all. Once you have control of the cog via hub LMM instructions, you can do what you like. I always intended to add breakpoints etc to it. Spin can be replaced with pasm/spin (maybe 2 cogs, 1 with fast pasm and hand this off to another cog for decoding in spin or C).

Actually, thinking further, using SNDSER may be a nice way to unload the decoding off chip, provided of course its not being used already. Guess it depends if we can really run SNDSER from each cog - I presume we can.

Jazzed also implemented a debugger but I know he is flatout in GCC and the IDE.

cgracey · 2012-08-10 05:51

Heater. wrote: »
@Chip,

Re: "..the old shackles that C has subtly placed on computing..."

You did remind me of one famous case where these shackles hurt. The INMOS Transputer chip. The Transputer was designed with multi-threaded and parallel processing in mind. To that end it could schedule threads on a single CPU in hardware, no operating system, it could communicate easily and quickly with other Transputer chips again supported in hardware instructions. It really was
designed with an eye for building hugely parallel systems.

To that end it was supplied with a language, Occam, that directly supported parallel processing and communication in it syntax and semantics. No messy function calls to start bunch of parallel tasks just write:
par
    doThis()
    doThat()
    doTheOther
If you actually wanted normal sequential code you had to write:
seq
    doThis()
    doThat()
    doTheOther
Note the use of white space block delimiting, no brackets!

Communication between taks was as simple as:
channelA ? Avariable
channelB ! Avariable
To read and write data via channels to other tasks.

One could compile a whole program to run on one processor or many with almost no change many processors, distributing tasks around.

The Transputer and Occam failed to get mass acceptance, despite being streets ahead of the state of the art.

David May, the chief architect of all this later said that he felt one reason for this failure was that Occam was not C!!! It was just to different for the masses to get their heads around.

As others have pointed out those shackles are not really C's fault. C is just another member of a family of single processor, single threaded, block structured languages that dates back to FORTRAN and ALGOL and includes Pascal, ADA, PL/M, etc etc. In essence these languages are all the same, they implement the tree fundamentals of programming and algorithm design: sequence, selection and iteration.

Sequence = a list of statements to be executed in the order given.
Selection = conditional execution, "if".
Iteration = Loops, "for", "while", "repeat" and so on.

That's pretty much all they do. Things like threads or parallel processing are bolted on via function calls that implement them. (Except ADA which has some idea of such things built into its syntax).

In this light Spin is just another member of that family all be it with a simple object model included.

Aside: Those who think Spin and its objects are great for parallel processing, note that the Spin language itself has no support for those things. Syntactically the concepts of threads, parallel execution, communication between threads, mutual exclusion are not represented in the language at all. They are bolted on via plain old function calls, if at all.

How did these shackels of "sequence", "selection" and "iteration" come about? I believe it is because of the single processor Von Neumann architecture adopted by most computers. And, well, it's all you need to write an algorithm.

Why do we have the Von Neumann architecture? I believe that is basically because computer builders have always been limited by the number of tubes, transistors or chip real estate they had to play with, Von Neuman style is about the only practical way to get the job done, it works, it's very flexible.

We have always wanted faster machines, wider arithmetic, floating point etc, etc, That has always eaten all the resources available as we move from 8 to 16 to 32 to 64 bits with just the Von Neumann architecture.

I hate to say it here, but when the Prop II is cooked,'and you have a little time have a look at the XMOS architecture, there you will find some sideways thinking about how to build a multi-core MCU and create a non-traditional language. I'm not suggesting future Props should copy that, but there may be some stimulating ideas in there. The architect their is the same David May, whom I think feels the same way as you about those "shackles".

Heater, thanks for writing all that. I don't disagree with anything you wrote, and it's all pretty darn interesting. It does seem that Von Neumann is effectively inescapable for now, and practically all cpu's and languages are heavily bound by those precepts. We can hardly even think beyond it because it is all we know. I hope you watch that video Kye posted of Steve Teig. He articulated the situation we are in pretty well, like you did, and he talked about his own non-Von Neumann architecture which is kind of like XMOS, but rather than scheduling and running little code snippets, he's placing and scheduling bits at the gate-level.

I admit that the Propeller and Spin, especially, are in no way breakaways from the established fundamentals. They're maybe just pleasant arrangements of some ideas that were made to work somewhat harmoniously together.

average joe · 2012-08-10 06:21

cgracey wrote: »

I admit that the Propeller and Spin, especially, are in no way breakaways from the established fundamentals. They're maybe just pleasant arrangements of some ideas that were made to work somewhat harmoniously together.

Yes Chip, they are. I LOVE programming the Propeller because: #1. Spin. It's a very COMFORTABLE language for me. It's fairly easy to work with and it's very flexible. Not only that but it "looks" good too. It also gets the job done, for the most part. That brings up #2. PASM. When Spin doesn't quite cut it (or I'm incapable of figuring out the most efficient way of coding something) I can almost always get it done is PASM. #3. The silicon itself. The Propeller Chip is just brilliant, and the P2 just keeps getting better with every post! Keep up the excellent work!

Keep in mind that I I got my BoeBot in 2002. While I was able to accomplish what I wanted (and had a blast doing it) I always wanted more... *like Midi In while scanning a foot-switch array

The propeller has been all that and more! Now I'm playing with touch-screens! But back to the point.

I think if you HAD broken broken away from the established fundamentals, then programming would not be as comfortable as it is. I've dabbled in a few different languages over the years and the combo of Spin and PASM are my favorite so far. Thank you again for all your hard work!

Propeller II

Comments