p1spin

Dave Hein · 2015-11-24 14:17

The attached zip file contains p1spin, which runs P1 Spin programs on the P2. The P1 Spin program is included as a binary file at the end of the program using the FILE directive. p1spin currently maps port A and B in the P1 to port b in the P2. The port mapping can be changed by modifying the "regmap" table in p1spin.spin.

I have only run this under spinsim, but it should work with an FPGA board as well. I have included a P1 Spin program that runs the Dhrystone 1.1 benchmark. The result from running this under spinsim is as follows:

# ./spinsim -t -b19200 ../p1spin/p1spin.obj

Testing Spin Interpreter
Dhrystone(1.1) time for 500 passes = 259 msec
This machine benchmarks at 1930 dhrystones/second

The P1 binary file can also be run using spinsim with the following results.

# ./spinsim  -b19200 ../p1spin/test.binary

Testing Spin Interpreter
Dhrystone(1.1) time for 500 passes = 692 msec
This machine benchmarks at 722 dhrystones/second

P2 runs 2.7 times faster than P1. Adjusting for the difference between the 80 MHz P1 clock and the 50 MHz P2 clock gives a speed up factor of 4.3.

David Betz · 2015-11-24 14:19

That is very cool!

Rayman · 2015-11-24 14:39

This is neat. Might allow me to test some hardware that I don't have drivers for yet...

Did you basically port the P1 Spin Interpreter to P2?

Not sure what you pin mapping means, but I sure I'll figure it out...

Dave Hein · 2015-11-24 15:04

p1spin is based on some code I wrote a couple of years ago that unrolled the P1 Spin interpreter. It uses jump tables stored in hub RAM that maps a bytecode to the address of a routine that implements the function. The more frequently uses bytecodes are executed from cog RAM, and the others are executed from hub RAM.

The register mapping is used to handle the difference in register values for DIRA, OUTA and INA from P1 to P2. Even though the P1 doesn't support Port B the P1 Spin interpreter and compiler do support it. I currently map the registers for P1 port A to P2 port B because that's where the serial pins are located.

I make a special case for register $1F1, which is the CNT register. When I detect $1F1 I call a routine that uses GETCT and put that value on the stack. I also check if location $0000 is read, which is the CLKFREQ location. I return 50_000_000 when location $0000 is read.

I don't support the PAR register yet. I'm pretty sure that the PAR value is passed in PTRA on the P2, but I haven't seen any documentation on how is this passed by the COGINIT instruction.

Rayman · 2015-11-24 15:27

I don't think PAR is passed on COGINIT for P2, but could be wrong.

I might try the original pin mapping and just modify the SimpleSerial.spin to use port B.
That should work, right?

Dave Hein · 2015-11-24 16:25

Yes, if you change all the DIRA/OUTA/INA references to DIRB/OUTB/INB in Simple Serial it should work with P2's Port B. However, p1spin still has to remap the P1 register values to the P2 values since they differ between P1 and P2.

I suspect that PAR is pass though the D register in the "COGINIT D, S" instruction, but it would need to be shifted to the left to avoid the bits that specify the cog number, "new" cog and hubexec mode. The PAR value would set the initial value of PTRA. At least that's how it worked in P2-Hot. Maybe Chip, or someone else that knows could comment on this.

cgracey · 2015-11-24 19:48

That's wild, Dave!

If you do a SETQ before a COGINIT, you establish the PTRA value, which is akin to PAR on the Prop1.

Dave Hein · 2015-11-24 20:24

Thanks Chip. I hadn't thought of using SETQ.

nutson · 2015-11-24 21:52

Dave, many thanks, very usefull. After some modification I got your example running on my DE2-115.

Dave Hein · 2015-11-25 00:17

nutson, I saw that you changed _clkfreq to 50_000_000 and the baud rate to 1200. I don't think the value of _clkfreq matters since p1spin always returns 50MHz for CLKFREQ.

Did you need to change the baud rate to 1200 to get it to work? With spinsim it works fine at 19200, and I suspect it could run even higher. I wonder why the run time isn't repeatable on the DE2-115. It seems like it should be the same every time it is run.

nutson · 2015-11-25 07:32

Dave, baudrates up to 31250 work ok (did not test higher, will do). Initially the program did not seem to work, the LED indicating COG1 activity dimmed shortly after starting and I got no output in Parallax serial terminal. First try was some experimenting with clock, baudrates and delays. The thing came alive after adding the REPEAT. No clue for the variation in execution time, the same variation occurs at higher baudrates

Coley · 2015-12-31 11:09

Dave, I've used nutson's version on my A9 board supposedly running at 80 MHz but I get the same results?

Am I missing something obvious?

At first the serial terminal just displayed garbage until I changed the section ldclkfreq to match the core frequency.

'***********************************************************************
' This code is used when absolute address 0 is accessed
'***********************************************************************
ldclkfreq               mov     x, ##80_000_000
                        jmp     #pushx1

I also changed nutson's file to operate at 80MHz to match.

CON
  _clkmode = xtal1+pll16x
  _clkfreq = 80_000_000

OBJ
  ser : "Simple_Serial"
  dry : "dry11"

PUB main | ptr, count
  waitcnt(clkfreq*3+cnt)
  ser.init(31, 30, 19200) 
  repeat
    ser.str(string(13, "Testing Spin Interpreter", 13))
    dry.Dhrystone11
    waitcnt(clkfreq/10+cnt)

Dave Hein · 2015-12-31 16:38

The routine ldclkfreq must be set for the correct frequency for serial to work. I guess for the A9 that is 80MHz. For the DE2-115 it is 50MHz. The values for _clkmode and _clkfreq shouldn't matter unless they are referenced by the code.

Coley · 2015-12-31 16:40

my point was that if the A9 is running at 80MHz as opposed to the DE2-115 at 50MHz then surely the dhrystone mark should be better?
It's a strange one.

Other than that it's great!

Dave Hein · 2015-12-31 16:54

There is a difference in run speed between spinsim and the FPGA. The FPGA takes about 1.5 times longer than spinsim. So spinsim must not be accounting for stall cycles correctly when running in hubexec mode.

There is also an error in computing the elapsed time. I'll look at it to determine the exact cause of the problem.

Dave Hein · 2015-12-31 17:00

The following method is at the end of dry11.spin.

PUB time_msec
  return cnt / (_clkfreq/1000)

So this does require that _clkfreq be set for the correct frequency. The method should use clkfreq/1000 instead of _clkfreq/1000, or _clkfreq needs to be set correctly.

Coley · 2015-12-31 17:16

_clkfreq is already set correctly but I tried the modification anyway and got the same timing results.

I will perform some other bit twiddling tests to confirm the speed this A9 is running at.

Dave Hein · 2015-12-31 21:27

I changed the time_msec method to use clkfreq instead of _clkfreq, and I get 652 msec on my DE2-115. This ratio of 652 to 408 is 8 to 5, so this time is consistent with the time you're getting on the A9. I could have changed _clkfreq to 50_000_000 instead in dry11.spin and would have gotten the same result.

Coley · 2016-01-01 10:05

Thanks for clearing that up Dave, Happy New Year! ;-)

Coley · 2016-01-01 10:42

Dave, I've been testing the I/O and found that the .. operator when using DIRA and OUTA doesn't work.
It's fine when you are using a single pin but not for multiple.

This works:-

  dira[0]~~
  dira[1]~~

This doesn't:-

  dira[0..1]~~

Dave Hein · 2019-03-02 19:02

I saw Cluso's thread about running Spin 1 bytecodes on the P2, so I thought I would update my Spin 1 interpreter to the latest P2 opcodes. The attached zip file contains the updated code.

If you have bstc, p2asm and loadp2 installed you can used the runtest script to assemble the code and run it. Or you can just run the pre-built P2 binary file using loadp2. You have to specify a baud rate of 57600.

You can also use the Spin tool to assemble the .spin code, and PNut to assemble and load p1spin.spin2.

jmg · 2019-03-02 20:29

Dave Hein wrote: »

I saw Cluso's thread about running Spin 1 bytecodes on the P2, so I thought I would update my Spin 1 interpreter to the latest P2 opcodes. The attached zip file contains the updated code.

Shouldn't this be called something clearer than p1spin ?
that name suggests Spin for P1, and P2 does not appear. Maybe P1SpinOnP2 or p1spin4p2 or...

Or, if there is going to be a V2Spin, (as in fastspin) that has much more language 'oomph' and runs on P2 and maybe P1 too, perhaps this needs to drop the confusion of numbers ?
Maybe SpinO (Spin Old) and SpinN (Spin New) ? (and maybe SpinR can be reserved for the exact, ROM binary image of original P1 Spin ? )

dgately · 2019-03-02 20:41

Dave Hein wrote: »

If you have bstc, p2asm and loadp2 installed you can used the runtest script to assemble the code and run it. Or you can just run the pre-built P2 binary file using loadp2. You have to specify a baud rate of 57600.

I built test.spin with fastspin (for P1)... And, loading with -b 57600 resulted in garbage text. So, loaded with 115200 baud, receiving the following:

$ loadp2 -p /dev/cu.usbserial-P2EEI8V p1spin.bin -b 115200 -l 115200 -t -CHIP
tcsetattr failed
( Entering terminal mode.  Press Ctrl-] to exit. )
Unsupported opcode 2C 32

Could fastspin have use an opcode that bstc would not?

dgately

Dave Hein · 2019-03-02 20:43

@jmg, yes, the name p1spin is a bit confusing, but a less confusing name seems to long.

@dgately, it appears that fastspin used different bytecodes than bstc. I don't recall offhand what 2C 32 does, but I'll look into it.

David Betz · 2019-03-02 21:43

Dave Hein wrote: »

@jmg, yes, the name p1spin is a bit confusing, but a less confusing name seems to long.

@dgately, it appears that fastspin used different bytecodes than bstc. I don't recall offhand what 2C 32 does, but I'll look into it.

I don't think fastspin uses byte codes at all. It compiles to native P1 or P2 code doesn't it?

Dave Hein · 2019-03-02 22:01

Good point. So p1spin was trying to interpret native assembly as Spin bytecodes. The bytecode 2C is supported by p1spin, but maybe the hub memory got corrupted by the interpreting native assembly as bytecodes. I could add a test at the beginning to check if the code looks like valid Spin code.

ersmith · 2019-03-02 23:40

dgately wrote: »

Could fastspin have use an opcode that bstc would not?

fastspin doesn't use opcodes at all, it compiles to LMM PASM on P1. So for testing p1spin you should use bstc, openspin, or homespun, I think.

Cluso99 · 2019-03-03 01:43

@davehein
WOW. I didn't recall this so thanks for updating.

I've used PTRA=pcurr and PTRB=dcurr as these are the most used. This save lots of space by using PTRA/B++ and --PTRA/B.

I used a vector_table for my version. It uses longs in hub (now moved to lut) which contains 3 routine addresses and 5 flags (flags unused?). Most bytecodes only use 1 or 2 routine addresses. Some routines require a popx/popyx/popayx first before their main routine.

The maths routines have 2 entry points, for binary and unary, followed by 5 different routines. One is where there is a mapped P2 instruction, and the instruction opcode is the 3rd vector. But I've been looking at perhaps using skipf and loading the skip values from a table.

I was tracing my code yesterday and realised the different special register mapping for DIR/IN/OUT - shame we didn't realise this earlier as we could have asked Chip to maintain the mapping

So, I was about to put in another mapping table. I support both ports A & B. BTW the new instruction register bit mapping on the next silicon will be a boost here too.

Anyway, I think there is some synergy to combining features in both our versions. Your fully unrolled loops are great (I didn't have space on P1)
Are you interested?

Dave Hein · 2019-03-03 02:45

I'm really not actively working on the Spin interpreter. The last time I looked at it was three years ago. I did try moving the portion that's in hub to lut memory. However, it made absolutely no difference on the execution time of my test program. I think most of the cycles are expended on fetching the bytecode and accessing the stack. It will be interesting to see the execution times you get with your interpreter.

Cluso99 · 2019-03-03 03:41

Do you mind then if I use your code as a base as you've already unrolled the code better than I was able to do on the P1?

Cluso99 · 2019-03-03 03:44

While there is a little gain from using LUT as the lookup, it is on every bytecode, and a few clocks every bytecode adds up.
Even the use of rdlong x,ptrx++ saves 2 clocks, and most bytecodes have a pop and push.

p1spin

Comments