JVM for prop

Peter Verkaik · 2008-03-05 08:23

Thanks Thomas for this new update.
Testing it with a simple program, I do not get a message window for
System.out.print() statements. Did you get a messagewindow on your PC?

I was surprised by the savings: compiling shows 2200 longs free, where the spin
version shows 900 longs free, so the pasm version saves 1300 longs.

@Hippy: assuming those 0-0 are placeholders, how do you know where to place
data. It is unknown to what VP·data belongs to until the VP is installed, and it
won't be installed until the data is in place. So we need the vpRambank datastructure
to place data, then when the VP installation routine is called, data can be copied
to cog ram I suppose, but how can that be done from spin?
Also, some data changes while the VP is running. Maintaining two copies of the
data seems a bit awkward.
@Jazzed: Maybe we can not have 6 vpslots in one cog, but 4 should be possible
(to not cross the 694 cycles). Then we just must use multiple cogs to support
6 and more VP's running simultaneously.

regards peter

hippy · 2008-03-05 15:00

@ Peter : Yes, those 0-0 are the placeholders for the data the Cog VP uses and do effectively duplicate the vpRambank structures.

Using "mov #" this way saves an awful lot of extra time busking round the lack of register indirection, only when the VP Cog state changes does such indirection need to be used to update that data, and it would be needed anyway, it just requires "movs" here rather than "mov". When a VP doesn't alter state execution should be blindingly quick, slightly slower when it does. That should also give some leeway where necessary where a VP takes longer than we would like it to; only when the longest executing VP is being executed by all VP's and all update state on the same tick would the time exceed what's allowed. Even then, a slight slow down shouldn't have that much effect on VP or foreground operation.

The way I'd have the foreground ( native methods ) interface to the VP Cog would be to allocate one or more longs into which the native methods pokes a command, address and data then waits for the Cog to have cleared the command by way of "Done".

That means a small amount of blocking of the foreground execution but not too much, only when the VP executes every 8.68uS would it take a more substantial time, and in that respect it would be no worse than the way Javelin does it. It would be possible to queue up commands so writes and updates of data do not block, only reads would.

The main Cog handler would be something like ...

Do
  If 8.68us timeout about to expire Then
    Wait for 8.68uS timeout
    Execute Virtual Peripherals
  Else
    If any command in foreground queue Then
      Get command from foreground queue
      If it is a write command Then
        Put data to Cog address lsb's
      Else
        Get data from Cog address lsb's
        Pass data back to foreground
      End if
    End If
  End If
Loop Forever

The 'If 8.68us about to expire' is used to prevent the cog obeying a command which would take so long that the 8.68us timeout would be missed. This avoids jitter so the VP's do always run exactly every 8.68us. This may be overkill but easy enough to add.

For the foreground ...

Wait until queue is not full
If it is a write Then
  Set write command, set address to write, set data
  Add command to the queue
Else
  Set read command, set address to read
  Add command to the queue
  Wait for data returned
End If

I don't see a problem with having vpRambank in the foreground and an equivalent within the Cog. The only issue is in mapping what is put into vpRambank into what needs to be put into the Cog.

It would be possible to update the VP handling within the Java Classes which interface to vpRamBank as how they actually work is hidden by abstraction layers which the end programmer uses, and I don't see a Java programmer updating the vpRamBank directly or through very low-level method calls. To me that's like poking data into a BIOS and expecting it to work on an entirely different architecture; tough, it won't.

If the VP's can be implemented without changing any of the Java Classes then all well and good ( and that's the preferable solution ) but with the fullness of time we are likely to add new VP's for things the Propeller supports which the Javelin doesn't anyway so some change will have to come. I don't see having to move from 'use stamp.core.*' to 'use propeller.core.*' to be that much of an onerous demand.

Peter Verkaik · 2008-03-05 16:11

Hippy,
I see possibilities here. In·fact, queing won't be necessary.
All the interfacing is done via readRegister and writeRegister, and those·already
distinguish between vpRambank registers and jvm registers. So all I need
is the interface you described, only providing for a read command and a write command,
with parameters register ( =·bank+offset) and in case of write·also a value.
The·VP cog mainloop can poll for such a command every 8.68 usec and only
needs to read it (a single long) and possibly write a value (a single long).
For 6 banks the cog must reserve 6*16 longs but that leaves 400 longs for
assembly code which should be sufficient. The vpRambank registers·can be totally removed
from the main ram because they only have meaning for the VP code.

regards peter

hippy · 2008-03-05 18:25

Looks like we're converging - I'm afraid I still don't really understand the Javelin enough to put my ideas into words which make them easier or clearer to understand

Yes, what you suggest would work, and what I'd suggest as the first and entirely serviceable step.

I moved on from that though to queueing because it would otherwise stop the foreground for 8.68uS every write. This way, an entire vpRamBank can be setup almost instantly with no waiting and then be pulled into the Cog at its leisure ( actually at a much higher speed if the Cog doesn't wait for the 8.68us tick ).

For serial the queue is also a buffer so a Java program can form up a byte, send it, form up another, place it for sending in parallel with the bytes being sent by the VP. That should increase foreground throughput.

As always though, get it working, it can always be improved later.

jazzed · 2008-03-05 20:04

@Peter,
If you haven't seen it yet this stuff is described in "Tricks and Traps" in one of the stickies.

@Hippy,

Is it necessary to use a "label mov varname, 0-0" for every address to be indexed ?
Is it possible to define one "label indexer" and then provide varname+offset ?
Do you have a library that provides ready examples of this approach ?

Having to provide a label for each element of the "array" would be onerous and
require some incrementing element name of sorts.

TIA

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

hippy · 2008-03-05 20:57

@ Jazzed : No need for the label names every time ( they are there for vp0 to allow the constant offset from vp0 to be determined, ie "CON state_offset = vp0_state-vp0" ). Once the offset is determined it can be used wherever it's needed.

One can use a pointer to each, vp0, vp1, etc but then the code needs to be self-modifying to get the data pointed at and that slows things down. This is really just a quick way to get each entity's data into variables ready for use.

No other example than what I have there I'm afraid.

Peter Verkaik · 2008-03-06 07:45

Jazzed, Hippy, and others:
I moved vpRambank into the VP cog, and added methods writeVPregister and readVPregister.
At the end of the cog mainloop, vpCommand is read and decoded and a response is written
back to vpCommand that may or may not include a read value.
It certainly needs optimizing and completion but the idea should be clear.
Only 2 main ram accesses per cog mainloop cycle, eg. per 8.68 usec

regards peter

hippy · 2008-03-06 13:59

@ Peter : That certainly looks to me to be heading in the right direction.

bboy8012 · 2008-03-06 18:55

I hate to sould like a beginner which I am, but how would you use this say for the average person like me with a little bit of java experience, and now prop experience?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Hunger hurts, starvation works!

jazzed · 2008-03-06 20:18

@Peter/@Hippy.
I've done a little integration. You will note a jump table version of doVpUser is default switcher. Slower than Hippy's state machine I'm sure but faster than the if-else-if one. doVpJump requires all jumped procedures to return with "jmp doVpJump_ret" ... I added code for your read/write commands and am able to see the timer run when enable is written. Latch is empty and needs update for jvmVpDemo to work. If you F10 jvmVpCore now and watch pins 25/1 you will see the bits toggle. The vpAjTimer runs faster than the latest vpAsmTimer though i'm not sure about what byte lane swapping needs to be done. I don't have more time for this today; will hit again tomorrow.

@bboy8012, if you look back in this thread to Peter's last jvm.zip posting, you will find a JVM that runs normal javelin stamp·code (without virtual peripherals).

Have a look at this link for more info:
http://propeller.wikispaces.com/Fast-Track+for+PropJavelin

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

jazzed · 2008-03-07 16:00

I'm finding the overhead for loading contents of registers via self-modifying code to be no faster than rdlong especially in the case of PWM where data addresses are non-contiguous. Being able to load data from back-to-back addresses would allow a tight loop, but how would the data be stored or used then ?

The other issue is that of how the data is being represented in cog ram. Using self-modifying code uses long access naturally. If we blow up the vpRambank to 4x then memory gets tight and more code is required to use the data.

Unless there are better ideas, I think I'll take a step back to the time where the vpRambank is in hub ram and work on other approaches to the design.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

jazzed · 2008-03-08 18:14

Hi. I'm making some progress with VPs - have 6 timers running with updates in < 7us* in one COG. I'll update more after I have a working PWM in the code .... Taking a few hours break now.

* 8us --·I was reading the scope wrong

** now ~6.0us -- with some optimization

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

Post Edited (jazzed) : 3/9/2008 4:42:50 AM GMT

jazzed · 2008-03-10 07:54

Thread seems kind of lonely these days.

I'm including my latest work on the VP assembly stuff. Here are some details:

6 DACs can be used simultaneously.
Have not tested with R/C circuit, but it should work with right R*C.
4 PWMs can be used ... the PWM code is about·0.4us each over budget.
The rambank is all longs. All that shifting and masking kills us.
Getting and setting values from rambank other than timer is one-to-one.
6 Timers can be used simultaneously.
Timers can be latched with read/write on Timer1 address.
Timer value extraction will have to be translated by spin.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

Peter Verkaik · 2008-03-10 11:19

Jazzed,
I noticed the comment about DIRA not working in another cog.
I thought all DIRA's and OUTA's were ORed by COG's, so setting a pin high output
in one cog, makes the pin high.·Normally, a pin is exclusively assigned to
some function programmed in the application java program so there are
no pin conflicts.

Also, there is only one timer VP running at the most. Multiple timers do not exist.

regards peter

jazzed · 2008-03-10 14:08

There are various statements from others that support my observations with the DIRA/OUTA.
Search the forum. Spin runs in a separate cog from the asm, so the registers are physically
different. Also, if you look closely at the manual it only talks about "wire or" outputs for cogs
and nothing about inputs. ADDED: OUTA is fine. Setting DIRA·for one cog doesn't mean it's
set for another and DIRA is an AND enabled. That's why your spin manipulation did not work.

If you want to artificially restrict the number of timers, that's fine. The point is if you have
5 DACs running and add a timer, it will still work.

Getting 6 PWMs to run in budget will be difficult if not impossible in one cog. ADCs and Uarts
still need design. I'll look at these some today.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

Post Edited (jazzed) : 3/10/2008 2:21:40 PM GMT

hippy · 2008-03-10 14:23

My understanding of the I/O is ...

All physical legs go to all Cogs in parallel as input pins.

A pin is an output pin for a particular Cog only if its DIR bit is set.

A physical leg becomes an output when any Cog's DIR bit is set.

All Cog's OUT pins which have that Cog's DIR bit set are or'd to produce the output on the physical leg.

Peter Verkaik · 2008-03-10 14:45

So each cog has its own DIRA register and it needs to be set from the asm code.
I understand that.
Regarding Tmer, the javaclass Timer.java only calls Timer_install for the first new Timer(),
subsequent calls to new Timer() do not call Timer_install.
So we can set aside one instance of timer variables for the Timer object.
That way we don't need to index in vpRambank for Timer·(meaning the vpRambank·registers
for the bank in which Timer is installed, are simply not used). So we trade some memory
for·utilizing less CPU cycles.

regards peter

jazzed · 2008-03-10 16:08

Peter Verkaik said...
>> Regarding Tmer, the javaclass Timer.java only calls Timer_install for the first new Timer(),
>> subsequent calls to new Timer() do not call Timer_install.

Well, at least we don't have to be concerned about limiting timers in the VP code [noparse]:)[/noparse]

I suppose one can make the timer tick independently and add zero/latch "commands" to the
vpUserCmd asm. Doing this would save about 100ns of cycle time. I much prefer a generic solution
with some restriction rather than further complicating the vpUserCmd interaction unless necessary.

If you want the design to work differently, provide code; just consider the time constraints.

The register access "vpUserCmd" consumes between 0.9us and 1.3us.

The budget per VP (plus vpUpdate time) appears to be =< 1.0 us today. If the vpUpdate could be
further optimized, that would help. I tried a variation of hippy's "state-machine", but could not make
it work faster than the current loop (removing "xor outa, #VP_DEBUG_PIN" helps some.
Maybe hippy can do better ?

@Hippy, can you conjure up an optimization of my "vpUpdate" ? Cycle time is ~330ns now.

Unless, PWM (plus update) can be chopped from 1.44us to =< 1.0us, a second cog will need to be
added anyway so much of this optimizing, while very educational for me, may be in vain ....
Not having register indirection built in is a terrible pain in the tushie.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

jazzed · 2008-03-12 20:31

Peter/Hippy, etal
This package can run 5 PWM and a Timer in 8.68us. Demo is just for timer as before.
File jvmVpCore.spin has 5 PWM + Timer initialized by default. I optimized the update
method to use jumps programmed by an "enableVP" command to get here. I'm this
close (") to having 6 PWM's but i have no idea where to get the cycles. If you're
curious about use of 1 for hi/low·pulse width, that is the worst case as it makes all
code in PWM method execute.·What a pain this is without indirect addressing.
Guess I can take a break from PWM for a while and do other stuff.

ADDED: Uugh :< Dang it there is still an issue. Sometimes·a PWM counter goes
negative for PWM hi/low widths > 1. More to do; still taking a break though.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

Post Edited (jazzed) : 3/12/2008 8:41:59 PM GMT

Peter Verkaik · 2008-03-12 23:19

Jazzed,
It looks getting 6 PWM running may be just possible, receive uarts however will
take more cycles, so I think we should/must consider servicing less VP's per cog.
If we·take 3 VP's per cog, we will have better change of getting this up and ready
in a reasonable time. It means 2 cogs are required for the VP code (each cog capable
of running 3 VP's from a set of 6 VP types, codesize is not the problem),·the first cog
deals with·bank 0 to 2, the 2nd cog with bank 3 to 5. Thomas's (Kaio)·asm code for
the engine also requires 2 cogs, plus we need one cog for the main spin code.
That still leaves 3 cogs for enhancements·like userdefined VP's and of course
the native types long, float and double.

It should be possible I guess, to put this 3VP design in an object and then
use 2 of those objects. VP's run independantly of each other, so there is
no need for a shared·8.68usec tick on which those cogs must synchronize.
So basically we can use the design as is, but only for a loop of 3 instead of 6.
That also halves the required space for vpRambank in each cog.

regards peter

jazzed · 2008-03-13 15:54

Turns out the "optimization" in read/write/enable handler caused devices not to start properly.
Per-device time budget for 6 VP is 0.94us. For 3 VP, budget will be 2.1us (less overhead).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

Kaio · 2008-03-13 23:58

Hi all,

here is the next release of translated JVM using PASM. Now the first bytecode handler cog is complete tested. The second bytecode handler cog has only implemented basic arithmetic and logical functions (Jem_IADD - Jem_IXOR) which is successful tested.

So it should be possible to run small Java programs perhaps a fibo(29) benchmark.

Please be aware that currently only static class variables can be used because Jem_GETFIELD and Jem_PUTFIELD are not implemented yet.

I didn't have it tested yet with Java programs because I had trouble connecting the JavelinIDE to my Prop running the JVM. The device was not identified also when I had used the spin version of JVM.

Native functions are currently not supported.

Thomas

Peter Verkaik · 2008-03-14 22:18

Thomas,
We won't be able to fully test even the smallest java program because
the class String is always included, and the class String uses·opcodes
new, getfield and putfield. If you could give these priority over any of
the other remaining opcodes, then we will be able to run·small
java test programs.

regards peter

Kaio · 2008-03-14 22:34

Peter,

thank you for the info. Then I'll implement to handle those bytecodes. At last I want to see the ASM version running.

Thomas

jazzed · 2008-03-17 04:01

Hi.

Attached is my latest installment of jvmVpDemo which adds a limited version of Uart transmit (no backpressure mainly). The uartTxTest method is a fair example of transmitting bytes.

I had to use a code structure with an enable/init, common tx, and wrappers per VP to have room for 3 VP time-slots to make timing restraints. Now I'm running out of cog ram space because of extra code complexity and variables (at some point the vpRambank being converted to bytes from long may help, but not by much). It is likely that the design should be limited to two VP's per cog.

With the current design, 104 longs were used on uart transmitter. With a design that runs slower and is more dynamic for input variables (i.e. no init & separate wrappers), app 70 longs would be used for Uart TX leaving more room for Uart RX and ADC. Of course many variables can be shortened from long to word and byte, but I've found PASD is not very happy with such a mix, and I've left them as longs until closer to a release.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

Peter Verkaik · 2008-03-17 21:53

Jazzed,
I moved a step back to make the VP code look simpler.
I moved the vpRambank registers back to hub ram with the
addition of a single long vpTimerCount that serves as the
free running 32bit counter.
Both vpRambank and vpTimerCount are declared in jvmVirtualPeripheral
that also contains the VP code initialization native functions.
I added a jvmVpCore that starts up a COG for 3 VP slots.
Two of these COGs are started from jvmVirtualPeripheral,
each serving 3 VP banks.
Assuming overhead takes less than 94 cycles, for each VP code
there are (694-94)/3 = 200 cycles available. I used rdbyte and wrbyte
where required but that can still be optimized by using rdlong/wrlong
if some registers are moved or swapped.
I have put in a jump table for the vpType code but am not sure
wether that is correct. I don't understand the /4 in the jumptable itself.
I thought addresses in COG's were $000-$1FF so why the need for /4?

The VP code is now integrated into the entire package (attached).
The VP cogs are started and stopped from jvmMain.

regards peter

jazzed · 2008-03-18 03:44

Hi Peter,

I see you have an ADC routine. Have you tested it? I'll plug it in and try later.
I'll have a Uart receiver for you soon. I moved offset constants that I had to CON
space to make room for uart rx. You can integrate that and the transmitter as you like.

It is likely with all the rd/wr* stuff you will need 6 cogs.
My first experience with the uart tx had rd/wr and the time was 4.5us.
You might optimize that though. I will finish what i've started regardless.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

Kaio · 2008-03-19 01:38

Hi Peter,

here is the next release of JVM using PASM. The opcodes new, getfield and putfield are now handled. Can you please make a test.

Thomas

Peter Verkaik · 2008-03-19 08:00

Thomas,
After programming the prop with your asm version, I can program a java program
using the javelin IDE, but it does not run properly. The simple program

import stamp.core.*;
public class registerTest3 {
· static void main() {
··· System.out.print("hello world");
··· while (true) {
··· }
· }
·
}

does not make the JIDE message window appear.
When I use debug and click Step Into, I get error messages Invalid Class Offset.
So you need to check your asm code to see if you calculate object and class references
correctly (remember all references are stored in big endian format·in the javaprog array,
whereas the prop uses little endian for any value that occupies more than 1 byte).

regards peter

Kaio · 2008-03-19 20:11

Peter,
many thanks for testing. I'll check where the problem is located.

Thomas

JVM for prop

Comments