TLMM: Threaded LMM - four, nine or even 19 drivers in one cog!

Bill Henning · 2010-07-08 22:43

Over in the i2c prop to prop thread http://forums.parallax.com/showthread.php?p=920681 I asked for a combination mouse/keyboard/serial driver, implemented on top of Peter's great scheduler.

Peter thought it was a great idea, and mentioned he was thinking about loading scheduled code on the fly - and asked if that was similar to LMM. It would be, if the LMM code used FCACHE [noparse]:)[/noparse]

Why is this relevant?

Because it reminded me of the idea for a deterministic LMM threading model. Mind you, to keep it deterministic, there would be strict limitations - but if those are acceptable, it would be possible to have four 1MIPS or nine 0.5MIPS threads running in a single cog! It would also be possible to have 19 0.25MIPS threads.

This would allow four to nine low speed drivers to co-exist in a single cog - and allow them to be easy to write!

Mind you, Peter's scheduler is MUCH nicer for a larger number of tasks - but TLMM would be more deterministic.

I don't want to take too much time away from debugging the new PCB's, but I will post some simple source code to illustrate how it would work later.

Limitations required for deterministic timing:

- each thread is limited in size, about 1K instructions is the practical maximum, and only -127..+128 instruction relative jumps would be permitted
- no FCACHE, no calling native pasm "primitives"
- doing a CALL will take multiple instructions, so will a RET - this model is mainly meant for in-line code without subroutines
- no WAITxxx instructions
- must use very few RD* and WR* instructions, and only where it has little impact as it will offset all other threads by 200ns

Note how none of the limitations would really impact keyboard, mouse and serial drivers!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Post Edited (Bill Henning) : 7/9/2010 12:08:37 AM GMT

Bill Henning · 2010-07-08 23:11

Here is a rough, untested implementation that illustrates how a four-thread TLMM works:

' TLMM - Threaded Large Memory Model
'
' This version of LMM is intended to be reasonably deterministic, and as such, does NOT support FCACHE
'
' Copyright 2010 by William Henning
'
' Distributable under the terms of the MIT license, attribution required in source and documentation.
'
' T4LMM - four thread version, 1MIPS per thread
' T8LMM - nine thread version, 0.5MIPS per thread
'
CON
  version = 1_00
  threads = 4

VAR
  long mbox

PUB launch(thread1,thread2,thread3,thread4)
  mbox[noparse][[/noparse]0]:=thread1
  mbox(1):=thread2
  mbox(2):=thread3
  mbox(3):=thread4
  cognew(@tlmm_init,@mbox)

' invoke as launch(@thread1,@thread2,@thread3,@thread4)

DAT

ptr     long  0 ' 1+9*4 = 37 longs for registers

pc1     long  0
r1a     long  0 ' A register for thread 1
r1b     long  0 ' B register for thread 1
r1c     long  0 ' C register for thread 1
r1d     long  0 ' D register for thread 1
r1e     long  0 ' E register for thread 1
r1f     long  0 ' F register for thread 1
r1g     long  0 ' G register for thread 1
r1h     long  0 ' H register for thread 1

pc2     long  0
r2a     long  0 ' A register for thread 2
r2b     long  0 ' B register for thread 2
r2c     long  0 ' C register for thread 2
r2d     long  0 ' D register for thread 2
r2e     long  0 ' E register for thread 2
r2f     long  0 ' F register for thread 2
r2g     long  0 ' G register for thread 2
r2h     long  0 ' H register for thread 2

pc3     long  0
r3a     long  0 ' A register for thread 3
r3b     long  0 ' B register for thread 3
r3c     long  0 ' C register for thread 3
r3d     long  0 ' D register for thread 3
r3a     long  0 ' E register for thread 3
r3b     long  0 ' F register for thread 3
r3c     long  0 ' G register for thread 3
r3d     long  0 ' H register for thread 3

pc4     long  0
r4a     long  0 ' A register for thread 4
r4b     long  0 ' B register for thread 4
r4c     long  0 ' C register for thread 4
r4d     long  0 ' D register for thread 4
r4e     long  0 ' E register for thread 4
r4f     long  0 ' F register for thread 4
r4g     long  0 ' G register for thread 4
r4h     long  0 ' H register for thread 4

'------------------------------------------------------------------------------------------
        org 0

tlmm_init ' later overlaid by program counters & registers
        mov   ptr,par
        rdlong pc1,ptr
        add   ptr,#4
        rdlong pc2,ptr
        add   ptr,#4
        rdlong pc3,ptr
        add   ptr,#4
        rdlong pc4,ptr
        jmp   #next

        long  0[noparse][[/noparse]28]

next    rdlong ins1,pc1
        add   pc1,#4
ins1    nop
        rdlong ins2,pc2
        add   pc2,#4
ins2    nop
        rdlong ins3,pc3
        add   pc3,#4
ins3    nop
        rdlong ins4,pc4
        add   pc4,#4
ins4    nop
        jmp   #next

'------------------------------------------------------------------------------------------

' area from here on is available as scratch registers for the treaded code!

' i recommend 
' reg 100-199 for thread 1 scratch area
' reg 200-299 for thread 2 scratch area
' reg 300-399 for thread 3 scratch area
' reg 400-495 for thread 4 scratch area

'------------------------------------------------------------------------------------------
' sample thread 1 - blink an LED on P0 at 50Hz
'------------------------------------------------------------------------------------------

        org 0

thread1 or  dira,#%000000001

        mov r1a,#100
iloop1  mov r1b,#100
        sub r1b,#1 wz
 if_nz  sub pc1,#12 ' branch back to iloop
        sub r1a,#1 wz
 if_nz  sub pc1,#20

        xor outa,#%000000001

        sub pc1,#32

'------------------------------------------------------------------------------------------
' sample thread 2 - blink an LED on P1 at 25Hz
'------------------------------------------------------------------------------------------

        org 0

thread2 or  dira,#%000000010

        mov r2a,#50
iloop2  mov r2b,#100
        sub r2b,#1 wz
 if_nz  sub pc2,#12 ' branch back to iloop
        sub r2a,#1 wz
 if_nz  sub pc2,#20

        xor outa,#%000000010

        sub pc2,#32

'------------------------------------------------------------------------------------------
' sample thread 3 - blink an LED on P2 at 12.5Hz
'------------------------------------------------------------------------------------------

        org 0

thread3 or  dira,#%000000100

        mov r3a,#50
iloop3  mov r3b,#100
        sub r3b,#1 wz
 if_nz  sub pc3,#12 ' branch back to iloop
        sub r3a,#1 wz
 if_nz  sub pc3,#20

        xor outa,#%000000100

        sub pc3,#32

'------------------------------------------------------------------------------------------
' sample thread 4 - generate a 500Khz square wave on P3
'------------------------------------------------------------------------------------------

        org 0

thread4 or  dira,#%000001000

iloop4  xor outa,#%000001000
        sub pc4,#4

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Post Edited (Bill Henning) : 7/9/2010 12:08:30 PM GMT

Cluso99 · 2010-07-08 23:18

WOW, the prop is now really cooking with gas!

Just a thought...
Keyboard (and I presume mouse because I have not tested it) are really serial devices. I have written the keyboard driver using a single pin and therefore presume the mouse could be done similarly. This would save 2 pins (we are always short of pins).
Now the issues with 1pin keyboard are..
* you cannot reset the keyboard (not really required anyway)
* you cannot set the leds on the keyboard (dont have them on a laptop anyway)
* need to test the timing of the keyboard initially. It can then be fixed in the program if required.
* Note it still works with the existing hardware 2pin interface

This would be a really neat driver to handle 1pin keyboard, 1pin mouse and serial in a single cog.

With a splitter cable, an existing Keyboard socket could be used to interface to both the 1pin keyboard and 1pin mouse.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz

Bill Henning · 2010-07-09 00:16

Thanks!

Since posting, I have been doing some additional thinking on this matter. If a lower MIPS rating is acceptable, I can see how to make the threads jitterless while still allowing them to do a hub read and a hub write per execution cycle.

But first, a problem. The flags are not preserved across threads, so the conditional branch I showed above can't work.

One simple solution is to use two slots for every thread, and preserve/restore WC and Z for each thread.

I hope to have a better solution RSN.

The basis of this deterministic threaded LMM is to stay synchronized to the hub, and have a number of "hub slots" assigned to every thread.

For arguments sake, let's assume there are 20 slots.

Each slot can contain one of:

- LMM fetch, increment thead_pc, execute instruction
- hub read for one of the threads
- hub write for one of the threads
- jump to top of TLMM loop (call it TLOOP slot)

Note that if at least one slot contains a hub read or hub write, there is no need to waste a slot on a TLOOP slot!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Post Edited (Bill Henning) : 7/9/2010 12:22:42 AM GMT

localroger · 2010-07-09 00:20

Cluso, a bit offtopic but worth mentioning -- one big advantage of the 2-pin keyboard driver is that a lot of hardened industrial keypads don't have a num lock annunciator and the ability to command the keyboard into num lock and disable switching is invaluable. I noticed this because another device I use lacks the capability of sending commands to the external PS/2 keyboard (not through hardware limit but firmware non-implementation), and I was gratified when I moved to the prop to see that I had gained the capability.

RossH · 2010-07-09 00:22

Nice work, Bill. I forsee great things ahead for Peter's scheduler.

I also feel a challenge coming on ...

Who will be the first to get a single cog to support a keyboard, mouse, sd card and rtc? All these drivers are individually quite small, and it should be possible to fit them all into a single cog.

I would include a display driver as well, but I suspect this would be pushing things a bit too far even for the Propeller. Or would it?

I don't think we really need a prize, but I'll happily give anyone who achieves this feat a copy of the new Catalina Code Optimizer (10% smaller code with up to 15% speed improvement!).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

Bill Henning · 2010-07-09 00:33

Thanks Ross!

Actually, I think Peter's scheduler is significantly more sophisticated, and better for most applications. His scheduler actually schedules... TLMM time-slices.

Where TLMM may be a bit better is on finer-grained determinism. I am still thinking through all the implications...

I can easily see a single TLMM cog handling:

- keyboard
- mouse
- two 38.4kbps serial ports
- maybe even a 100khz I2C read/write engine instead of the serial ports

As for your challenge... I am afraid I have too much on my plate to compete [noparse]:)[/noparse]

I think I am better off coming up with enabling technologies for stretching the propeller's limits (LMM, VMCOG, TLMM, more in the future) and letting others implement apps with them

As an aside, if I saved/restored flags, TLMM could run four non-deterministic (think Catalina <grin>) threads at about 0.5MIPS each! Or two at 1MIPS.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

wjsteele · 2010-07-09 00:45

I wonder... if this were combined with an extened memory (outside cog or hub) is there a benefit of "interleaving" the instructions for the various threads. I'm thinking about read speed being faster instead if it was sequential instead of random. That way, you could read and load all the longs at once instead of having a read in each one individually. I realize rdlong is only doing one, but that's to local memory only.

Now that I think about it... what about interleaving the code in the cog itself... is there any benefit to doing that? It would basically make it unreadable, but a tool could easily be built to do it for us after we code them independently.

Is there some mechanism or memory model where this would be optimal?

BTW... I really love this model!!! I'm going to have to rethink what we're doing with our little toy, now! I think this can really save us a bunch of cost associated with our new upcoming hardware and software requirements. Instead of adding a second prop, I think we can get it to fit quite easily in the one we have with this technique. For example, we're spinning up 3 cogs for different serial stuff right now... which, with this, I'm sure can be done in one now!

Bill

RossH · 2010-07-09 01:07

@Bill,

Sorry - I thought your solution used Peter's scheduler. But in any case, the challenge itself is still valid - it could be done either using your time slicer, Peter's scheduler - or simply some very efficient hand-crafted code.

My main interest is simply in reducing the number of cogs currently required for Catalina drivers - this would give Catalina programs more cogs on which to run application code.

While multithreading Catalina is something I've long wanted to do, it hasn't proved practical yet. My original design for Catalina included a subsidiary kernel that was going to support multithreading - but my attempts to do this so far have required me to remove so much of the kernel functionality that I need yet another code generator to generate code for it. This really complicates everything, since it means Catalina now has to know at compile time where the code is to be executed (whereas the ideal situation is for this to be determined at run time- i.e. only running multiple threads on the one cog when there are no more available cogs). This in turn eliminates one of Catalina's main features - i.e. that it provides a sophisticated "Hardware Abstraction Layer" which allows the same Catalina program to run unmodified on any Propeller platform (since the number of cogs available changes depending on the device drivers - which are platform dependent).

However, while typing this I've just thought of a fairly simple idea that may make this unnecessary. I'll have to give this some more thought.

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

kwinn · 2010-07-09 02:07

Very elegant Bill. Reminiscent of how a single Z80 CPM system was used to run 8 serial terminals for a key to disk data entry system.

potatohead · 2010-07-09 03:27

Well, a display driver would be tough. But, offering display services, such as calculating screen mode changes, manipulating text, setting up to draw a mouse pointer during blanking periods, and other things could be done in that COG.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
8x8 color 80 Column NTSC Text Object
Wondering how to set tile colors in the graphics_demo.spin?
Safety Tip: Life is as good as YOU think it is!

Cluso99 · 2010-07-09 04:12

localroger: The keyboard actually sends you codes indicating keypresses so the numlock feature is actually implemented in the cog. The led on the keyboard is just a light. Now if you need to know it's state that is different. But as I said, you do not have one on a laptop.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz

potatohead · 2010-07-09 04:16

Not that I care about this (I don't), but I have a state indicator on all my laptops. Currently I have a Dell inspiron, a coupla HP entertainment class laptops, and a newer Thinkpad.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
8x8 color 80 Column NTSC Text Object
Wondering how to set tile colors in the graphics_demo.spin?
Safety Tip: Life is as good as YOU think it is!

Roy Eltham · 2010-07-09 04:26

Cluso99: Every laptop I have ever owned or seen has had the keyboard led state indicators on them. Not sure why you think laptops don't have them.

Bill: I got a up close and detailed demo of Peter's threading stuff along with Chip during the late night hours of UPEW, and it seemed to me to be reaonably deterministic. I guess it just depends on the speed you are going after and how well behaved your "threads" are...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Check out the Propeller Wiki·and contribute if you can.

HollyMinkowski · 2010-07-09 06:26

Bill, this is great stuff!

This shows just how important it is for everyone
to learn and use PASM.

With the prop you just have to know PASM to do
really good work. The prop's assembly language
is a joy to work with compared to some other
processors.

Bill Henning · 2010-07-09 12:02

wjsteele:

Interleaving would be useful for a non-LMM threaded cog where there was no need for branching or looping - but that is of marginal utility.

I'm glad to hear you like TLMM - the whole intention was to be able to easily write multi-driver cogs, so that we don't "waste" cogs in our products!

RossH:

No worries! Peter's schedule and TLMM are both great approaches for getting more work done in a single cog, precisely so that we can all save cogs.

Hmm.. maybe Peter and I should start a "Save the COGs!" foundation...

For multi-threading Catalina, after careful thought, I think you would be better off having a small scheduler and swapping out the state of the LMM kernel every X*100 LMM instructions. Less overhead, but less fine grained than TLMM - but that is fine for "business logic".

kwin:

Thank you!

For a lot of uses Peter's scheduler is actually superior - ie if you have threads that need to be woken up at specific times or based on specific events (pins going high/low).

TLMM may be somewhat better when you are trying to fit several high speed serial ports into a cog.

potatohead:

TLMM is not intended to do video drivers!

Having said that, a TV driver *might* be doable if assigned several thread slots in order to give it about half of the cog's time. I'd have to do timing calculations to verify this.

It should however be possible to write a PASM TV driver that executed one LMM instruction per WAITVID, for a slow background thread.

Roy Eltham:

I totally agree - that's why in my previous post I said:

Actually, I think Peter's scheduler is significantly more sophisticated, and better for most applications. His scheduler actually schedules... TLMM time-slices.

Where TLMM may be a bit better is on finer-grained determinism. I am still thinking through all the implications...

I am aware that Peter has a pretty sophisticated scheduler, however I don't believe it could reliably schedule (with more than a single thread running) so that four threads run every 1us.

Conceptually, you can think of Peter's scheduler as a "real" scheduler for multiple threads, and you can think of TLMM as a way of turning one cog into four/nine/... baby slow cogs, which are implemented by round-robbin time slicing a cog, giving each thread one LMM instruction before passing control to the next.

HollyMonkowski

Thank you!

I could not agree with you more about the importance of learning PASM, and how powerful it is.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

potatohead · 2010-07-09 14:14

Yeah, I can see that working. All in all, I'm intrigued with all the ways there are to get things done. Well done, BTW! I mentioned don't care up-thread, and it was about blinky lights. Just thought I would make that clear.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
8x8 color 80 Column NTSC Text Object
Wondering how to set tile colors in the graphics_demo.spin?
Safety Tip: Life is as good as YOU think it is!

Cluso99 · 2010-07-09 14:17

Roy & potatohead: Before I sent the email off I checked my wife's laptop Compaq which is a week old (with the numeric keypad) and it does not have a numlock led.
Anyway, my point was that it is dealt with in the keyboard driver software, so unless the led is required then a pin can be saved.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz

Bill Henning · 2010-07-09 15:15

Thanks!

Actually the more I think about it, the more I like interleaving two LMM threads (one per waitvid) with a TV driver... perfect for also doing the keyboard and mouse driver in that cog!

Heck, there may even be enough time for a 38.4kbps serial driver too.

What's nice about TLMM is that the threads are pretty much normal PASM (as long as I save/restore the flags) - no need for SLEEP, no need to call a scheduler regularly. The down side is that threads need to busy wait for pin states or time if they need them as trigger events - so Peter's scheduler and TLMM are complementary, not competing with each other.

(i figured it was something like that for the don't care)

potatohead said...
Yeah, I can see that working. All in all, I'm intrigued with all the ways there are to get things done. Well done, BTW! I mentioned don't care up-thread, and it was about blinky lights. Just thought I would make that clear.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

wjsteele · 2010-07-09 15:44

You know, one other thought occured to me. Instead of writing the instruction over for each thread, could we could simply swap the registers for each "thread" to the second/third/whatever set? That way, we don't have "duplicate" code... only one set of VMed code, but different stacks for each thread, if you will.

This would be especially handy where we are using the same serial code for multiple serial connections, etc. We could then initialize the VM thread with the code block to use as well as the stack block to use. Doing that, we could easily have something like 3 serial drivers + 1 keyboard driver = 4 stacks - giving us 4 logical threads, but only using two sets of instructions in cog ram.

Bill

Bill Henning · 2010-07-09 16:17

Something like that could be done, however it would be slower [noparse]:([/noparse]

wjsteele said...
You know, one other thought occured to me. Instead of writing the instruction over for each thread, could we could simply swap the registers for each "thread" to the second/third/whatever set? That way, we don't have "duplicate" code... only one set of VMed code, but different stacks for each thread, if you will.

This would be especially handy where we are using the same serial code for multiple serial connections, etc. We could then initialize the VM thread with the code block to use as well as the stack block to use. Doing that, we could easily have something like 3 serial drivers + 1 keyboard driver = 4 stacks - giving us 4 logical threads, but only using two sets of instructions in cog ram.

Bill

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Bill Henning · 2010-07-09 16:19

Here is another thought experiment... take the LMM out of TLMM, and you get TCOG - threaded cog!

This would fetch the code for the different threads from inside the cog; however any hub reference will cause a 200ns-300ns "offset" to running the next threads.

' i know - flags are not preserved... but interesting thought experiment - 5x >1MIPS in-cog threads!
' a thread can perform an absolute jump by 'movs threadN,#label'

    org 0

start   movs  thread1,#thread1_code
        movs  thread2,#thread2_code
        movs  thread3,#thread3_code
        movs  thread4,#thread4_code
        movs  thread5,#thread5_code

thread1 mov   ins1,0-0
        add   thread1,#1
ins1    nop

thread2 mov   ins2,0-0
        add   thread2,#1
ins2    nop

thread3 mov   ins3,0-0
        add   thread3,#1
ins3    nop

thread4 mov   ins4,0-0
        add   thread4,#1
ins4    nop

thread5 mov   ins5,0-0
        add   thread5,#1
ins5    nop
        jmp   #thread1    

thread1_code
    ' do stuff

thread2_code
    ' do stuff

thread3_code
    ' do stuff

thread4_code
    ' do stuff

thread5_code
    ' do stuff

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Post Edited (Bill Henning) : 7/9/2010 4:25:45 PM GMT

Bill Henning · 2010-07-09 16:31

I can't find all the tricky ways of saving/restoring flags that were posted in the forum in the past; however my murky memory indicates that the following should work:

' save flags
muxnz flags,#1
rcl flags,#1

' restore flags
rcr flags,#1 wz

Note, the above are untested, and are probably exactly the same as in the thread about saving/restoring the flags that I can't find....

Or they simply might not work.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

wjsteele · 2010-07-09 17:07

Bill Henning said...
Something like that could be done, however it would be slower [noparse]:([/noparse]

Yep, I agree, but it could still allow a nice approach if speed wern't important. The thing to point out here is it's always going to be slower than PASM, that's what this trade off is, but this process is actually using cycles that were going to be wasted by the cog anyway. So, any gain we get is still a gain, and it's a real bonus if we're eliminating the need for additional cogs, right?

Bill

Bill Henning · 2010-07-09 17:59

Actually what I meant is that it would be slower, and not allow any extra cycles... ie it would get less work done per unit time (ie every sec); so while theoretically possible, it is a less attractive approach.

Where something like that is useful is in what I suggested to heater, swapping out the context every 100us or so, for "high level" threading of C apps.

wjsteele said...

Bill Henning said...
Something like that could be done, however it would be slower [noparse]:([/noparse]

Yep, I agree, but it could still allow a nice approach if speed wern't important. The thing to point out here is it's always going to be slower than PASM, that's what this trade off is, but this process is actually using cycles that were going to be wasted by the cog anyway. So, any gain we get is still a gain, and it's a real bonus if we're eliminating the need for additional cogs, right?

Bill

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

wjsteele · 2010-07-09 18:05

Ah, got it!

Bill

pjv · 2010-07-09 18:27

@Bill & Bill;

Without indirect addressing capabilities in the Propeller, re-using code for several iterations is a real bear. It probably takes a similar amount of code to deal with that as it does to just make several copies.

That said however, it IS what I am pursuing with loading the tiny drivers from HubRam, so maybe an efficient method can be uncovered.

Cheers,

Peter (pjv)

Bill Henning · 2010-07-09 18:37

Peter,

I tend to agree [noparse]:)[/noparse]

FYI, I think your scheduler is far superior to TLMM for most scheduled tiny drivers, and I plan on using your scheduler soon [noparse]:)[/noparse]

The i2c thread just reminded me of my musings in '06 about making a multi-threaded LMM, which lead me to think of how to make it deterministic at the 1us level. Then frustrations with debugging SPI RAM issues on a new PCB lead me to whip up a quick sample TLMM implementation [noparse]:)[/noparse] [noparse]:)[/noparse] [noparse]:)[/noparse]

Unfortunately having to save/restore flags reduces the throughput per thread; two threads would run at 1.25MIPS each, three at 0.714MIPS each, four at 0.55MIPS each, five at 0.45MIPS... seven threads at 0.333333MIPS

MIPS PER THREAD = (COG_MIPS) / (8*NUM_THREADS + 4)

For TCOG, the variant that is less deterministic, but runs totally in-cog, but hub access messes up determinism big time here...

MIPS PER THREAD = (COG_MIPS) / (6*NUM_THREADS + 1)

Regards,

Bill

pjv said...
@Bill & Bill;

Without indirect addressing capabilities in the Propeller, re-using code for several iterations is a real bear. It probably takes a similar amount of code to deal with that as it does to just make several copies.

That said however, it IS what I am pursuing with loading the tiny drivers from HubRam, so maybe an efficient method can be uncovered.

Cheers,

Peter (pjv)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Post Edited (Bill Henning) : 7/9/2010 6:48:36 PM GMT

Bill Henning · 2010-07-09 18:53

After doing the calculations in the above post, seven threads jumps out as a useful number.

Keyboard and mouse drivers have a clock frequency between 10-16.7khz

0.333MIPS / 10khz = 33.3 instructions per keyboard bit - piece of cake.

0.333MIPS / 16.7Khz = 20 instructions per bit

So we know that a single TLMM cog can easily support up to seven keyboards or mice!

Here is a hypothetical 7 cog thread allocation:

thread 0 = keyboard
thread 1 = mouse
thread 2 = 19.2KBPS comm port

leaving four more threads available for additional drivers for which 0.333MIPS is fast enough!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

Post Edited (Bill Henning) : 7/9/2010 7:00:39 PM GMT

jazzed · 2010-07-09 19:25

Bill, Is the aggregate TLMM throughput higher than single LMM?

I hope we can help Peter with his driver loader.

I've never heard of anyone trying to do multi-thread chunk overlay loading
except for this TLMM which is essentially a one long at a time overlay loader.

I guess a question is: do the scheduler code threads need to be swapped often?

Having a tiny unrestricted overlay loader would be useful. Many of us swap
out blocks of memory as required to run PASM chunks, the most efficient
cases appear to have predetermined begin/end points. I cheat and zero
terminate chunks (NOP must be non-zero and data is not interleaved in code).
I use the unrolled read 4 long then jump method because several fragments
required by the JVM have less than 6 instructions and use on COG macros.
Loading 6 instructions with an unrolled loop is essentially as fast and may
be faster than a perfect window timing loader because of lower overhead.

The things good about chunks over LMM style are that once loaded the PASM
can run at speed, natural jumps can be used, and on COG service macro
routines are accessible directly. The bad thing about chunks is that for the
most part they have to be predefined. LMM does not suffer from predefinition.

If there was a way for chunks to be compiled and used generically that would
be great ... FCACHE does this a little, but one still needs to use LMM macros
and if I remember correctly, you have to declare code as FCACHE-able.

BTW: how about a little golf challenge (maybe another thread):
How many instructions and registers does it take you to save/restore C & Z?

Cheers,
--Steve

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM

Bill Henning · 2010-07-09 20:08

jazzed said...
Bill, Is the aggregate TLMM throughput higher than single LMM?

Nope. Can't be.

LMM's theoretical maximum execution rate is 5MIPS for "pure" LMM code.

With a four-way unrolled LMM inner loop, 80% of that ideal is achieved, ie 4MIPS for a single thread.

An eight-way unrolled LMM inner loop would hit (8/9)*5MIPS ie 4.44MIPS for "pure" LMM code

Mind you, it is more complicated than that. Executing hub instructions slows things down, using FCACHE properly speeds things up dramatically.

A very good optimizing compiler, or good programmer, should be able to hit 15-19 MIPS, depending no how much advantage it can take of FCACHE.

jazzed said...
I hope we can help Peter with his driver loader.

I agree!

jazzed said...
I've never heard of anyone trying to do multi-thread chunk overlay loading

Yep, Peter's approach is very interesting.

jazzed said...
except for this TLMM which is essentially a one long at a time overlay loader.

Umm.. that's not how I'd describe it, nor do I think of it that way.

Basically, TLMM is an unrolled LMM loop (unroll factor = # threads) where instead of running the same thread, each read/exec cycle executes a different thread.

Theoretically, for non-highly deterministic threads, TLMM could use kernel primitives, FCACHE, etc, to run multiple compiled LMM threads on the same cog.

Heck, it could be independent processes!

Vi would probably run usably at 0.5MIPS!

This would trade almost-determinism for running large multi-threaded user code ("business logic"). That's the role I originally envisioned for multi-threaded LMM kernels a couple of years ago; it only dawned on me to use it for (almost) deterministic medium-speed drivers when I started this thread.

jazzed said...
I guess a question is: do the scheduler code threads need to be swapped often?

That wold depend on the application.

jazzed said...
Having a tiny unrestricted overlay loader would be useful. Many of us swap
out blocks of memory as required to run PASM chunks, the most efficient
cases appear to have predetermined begin/end points. I cheat and zero
terminate chunks (NOP must be non-zero and data is not interleaved in code).
I use the unrolled read 4 long then jump method because several fragments
required by the JVM have less than 6 instructions and use on COG macros.
Loading 6 instructions with an unrolled loop is essentially as fast and may
be faster than a perfect window timing loader because of lower overhead.

Sounds like a perfect fit for your JVM!

jazzed said...
The things good about chunks over LMM style are that once loaded the PASM
can run at speed, natural jumps can be used, and on COG service macro
routines are accessible directly. The bad thing about chunks is that for the
most part they have to be predefined. LMM does not suffer from predefinition.

FCACHE does exactly that.

jazzed said...
If there was a way for chunks to be compiled and used generically that would
be great ... FCACHE does this a little, but one still needs to use LMM macros
and if I remember correctly, you have to declare code as FCACHE-able.

I think you may wish to re-read my postings on FCACHE in my original thread [noparse]:)[/noparse]

In my original kernel, FCACHE loads code into the $080-$0FF range.

Code that is loaded there runs at full speed, and can use regular jumps, djnz etc within that range.

The intention was, for example, that str*() functions would be FCALL'ed

LMM code would do the initialization (as it is faster than loading the FCACHE'd block)

but the actual working loop would be FCACHE'd - so all the body of the str*() functions, the loops, would run at full speed!

Same for memcpy() and friends.

Heck, FltDIV / FltMul loops would also approach raw pasm speeds!

jazzed said...
BTW: how about a little golf challenge (maybe another thread):
How many instructions and registers does it take you to save/restore C & Z?

Cheers,
--Steve

A few posts above I showed some code that should save in two instructions, and restore in only one, and requires one flag register per thread. I am pretty sure it is exactly the same (or functionally the same) as the code posted in the original threads that talked about saving/restoring flags... which I could not find!

I'd love to see a way to save both C and Z in one instruction... but I just don't see how that could be done.

(Having fun)

Bill

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com
My products: Morpheus / Mem+ / PropCade / FlexMem / VMCOG / Propteus / Proteus / SerPlug
and 6.250MHz Crystals to run Propellers at 100MHz & 5.0" OEM TFT VGA LCD modules
Las - Large model assembler Largos - upcoming nano operating system

TLMM: Threaded LMM - four, nine or even 19 drivers in one cog!

Comments