Could the cache be two blocks of 8*longs. Once executing from the first, the second could be loaded without stalling the first, and visa versa. This would prevent stalling due to WIDEs reloading, unless there was a JMP/CALL/RET.
For HUBEXEC mode, Windowing into the cog isn't necessary. Do you agree?
Optionally, if a REP instruction was less than #9 instructions, the reloading of its cache could be prevented (simplest implementation), permitting fast small REP loops from reloading the cache.
For Video Gen or Aux data, can WIDEs load into AUX directly without having to be moved by sw from cache to aux?
I'm thinking that for hub execution, there could be a separate set of 16 long registers which never get mapped (as you pointed out, they wouldn't need to be). They would serve as an instruction cache, each set of 8 being reloadable by a background read instruction. The WIDEs would be relegated for read cache and data use, only - no execution possible. This would reduce caveats. I still need to think about this. What we are going to lack is all the built-in pointer-based memory access that a normal large-memory CPU has for things like stacks and array indexing. To think that many people are going to use this mode because it's the type of thinking they're used to, means that they may find it lacking. Just a thought.
To answer your last question, it could be done with auto-move operations, which I'm working on. It would probably require that the hub execution stall itself (like in muti-tasking with jump-backs), so that it doesn't need the hub slot, leaving it available for the auto-move operation.
Don't break your key caps! Sometimes those are hard to get back on. Good grief, how do the Apple guys do it? Amazing small scale engineering. Took me a Looooong time to replace the one I hosed up.
I'm thinking that for hub execution, there could be a separate set of 16 long registers which never get mapped (as you pointed out, they wouldn't need to be). They would serve as an instruction cache, each set of 8 being reloadable by a background read instruction. The WIDEs would be relegated for read cache and data use, only - no execution possible. This would reduce caveats. I still need to think about this.
What we are going to lack is all the built-in pointer-based memory access that a normal large-memory CPU has for things like stacks and array indexing. To think that many people are going to use this mode because it's the type of thinking they're used to, means that they may find it lacking. Just a thought.
That's why I was hoping to use the cog's PC, to free up PTRA, as PTRA & PTRB are pretty much what people are used to - just with smaller indexing range. However, as you pointed out, using PTRA as the PC does simplify entering/exiting hubexec mode.
To answer your last question, it could be done with auto-move operations, which I'm working on. It would probably require that the hub execution stall itself (like in muti-tasking with jump-backs), so that it doesn't need the hub slot, leaving it available for the auto-move operation.
I think the idea of portable assembly language is pretty funny.
Wow, really ?
Portable means many things - With already two assemblers, the simplest meaning is the ease of moving code between either.
That simply makes libraries more useful.
The longer term meaning is portable across programmers.
Parallax strive for long design lifetimes, that means more than one person is likely to work on the code.
Someone looking at spaghetti soup code, is not going to think 'That's clever, I love niche things !!'
Most programmers think about who will come after them, or who will read their libraries.
It only takes a tiny effort at the tools end, to make that easier.
Yeah JMG, programmers think about that, until they don't. This is all about intent. In a structured setting, what you say makes sense. But what about the unstructured setting?
That's what SPIN + PASM is all about.
There will be an industry standard tool chain for P2, no worries right?
So then, who cares what SPIN + PASM looks like? Seems to me, you should be interested in maximizing the industry standard stuff, not so worried about SPIN + PASM.
Now I understand the worry is that some code you might want to use gets written in "that other toolchain Chip did" Yep. That will absolutely happen.
Comparing the fine syntax found in Chip's assembler to spaghetti soup code is unproductive. One could say the same thing about the killer little assembler Linus wrote for P1 that he then used to produce a Breakpoint "wild" category winning demo for P1 on. The whole package is brilliant, and self documenting, for those who would go looking.
I will. Many others will too. Comparing your preferences as common to most programmers is also unproductive. It's unproductive because your needs, preferences, outlook on this isn't very well aligned with a lot of the people who would find SPIN + PASM attractive.
Some people may well author things in SPIN + PASM for the sheer fun of it. That needs to exist. Anybody claiming that work needs to get done is completely free to go and do the work in that standard work environment, which BTW, isn't always the same kind of fun, which is sort of my point. I don't mean that as a bash, just highlighting some real differences.
It's that kind of thinking that gets us a Propeller instead of some other industry standard chip derivative. Does that make sense?
In the nice little assembler Chip wrote, I can represent binary data with a file, or in lots of common sense ways. Better still, every one of those makes great sense on sight! Was the leanest environment I had ever seen. One needs to go and read a manual to understand all the gas options. Great options, but who wants to go reading a manual when we get things like:
byte byte %0001_1000, %%1232, 56, $45, "alpha"
or
file:myfile.bin
Etc... ?
I could go and list off a ton of things, but the truth is, the PASM syntax as Chip created it is simple, beautiful, flexible, productive. But, it's just not standard. Oh well. He's not the first guy to write a tool that works well for him, and he won't be the last. And assembly language programmers tend to think that way.
I don't want to write up a gas vs Pnut.exe comparison, because I don't want to write bad things about gas, because I don't think gas is bad. Not about that at all.
If they are paid to worry about it, they've got industry standard tools to make the money in. On the other hand, maybe they just want to use the chip how they envision using it too. That really is an entirely different thing and equally valid.
The idea that there is always one right way to do these things is kind of bunk really, and on that basis, let's just leave SPIN and PASM to the guy who authored it, and when that's all done, the industry standard tools can follow and we all go and do our thing.
I'll look at your code, if I see some magic in there. If you won't go and look at those who author it in SPIN + PASM, ask them nicely to translate,
Finally, this shouldn't be a bad conversation, nor a much longer one. If so, let's take it to another thread.
Yeah JMG, programmers think about that, until they don't. This is all about intent. In a structured setting, what you say makes sense. But what about the unstructured setting?
That's what SPIN + PASM is all about.
I do not quite get this 'us and them' locked mindset.
When I'm writing an Assembler, I do not think "let's make hex $, and hey, everyone else can follow"
Instead, I think differently I guess.
I think what hex formats might users have, they are working with ?
They may get hex tables from Calculators, or LCD generators, or many other tools...
See, it is not just what they can read, but what they are given too..
So I'll add 0xAB and 0FEH as also-supported. Suddenly, it is more portable, and users can get their job done sooner.
See how little effort, this really takes ? You just need to be aware that it matters.
Guess what ? Nowhere did I remove anyone's beloved syntax!!.
Great. Seems to me there is a nice effort to deliver precisely what you are writing about. What is the worry?
SPIN+PASM is a different thing.
I have to know more things when I use gas. That's not desirable, and I'll be honest about it. I have to know more things when I'm using C too, and Forth... sheesh! Gotta know a heck of a lot more things, but I'm not out there trying to get the Forth guys to make syntax that works in the "everything but the kitchen sink" environment.
Why would I? That takes away from the lean thing they've got going. Lean is awesome. But it's not standard. Sorry. Frankly, nobody has anything on Forth when it comes to lean. It's not how I feel I can work best, but I see it making the chip sing for those who do. And that is awesome! Those guys can work from machine language and be running higher level code in a day. Awesome, but totally non-standard. The more the merrier.
Let me know when all the nice little shortcut data format representations are in there. Oh, and when can I write a program that contains PASM + SPIN + it's data and just hit "file, save?" Being able to do that is lean, mean, fun, easy, and quite potent for a lot of things. Having that exist takes nothing away, only adds. Or maybe I want to package up the tool and the code for use later on. Unpack it, write, build, go, no install, no environment, nothing. LEAN.
I just want that lean environment to exist, and it makes a lot of sense. Knowing a few odd operators is nothing compared to the extras that come with one size fits all kinds of environments. Again, not bad, just a difference in outlook and motivation, etc... One size fits all isn't bad, but it's not LEAN. Big difference.
What Chip does is LEAN. No getting around this. It is LEAN I want to continue to exist.
If we want to head toward on chip development, it's going to be the LEAN environments that get there first and they will be potent. Ask the Forthers about that. Right now, SPIN + PASM with no extras will require little more than a nice editor and an executable and some storage to work on chip. That's something I want to see happen.
I also want to see the gcc and friends tool chain happen too, and I want both to exploit the chip well. But they will happen for very different reasons.
If you are paying me, I'm happy to do it the standard way.
And it's funny how I will suddenly have a great reason to know all those extra things. BTW: I can do those extra things, but I really don't want to HAVE to do them.
See how that works?
I've said enough. Seems to me, we've got chip features to discuss.
I'm thinking that for hub execution, there could be a separate set of 16 long registers which never get mapped (as you pointed out, they wouldn't need to be). They would serve as an instruction cache, each set of 8 being reloadable by a background read instruction. The WIDEs would be relegated for read cache and data use, only - no execution possible. This would reduce caveats. I still need to think about this.
Yes, this sound like a nice solution.
What we are going to lack is all the built-in pointer-based memory access that a normal large-memory CPU has for things like stacks and array indexing. To think that many people are going to use this mode because it's the type of thinking they're used to, means that they may find it lacking. Just a thought.
P2 is still a micro, after all. It doesn't need anything. Better to have the mode less some features, than not have the mode at all, particularly because of the speed gains possible.
To answer your last question, it could be done with auto-move operations, which I'm working on. It would probably require that the hub execution stall itself (like in muti-tasking with jump-backs), so that it doesn't need the hub slot, leaving it available for the auto-move operation.
Yes, this would be nice. Permits the aux loading in the background, and only stalls if there is hub contention.
On the matter of standards. I've used so many different development tools, programming languages, assemblers, etc.. That I don't even care for standards anymore,. Its hard to find stardards when everyone is always re-inventing the wheel... because I know when I get back into the office and move on to the next project. It's gonna be something different again.
I've coded on the P1 using the parallax tool's PASM. It made sense to me. I found it very easy to use and learn.
I'm used to just accepting whatever the development environment gives me.
One idea I was thinking about is that if it turns out to be a bit difficult to add in the all additional instructions to manage all the jumps, calls, returns in hub exec mode in the time available, to simplify could things just be done manually with the compiler issuing existing P2 instructions like this below?
One benefit then is that the compiler has full control over where the stack lives, at the expense of a couple of cycles for branches and a bit more function call prologue overhead. Given most function calls will probably need to exit the currently loaded of 8 cached long instructions there is often going to be time while waiting for the next hub window to arrive where some work can be fit in (of course it depends on where you are in the group of 8 instructions when you do the call/jump). Now I'm not saying this method is ideal compared to having dedicated instructions, but it might be a much simpler implementation, and lets the code decide how to do things. You'd obviously still need some instruction to enter/exit from hub exec mode. This code example assumes PTRA is the hub PC, and either SPA (for AUX RAM) or PTRB (for HUB RAM) is the stack pointer.
To absolute jump in hub mode:
BIG #(JUMP_ADDR>>9) ; or whatever latest BIG equivalent is now
SETPTRA #(JUMP_ADDR & $1FF)
For relative jumps in hub mode:
BIG #(JUMP_OFFSET>>9)
ADDPTRA #(JUMP_OFFSET & $1FF)
For calls in hub mode:
GETPTRA tempreg
PUSHA tempreg (or) WRLONG tempreg, --PTRB (if ptrb is used as the stack pointer)
BIG #(CALL_ADDR>>9)
SETPTRA #(CALL_ADDR & $1FF)
To return:
POPA tempreg (or) RDLONG tempreg, ++PTRB if stack is in hub
ADD tempreg, #$C ; offset to compensate and skip three (or 4) instructions after getptra was first read
SETPTRA tempreg
Note the compensation could be done more understandably in the calling code too but I was just thinking it may be better in the return for code density, depending on number of calls vs returns for each function.
Roger.
ps. I think the changes to PTRA during hub mode code would need to be able to flush the execution pipeline as well but only in hub mode. It would also need to trigger the next read of 8 longs from the hub as required if the upper address bits differ to what was already loaded there previously.
While what you posted should work, it would use a lot more hub memory to encode the same functionality.
Fortunately opcode space has been found for the instructions, so we don't have to use extra longs; which is great because HJMP/HCALL/HCALLA/HCALLB fit nicely into a single 32 bit opcode. Given how many jumps/calls there are in code (I think the average is that one instruction in six is a jump or a call) using an extra long would increase code size by 16.6%, which would have the same effect as reducing the hub by approx. 43KB ... way too much space to waste.
Regarding PS...
While I'd prefer the cog program counter being extended to 18 bits so we get to heelp PTRA for code use, I am happy to get hubexec even with PTRA being taken over as the hubpc.
Right now, I am just waiting to see what Chip does - Ken said he may have it running by Monday. Once we try that, we can think about tuning it.
NOTE: I DO NOT ADVOCATE SLOT STEALING!!!! That would remove determinism, and is in the "don't even think it" category.
Usage case:
For arguments sake, let's say COG#1 is running, and COG#1 only gets its own slot.
hubexec running gcc code will have - for arguments sake - two hub references in every 8 longs.
To execute that window of eight longs will take 24 clock cycles (3x8) best case.
Now let's assume a peaceful world of happy cooperation, where unused slots go free to any cog that needs a slot. Note I am not even talking about yielding a slot, merely unusued slots.
Let's say out of the other 7 cogs, two are not using their slot.
Happyness!
Now the propgcc code might execute in 8 clock cycles. Frankly, on average, more like 12 clock cycles.
But 2x-3x speedup is nothing to sneeze at.
Note - no loss of determinism to the other cogs.
The hubexec cog can just make use of unused slots to run faster. If two cogs were running hubexec, they'd get to share the surplus.
I don't think there is any real need for anything fancier than this.
This way, EVERY cog gets its guaranteed slot.
hubexec cogs can vacuum up any table scraps.
This does not undermine determinism, just speed up hubexec code (Spin, GCC, etc).
It maybe useful to have a
SETSLOT "DETERMINISTIC | HUNGRY"
In deterministic mode, a cog can only get its own slots, and may not use other cogs slots. The hub goes merrily around every 8 cycles. If it leaves a slot unused, a HUNGRY cog can use it.
In HUNGRY mode, a cog get its own guaranteed slot, and shares spare slots with other hungry cogs. Only guaranteed 1/8 slots, but may get more.
I am sure I've seen this (or extremely similar) proposal posted before, but there have been too many messages, so no idea where.
Then, we can leave more complex schemes for P3 :-)
Personally, I like being able to set priorities, yielding etc., and don't see the Obex issue as being so serious - however the above may be a good compromise for the first P2 series microcontroller.
Perhaps "SETSLOT DETERMINISTIC | HUNGRY | LAZY" would be better. A LAZY cog voluntarily gives up its own slice, and will only use other cogs scraps.
Talking to myself : I am not sure that LAZY is needed, as if the cog does not use its slice, it will be available to HUNGRY cogs anyway.
I think what you call LAZY is needed, at least in the paired scenario.
It lets the HUNGRY COG of the pair have deterministic 2X slots while the LAZY one gets anything the HUNGRY one doesn't want.
If you paired a HUNGRY and a DETERMINISTIC the HUNGRY one could not depend on cycles being there in timing critical applications like video.
While what you posted should work, it would use a lot more hub memory to encode the same functionality.
Fortunately opcode space has been found for the instructions, so we don't have to use extra longs; which is great because HJMP/HCALL/HCALLA/HCALLB fit nicely into a single 32 bit opcode. Given how many jumps/calls there are in code (I think the average is that one instruction in six is a jump or a call) using an extra long would increase code size by 16.6%, which would have the same effect as reducing the hub by approx. 43KB ... way too much space to waste.
Regarding PS...
While I'd prefer the cog program counter being extended to 18 bits so we get to heelp PTRA for code use, I am happy to get hubexec even with PTRA being taken over as the hubpc.
Right now, I am just waiting to see what Chip does - Ken said he may have it running by Monday. Once we try that, we can think about tuning it.
Yeah I know it does get pretty wasteful on hub memory for the calls and jumps, that is its Achilles Heel. With luck we will get the best dedicated instructions for hub mode and so avoid it. It's only advantages would be if there was no time to build in the extra instructions, but only add the prefetch execution pipeline, in this case you could still have something perhaps. It's a poor man's implementation but it might work as a fallback or in a pinch.
I do think we ultimately need something for hubexec mode that can work with having the stack held either in hub RAM or in AUX. I can imagine some more typical application code where the stack/data should all be in hub RAM, just with register variables held in COG memory. There will be associated hub window performance hits for any stack/data access. The RDxxx using PTRB as stack pointer (assuming PTRA is hub PC) then also allows indexed accesses which is rather nice for reading locals on stack frames.
For higher performance I can also see other applications where the stack could be entirely kept in AUX, and all the data could be in the COG registers. Think a 256 entry stack, and maybe around ~460 longs of data or so. Basically that second type would be a Harvard type model with code kept separated from data/stack memory which is small and of similar size to some typical AVR/PICs etc, but obviously runs at a much higher performance level than those devices and still allows a very large program space. There's plenty of real time/driver stuff that doesn't need a large data space that would benefit there with that memory model, and it will be a lot easier to code in C instead of needing to resort to dedicated PASM.
Similarly there's plenty of regular application type code that can benefit from the other model too with much larger data/stack space in hub. We basically will want both. Plus we still get our full blown PASM for highest performance and complete customization. An awesome combination. Can't wait to hear back on how it goes with Chip looking at this more now.
I think what you call LAZY is needed, at least in the paired scenario.
It lets the HUNGRY COG of the pair have deterministic 2X slots while the LAZY one gets anything the HUNGRY one doesn't want.
If you paired a HUNGRY and a DETERMINISTIC the HUNGRY one could not depend on cycles being there in timing critical applications like video.
Yeah I know it does get pretty wasteful on hub memory for the calls and jumps, that is its Achilles Heel. With luck we will get the best dedicated instructions for hub mode and so avoid it. It's only advantages would be if there was no time to build in the extra instructions, but only add the prefetch execution pipeline, in this case you could still have something perhaps. It's a poor man's implementation but it might work as a fallback or in a pinch.
Fortunately between Ray and Chip's combing, we actually have 5 or 6 dual op opcodes available.
HJMP/HCALL/HCALLA/HCALLB only need one such opcode!!!!
HRET/HRETA/HRETB only need single op opcodes, there are tons of those that are unused - so everything fits nicely now.
I do think we ultimately need something for hubexec mode that can work with having the stack held either in hub RAM or in AUX. I can imagine some more typical application code where the stack/data should all be in hub RAM, just with register variables held in COG memory. There will be associated hub window performance hits for any stack/data access. The RDxxx using PTRB as stack pointer (assuming PTRA is hub PC) then also allows indexed accesses which is rather nice for reading locals on stack frames.
See post #1 & #2 in this thread; already accounted for.
HCALLA/HRETA use the SPA stack pointer into AUX
HCALLB/HRETB use the SPB stack pointer into AUX
HRET uses a link register, which can be pushed to a hub stack with a simple
For higher performance I can also see other applications where the stack could be entirely kept in AUX, and all the data could be in the COG registers. Think a 256 entry stack, and maybe around ~460 longs of data or so. Basically that second type would be a Harvard type model with code kept separated from data/stack memory which is small and of similar size to some typical AVR/PICs etc, but obviously runs at a much higher performance level than those devices and still allows a very large program space. There's plenty of real time/driver stuff that doesn't need a large data space that would benefit there with that memory model, and it will be a lot easier to code in C instead of needing to resort to dedicated PASM.
Totally agreed. My initial proposal was limited to three instructions (see post#1) HJMP / HCALL / HRET, and implicitly used SPA into the AUX stack.
The GCC guys need LR to make the port easier, and for symmetry with cog code I changed things to add the capability to use LR, SPA into AUX, or SPB into AUX - mirroring how cog pasm works.
Similarly there's plenty of regular application type code that can benefit from the other model too with much larger data/stack space in hub. We basically will want both. Plus we still get our full blown PASM for highest performance and complete customization. An awesome combination. Can't wait to hear back on how it goes with Chip looking at this more now.
Exactly!
I read Chip is adding a separate cache for the hubex mode, so all the RDxxxxC instructions will go to a different octal cache - avoiding much thrashing.
Plus using spare hub slots speeds things up even more.
256 stack entries will work for a lot of microcontroller style code. It definitely won't work for things like printf() which use a ton of stack and call deeply nested helper functions
512 would allow for moderate stack usage, and much bigger microcontroller style code.
If a few simple rules are used, we can get a lot of mileage out of AUX ... and it s MUCH faster than a hub based stack.
For large code (ie pnut compiler, other compilers, user interface apps, using printf and friends) supporting a hub based stack is needed.
Many of the Simple Library functions only go 5 or 6 stack frames deep. This print function (the printf "equivalent") with 32bit-doubles floating point goes about 8 stack frames deep. I'll do some measurements tomorrow evening to get some concrete numbers.
Summary of revised HUBEXEC instructions, including suggested instuction bit encoding:
...
TTTTTTT ZC I CCCC jjAAAAAAAAA AAAAAAA
where jj select between HJMP / HCALL / HCALLA / HCALLB
TTTTTTT ZC I CCCC 00AAAAAAAAA AAAAAAA HJMP
TTTTTTT ZC I CCCC 01AAAAAAAAA AAAAAAA HCALL
TTTTTTT ZC I CCCC 10AAAAAAAAA AAAAAAA HCALLA
TTTTTTT ZC I CCCC 11AAAAAAAAA AAAAAAA HCALLB
TTTTTTT is the seven bit op code to be assigned by Chip
ZC,I,CCCC as normal P1/P2 usage
jj selects between the four hub-address instructions
HJMP D/#addr
TTTTTTT ZC I CCCC 00AAAAAAAAA AAAAAAA HJMP
HCALL D/#addr
TTTTTTT ZC I CCCC 01AAAAAAAAA AAAAAAA HCALL
HCALLA D/#addr
TTTTTTT ZC I CCCC 10AAAAAAAAA AAAAAAA HCALLA
HCALLB D/#addr
TTTTTTT ZC I CCCC 11AAAAAAAAA AAAAAAA HCALLB
Bill,
I was looking at your latest encodings reproduced above.
Is there a problem with this encoding if the top two bits of where the D field normally sits in the 32 bit instruction encoding are also being used for indicating which of the four operations are being done? I can see how it works when I bit is set and it is just using the 16 bit #constant format, but not for the other format when a D COG register is being used to form the jump/call address, unless D is extracted from a different position in the instruction to where it normally comes from. Should your jj bits be made as the LSB bits instead? If we are reading the address as a long, then the bottom two bits can be ignored anyway, right?
No, not a problem, because of the two implied zero bits (as addresses are long aligned)
The reason I put the instruction code at the top is that a later P2.1+ might have more hub space, and use separate op codes... which would allow 1MB of hub.
I was looking at your latest encodings reproduced above.
Is there a problem with this encoding if the top two bits of where the D field normally sits in the 32 bit instruction encoding are also being used for indicating which of the four operations are being done? I can see how it works when I bit is set and it is just using the 16 bit #constant format, but not for the other format when a D COG register is being used to form the jump/call address, unless D is extracted from a different position in the instruction to where it normally comes from. Should your jj bits be made as the LSB bits instead?
Ok, but which 9 bits are significant for representing the D register in the "D" form of the instructions? ie specifically for these variants, without the #addr.
Ok, but which 9 bits are significant for representing the D register in the "D" form of the instructions? ie specifically for these variants, without the #addr.
Many of the Simple Library functions only go 5 or 6 stack frames deep. This print function (the printf "equivalent") with 32bit-doubles floating point goes about 8 stack frames deep. I'll do some measurements tomorrow evening to get some concrete numbers.
For you compiler gurus, what is a respectable stack size?
I gather some of you believe 256 (longs) is too small. Would 512 be enough?
It's not so much that 256 longs is too small for a stack. It's more that the AUX registers are not in the same address space as hub or COG memory. C really wants a unified address space. You can come up with a subset of C that will work with the AUX stack but it won't be full C and I'm not sure it will be easy to find a way to get GCC to enforce the subset. You'd probably have to write an entirely new compiler. I don't even think Spin will do well with the AUX stack because some things we've gotten used to doing, like taking the address of a local variable, will work differently because the stack isn't in hub memory. Chip has already said that you need to use a different syntax to dereference a pointer to AUX memory. That means that any function that takes a pointer will need to know whether it is a hub pointer or AUX pointer and handle them differently.
Today, I got the pipeline and peripherals to handle stalls of any duration. Stalls are going to be necessary to hold off the pipeline while instruction caches reload for out-of-range branches.
Next, I must implement the instruction caches. This is going to be a bunch of big steps that must all be done in parallel before anything is testable. All PC's are going to become 16 bits. I'm planning on having four 8-long caches, since I could see that two would not be enough and three would be odd. With the right cache-picking algorithm, maybe all 4 tasks could execute from the hub without too much thrashing. Any ideas on how the cache-picking algorithm should work? I'm thinking that maybe the cache that was read longest ago should be the one to get reloaded, no matter the number of tasks executing from the hub. Is there a better way?
Comments
Maybe some example would be enlightening, but not tonight please, because I'm getting QWERTY forehead.
I'm thinking that for hub execution, there could be a separate set of 16 long registers which never get mapped (as you pointed out, they wouldn't need to be). They would serve as an instruction cache, each set of 8 being reloadable by a background read instruction. The WIDEs would be relegated for read cache and data use, only - no execution possible. This would reduce caveats. I still need to think about this. What we are going to lack is all the built-in pointer-based memory access that a normal large-memory CPU has for things like stacks and array indexing. To think that many people are going to use this mode because it's the type of thinking they're used to, means that they may find it lacking. Just a thought.
To answer your last question, it could be done with auto-move operations, which I'm working on. It would probably require that the hub execution stall itself (like in muti-tasking with jump-backs), so that it doesn't need the hub slot, leaving it available for the auto-move operation.
Let me think on it Steve.
Interesting idea. Will think about it.
That's why I was hoping to use the cog's PC, to free up PTRA, as PTRA & PTRB are pretty much what people are used to - just with smaller indexing range. However, as you pointed out, using PTRA as the PC does simplify entering/exiting hubexec mode.
We are all waiting to see what you cook up
Wow, really ?
Portable means many things - With already two assemblers, the simplest meaning is the ease of moving code between either.
That simply makes libraries more useful.
The longer term meaning is portable across programmers.
Parallax strive for long design lifetimes, that means more than one person is likely to work on the code.
Someone looking at spaghetti soup code, is not going to think 'That's clever, I love niche things !!'
Most programmers think about who will come after them, or who will read their libraries.
It only takes a tiny effort at the tools end, to make that easier.
That's what SPIN + PASM is all about.
There will be an industry standard tool chain for P2, no worries right?
So then, who cares what SPIN + PASM looks like? Seems to me, you should be interested in maximizing the industry standard stuff, not so worried about SPIN + PASM.
Now I understand the worry is that some code you might want to use gets written in "that other toolchain Chip did" Yep. That will absolutely happen.
Comparing the fine syntax found in Chip's assembler to spaghetti soup code is unproductive. One could say the same thing about the killer little assembler Linus wrote for P1 that he then used to produce a Breakpoint "wild" category winning demo for P1 on. The whole package is brilliant, and self documenting, for those who would go looking.
I will. Many others will too. Comparing your preferences as common to most programmers is also unproductive. It's unproductive because your needs, preferences, outlook on this isn't very well aligned with a lot of the people who would find SPIN + PASM attractive.
Some people may well author things in SPIN + PASM for the sheer fun of it. That needs to exist. Anybody claiming that work needs to get done is completely free to go and do the work in that standard work environment, which BTW, isn't always the same kind of fun, which is sort of my point. I don't mean that as a bash, just highlighting some real differences.
It's that kind of thinking that gets us a Propeller instead of some other industry standard chip derivative. Does that make sense?
In the nice little assembler Chip wrote, I can represent binary data with a file, or in lots of common sense ways. Better still, every one of those makes great sense on sight! Was the leanest environment I had ever seen. One needs to go and read a manual to understand all the gas options. Great options, but who wants to go reading a manual when we get things like:
byte byte %0001_1000, %%1232, 56, $45, "alpha"
or
file:myfile.bin
Etc... ?
I could go and list off a ton of things, but the truth is, the PASM syntax as Chip created it is simple, beautiful, flexible, productive. But, it's just not standard. Oh well. He's not the first guy to write a tool that works well for him, and he won't be the last. And assembly language programmers tend to think that way.
I don't want to write up a gas vs Pnut.exe comparison, because I don't want to write bad things about gas, because I don't think gas is bad. Not about that at all.
If they are paid to worry about it, they've got industry standard tools to make the money in. On the other hand, maybe they just want to use the chip how they envision using it too. That really is an entirely different thing and equally valid.
The idea that there is always one right way to do these things is kind of bunk really, and on that basis, let's just leave SPIN and PASM to the guy who authored it, and when that's all done, the industry standard tools can follow and we all go and do our thing.
I'll look at your code, if I see some magic in there. If you won't go and look at those who author it in SPIN + PASM, ask them nicely to translate,
Finally, this shouldn't be a bad conversation, nor a much longer one. If so, let's take it to another thread.
I do not quite get this 'us and them' locked mindset.
When I'm writing an Assembler, I do not think "let's make hex $, and hey, everyone else can follow"
Instead, I think differently I guess.
I think what hex formats might users have, they are working with ?
They may get hex tables from Calculators, or LCD generators, or many other tools...
See, it is not just what they can read, but what they are given too..
So I'll add 0xAB and 0FEH as also-supported. Suddenly, it is more portable, and users can get their job done sooner.
See how little effort, this really takes ? You just need to be aware that it matters.
Guess what ? Nowhere did I remove anyone's beloved syntax!!.
Macros and Conditional Assembly ? Likewise...
SPIN+PASM is a different thing.
I have to know more things when I use gas. That's not desirable, and I'll be honest about it. I have to know more things when I'm using C too, and Forth... sheesh! Gotta know a heck of a lot more things, but I'm not out there trying to get the Forth guys to make syntax that works in the "everything but the kitchen sink" environment.
Why would I? That takes away from the lean thing they've got going. Lean is awesome. But it's not standard. Sorry. Frankly, nobody has anything on Forth when it comes to lean. It's not how I feel I can work best, but I see it making the chip sing for those who do. And that is awesome! Those guys can work from machine language and be running higher level code in a day. Awesome, but totally non-standard. The more the merrier.
Let me know when all the nice little shortcut data format representations are in there. Oh, and when can I write a program that contains PASM + SPIN + it's data and just hit "file, save?" Being able to do that is lean, mean, fun, easy, and quite potent for a lot of things. Having that exist takes nothing away, only adds. Or maybe I want to package up the tool and the code for use later on. Unpack it, write, build, go, no install, no environment, nothing. LEAN.
I just want that lean environment to exist, and it makes a lot of sense. Knowing a few odd operators is nothing compared to the extras that come with one size fits all kinds of environments. Again, not bad, just a difference in outlook and motivation, etc... One size fits all isn't bad, but it's not LEAN. Big difference.
What Chip does is LEAN. No getting around this. It is LEAN I want to continue to exist.
If we want to head toward on chip development, it's going to be the LEAN environments that get there first and they will be potent. Ask the Forthers about that. Right now, SPIN + PASM with no extras will require little more than a nice editor and an executable and some storage to work on chip. That's something I want to see happen.
I also want to see the gcc and friends tool chain happen too, and I want both to exploit the chip well. But they will happen for very different reasons.
If you are paying me, I'm happy to do it the standard way.
And it's funny how I will suddenly have a great reason to know all those extra things. BTW: I can do those extra things, but I really don't want to HAVE to do them.
See how that works?
I've said enough. Seems to me, we've got chip features to discuss.
Yes, this would be nice. Permits the aux loading in the background, and only stalls if there is hub contention.
I've coded on the P1 using the parallax tool's PASM. It made sense to me. I found it very easy to use and learn.
I'm used to just accepting whatever the development environment gives me.
One benefit then is that the compiler has full control over where the stack lives, at the expense of a couple of cycles for branches and a bit more function call prologue overhead. Given most function calls will probably need to exit the currently loaded of 8 cached long instructions there is often going to be time while waiting for the next hub window to arrive where some work can be fit in (of course it depends on where you are in the group of 8 instructions when you do the call/jump). Now I'm not saying this method is ideal compared to having dedicated instructions, but it might be a much simpler implementation, and lets the code decide how to do things. You'd obviously still need some instruction to enter/exit from hub exec mode. This code example assumes PTRA is the hub PC, and either SPA (for AUX RAM) or PTRB (for HUB RAM) is the stack pointer.
To absolute jump in hub mode:
For relative jumps in hub mode:
For calls in hub mode:
To return:
Note the compensation could be done more understandably in the calling code too but I was just thinking it may be better in the return for code density, depending on number of calls vs returns for each function.
Roger.
ps. I think the changes to PTRA during hub mode code would need to be able to flush the execution pipeline as well but only in hub mode. It would also need to trigger the next read of 8 longs from the hub as required if the upper address bits differ to what was already loaded there previously.
While what you posted should work, it would use a lot more hub memory to encode the same functionality.
Fortunately opcode space has been found for the instructions, so we don't have to use extra longs; which is great because HJMP/HCALL/HCALLA/HCALLB fit nicely into a single 32 bit opcode. Given how many jumps/calls there are in code (I think the average is that one instruction in six is a jump or a call) using an extra long would increase code size by 16.6%, which would have the same effect as reducing the hub by approx. 43KB ... way too much space to waste.
Regarding PS...
While I'd prefer the cog program counter being extended to 18 bits so we get to heelp PTRA for code use, I am happy to get hubexec even with PTRA being taken over as the hubpc.
Right now, I am just waiting to see what Chip does - Ken said he may have it running by Monday. Once we try that, we can think about tuning it.
I found a great use for unused hub slots - feeding hubexec!
Guesstimate is 2x-3x speedup for Spin, GCC, and all other code that uses hub based variables or tables.
Perhaps "SETSLOT DETERMINISTIC | HUNGRY | LAZY" would be better. A LAZY cog voluntarily gives up its own slice, and will only use other cogs scraps.
Talking to myself : I am not sure that LAZY is needed, as if the cog does not use its slice, it will be available to HUNGRY cogs anyway.
I think what you call LAZY is needed, at least in the paired scenario.
It lets the HUNGRY COG of the pair have deterministic 2X slots while the LAZY one gets anything the HUNGRY one doesn't want.
If you paired a HUNGRY and a DETERMINISTIC the HUNGRY one could not depend on cycles being there in timing critical applications like video.
C.W.
Yeah I know it does get pretty wasteful on hub memory for the calls and jumps, that is its Achilles Heel. With luck we will get the best dedicated instructions for hub mode and so avoid it. It's only advantages would be if there was no time to build in the extra instructions, but only add the prefetch execution pipeline, in this case you could still have something perhaps. It's a poor man's implementation but it might work as a fallback or in a pinch.
I do think we ultimately need something for hubexec mode that can work with having the stack held either in hub RAM or in AUX. I can imagine some more typical application code where the stack/data should all be in hub RAM, just with register variables held in COG memory. There will be associated hub window performance hits for any stack/data access. The RDxxx using PTRB as stack pointer (assuming PTRA is hub PC) then also allows indexed accesses which is rather nice for reading locals on stack frames.
For higher performance I can also see other applications where the stack could be entirely kept in AUX, and all the data could be in the COG registers. Think a 256 entry stack, and maybe around ~460 longs of data or so. Basically that second type would be a Harvard type model with code kept separated from data/stack memory which is small and of similar size to some typical AVR/PICs etc, but obviously runs at a much higher performance level than those devices and still allows a very large program space. There's plenty of real time/driver stuff that doesn't need a large data space that would benefit there with that memory model, and it will be a lot easier to code in C instead of needing to resort to dedicated PASM.
Similarly there's plenty of regular application type code that can benefit from the other model too with much larger data/stack space in hub. We basically will want both. Plus we still get our full blown PASM for highest performance and complete customization. An awesome combination. Can't wait to hear back on how it goes with Chip looking at this more now.
I gather some of you believe 256 (longs) is too small. Would 512 be enough?
Fortunately between Ray and Chip's combing, we actually have 5 or 6 dual op opcodes available.
HJMP/HCALL/HCALLA/HCALLB only need one such opcode!!!!
HRET/HRETA/HRETB only need single op opcodes, there are tons of those that are unused - so everything fits nicely now.
See post #1 & #2 in this thread; already accounted for.
HCALLA/HRETA use the SPA stack pointer into AUX
HCALLB/HRETB use the SPB stack pointer into AUX
HRET uses a link register, which can be pushed to a hub stack with a simple
WRLONG LR,--PTRB
To return using hub stack:
RDLONG LR,PTRB++
HRET
Totally agreed. My initial proposal was limited to three instructions (see post#1) HJMP / HCALL / HRET, and implicitly used SPA into the AUX stack.
The GCC guys need LR to make the port easier, and for symmetry with cog code I changed things to add the capability to use LR, SPA into AUX, or SPB into AUX - mirroring how cog pasm works.
Exactly!
I read Chip is adding a separate cache for the hubex mode, so all the RDxxxxC instructions will go to a different octal cache - avoiding much thrashing.
Plus using spare hub slots speeds things up even more.
This is going to be great...
512 would allow for moderate stack usage, and much bigger microcontroller style code.
If a few simple rules are used, we can get a lot of mileage out of AUX ... and it s MUCH faster than a hub based stack.
For large code (ie pnut compiler, other compilers, user interface apps, using printf and friends) supporting a hub based stack is needed.
Yep, and I would say that is probably an understatement.
Without making the claim to be a guru, I haven't needed more than about 400 longs total with PropGCC.
Many of the Simple Library functions only go 5 or 6 stack frames deep. This print function (the printf "equivalent") with 32bit-doubles floating point goes about 8 stack frames deep. I'll do some measurements tomorrow evening to get some concrete numbers.
Thanks, good data point.
Steve,
Thanks, that would be much appreciated, and great info to have.
Bill,
I was looking at your latest encodings reproduced above.
Is there a problem with this encoding if the top two bits of where the D field normally sits in the 32 bit instruction encoding are also being used for indicating which of the four operations are being done? I can see how it works when I bit is set and it is just using the 16 bit #constant format, but not for the other format when a D COG register is being used to form the jump/call address, unless D is extracted from a different position in the instruction to where it normally comes from. Should your jj bits be made as the LSB bits instead? If we are reading the address as a long, then the bottom two bits can be ignored anyway, right?
The reason I put the instruction code at the top is that a later P2.1+ might have more hub space, and use separate op codes... which would allow 1MB of hub.
HJMP D
HCALL D
HCALLA D
HCALLB D
Next, I must implement the instruction caches. This is going to be a bunch of big steps that must all be done in parallel before anything is testable. All PC's are going to become 16 bits. I'm planning on having four 8-long caches, since I could see that two would not be enough and three would be odd. With the right cache-picking algorithm, maybe all 4 tasks could execute from the hub without too much thrashing. Any ideas on how the cache-picking algorithm should work? I'm thinking that maybe the cache that was read longest ago should be the one to get reloaded, no matter the number of tasks executing from the hub. Is there a better way?