Also, I'm assuming that the following P2 features will not be ported:
SERDES
INDx
tasks
register remapping
If this were the case, would it be possible to add an extremely simple cooperative multitasking instruction set? I'm thinking something along the following lines:
Single internal TASK register for holding a PC/Z/C.
GETTASK instruction to read TASK.
SETTASK instruction to write TASK.
SWTASK instruction that takes PC+1/Z/C and swaps it with whatever is in TASK.
With just SETTASK and SWTASK, it would be possible to write drivers with "concurrent" read/write threads. With GETTASK, more complex schedulers could be developed. No, it's not as efficient as interleaved tasking, but it should be very little increase in complexity and circuitry for a significant increase in usability over the current P1 approach(es).
Here is my grep analysis of the OBEX files I use for compiler testing. This consists of 1465 spin files. My counts below are just for number of spin files that contain the instruction. Also, I can't easily account for SPIN keywords that match PASM ones, but it shouldn't matter for this purpose.
Lest anyone think that the Moderators are not active on this site, understand that there are not that many of us and that there are tens of tens messages to oversee.
That said - I'm locking this thread until it can be reviewed for moderation.
Ditto - please keep CMPSUB. Once I saw it in use the first time (credit: kuroneko) it became a go-to in tight loops for timing purposes (compared to separate CMP and SUB). Even if a new chip will be much faster, I'll just want to do 5% more than it can do, no matter how much that is.
I also analyzed my use of waitpeq and waitpne. Out of 193 cases, only 8 used an immediate operand for the source. This may suggest a bit besides the C flag that could be used to distinguish port A from port B.
The thing that makes it confusing is that you are suggesting several things.
I was building up from minimum, to better... better... trying to save myself time.
My intent was for each stage to be read, analyzed, internalized before moving to the next.
This way I was hoping on saving everyone time, and trying to present a "roadmap" from minimum gates some performance improvement, to maximum performance, with as few gates as I could see using.
It seems like the minimal implementation would just be a RDLONGC. Latching the hub bus is equivalent to implementing a single cache line. The cache line could be used for data, instructions or both. I.E., the latched hub bus would be a shared instruction/data cache.
A shared single line 4 long cache cannot improve for LMM or hubexec, as it would be reloaded on every hub reference, and the first instruction after a hub reference. Performance would be terrible, almost zero benefit.
Shared I/D caches can work very well when there are a LOT of cache lines, and use an LRU algorithm.
Two lines of I with prefetch and one line of D cache is the minimum for decent performance. (Diminishing returns hits after 8 lines of I and 4 of D)
I think all you need is the long JMP and CALL instructions. What's the purpose of the LOAD instruction? Isn't that the same as RDLONGC?
No. Load is equivalent to LOCPTRA on the P2, but going to a fixed location to avoid needing bits for D.
(gate) poor mans replacement for the P2 LOC* instructions, without needing PTRA. Not as good, but a good boost for compiled code. To wit:
' LMM
CALL #MVI_R4
long hub_addr_of_array
RDLONG R3, R4 ' get first element, can incr R4 to walk array
' HUBEXEC
LOADK #hubaddr
RDLONG R3, $1EE (or whatever fixed address)
HUGE performance win, reduces memory use too.
As per my discussion with David, I'd be delighted if Chip instead could add AUGS:
RDLONG R3,##hubaddr
and that would also cover reading 32 bit constants.
I did not have it in my minimized proposal... as it was the minimum
My intent was for each stage to be read, analyzed, internalized before moving to the next.
Hi Bill,
so if we take my example from earlier this afternoon, what would the numbers now be with your proposed mechanism?
So if a 16 COG P1+ appears I can have 8 COGs acting as very capable intelligent peripherals accessing HUB RAM at 1/128 to exchange data and 8 COGs running out of HUB at 15/128 or around 23.5 MIPS each! So that's a chip with a pile of intelligent peripherals and the equivalent of 8 typical 8-bit processors (except they're 32-bit) all in one package.
The thing that makes it confusing is that you are suggesting several things. It seems like the minimal implementation would just be a RDLONGC. Latching the hub bus is equivalent to implementing a single cache line.
The cache line could be used for data, instructions or both. I.E., the latched hub bus would be a shared instruction/data cache.
I think all you need is the long JMP and CALL instructions. What's the purpose of the LOAD instruction? Isn't that the same as RDLONGC?
I'm not sure what Bill meant but I wasn't proposing to use RDLONGC (which Chip hasn't promised anyway). I was hoping for a 17 bit PC and logic that would automatically do the equivilent of RDLONGC when fetching an instruction whose 8 high bits are non-zero.
Relevant posts have been moved from the other thread. That thread has been locked.
-Phil
Ummm... Why can't this thread just remain? You could just remove a couple of posts rather than pulling everything out of context and moving it to a new thread.
Relevant posts have been moved from the other thread. That thread has been locked.
-Phil
Boy! Talk about misunderstanding people in writing...
I read this post and my mind saw, "Relevant posts have been moved TO the other thread. THIS thread has been locked." I was thinking, "What in the heck do David James and Phil know that I keep missing?"
I came back later, after I saw it wasn't locked in the Prop2 Forum, and happened to re-read it correctly.
A few of my local friends here in Red Bluff are diagnosed as paranoid schizophrenics, and I've seen them completely mis-recall conversations that I happened to witness, as if the data fed data into their head through some upside-down filter.
Yes, exactly.
Out of 1465 files only 130 have one or more MOV instructions.
Assume all PASM has at least one MOV. (Show me useful PASM code that does not)
Ergo. Only 130 files have PASM in them or 1335 files have no assembler in them.
Seems a bit odd to me.
Yes, exactly.
Out of 1465 files only 130 have one or more MOV instructions.
Assume all PASM has at least one MOV. (Show me useful PASM code that does not)
Ergo. Only 130 files have PASM in them or 1335 files have no assembler in them.
Seems a bit odd to me.
This may actually be good news if the new Spin is source compatible with the old Spin. It may mean that almost all OBEX code will work on the new processor even if the assembly language changes a bit.
Good point David, I just hope the compiler bods get a hearing and whatever little changes are going on to the instruction set include help for compiled code.
I think a 16x16 multiplier would be good, per cog. Any thoughts on whether that would be precise enough? 16x16 yields a convenient 32-bit result, at least.
From a compiler perspective, the compiler doesn't know if your are multiplying 2 numbers of about the same size or a big one and a small one. So a multiply would become a subroutine call, which moves it into hub memory.
So actually my vote would be for a 32x32 multiply (which returns 32 significant bits) and as a shared resource.
--- edit -- assuming both numbers are declared as int -- which in P* would be 32 bits.
From a compiler perspective, the compiler doesn't know if your are multiplying 2 numbers of about the same size or a big one and a small one. So a multiply would become a subroutine call, which moves it into hub memory.
So actually my vote would be for a 32x32 multiply (which returns 32 significant bits) and as a shared resource.
It wouldn't move the code into hub memory. The PropGCC LMM kernel has a COG function to do multiply and divide. I'm sure Catalina does as well.
Would that COG function be pulled into COG memory on demand or there most of the time?
I believe it is there all the time although some stuff has been moved out into kernel extensions. I'll have to check. I would bet multiply is always in COG memory though. Not so sure about divide.
Edit: Just checked. As I expected, multiply is permanently resident and divide is in a kernel extension.
Regarding the moved posts and locked thread: I had placed a note in the other thread saying, "Relevant posts have been moved to the other thread. This thread has been locked." But somehow that post got moved to this thread, along with the good stuff. It wasn't supposed to. (I think maybe two of us moderators were involved, but it may just be early-onset dementia.) Anyway, when I saw what had happened, I edited the post in this thread to read the way it does now. But some of you may have seen it before I edited it. So you're not going crazy after all.
Comments
Also, I'm assuming that the following P2 features will not be ported:
If this were the case, would it be possible to add an extremely simple cooperative multitasking instruction set? I'm thinking something along the following lines:
With just SETTASK and SWTASK, it would be possible to write drivers with "concurrent" read/write threads. With GETTASK, more complex schedulers could be developed. No, it's not as efficient as interleaved tasking, but it should be very little increase in complexity and circuitry for a significant increase in usability over the current P1 approach(es).
I will not presume to advise Chip on which ones to keep and which to eliminate.
-Phil
It's alive!!!
-Phil
I was building up from minimum, to better... better... trying to save myself time.
My intent was for each stage to be read, analyzed, internalized before moving to the next.
This way I was hoping on saving everyone time, and trying to present a "roadmap" from minimum gates some performance improvement, to maximum performance, with as few gates as I could see using.
A shared single line 4 long cache cannot improve for LMM or hubexec, as it would be reloaded on every hub reference, and the first instruction after a hub reference. Performance would be terrible, almost zero benefit.
Shared I/D caches can work very well when there are a LOT of cache lines, and use an LRU algorithm.
Two lines of I with prefetch and one line of D cache is the minimum for decent performance. (Diminishing returns hits after 8 lines of I and 4 of D)
No. Load is equivalent to LOCPTRA on the P2, but going to a fixed location to avoid needing bits for D.
(gate) poor mans replacement for the P2 LOC* instructions, without needing PTRA. Not as good, but a good boost for compiled code. To wit:
HUGE performance win, reduces memory use too.
As per my discussion with David, I'd be delighted if Chip instead could add AUGS:
RDLONG R3,##hubaddr
and that would also cover reading 32 bit constants.
I did not have it in my minimized proposal... as it was the minimum
Super job, Roy and Heater. This gives some great insight.
Roy, how many objects did you check? I ask because the numbers seem kind of small for a large code base.
-Phil
Hi Bill,
so if we take my example from earlier this afternoon, what would the numbers now be with your proposed mechanism?
I have narrowed down my 563 questions to one.
Are cordic functions going to be in hardware on this go around? An answer here will allow me to see the future and not ask anymore questions:)
Rich
Ummm... Why can't this thread just remain? You could just remove a couple of posts rather than pulling everything out of context and moving it to a new thread.
Edit: Oops. I read the post wrong. Sorry!
Good point, Roy says 1465 files.
I'm going to speculate that every piece of assembler has a MOV in it. But there is only 130 files counted with a MOV.
Is it really so we have 1335 files with no assembler in OBEX?
Well since I had this space available...
+1 on Chip's comment below that we will still have CORDIC.
C.W.
You bet! I would be frustrated without CORDIC, myself.
Boy! Talk about misunderstanding people in writing...
I read this post and my mind saw, "Relevant posts have been moved TO the other thread. THIS thread has been locked." I was thinking, "What in the heck do David James and Phil know that I keep missing?"
I came back later, after I saw it wasn't locked in the Prop2 Forum, and happened to re-read it correctly.
A few of my local friends here in Red Bluff are diagnosed as paranoid schizophrenics, and I've seen them completely mis-recall conversations that I happened to witness, as if the data fed data into their head through some upside-down filter.
Yes, exactly.
Out of 1465 files only 130 have one or more MOV instructions.
Assume all PASM has at least one MOV. (Show me useful PASM code that does not)
Ergo. Only 130 files have PASM in them or 1335 files have no assembler in them.
Seems a bit odd to me.
I think a 16x16 multiplier would be good, per cog. Any thoughts on whether that would be precise enough? 16x16 yields a convenient 32-bit result, at least.
So actually my vote would be for a 32x32 multiply (which returns 32 significant bits) and as a shared resource.
--- edit -- assuming both numbers are declared as int -- which in P* would be 32 bits.
Edit: Just checked. As I expected, multiply is permanently resident and divide is in a kernel extension.
-Phil