How about a built-in LMM operating mode. Where the existing 512 Cog words are treated exclusively as registers. And Hub memory is the normal address space. Register to register opcodes are a single Hub cycle. I assume that's roughly how LMM works already but the whole of Cog memory is free for register space.
It's another whole instruction set though. Would it be enough of an advantage over existing LMM implementation?
For the enhanced video generator, how would you know that it's finished?
Beanie2k is right. One of the advantages of the Propeller over its competition is that it's pretty straightforward. Prop II is starting to sound like the opposite with state machines, independent math units, much more complex interaction of pipelining and programming. True, this complexity provides more execution power, but it makes assembly programming more and more inaccessible. There have been excellent computers in the past that had very complex instruction sets that were only programmed by the user in high level languages. Assembly language was only used by the compiler / library developers. I don't think this is what is wanted here.
I don't have a problem with a more complex video generator. That's already something that requires a lot of expertise that few people will "mess with". There'll be a driver with a high level interface and most people will interface with that.
After re-reading I noticed Chip's comments about multitasking. I also agree with Sapieha - keep the hardware simple and allow the software to do the work, unless the hardware feature is easy to add and implement. The less complicated the Prop II is, the more flexible it will be and the sooner we'll have it!·
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ ·"I have always wished that my computer would be as easy to use as my telephone.· My wish has come true.· I no longer know how to use my telephone."
Beanie2k said...
Do we really want to create a Frankenstein chip that includes everything short of the proverbial kitchen sink but requires a PhD to program it?
That's a fair point. We've seen other multi-core CPU's noted in these forums and the general consensus is they appear far more complicated to use than the Propeller. For such a powerful chip the Propeller is simple to use and that's down to it's clean and lean design.
No register indirect and self-modifying code took some time to master, but what attracted me was "it's so simple it's got to be easy".
I do wonder if it's heading off in the wrong direction but then I'm not sure what it's target market would be. Maybe we're just seeing Chip in full 'flight of fancy' mode, fired-up and enthusing about what could be rather than what will be ? -- No offence meant to Chip there, as I know exactly what it's like to have an audience which really appreciates what the possibilities could be.
Some of the stuff looks simple enough 'REP [noparse][[/noparse]x,y]' and the enhancement to RDxxxx/WRxxxx make sense, but I recall some mention along the lines elsewhere of 'that would preclude RDBYTE'. It's starting to look not like an enhancement but something entirely different. I've not absorbed all of the posts ( busy debugging so brain very much elsewhere ! ) so don't take that as specific criticism. It may all look a lot more rational and consolidated in its final form.
The other thing which sort of worries me is that a lot of this sounds quite complicated and that would seem to indicate an extended time to market. I'm sure Chip enjoys his work and would embrace it all with relish but is there a danger in extending it just a little too far for what's actually needed ?
I might be a little late, as this thread grew 3 pages overnight, but I think it should be kept as simple as possible, ie. no hardware threading and no memory caching(?). I think I would go with the 256K, 8 cog version.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I am 1011, so be surprised!
Advertisement sponsored by dfletch:
Come and join us on the Propeller IRC channel for fast and easy help!
Channel: #propeller
Server: irc.freenode.net or freenode.net
If you don't want to bother installing an IRC client, use Mibbit. www.mibbit.com
Oh.. and source compatible with the current Prop..
I don't mind re-compiling.. binary compatibility is a millstone.. just look at the abortion that is Windows..
The feature creep being bandied about here kinda scares me..
evanh said...
How about a built-in LMM operating mode. Where the existing 512 Cog words are treated exclusively as registers.Would it be enough of an advantage over existing LMM implementation?
Right now there are quite a few registers providing the kernel can be kept tight. I don't think having access to 496 rather 396 is going to make a lot of difference. What would help is some fast context switching to support multi-tasking with ease.
This is another case where a stack machine has an inherent advantage over a register based VM as there's only a minimal number of registers used to hold task context which need to be swapped over.
The biggest problem with LMM and VM's at the moment isn't so much doing it but the convoluted faffing about to represent it within the PropTool, "DJNZ reg,@LABEL" invariable ends up as something like ...
long LMM_FLG | LMM_DJNZ << 16 | LMM_NR | reg
long @Label+$0010
If there were an "LMM" block like "DAT" and the PropTool assembled LMM code according to some specified template ( maybe macros would be enough ) I'm sure a lot more people would be using LMM than there are.
LMM isn't always a good as it can be either, needing to call the kernel rather than execute sequentially - load > $1FF constant, indirect/self-modifying code, CALL/RET, TJZ/TJNZ/DJNZ.
What would be really handy would be a means of determining if a PASM opcode to be executed could be executed inline or need to be handled by the kernel. That way near pure PASM could be used as LMM and the LMM wouldn't have to check it's one of a dozen different opcodes ...
LmmLoop rdlong :Opc,lmmPc
add lmmPc,#4
jop #Kernel ' Use an internal flag set during rdlong
:Opc nop ' Execute as normal PASM
jmp #LmmLoop
I'm with Beanie2k, don't let feature creep turn the Prop II into some sort of monster that is unusable by the majority of people here. And for those who want hyperthreading, DSP,s, true parallelism, etc there are already plenty of offerings ready for you.
As it is the Prop is the only multicore chip that easily accessible to most folks and I would like to see it stay that way.
evanh said...
How about a built-in LMM operating mode. Where the existing 512 Cog words are treated exclusively as registers.Would it be enough of an advantage over existing LMM implementation?
LMM isn't always a good as it can be either, needing to call the kernel rather than execute sequentially - load > $1FF constant, indirect/self-modifying code, CALL/RET, TJZ/TJNZ/DJNZ.
What would be really handy would be a means of determining if a PASM opcode to be executed could be executed inline or need to be handled by the kernel. That way near pure PASM could be used as LMM and the LMM wouldn't have to check it's one of a dozen different opcodes ...
I'd say yes then. Or at least some support for improving the LMM kernel. It would certainly put a guidance on what feature creep would really make a splash.
While I have used more than 8 COGs before, I'm convinced that 8 COGs is enough for the platform because virtual·COGS (deterministic or not)·can be created·in a few ways and especially the traditional method :P :P :P and maybe with merging jmpret behaviour too.·Given enough memory, one can have·"N" threads for non-time critical tasks instead of asking for more and more COGs without extra hardware using LMM although the "I word" makes things easier and faster for software.·Having LMM as a truly supported feature by PropII as has been described by Chip earlier will make LMM almost as fast as PASM.
Memory and fast access to the external world are much more precious resources to me at least. Having 16 COGs would actually degrade pin read/write speed. Now if only more memory could be had.
·
The rest of this is OT, but providing an answer to the quoted sub-thread.
Sapieha said...
LMM is very fine programing skills but only serves non time critical programs.
then ....
evanh said...
I would like an factual answer to that question. From someone that has worked on the LMM.
I wrote a module similar to Simple_Serial.spin for the ICC compiler which uses LMM called ASIO. This was to get past the lack of a FullDuplexSerial.spin "FDX" like feature (which has now been implemented using an interface to a COG running the FDX PASM code from ICC's LMM).
ASIO can work at baud-rates up to 57600. To do that however requires tuning between compiler optimizations/versions and makes ASIO something of a maintenace headache. Much of the tuning is mitigated by using "waitcnt" calls, but there are still dependencies on function call overhead and other small issues.
So the answer is: all else being equal with no optimizations, tool,·or code changes, LMM can serve time critical programs.·This is dependent on the platform remaining the same, and a new compiler version for example would be cause for "tuning". Tuning because of tool changes for me is a no-no and I would like to see ASIO replaced completely with FDX.
If Sapieha and evanh care to respond to this, a new thread or PM would be more appropriate.
Wow, a guy can't even go to bed at night without falling behind in class!
Well, it looks like multithreading is yesterday's news now. And that's okay: it got a fair and enthusiastic hearing, and I can't disagree with the verdict. To make it work cleanly from the user's perspective, I think one would need multiple, separate, complete execution units, each with its own pipeline, along with an overlay area for returns, etc. But if you go that far, you might as well spawn off another cog.
The idea still intrigues me, due to the raw MIPs available, so maybe it could be done in software via emulation. The EXEC capability via JMPD may hold the key. Once I've had my coffee, maybe I can wrap my mind around it. SMM, anyone?
Propeller power is simplicity.
Do not construct monster.
All new adition must have sama simplicity.
I agree 100% with Sapieha here. Frankly this thread is scaring me. In my lifetime I've seen way too many excellent products completely ruined by feature bloat. The PropI is a good chip. Let's stay with the basic concept and simply address the glaring weaknesses that affect the majority of users. Otherwise we're going to end up with a hardware version of Windows Vista.
Don't worry, this is how things work. An idea is thrown out there, it's shaped and formed while·looking at all the facets of it. And in the end it's dismantled because the reality of things cant support the idea. Multi-threading is an excellent example, it was thrown out there, some tussling with the idea went on, taking a close look at how different parts of the chip would be affected. As each·layer of·alteration is added, to get the idea to fit into the Propeller, the overall course·begins to meander all over the place. At a certain point the path to get from here to there is so circuitous that we realize there is no elegant way·to acheive the goal. And if an alternate way to the goal isn't known or doesn't exist,··then the idea is·scrapped.
I talked to Andre LaMothe (HYDRA) tonight about what he thought could be done to improve the video circuitry.
He thought adding a layer on top of the current video circuitry which would automate the gathering and outputting of color and pixel data would be good. This would mean that rather than doing flurries of WAITVIDs, you could point at the beginning of a stream of color and pixel longs within cog RAM and have the video circuit fetch them automatically when needed by stalling·cog execution periodically for one clock to get access to the cog RAM. So, you would set the begin and end pointers, set VSCL, do a WAITVID, and it would release you as soon as it took your command. After that, it would gather and output all color/pixel long pairs, not accepting another WAITVID until it was done with the series. This means that rather than doing lots of WAITVIDs, you'd be free to compose a whole scan line. During that time, you would lose a cycle here and there, but not otherwise be interrupted.
Also, we talked about the possibility of putting a color lookup RAM into the video circuit which would translate those 8-bit pixels into 16-bit pixels which could be·output as follows:
for composite video: %PPPPPP_TTTTT_BBBBB; where P=phase, T=top level, B=bottom level. This would use a 5-bit R2R DAC.
for vga (possibility): %RRRRR_GGGGG_BBBB_HV; where RGB have 6:6:5 bits, and Horizontal and Vertical are in the LSBs. Each color would use an R2R DAC.
This would make both composite and vga quite high quality.
The lookup table would be loaded 1-color-word-at-a-time by special instructions.
Would·these modifications be beneficial to you?
····· I think our current dillema is that the phase-key shifting in the propeller I is quantized to "a ···· granularity of 16", and that the 4 bit phase-state variable in the color variable is constant until a ·····video restart.· If the phase-key modulator in the video generator is repeatable,·could there ····· be·possibility of using more than one·generator cooperatively to achieve·quadrature·(or octal) ····· phase-key shifting?·
····· If you were to run·virtually identical·code in·multiple cogs·and·synced to the system counter, would ····· the time waiting between both cogs during a WAITCNT command be the same?· If you could cause ······simultaneous passing of unique sets of·color and pixel variables into·the separate·video ····· generators,·couldnt you then effectively use them to modulate each other at the output with a·· ····· high-speed quadrature modulator, or something of the sort.
····· heres what im thinking could happen if the generators PLL's are just slightly coherent.·You could ····· dedicate·cog A's·generator to the Quadrature and cog B's generator to the In-phase signal, or ····· alternatively CMYK phases instead of RGB (CK in one cog, the In-phase, and MY in the other), ·····using·a subtractive scheme instead of the additive.·· ····· That way modulating the signals against each other on the way out, like a quadrature modulator ····· does, youd be able to phase-key to finer-quantized variables. Just a thought
···· I have a feeling the PLL's do not care at all about each other, and that the sync is impossible. ···· you'd probably have to put two generators on the same PLL.
····· Thoughts?
····· I dunno, i think theres some way to do this.
tpw_man said...
I might be a little late, as this thread grew 3 pages overnight, but I think it should be kept as simple as possible, ie. no hardware threading and no memory caching(?). I think I would go with the 256K, 8 cog version.
This is how it is shaping up: No hardware threading, no cache-line memory accessing, but 8 cogs and 256KB RAM.
The enhancements that will be added for sure are: CORDIC, MAC/MACS,·REPeat, indirect register addressing, and hub memory pointers.
What will be investigated further is: improved video with CLUT and cog DMA. These expand the color immensely and free the cog to do meaningful work during rasterization.
Chip Gracey said...
In case you didn't read elsewhere, we've got a new JMPRETD instruction which DOESN'T flush the pipe on a branch, leaving the two trailing instructions to execute.
Chip,
In the Prop I, regular JMPs don't flush the pipe, either, except for things like DJNZ, which flush the pipe only if the jump is not taken. Are you implying, in your above statement, that this will change in the Prop II? Would it be possible to divulge exactly how the new pipeline works?
@mike_green You would know it's finished when the next waitvid is accepted. I take that to mean if one waitvid, with the COG memory stream bit set on, is in progress, the COG program execution would pause at the next one until it's ready.
So you do some math, and code your intermediate instructions such that they will finish before the waitvid does, just like we do now.
If this is done, via switch so that the old behavior is still there for the most part, I think this is a fine idea!
No tricks for decent color depths, and full use of the 8 bits / pixel mode possible.
I can see a routine where a scan line is built, waitvid inititated, then while waitvid is fetching from the scanline, the next one is built right behind it. Throughput is very high, meaning single cog video engines will be more than adequate for a lot of tasks. Damn cool. I can also see the rep instructions being extremely useful in this context as well, meaning very robust video with a far lower COG cost. This plus auto increment really packs a punch for this task. Home run, IMHO.
Can you explain more about the proposed color specification? lower level? Upper level? Is that for saturation, etc...? Just not sure what I'm seeing there. I think I'm also seeing sync being something declared explicitly, meaning that would now be something more automated? If so, that's cool too as more attention could be given to actually building the engine.
One thing I really wanted to do with the TV video was use more than one generator in tandem for cheap and easy overlays. Cursor, HUD, etc... It's a feature unique to the prop, it's a shame to not use it more.
In VGA this is possible and works well. There are no color signals to worry about. Those are the problem with the TV graphics generators being used in tandem.
Another useful thing would be to have the waitvid work once, then just turn off, for bursts of data in either DMA (guess that's the best term) or traditional mode. Again, I'm thinking about using more than one of them together, but also just thinking about other uses, like sound. A known short burst would be a nice option instead of having to manage two running streams. Having setup the PLLs, and data, just ask for one complete frame and know it's happening and that it will just finish with no worries.
Maybe two switches then? One for DMA, another for "one shot" mode?
Chip Gracey (Parallax) said...
What will be investigated further is: improved video with CLUT and cog DMA. These expand the color immensely and free the cog to do meaningful work during rasterization.
Considering you snubbed me pretty quickly when discussing Pseudo-DMA for input, but embraced Andres' suggestion for output, I take this to mean COG DMA would be for output only?
Chip Gracey said...
In case you didn't read elsewhere, we've got a new JMPRETD instruction which DOESN'T flush the pipe on a branch, leaving the two trailing instructions to execute.
Chip,
In the Prop I, regular JMPs don't flush the pipe, either, except for things like DJNZ, which flush the pipe only if the jump is not taken. Are you implying, in your above statement, that this will change in the Prop II? Would it be possible to divulge exactly how the new pipeline works?
Thanks,
-Phil
The current Propeller serially executes each instruction: Read S, Read D, Read next instuction, write D (4 clocks total). Because the next-instruction read is done during the ALU settling (before the results are known), the DJNZ assumes the branch WILL occur, and fetches the instruction at the top of the loop, not the one after DJNZ. This made DJNZ take 4 clocks for a branch and 8 clocks for a fall-through, in which case the instruction at the top of the loop got internally NOP'd and we started over again at the instruction after the DJNZ - hence, 8 clocks or 2 instruction cycles for a fall-through.
The next Propeller is like·a turbine, with things flying through it in various stages. Every clock, everything happens: an instruction is read for inst N-2, S and D are read for inst N-1, and·results (D, Z, C, and PC) are written for inst·N (the 'executing' instruction). There are data forwarding mechanisms to pass result values backwards to keep N-1 and N-2 up-to-date. It's a real party.
Right now, DJNZ takes 3 cycles for a branch and 1 for a fall-through. I would like to make a delayed DJNZD that you place two instructions up, so that it only takes 1 cycle for both looping and fall-through. We have a general JMPRETD which works like this.
····· I think our current dillema is that the phase-key shifting in the propeller I is quantized to "a ···· granularity of 16", and that the 4 bit phase-state variable in the color variable is constant until a ·····video restart.· If the phase-key modulator in the video generator is repeatable,·could there ····· be·possibility of using more than one·generator cooperatively to achieve·quadrature·(or octal) ····· phase-key shifting?·
····· If you were to run·virtually identical·code in·multiple cogs·and·synced to the system counter, would ····· the time waiting between both cogs during a WAITCNT command be the same?· If you could cause ······simultaneous passing of unique sets of·color and pixel variables into·the separate·video ····· generators,·couldnt you then effectively use them to modulate each other at the output with a·· ····· high-speed quadrature modulator, or something of the sort.
····· heres what im thinking could happen if the generators PLL's are just slightly coherent.·You could ····· dedicate·cog A's·generator to the Quadrature and cog B's generator to the In-phase signal, or ····· alternatively CMYK phases instead of RGB (CK in one cog, the In-phase, and MY in the other), ·····using·a subtractive scheme instead of the additive.·· ····· That way modulating the signals against each other on the way out, like a quadrature modulator ····· does, youd be able to phase-key to finer-quantized variables. Just a thought
···· I have a feeling the PLL's do not care at all about each other, and that the sync is impossible. ···· you'd probably have to put two generators on the same PLL.
····· Thoughts?
····· I dunno, i think theres some way to do this.
Actually, separate PLLs can sync quite nicely when started at the same time with same initial phase in the CTR. Just try running the vga_1280_x_1024_demo (or whatever it's exact name is). You will see.
I am glad you brought up this phase issue, though, because this is something that will get some attention. In the current Propeller, the 4-bit phase counter in each video unit is increment-only, and random on power-up, so it would take some special coding to joggle a few of them into exact phase. On the next Propeller, they will be reset through some simple mechanism, so that syncing multiple phase counters will be a breeze. Also, they will expand from 4 to 6 bits, giving 64 chroma phases, instead of 16.
Fascinating! So somehow you've managed to create a four-port memory, then? I see three reads and one write happening on every rising and falling clock edge!
Chip Gracey (Parallax) said...
What will be investigated further is: improved video with CLUT and cog DMA. These expand the color immensely and free the cog to do meaningful work during rasterization.
Considering you snubbed me pretty quickly when discussing Pseudo-DMA for input, but embraced Andres' suggestion for output, I take this to mean COG DMA would be for output only?
Ouch! Sorry, Steve. Weren't you talking about hub DMA which·could actually be done in software just as fast (8 clocks =·WRLONG + 6 instructions)? I know you are interested in pin states being written to hub RAM as fast as possible.
Fascinating! So somehow you've managed to create a four-port memory, then? I see three reads and one write happening on every rising and falling clock edge!
-Phil
That's right, there are 3 random read ports, and one random write port. Every clock, they all get used, except when you are hanging in some kind of WAIT. The bit cell for this memory is huge. Hey, Beau? Can you post a screen shot of the 4-port bitcell next to the single-port 6T cell? Beau's at lunch now, but will be back soon. Anyway, 4-port memory takes a lot of silicon, but enables single-clock execution for the Propeller.
Actually, separate PLLs can sync quite nicely when started at the same time with same initial phase in the CTR. Just try running the vga_1280_x_1024_demo (or whatever it's exact name is). You will see.
I'm not so convinced that they're 100% sync'd...· I think they each have some timing offset.· But, I think the monitor must correct for this via horizontal sync...
Actually, separate PLLs can sync quite nicely when started at the same time with same initial phase in the CTR. Just try running the vga_1280_x_1024_demo (or whatever it's exact name is). You will see.
I'm not so convinced that they're 100% sync'd...· I think they each have some timing offset.· But, I think the monitor must correct for this via horizontal sync...
I've noticed the jitter to be within a nanosecond, which is 1/8 of·a pixel at a 125MHz pixel rate - not noticeable on a CRT, but some LCDs (which·all use internal PLLs to sync) exhibit occasional pixel-boundary jitter. A monitor with a good PLL system always looks rock-solid, though. Some monitors are better than others in this regard.
Comments
btw: I really like your tag.
How about a built-in LMM operating mode. Where the existing 512 Cog words are treated exclusively as registers. And Hub memory is the normal address space. Register to register opcodes are a single Hub cycle. I assume that's roughly how LMM works already but the whole of Cog memory is free for register space.
It's another whole instruction set though. Would it be enough of an advantage over existing LMM implementation?
Evan
Beanie2k is right. One of the advantages of the Propeller over its competition is that it's pretty straightforward. Prop II is starting to sound like the opposite with state machines, independent math units, much more complex interaction of pipelining and programming. True, this complexity provides more execution power, but it makes assembly programming more and more inaccessible. There have been excellent computers in the past that had very complex instruction sets that were only programmed by the user in high level languages. Assembly language was only used by the compiler / library developers. I don't think this is what is wanted here.
I don't have a problem with a more complex video generator. That's already something that requires a lot of expertise that few people will "mess with". There'll be a driver with a high level interface and most people will interface with that.
LMM is very fine programing skills but only serves non time critical programs.
It have big posiblites with PropII.
But for time critical programing only COG power is good.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
Sapieha
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
·"I have always wished that my computer would be as easy to use as my telephone.· My wish has come true.· I no longer know how to use my telephone."
- Bjarne Stroustrup
That's a fair point. We've seen other multi-core CPU's noted in these forums and the general consensus is they appear far more complicated to use than the Propeller. For such a powerful chip the Propeller is simple to use and that's down to it's clean and lean design.
No register indirect and self-modifying code took some time to master, but what attracted me was "it's so simple it's got to be easy".
I do wonder if it's heading off in the wrong direction but then I'm not sure what it's target market would be. Maybe we're just seeing Chip in full 'flight of fancy' mode, fired-up and enthusing about what could be rather than what will be ? -- No offence meant to Chip there, as I know exactly what it's like to have an audience which really appreciates what the possibilities could be.
Some of the stuff looks simple enough 'REP [noparse][[/noparse]x,y]' and the enhancement to RDxxxx/WRxxxx make sense, but I recall some mention along the lines elsewhere of 'that would preclude RDBYTE'. It's starting to look not like an enhancement but something entirely different. I've not absorbed all of the posts ( busy debugging so brain very much elsewhere ! ) so don't take that as specific criticism. It may all look a lot more rational and consolidated in its final form.
The other thing which sort of worries me is that a lot of this sounds quite complicated and that would seem to indicate an extended time to market. I'm sure Chip enjoys his work and would embrace it all with relish but is there a danger in extending it just a little too far for what's actually needed ?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
I am 1011, so be surprised!
Advertisement sponsored by dfletch:
Come and join us on the Propeller IRC channel for fast and easy help!
Channel: #propeller
Server: irc.freenode.net or freenode.net
If you don't want to bother installing an IRC client, use Mibbit. www.mibbit.com
Oh.. and source compatible with the current Prop..
I don't mind re-compiling.. binary compatibility is a millstone.. just look at the abortion that is Windows..
The feature creep being bandied about here kinda scares me..
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Pull my finger!
Right now there are quite a few registers providing the kernel can be kept tight. I don't think having access to 496 rather 396 is going to make a lot of difference. What would help is some fast context switching to support multi-tasking with ease.
This is another case where a stack machine has an inherent advantage over a register based VM as there's only a minimal number of registers used to hold task context which need to be swapped over.
The biggest problem with LMM and VM's at the moment isn't so much doing it but the convoluted faffing about to represent it within the PropTool, "DJNZ reg,@LABEL" invariable ends up as something like ...
long LMM_FLG | LMM_DJNZ << 16 | LMM_NR | reg
long @Label+$0010
If there were an "LMM" block like "DAT" and the PropTool assembled LMM code according to some specified template ( maybe macros would be enough ) I'm sure a lot more people would be using LMM than there are.
LMM isn't always a good as it can be either, needing to call the kernel rather than execute sequentially - load > $1FF constant, indirect/self-modifying code, CALL/RET, TJZ/TJNZ/DJNZ.
What would be really handy would be a means of determining if a PASM opcode to be executed could be executed inline or need to be handled by the kernel. That way near pure PASM could be used as LMM and the LMM wouldn't have to check it's one of a dozen different opcodes ...
As it is the Prop is the only multicore chip that easily accessible to most folks and I would like to see it stay that way.
Keep the hardware straightforward
LMM is something that will be used a lot, imho.
Evan
Memory and fast access to the external world are much more precious resources to me at least. Having 16 COGs would actually degrade pin read/write speed. Now if only more memory could be had.
·
The rest of this is OT, but providing an answer to the quoted sub-thread.
I wrote a module similar to Simple_Serial.spin for the ICC compiler which uses LMM called ASIO. This was to get past the lack of a FullDuplexSerial.spin "FDX" like feature (which has now been implemented using an interface to a COG running the FDX PASM code from ICC's LMM).
ASIO can work at baud-rates up to 57600. To do that however requires tuning between compiler optimizations/versions and makes ASIO something of a maintenace headache. Much of the tuning is mitigated by using "waitcnt" calls, but there are still dependencies on function call overhead and other small issues.
So the answer is: all else being equal with no optimizations, tool,·or code changes, LMM can serve time critical programs.·This is dependent on the platform remaining the same, and a new compiler version for example would be cause for "tuning". Tuning because of tool changes for me is a no-no and I would like to see ASIO replaced completely with FDX.
If Sapieha and evanh care to respond to this, a new thread or PM would be more appropriate.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Well, it looks like multithreading is yesterday's news now. And that's okay: it got a fair and enthusiastic hearing, and I can't disagree with the verdict. To make it work cleanly from the user's perspective, I think one would need multiple, separate, complete execution units, each with its own pipeline, along with an overlay area for returns, etc. But if you go that far, you might as well spawn off another cog.
The idea still intrigues me, due to the raw MIPs available, so maybe it could be done in software via emulation. The EXEC capability via JMPD may hold the key. Once I've had my coffee, maybe I can wrap my mind around it. SMM, anyone?
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
You mean Symmetric Multi-processing Model ?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Well, actually it was Small Memory Model, or in-cog emulation. I like your definition, though!
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
· ··· granularity of 16", and that the 4 bit phase-state variable in the color variable is constant until a
···· ·video restart.· If the phase-key modulator in the video generator is repeatable,·could there
····· be·possibility of using more than one·generator cooperatively to achieve·quadrature·(or octal)
····· phase-key shifting?·
····· If you were to run·virtually identical·code in·multiple cogs·and·synced to the system counter, would
····· the time waiting between both cogs during a WAITCNT command be the same?· If you could cause
······simultaneous passing of unique sets of·color and pixel variables into·the separate·video
····· generators,·couldnt you then effectively use them to modulate each other at the output with a··
····· high-speed quadrature modulator, or something of the sort.
····· heres what im thinking could happen if the generators PLL's are just slightly coherent.·You could
····· dedicate·cog A's·generator to the Quadrature and cog B's generator to the In-phase signal, or
····· alternatively CMYK phases instead of RGB (CK in one cog, the In-phase, and MY in the other),
···· ·using·a subtractive scheme instead of the additive.··
····· That way modulating the signals against each other on the way out, like a quadrature modulator
····· does, youd be able to phase-key to finer-quantized variables. Just a thought
···· I have a feeling the PLL's do not care at all about each other, and that the sync is impossible.
···· you'd probably have to put two generators on the same PLL.
····· Thoughts?
····· I dunno, i think theres some way to do this.
The enhancements that will be added for sure are: CORDIC, MAC/MACS,·REPeat, indirect register addressing, and hub memory pointers.
What will be investigated further is: improved video with CLUT and cog DMA. These expand the color immensely and free the cog to do meaningful work during rasterization.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
In the Prop I, regular JMPs don't flush the pipe, either, except for things like DJNZ, which flush the pipe only if the jump is not taken. Are you implying, in your above statement, that this will change in the Prop II? Would it be possible to divulge exactly how the new pipeline works?
Thanks,
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
So you do some math, and code your intermediate instructions such that they will finish before the waitvid does, just like we do now.
If this is done, via switch so that the old behavior is still there for the most part, I think this is a fine idea!
No tricks for decent color depths, and full use of the 8 bits / pixel mode possible.
I can see a routine where a scan line is built, waitvid inititated, then while waitvid is fetching from the scanline, the next one is built right behind it. Throughput is very high, meaning single cog video engines will be more than adequate for a lot of tasks. Damn cool. I can also see the rep instructions being extremely useful in this context as well, meaning very robust video with a far lower COG cost. This plus auto increment really packs a punch for this task. Home run, IMHO.
Can you explain more about the proposed color specification? lower level? Upper level? Is that for saturation, etc...? Just not sure what I'm seeing there. I think I'm also seeing sync being something declared explicitly, meaning that would now be something more automated? If so, that's cool too as more attention could be given to actually building the engine.
One thing I really wanted to do with the TV video was use more than one generator in tandem for cheap and easy overlays. Cursor, HUD, etc... It's a feature unique to the prop, it's a shame to not use it more.
In VGA this is possible and works well. There are no color signals to worry about. Those are the problem with the TV graphics generators being used in tandem.
Another useful thing would be to have the waitvid work once, then just turn off, for bursts of data in either DMA (guess that's the best term) or traditional mode. Again, I'm thinking about using more than one of them together, but also just thinking about other uses, like sound. A known short burst would be a nice option instead of having to manage two running streams. Having setup the PLLs, and data, just ask for one complete frame and know it's happening and that it will just finish with no worries.
Maybe two switches then? One for DMA, another for "one shot" mode?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Post Edited (potatohead) : 8/29/2008 6:01:43 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
The next Propeller is like·a turbine, with things flying through it in various stages. Every clock, everything happens: an instruction is read for inst N-2, S and D are read for inst N-1, and·results (D, Z, C, and PC) are written for inst·N (the 'executing' instruction). There are data forwarding mechanisms to pass result values backwards to keep N-1 and N-2 up-to-date. It's a real party.
Right now, DJNZ takes 3 cycles for a branch and 1 for a fall-through. I would like to make a delayed DJNZD that you place two instructions up, so that it only takes 1 cycle for both looping and fall-through. We have a general JMPRETD which works like this.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
I am glad you brought up this phase issue, though, because this is something that will get some attention. In the current Propeller, the 4-bit phase counter in each video unit is increment-only, and random on power-up, so it would take some special coding to joggle a few of them into exact phase. On the next Propeller, they will be reset through some simple mechanism, so that syncing multiple phase counters will be a breeze. Also, they will expand from 4 to 6 bits, giving 64 chroma phases, instead of 16.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
[noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Fascinating! So somehow you've managed to create a four-port memory, then? I see three reads and one write happening on every rising and falling clock edge!
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
Cog DMA could both read and write, for sure.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.