Shop OBEX P1 Docs P2 Docs Learn Events
Observations of a multi-tasker - Page 3 — Parallax Forums

Observations of a multi-tasker

13567

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2013-09-19 17:42
    Cluso,

    I think the simplest way would be to use the existing prop to prop 3 pin serial instructions, and make the following simple mods:

    - sender and receiver can configure the serial circuit for 8/16/32 bits (no funny number of bits to keep the logic simple)
    - sender supplies CLK, if needed have it be clkfreq/2 or one of the counters
    - receiver clocks data in on received clock

    This would allow low overhead prop 2 prop comms, removing the need for a shared crystal (but may limit comm to clkfreq/2), and maps nicely onto SPI master/slave

    Worst case impact: the ser/des does not work, P2 can go to market without it

    Best case: very fast SPI, I2S, prop2prop comm

    (famous last words) implementation should be easy, if kept this simple.
    Cluso99 wrote: »
    Chip:
    Is there some simple way that you could allow serial input to work?
    Something like putting a gate to input to the VGA registers and be able to read the VGA registers, or
    some way of applying a counter clock to the high speed interprop comms?

    We could take care of start and stop bit detection etc with software, but some slight additional hardware using
    the existing serial silicon would be fantastic.
  • AribaAriba Posts: 2,690
    edited 2013-09-19 18:15
    Ken Gracey wrote: »
    Nothing against Andy or the suggestion as it may be very beneficial.

    Ears perked up a bit on this kind of request. I realize there may be tremendous benefits to some changes.

    But they have to be considered in relation to the opportunity cost of not completing the project. Eight years is a long time, and Parallax has much to consider when changes are made. Parallax is one place where extended R&D has no consequences other than those that might matter the most: serious financial considerations, which can grind us to a stop if we are unable to derive revenue from our investments.

    Just a word of caution, that's all. . . I'll look into swapping out the Starbucks coffee for some Yuban next week.

    Ken
    I'm fully aware of this. I had first a longer list, but then reduced it to only the things that Chip already considered as important enough to change if there ever will be a new revision. Now we are at this point, and I just want to give a little reminder....

    Andy
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-09-19 18:26
    Cluso,

    I think the simplest way would be to use the existing prop to prop 3 pin serial instructions, and make the following simple mods:

    - sender and receiver can configure the serial circuit for 8/16/32 bits (no funny number of bits to keep the logic simple)
    - sender supplies CLK, if needed have it be clkfreq/2 or one of the counters
    - receiver clocks data in on received clock

    This would allow low overhead prop 2 prop comms, removing the need for a shared crystal (but may limit comm to clkfreq/2), and maps nicely onto SPI master/slave

    Worst case impact: the ser/des does not work, P2 can go to market without it

    Best case: very fast SPI, I2S, prop2prop comm

    (famous last words) implementation should be easy, if kept this simple.
    Bill,
    I am thinking of something even simpler...
    - Clock supplied by one of the counters (for sender and receiver) - or external?
    - Always 32 bits (can we clear the register) - do it by software

    So it could just be a matter of adding a new mode to set the clocks to be generated by a counter(s). But there is only one of these per prop???
    Would be nicer to use the VGA or other serial section so that each cog could have one.

    I thought using the VGA might be simplest since we already have a shifter there, with clocking from the counters. Would only need an option to gate the output pin to be the input pin, and of course be able to read the VGA register (if we cannot do that now).
  • LawsonLawson Posts: 870
    edited 2013-09-19 18:28
    Cluso99 wrote: »
    Chip:
    Is there some simple way that you could allow serial input to work?
    *snip* or
    some way of applying a counter clock to the high speed interprop comms?

    Were is the documentation for this feature? Anyway, I assume the hardware for this comms is a 32-bit shift register with bits going out one end and coming in the other end that can be put in a master or slave mode. (i.e. generate a clock or use an external clock) If it has a slave mode that needs an external clock, I think we can already run this shift register with a counter. I.e. set the sender and receiver shift-registers to slave mode and use a counter to run the clock pin.

    Marty
  • YanomaniYanomani Posts: 1,524
    edited 2013-09-19 18:47
    Cluso99 wrote: »
    Yanomani: Unfortunately the variable hub cycle access has been beaten to death.

    I agree that it would be nice, and respectfully disagree with heater because those using the variable hub access would take on all the issues involved. I think there would be a number of apps that could make efficient use of additional hub cycles and it could be done in such a way to allow some cogs the normal deterministic access.

    Cluso99

    I wish I were being here in the last 14 years, at least, to learn as much as I could about the whole Propeller's concept development since its inception.
    There are literally tens of thousand posts to read, only to sync my mind with all the wonderfull thoughts you all shared during those years.
    As a newcomer, I tend to concentrate myself in the topics being discussed, thinking about the wonderful opportunities of creative thinking, just passing in front of my starving eyes.
    Sure at the risk of being a bit repetitive, when all the well work that was done by many others is taken into consideration.
    After many decades, counting for each used (and many times wasted too) Cpu cycle by my systems software, only to avoid conflicting interrupted tasks and their 'must be saved before proceeding' constraints, I found Propeller's concept being the most refreshing experience I ever had.
    I feel like an old miner who finds, after many years of mountain's drilling, an ever lasting source of gold and diamonds.
    I became excited with all the thoughts expressed in those threads, wondering something like that: "Oh, oh, oh! If I only had a little bit of this much, just a decade before!"
    When I imagine how easy it would be my life just if I had it earlier, a huge bunch of blossoms start popping inside my brain's flower pot.
    Many times I must pull the brake parking lever, just not to smash the flower pot.

    Much thanks for helping me realizing when to do so, realy!:lol:

    Yanomani
  • AribaAriba Posts: 2,690
    edited 2013-09-19 19:04
    cgracey wrote: »
    Andy,

    Yes!!! This is something that I remember you bringing up a while ago and I know it will make DSP a lot better.

    We have 20x20 bit signed multipliers for the MAC instructions, as you know, so by taking bits [top..18] of the result, we preserve significant digits, right? So, 18 is the singly magic number of shifts, though 20..16 would be a nice range, correct? I will make some changes to accommodate this.

    I thought about going to separate IN/OUT registers, but it's working in Spin2 so nicely now with your XOR advice, that it seems unnecessary to mess with. PINS, alone, seems fine for assembly programming. What do you think?

    A shift by 18 bits makes sense because the MAC factors and results have then the same bit format as for the SCL instruction. So you can use MACx and SCL instructions in an algorythm without shifting the results left and right. The signal path is then normalized to +-18 bits, the 20bit multiplier lets you use factors in the range -2.0 to +1.99999.
    If you only need factors in the range -1.0 to +1.0 or 0..+1.0 then you could have a signal resolution of 19 or 20 bits with a 20bit multiplier. So a shift by 19 or 20 makes also sense to maximize the signal resolution. 16 is not so important, this can also be made with MOVF as you already showed here.


    About the PINx: I see a little problem with my XOR solution when the concurrent PASM code messes with Spin's pins. The output latches and the OUTx register in your Spin interpreter are then not in sync and the XOR flips the pins into the wrong state. In this case you have no chance to overwrite it with the right values because the XOR changes the bits always relative to the expected state in the pseudo OUTx.
    If you could read the real states from the output latch then Spin can write the right values.
    I don't know if this is really a problem, if PASM messes with Spin's pins then there is anyway something wrong, and if it is intended then the PASM code can also read Spin's OUTx register and do the same atomic XOR as Spin.

    Andy
  • David BetzDavid Betz Posts: 14,516
    edited 2013-09-19 19:17
    Ariba wrote: »
    A shift by 18 bits makes sense because the MAC factors and results have then the same bit format as the SCL instruction. So you can use both instructions in an algorythm without shifting the results left and right. The signal path is then normalized to +-18 bits, the 20bit multiplier lets you use factors in the range -2.0 to +1.99999.
    If you only need factors in the range -1.0 to +1.0 or 0..+1.0 then you could have a signal resolution of 19 or 20 bits with a 20bit multiplier. So a shift by 19 or 20 makes also sense to maximize the signal resolution. 16 is not so important, this can also be made with MOVF as you already showed.


    About the PINx: I see a little problem with my XOR solution when the concurrent PASM code messes with Spin's pins. The output latches and the OUTx register in your Spin interpreter are then not in sync and the XOR flips the pins in the wrong state. In this case you have no chance to overwrite it with the right values because the XOR changes the bits always relative to the expected state in the pseudo OUTx.
    If you could read the real states from the output latch then Spin can write the right values.
    I don't know if this is really a problem, if PASM messes with Spin's pins then there is anyway something wrong, and if it is intended then the PASM code can also read Spin's OUTx register and do the same atomic XOR as Spin.

    Andy
    I still think separate IN and OUT registers is cleaner. They will allow the pins to be manipulated from C in the same way as they are on P1. Without the separate registers, we will probably have to resort to macros or inline functions. I'm worried about code like this:
       OUT |= 0x100;
    
    Spin can make a special case of this but the generic GCC compiler won't be able to do that.

    However, I'd much rather have P2 sooner than delay it to fix this.
  • YanomaniYanomani Posts: 1,524
    edited 2013-09-19 19:30
    Cluso99

    I still agree with Sapieha's comment about heavily using the internal port to exchange information between two COGs, without ever passing through HUB memory.
    I only wished a mild group of shadow semaphores to enhance interlocked handshaking, without having to mess with HUB's LOCKNEW, LOCKRET, LOCKSET and LOCKCLR ones.
    Perhaps, after setting from what COG any other wants to heard from, we can have some CHCKPDWR, CHCKPDRD, with corresponding wc and/or wz effects to test for, and stalling WAITPDWR, WAITPDRD ones too, so that every COG can peacefully wait for data exchange to complete before proceeding.
    Sure these new instructions will only have sense when the full 32 bits are used in each direction, but i believe they'll represent a big step forward in intercog data exchange speed and stability.

    Yanomani
  • potatoheadpotatohead Posts: 10,261
    edited 2013-09-19 19:36
    Re: Hub access changes.

    If it's not simple, round robin like it is now, it's simply not a Propeller anymore. I agree with what Heater put here. A COG needs to be a COG. There has been a huge, real benefit to this design principle. It's not just theoretical.
  • AribaAriba Posts: 2,690
    edited 2013-09-19 20:24
    Re: Hub access changes. and fast intercog communication

    There is already the RD/WRQUAD which lets you access 4 longs in a 8 cycle window. And the cached versions of RDBYTE/WORD/LONG makes this quad access very easy.

    And there is the virtual PORTD for a faster inter-cog comunication without the need to go over hubram.

    Andy
  • SeairthSeairth Posts: 2,474
    edited 2013-09-19 20:44
    Back on the original question about modifying the way that tasks are scheduled, I have a thought about how HUBOP behavior could be changed to reduce stalls. In stage four of the pipeline, if the cog is more than 3 cycles away from its access window, cancel all operations in the pipeline for that task and set the PC to the address of the HUBOP that was canceled. This would allow the other tasks to continue executing and the HUBOP would still get queued back up in time for its access window. This assumes, of course that the task in question is scheduled at least every fourth clock cycle and that none of the other tasks have multi-cycle instructions in the pipeline.

    Incidentally, this would also work for the trivial case where only one task is running.
  • YanomaniYanomani Posts: 1,524
    edited 2013-09-19 20:51
    potatohead wrote: »
    Re: Hub access changes.

    If it's not simple, round robin like it is now, it's simply not a Propeller anymore. I agree with what Heater put here. A COG needs to be a COG. There has been a huge, real benefit to this design principle. It's not just theoretical.

    potatohead

    It was my fault, or better explaining it, it was my poor understanding of the English language driving my hands when I wrote the word "theoreticaly" in a way it can be interpreted as I was discusing the Propeller's concept as a whole.
    In fact I meant this word in a way to express my humble objection against any difficulties of using an OBEX object, dependig solely the amount of HUB access bandwidth it intends to need or use.
    I believe my thought was clearly expressed in the way I had exemplified some meaningful way to account for its needs, at the time of code assembly or integration with any other piece of code. Now I can see I'd totaly failed in my intents.
    It's clear to me that is not the round robin concept that is under discussion, but the consequences of having the newly created multi thread concept, along with the fact that not all instructions does completes in single or an even number of cycles, and how it reflects in the total COG throughput.
    The total amount of thread stalling seems to be the enemy to fight against, and I believe that everyone here is trying hard to contribute with Chip's effort in solving this one and almost every bottlenecks that could be eliminated, without significatively redesigning anything.
    That said, I wish to thank you for pointing me a probable cause of misinterpretation in my posts. I'll try to be extra carefull in the future.:thumb:

    Yanomani
  • potatoheadpotatohead Posts: 10,261
    edited 2013-09-19 20:53
    No worries here. :)
  • YanomaniYanomani Posts: 1,524
    edited 2013-09-19 21:19
    Ariba wrote: »
    Re: Hub access changes. and fast intercog communication

    There is already the RD/WRQUAD which lets you access 4 longs in a 8 cycle window. And the cached versions of RDBYTE/WORD/LONG makes this quad access very easy.

    And there is the virtual PORTD for a faster inter-cog comunication without the need to go over hubram.

    Andy

    Ariba
    Thanks by refreshing my memory about the QUAD operations, they're surely a perfect and fast way of traversing data to/from hubram.
    And sure, I understood how to use PORTD to craft fast comms between two or more Cogs. It's only the lack of enough independently interlock semaphores to allow as many Cogs as we would need, to exchange data between then in some orderly way.
    As they are just now, we have only the following choices, in my point of view:

    - use extra pins, at ports A, B or C, to control the handshaking process. Then I suggested that Chip completes the four uncommitted P92 thru P95 with internal pullups or pulldowns, in order to use them as four more semaphores, only internaly accesible. Not enough to fill all the possibilities, but better than nothing.
    -use less than 32 bits in each direction, reserving some bits to do the handshaking control;
    -use hubram to create the handshaking flags, at the expense of bandwidth and possibly stalling some threads;
    -finaly, using the eight semaphores we already have, but still not enough and also at the expense of the solely interlock mechanism we now have to craft hubram blocked access protection.

    If I'd forgot some meaningful way to do my intents, please advise me, I'll be glad in learning a bit more about the Propeller 2.

    Yanomani
  • AribaAriba Posts: 2,690
    edited 2013-09-19 23:29
    Yanomani wrote: »
    Ariba
    Thanks by refreshing my memory about the QUAD operations, they're surely a perfect and fast way of traversing data to/from hubram.
    And sure, I understood how to use PORTD to craft fast comms between two or more Cogs. It's only the lack of enough independently interlock semaphores to allow as many Cogs as we would need, to exchange data between then in some orderly way.
    As they are just now, we have only the following choices, in my point of view:

    - use extra pins, at ports A, B or C, to control the handshaking process. Then I suggested that Chip completes the four uncommitted P92 thru P95 with internal pullups or pulldowns, in order to use them as four more semaphores, only internaly accesible. Not enough to fill all the possibilities, but better than nothing.
    -use less than 32 bits in each direction, reserving some bits to do the handshaking control;
    -use hubram to create the handshaking flags, at the expense of bandwidth and possibly stalling some threads;
    -finaly, using the eight semaphores we already have, but still not enough and also at the expense of the solely interlock mechanism we now have to craft hubram blocked access protection.

    If I'd forgot some meaningful way to do my intents, please advise me, I'll be glad in learning a bit more about the Propeller 2.

    Yanomani

    It depends on the application, as always. Do you have a certain application in mind which needs to transfer a lot of 32bit data between all cogs?

    My feeling is that if you have a lot of data to transfer, the way over the hubram with QUADs is the fastest. Perhaps with some handshake bits in PORTD.
    As far as I know the PORTD is more than just a shared 32bit register. You can also split the PORTD in 4 byte channels for example and then define which cog sees which channel or something like that. But I have not done any experiments in that direction, would make not much sense on a DE0-Nano with one cog ;-)

    Andy
  • Heater.Heater. Posts: 21,230
    edited 2013-09-20 00:18
    Yanomani, Cluso,


    Changing the simple round robin HUB access scheduling breaks one of the Prop's most important features. That is: "All cogs are equal"

    Currently a user can fetch 8 different objects that use 8 COGs from OBEX and elsewhere and be sure that when he stitches them together into an application timing will not be a reason it does not work.

    If there is memory enough and cogs enough for an object to operate it does not care what any other objects or cogs are doing.

    As soon as you allow the possibility that a COG can have more than it's "fair share" of HUB access slots you force the user of modules to do extra work in thinking about how the timing of his total project will work out with different combinations of modules.

    Yanomani is correct in that the compiler or build system could issue an error when you mix a bunch of objects together that demand too many hub access slots.

    That's nice, it means is that the user finds out his project is not going to work at run build time rather than run time.

    Bottom line though is that his project is not going to work.

    As I said it forces the user of modules to worry about timing budget like the good old interrupt driven days. Yuk. Hell he already has to worry about the memory budget.

    So the question is: Do you think it's worth breaking that "use of modules determinism" and introduce a whole load of complexity for everybody just to squeeze out a teensy bit of performance in some rare cases?

    I feel the answer is "no". If you feel the opposite I probably cannot sway you.
  • Heater.Heater. Posts: 21,230
    edited 2013-09-20 00:21
    Yanomani,
    When I imagine how easy it would be my life just if I had it [Propeller] earlier, a huge bunch of blossoms start popping inside my brain's flower pot.

    I love the way you put that. Many here feel the same but never found such a nice way to say it!
  • Heater.Heater. Posts: 21,230
    edited 2013-09-20 00:30
    Ken,
    Ears perked up a bit on this kind of request....
    I was already starting to have nightmare visions of "straw" and "camel" when these new requests started coming in.

    Yes, yes, I know, I'm guilty of throwing straw bails on as well. But...
    Eight years is a long time...
    I would really like to be able to have time to play withe the Prop II before dementia sets in (some say it's too late for me already) or I start pushing up daisies.
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-09-20 00:44
    Hi Guys

    You all have been very busy since I last viewed this post. Lot's of interesting ideas out there....

    I've been analuzing my "stall" issues a bit further and here's my latest findings.
    My original assesment of the situation "blamed" pipeline stall as the cause of my video sync glitches.
    While this is still a contributing factor I now believe it's not the primary cause of the problem.
    It appears my VGA driver is the "villian" here. Here's a brief description of the driver.

    VGA 800x600 image with a video window of 512x300 pixels
    Each horizontal line is output 2x for a total of 600 lines.
    The 512 pixels are output followed by 288 blank pixels for 800 pixels width.
    Using a dot clock of 45MHz (22nS) we have a build window of 800 x 22nS = 17.6uS
    This is not including the sync for the line (256 x 22nS = 5.63uS)
    I am using STR1_RGB9 video mode to keep things simple.

    The video image is stored in the video buffer in a linear layout.
    Here's some code snippets for example
    :pp1			rdlong	ax9,vid_ptr
    			add	vid_ptr,#4
    'pixel conversion
    			mov	ex9,#4
    :pp2			shl	dx9,#8
    			mov	bx9,ax9
    			shr	bx9,#24
    			rev	bx9,#24
    			or	dx9,bx9
    			shl	ax9,#8
    			djnz	ex9,#:pp2
    			pusha	dx9
    			djnz	cx9,#:pp1
    
      
    
    :send_pixels		call	#qsync			'do horizontal sync
    :wait1a			polvid	wc
    		if_nc	jmp	#:wait1a
    			waitvid	mode,char_color
    
    'horizontal sync
    
    qsync			polvid	wc
    		if_nc	jmp	#qsync
    			waitvid	v16,sync_color0
    
    :wait3			polvid	wc
    		if_nc	jmp	#:wait3
    			waitvid	v96,sync_color1
    
    :wait4			polvid	wc
    		if_nc	jmp	#:wait4
    			waitvid	v48,sync_color0
    qsync_ret		ret
    
    
    
    
    To improve performance the pixel data is read in as a long to try an reduce HUB access.
    Two conversions need to be done here.
    Firstly the data is now in "little endian" format and needs to be swapped.
    Secondly the data has to be reversed to suit the pixel shifter.

    My scheduling of the tasker has the Video task at 13/16 time slots (81.25%)
    Elapsed time of my conversion was measured at ~15uS, well short of the 17.6uS calculated earlier.
    This in addition to the actual sync time of 5.63uS and I should be well clear.

    The pipeline stall therfore seems to effect the POLVID and WAITVID syncronization.
    I think this is the origin of the glitches. This seems the answer that fits at this point.
    More testing, more thinking and more coffee is required.....

    On the subject of suggested instructions (Sorry ken), I have a few to throw into the mix.
    A "swap" instruction to convert longs to "big endian".
    Perhaps this could be combined with a bit reversal as well.
    Maybe instructions specifically designed to boost video streaming bandwidth.

    Cheers
    Brian
  • cgraceycgracey Posts: 14,134
    edited 2013-09-20 01:26
    Ariba wrote: »
    A shift by 18 bits makes sense because the MAC factors and results have then the same bit format as for the SCL instruction. So you can use MACx and SCL instructions in an algorythm without shifting the results left and right. The signal path is then normalized to +-18 bits, the 20bit multiplier lets you use factors in the range -2.0 to +1.99999.
    If you only need factors in the range -1.0 to +1.0 or 0..+1.0 then you could have a signal resolution of 19 or 20 bits with a 20bit multiplier. So a shift by 19 or 20 makes also sense to maximize the signal resolution. 16 is not so important, this can also be made with MOVF as you already showed here.


    About the PINx: I see a little problem with my XOR solution when the concurrent PASM code messes with Spin's pins. The output latches and the OUTx register in your Spin interpreter are then not in sync and the XOR flips the pins into the wrong state. In this case you have no chance to overwrite it with the right values because the XOR changes the bits always relative to the expected state in the pseudo OUTx.
    If you could read the real states from the output latch then Spin can write the right values.
    I don't know if this is really a problem, if PASM messes with Spin's pins then there is anyway something wrong, and if it is intended then the PASM code can also read Spin's OUTx register and do the same atomic XOR as Spin.

    Andy

    I'm thinking it might be good to get rid of that complex FITACCx stuff and just put arithmetic right shifters in. Then, you can get the bits exactly where you need them in one whack.

    The PASM/Spin pin conflict is not an issue, I think, because it would be very bad practice to modify the same pin(s) from PASM and Spin. It would be as silly as having multiple cogs try to manipulate the same pin(s). I see David Betz's point about C needing separate IN/OUT registers, but that would be a very deep change at this time, and I feel too constrained to attempt it.

    What about this, Guys: Would it be very beneficial to have a complete REPS/REPD circuit for each task, or is one per cog enough? It would add about 1k flip-flops to the chip, which currently has 43k flip-flops.

    I would have been responding to more threads all day, but my new Win8 machine's been saying "no". I think something's wrong with the way it's set up. Several times today it threw up a full-screen request that I "activate Windows" now. The trouble is, it can't use the internet any better than I can, so it takes me to that screen and wants me to click on a button, but it just hangs when I do. Until I click the button, I doesn't let me go. I'm typing now on a netbook I keep nearby.
  • cgraceycgracey Posts: 14,134
    edited 2013-09-20 01:34
    Yanomani wrote: »
    Ariba
    Thanks by refreshing my memory about the QUAD operations, they're surely a perfect and fast way of traversing data to/from hubram.
    And sure, I understood how to use PORTD to craft fast comms between two or more Cogs. It's only the lack of enough independently interlock semaphores to allow as many Cogs as we would need, to exchange data between then in some orderly way.
    As they are just now, we have only the following choices, in my point of view:

    - use extra pins, at ports A, B or C, to control the handshaking process. Then I suggested that Chip completes the four uncommitted P92 thru P95 with internal pullups or pulldowns, in order to use them as four more semaphores, only internaly accesible. Not enough to fill all the possibilities, but better than nothing.
    -use less than 32 bits in each direction, reserving some bits to do the handshaking control;
    -use hubram to create the handshaking flags, at the expense of bandwidth and possibly stalling some threads;
    -finaly, using the eight semaphores we already have, but still not enough and also at the expense of the solely interlock mechanism we now have to craft hubram blocked access protection.

    If I'd forgot some meaningful way to do my intents, please advise me, I'll be glad in learning a bit more about the Propeller 2.

    Yanomani

    WAITPEQ and WAITPNE can wait for values on PIND (the inter-cog-exchange). In a single-task program, that would let you key off an incoming value and then go into a REPS loop where you could grab a long into cog RAM on every clock. In a multi-tasking program, you must poll to avoid stalling the other tasks, so you can use JP/JNP instructions to branch on a PIND bit going high or low. I can imagine more complex handshake mechanisms, but they don't seem needed to me right now. What do you think? I could easily see making a nice handshake circuit, and it winds up being a crutch because its existence stops people from thinking about simpler/faster ways to do data exchange.
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-09-20 02:14
    cgracey wrote: »
    Would it be very beneficial to have a complete REPS/REPD circuit for each task, or is one per cog enough? It would add about 1k flip-flops to the chip, which currently has 43k flip-flops.

    Anything that make threads more efficient is great, but that's a lot of flip-flops!
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-09-20 05:44
    Chip: I cannot answer about the use of reps/repd for each task at this time, but 1K ffs certainly sounds like a risk at this late stage. I am all for anything fast, simple and without risk.
  • ctwardellctwardell Posts: 1,716
    edited 2013-09-20 06:14
    Chip,

    The REPS/REPD per task would be nice to have. It does save code space, which is a real benefit when trying to fit multiple tasks into the COG space.

    Chris Wardell
  • SeairthSeairth Posts: 2,474
    edited 2013-09-20 06:26
    cgracey wrote: »
    Would it be very beneficial to have a complete REPS/REPD circuit for each task, or is one per cog enough? It would add about 1k flip-flops to the chip, which currently has 43k flip-flops.

    How much smaller would the cogs be if tasking support was removed altogether? Enough to squeeze in additional cogs? Okay, I know that's not likely to happen, but I know which of the two approaches I prefer.
  • Heater.Heater. Posts: 21,230
    edited 2013-09-20 06:42
    Seairth,

    As far as I can tell tasking support does not take up much space, compared to the size of a COG. It was a feature, at least the auto-scheduling part, that Chip implemented in a matter of days after everything else was done already.

    Having more COGs sounds great. Problem is if you double the number of COGs you halve the bandwidth to HUB RAM and double the maximum amount of time a COG has to wait for a HUB access slot.

    This is not good.

    The whole tasking concept is that our existing 8 COGs can run 8 threads like normal. Only faster because it's a P2. But threads within a gives you chance to make better utilization of a COG by combining less processor intensive tasks into a single COG.

    This is good.
  • TubularTubular Posts: 4,694
    edited 2013-09-20 07:08
    cgracey wrote: »
    What about this, Guys: Would it be very beneficial to have a complete REPS/REPD circuit for each task, or is one per cog enough? It would add about 1k flip-flops to the chip, which currently has 43k flip-flops.

    If i'm brutally honest I'd say 1 REPS/D has proven to be enough. DJNZ instructions and similar do the job well for the other tasks. And would adding ~2.5% cramp the logic synthesis so it might have impact on overall efficiency/max clock rate?

    Having said that the task feature is very useful and is surely going to feature in many, many cogs. REPS/D instructions naturally become a favourite building block, and eliminating one more "gotchya" (that I admit falling into earlier) contributes to the overall pleasant experience of programming the Prop2. In other words I think it's a low priority, "aesthetic" decision, that would be nice but not essential.

    Andy's MAC suggestions seem a whole lot more useful.

    Another "aesthetic" that would be nice to see on Prop3, would be to allow a comment delimiting character, such as a semicolon, in your monitor. It would just ignore all characters to the end of the current line (for comment purposes). The monitor is useful and powerful, but being able to paste a "commented" script in, and having comments to explain say machine code ops, would make it even friendlier.
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-09-20 07:22
    The easiest way to get more cog's is get another P2 and make it talk to the first one. :)

    Another major issue would be the power requirements of the chip, yikes!

    Multi-tasking is a great feature, and in the real silicon at full speed will blow us all away!
  • cgraceycgracey Posts: 14,134
    edited 2013-09-20 07:40
    Tubular wrote: »
    If i'm brutally honest I'd say 1 REPS/D has proven to be enough. DJNZ instructions and similar do the job well for the other tasks. And would adding ~2.5% cramp the logic synthesis so it might have impact on overall efficiency/max clock rate?

    Having said that the task feature is very useful and is surely going to feature in many, many cogs. REPS/D instructions naturally become a favourite building block, and eliminating one more "gotchya" (that I admit falling into earlier) contributes to the overall pleasant experience of programming the Prop2. In other words I think it's a low priority, "aesthetic" decision, that would be nice but not essential.

    Andy's MAC suggestions seem a whole lot more useful.

    Another "aesthetic" that would be nice to see on Prop3, would be to allow a comment delimiting character, such as a semicolon, in your monitor. It would just ignore all characters to the end of the current line (for comment purposes). The monitor is useful and powerful, but being able to paste a "commented" script in, and having comments to explain say machine code ops, would make it even friendlier.

    Thanks, Lachlan and Everyone, for all your input.

    I think if I get rid of the complex FITACCx circuitry and put in shifters, there'd be no net increase by adding the REPS/REPD's per cog. I'll see today. I've found one REPS/REPD to be adequate, but having only one does pose a 'gotcha' for the programmer.

    I love that semicolon idea! I wish I had thought of that earlier. Right now there's not a single long left in the monitor, but maybe I could scare a few up. It would probably take two longs to implement. I'll look into that.
  • cgraceycgracey Posts: 14,134
    edited 2013-09-20 07:47
    Seairth wrote: »
    How much smaller would the cogs be if tasking support was removed altogether? Enough to squeeze in additional cogs? Okay, I know that's not likely to happen, but I know which of the two approaches I prefer.

    The multitasking only added about 1% to the cogs. Cogs have those memories, too, which take a good amount of space.

    When we had our logic synthesized for fun at a 40nm-process node, it only took 0.89 square millimeters! If we could use a 40nm process, we could do 32 cogs of logic at 1GHz in 3.6 square millimeters. Right now, our 8 cogs take 10 square millimeters in the 180nm process.
Sign In or Register to comment.