I see the title of this thread got changed to, "Ruminations while awaiting an FPGA image." My grandmother always used to say, "A watched pot never boils." Perhaps if we avert our gaze ...
Last night (4am) I chased down the final bug that was inhibiting the fast-write setup for hub streaming. Fast-read had already been working for a few days, but now the whole system seems complete.
The next thing to do is to adapt the hub streaming to both hub exec and the NCO-driven pin/DAC I/O. Those NCO modes are going to be fun, because those are features which never existed on the prior Prop2.
The streaming is already set up so that you can create a loop in hub memory that automatically wraps for reading or writing. The only other thing we need is a way to redirect the start/size on a block loop. That would enable page-flipping, so to speak, for high-speed analog output, so that you can be writing one buffer while playing another - at 200MHz on the real chip, and hopefully 160MHz on the FPGA.
<humour mode: ON>
Yesterday I bought a large can of compressed air - you know those computer cleaning sprays - for my DE0-Nano's & DE2-115 in anticipation of ChipMas! <humour mode: OFF>
As difficult it is to wait, I think all of us will find the wait worthwhile.
16 cogs and all that bandwidth with hubexec has me drooling.
The next thing to do is to adapt the hub streaming to both hub exec and the NCO-driven pin/DAC I/O. Those NCO modes are going to be fun, because those are features which never existed on the prior Prop2.
There is real potential for slimming down the Cogs even more. With buffering moving to the Hub like that, CogRAM is being freed up. Can 256 addresses work? Or, with HubExec, and presumably an instruction cache, is there a possibility for single-ported CogRAM?
TCan 256 addresses work? Or, with HubExec, and presumably an instruction cache, is there a possibility for single-ported CogRAM?
Single ported CogRAM would slow things down too much.
256 is interesting, but that's likely to break backward compatible operation with P1
? - depends on how HubExec works.
P2 really needs to swallow ANY P1 project to be a clear superset, if some apps fail to fit, that's a disappointed customer.
8 @ 512 and 8 @ 256 could be a possible split, but I think COGS are pin-mapped now, so all need to be equivalent to keeps pins equivalent.
Single ported CogRAM would slow things down too much.
256 is interesting, but that's likely to break backward compatible operation with P1
? - depends on how HubExec works.
P2 really needs to swallow ANY P1 project to be a clear superset, if some apps fail to fit, that's a disappointed customer.
8 @ 512 and 8 @ 256 could be a possible split, but I think COGS are pin-mapped now, so all need to be equivalent to keeps pins equivalent.
Well, I'm looking towards a 32 core device. Drop-in software compatibility is of little concern. Keeping dual-porting with HubExec could be improved by having three source operands and one destination. Then fitting the instruction would be problematic, hence the need to reduce register set numbers.
Bill, I am talking about verbatim Cogs.
The one thing I would agree with is it's far too late to be discussing such changes.
Keeping dual-porting with HubExec could be improved by having three source operands and one destination. Then fitting the instruction would be problematic, hence the need to reduce register set numbers.
That may be closer than you think with the ALTDS instruction proposed.
It can handle two source and one destination operand.
You might be, but I think you'd be disappointed if it arrived. That halves the bandwidth to HUB again. It would run like a sloth. And you have just thrown away half the COG RAM so now you have no place to put fast code like an FFT for example. Disaster!
One could mitigate the reduced HUB bandwidth buy adding more ports, widening buses etc. I'm not sure this scales well.
History, from the Intel 8080, through 8085, 8086, 186, 286, 386, 486, Pentium, whatever we call it today, and other examples, has shown us that backward compatibility is a very valuable thing. I for one would be very upset to find there was not an easy way to move my projects to the II. With half the COG space it's useless to me. I suspect a lot of useful Propeller objects would become useless. Great.
Yep, it's well passed time to be making suggestions.
Heater, although more cogs would likely entail a reduction in HUB space, I guess you meant to write "half the COG" space, based on the line of reasoning you made. But whichever the case, you've definitely spelled out one of the advantages of keeping things at 2KB (512 longs, 9 address bits).
I don't mind people considering more cogs (even Chip did at one point), but I'd likely lose faith if we went back to 8. And as far as rules for posts here, I'd say that there are no rules other than mutual respect. But I definitely understand that decisions have to be made and then effort exerted to realize such decisions. Chip seems to be well along that path. I also understand that a window of opportunity exists in terms of available time.
Why I really replied was that I'm interested in the subject of indirect addressing (which ties in with address bits and the instruction set). I'm unclear if that's in the works for the new chip. Having to use self-modifying code seems quite ancient or inelegant for a modern chip (as others have said elsewhere). What would be required to have some kind of indirection, such as for table access? Could we give up the conditional bits to facilitate something? Could an indirect accessing mode make assumptions to "encode" an offset? For example, if the destination address was on a long boundary, then the offset could be some value (from conditional bits?) times 4, or something similar? It would seem that if some kind of assumption were made (yes, a limitation were imposed) that something could be done within one instructional cycle. Or maybe a setup instruction could be used. Perhaps something is already in the works. I think I recall talk about something but have forgotten. Does anyone know? And what are the limiting factors (instruction bits for operands, instruction cycle timing, pipelining (though that's partly gone, I think))? What's involved so that we can think creatively about this. Or maybe self-modifying code isn't so inelegant as it seems. Maybe a macro could hide any inelegance somehow.
Sorry yes, I meant "With half the COG space its useless to me". Well spotted. Post has been edited.
As for the forum rules and behavior, I like to think we all have mutual respect. Even if sometimes the debating gets a bit rough. Such is the nature of human interaction.
Thanks, Heater. Yes, I agree: we have a high degree of mutual respect, here (sorry if it sounded like I thought otherwise). This has to be one of the top forums anywhere! The exception--if there is one--might be the ranting and raving we fell into over hub access mechanisms and/or slot-sharing, but, thankfully, those "dark days" are behind us (assuming that the "egg beater" hub access scheme doesn't explode like the first wormhole ship did in the movie Contact). So why bring it up again (mutual respect, that is)? No reason. Just waxing on. Anyway, I always enjoy reading your posts (and those of other regulars/active contributors). Guess I sometimes pay too much attention, and that's why I noticed the "mis-speak." Most would have just left it alone, trusting that others would know what you meant. But I'm perhaps too pedantic for that and wanted some leverage for the indirect addressing tangent.
What would be required to have some kind of indirection, such as for table access? Could we give up the conditional bits to facilitate something? Could an indirect accessing mode make assumptions to "encode" an offset? For example, if the destination address was on a long boundary, then the offset could be some value (from conditional bits?) times 4, or something similar? It would seem that if some kind of assumption were made (yes, a limitation were imposed) that something could be done within one instructional cycle. Or maybe a setup instruction could be used.
Hmm...thinking some more, if we could conditionally give up the four-bit condition field and if the ZCI bits could be conditionally sacrificed (I think the current R will be gone), then that could free up 7 bits as an offset. So, there could be a move with offset instruction from a base register plus the offset to a destination register, as in movo dest, base, 7-bit-offset (movi may be taken). That would allow for long access within a table that could span a quarter of cog memory, wherein the offset is encoded in the 7 freed-up bits. That's likely not the first time that has been considered/suggested.
Yeah, that would mean adding an exception mode of sorts for the flag and condition bits. Anything that's seen as an exception or as breaking orthogonality is frowned on here, I know, and usually for good reason, as simplicity is a major win for the Prop. But some tradeoffs can be worth it. Here, my only argument in favor of such an exception is that it seems like there should be a way around self-modifying code (SMC) for table-like access, whether in this way or another. One wouldn't have to use the mode and could keep the standard usage of all those bits if willing to use SMC.
The thinking above is that the instruction opcode itself would determine whether to use the flag and condition bits as an offset. However, I suppose that, if a "magic bit" stored somewhere else in a cog indicated the usage of those bits, then such a magic bit could be set/cleared with a prior (and likey post) instruction for a block of code, such that a more general offset could be applied to other instructions, such as math instructions (I'm assuming one of the 7 (6 on the P1) opcode bits can't be sacrificed for that), but that's getting more complicated (setting/clearing) and I think table access would likely be the most important usage. So, I'm not currently advocating for that, but it could trigger an idea (in someone's mind) that is worthy of consideration.
INDA / INDB is about all the indirection we are going to get in the upcoming chip.
A generic (use any register as an indirect register) requires WAY too much change at this point, and I for one do not wish to sacrifice the condition code bits, or half the register space.
Maybe in the next chip we can go to four IND registers.
Or, if the above scheme is not desirable in the grand scheme of things, perhaps the offset could be stored in a fixed (general purpose) register, either a cog register (fixed or user-specified with an instruction, but that begs the question) or a special-purpose register that doesn't have an address (in the sense of 0..511). Hmm...I kind of like the idea of having a special all-purpose register (kind of like a PC counter) somewhere for just such usage, such a register being brought into play by instruction opcodes that need to use it (i.e., implemented in silicon). The caveat for using this for table access (w/o SMC) is that it would entail a second operation to set the special register, but it would avoid self-modifying code (at the expense of speed and code space). But the offset generally needs to be changed anyway by an add instruction. In addition to a way of writing the special register (a mov instruction of some kind) there could be an add instruction used change it. Well, I believe that Chip said that there's now more room for additional instructions (within the bits dedicated to the instruction set).
Edit/Update: Thanks, Bill. I composed this "Or..." post as you were posting. I haven't at all closely followed the INDA/INDB thing (though I recall seeing it) or the status thereof (and I'm a bit mixed up on the new versus old design for the new chip). Maybe what I just posted overlaps with INDA/INDB. Perhaps someone could summarize the details and/or update me/us on the status (yes, perhaps I could search/Google it, too). --Jim
INDA/INDB are indirection registers, and store the address of the register whose content is to be used. Basically, they are what you asked for
There will also be a prefix instruction, INDS (?) that will allow auto increment/decrement of INDA/INDB and possibly other tweaks - we will know when the docs pop up
... That halves the bandwidth to HUB again. It would run like a sloth. ...
One could mitigate the reduced HUB bandwidth buy adding more ports, widening buses etc. I'm not sure this scales well.
Bandwidth is no problem. More ports is straight forward enough. I was personally very surprised how easy, and seemingly cheap, it was for Chip to put in the crosspoint switch. Hub burst duration becomes increasingly longer though.
History, from the Intel 8080, through 8085, 8086, 186, 286, 386, 486, Pentium, whatever we call it today, and other examples, has shown us that backward compatibility is a very valuable thing. ...
Not relevant here. General computing is a different world. We ditched compatibility with Prop1 a long time back.
When I brought up reducing number of general registers I was thinking of moving all code to separate Cog memory, namely an instruction space/cache, ie: Harvard.
I don't think INDA and INDB exists anymore in the current design. Indirect addressing is now done with the ALTDS instruction, which lets you alter the D or S field of the following instruction, and maybe allows some modification of the registers that provide the value for the follwing D or S field. Something like:
ALTS mypointer++ 'use the content of mypointer for the follwing S field, increment mypointer
MOV tmp,0-0 'indirect access of source
I'm no expert but you saying that does not convince me. If it were so straight forward how come my quad core Intel does not have four buses out to RAM? How come Amdahl's law exists? How come the world speaks so much of the "von neumann bottleneck". How come XMOS don't do this?
Not relevant here. General computing is a different world.
Is it?
Note that I have used and seen used the first 7 generations of the Intel family in real-time embedded systems (Well OK not the 8080 personally).
We ditched compatibility with Prop1 a long time back.
Did we?
The current Prop II design may not be a drop in replacement for the Prop 1 software wise. Code will need "porting". The last time I expressed concerns about code compatibility I was assured that Propeller 1 PASM could be tweaked to run on a Propeller II without a total rewrite and re-architecting. This is very important in terms of reusing much existing code, the Propeller is particularly in need of this as it has no hardware peripherals and requires that code in order to be useful to people out of the box.
I don't think INDA and INDB exists anymore in the current design. Indirect addressing is now done with the ALTDS instruction, which lets you alter the D or S field of the following instruction, and maybe allows some modification of the registers that provide the value for the follwing D or S field. Something like:
ALTS mypointer++ 'use the content of mypointer for the follwing S field, increment mypointer
MOV tmp,0-0 'indirect access of source
Comments
Great news. Thanks for the update.
David,
Nice title update.
It will work for analog and pin output, and pin input. We'd have to have some flash ADC to realize analog input at those speeds.
-Phil
Any difference between saying "pin input" and "digital input"? I read that thrice. Guess flash/video ADC's are waaay out of the question, or are they?
Moo.
C.W.
Yes, and if you don't have the stomach for it then you shouldn't be here
On that note, cows have 4 cores stomachs!
We're going to need to milk that!
C.W.
Thanks for the update.
Yesterday I bought a large can of compressed air - you know those computer cleaning sprays - for my DE0-Nano's & DE2-115 in anticipation of ChipMas!
<humour mode: OFF>
As difficult it is to wait, I think all of us will find the wait worthwhile.
16 cogs and all that bandwidth with hubexec has me drooling.
There is real potential for slimming down the Cogs even more. With buffering moving to the Hub like that, CogRAM is being freed up. Can 256 addresses work? Or, with HubExec, and presumably an instruction cache, is there a possibility for single-ported CogRAM?
(Yet another suggestion)
Single ported CogRAM would slow things down too much.
256 is interesting, but that's likely to break backward compatible operation with P1
? - depends on how HubExec works.
P2 really needs to swallow ANY P1 project to be a clear superset, if some apps fail to fit, that's a disappointed customer.
8 @ 512 and 8 @ 256 could be a possible split, but I think COGS are pin-mapped now, so all need to be equivalent to keeps pins equivalent.
keep all 16 with 512... remember everyone was asking "no more changes" and "keep all cogs the same"
Bill, I am talking about verbatim Cogs.
The one thing I would agree with is it's far too late to be discussing such changes.
That may be closer than you think with the ALTDS instruction proposed.
It can handle two source and one destination operand.
One could mitigate the reduced HUB bandwidth buy adding more ports, widening buses etc. I'm not sure this scales well.
History, from the Intel 8080, through 8085, 8086, 186, 286, 386, 486, Pentium, whatever we call it today, and other examples, has shown us that backward compatibility is a very valuable thing. I for one would be very upset to find there was not an easy way to move my projects to the II. With half the COG space it's useless to me. I suspect a lot of useful Propeller objects would become useless. Great.
Yep, it's well passed time to be making suggestions.
Heater, although more cogs would likely entail a reduction in HUB space, I guess you meant to write "half the COG" space, based on the line of reasoning you made. But whichever the case, you've definitely spelled out one of the advantages of keeping things at 2KB (512 longs, 9 address bits).
I don't mind people considering more cogs (even Chip did at one point), but I'd likely lose faith if we went back to 8. And as far as rules for posts here, I'd say that there are no rules other than mutual respect. But I definitely understand that decisions have to be made and then effort exerted to realize such decisions. Chip seems to be well along that path. I also understand that a window of opportunity exists in terms of available time.
Why I really replied was that I'm interested in the subject of indirect addressing (which ties in with address bits and the instruction set). I'm unclear if that's in the works for the new chip. Having to use self-modifying code seems quite ancient or inelegant for a modern chip (as others have said elsewhere). What would be required to have some kind of indirection, such as for table access? Could we give up the conditional bits to facilitate something? Could an indirect accessing mode make assumptions to "encode" an offset? For example, if the destination address was on a long boundary, then the offset could be some value (from conditional bits?) times 4, or something similar? It would seem that if some kind of assumption were made (yes, a limitation were imposed) that something could be done within one instructional cycle. Or maybe a setup instruction could be used. Perhaps something is already in the works. I think I recall talk about something but have forgotten. Does anyone know? And what are the limiting factors (instruction bits for operands, instruction cycle timing, pipelining (though that's partly gone, I think))? What's involved so that we can think creatively about this. Or maybe self-modifying code isn't so inelegant as it seems. Maybe a macro could hide any inelegance somehow.
Sorry yes, I meant "With half the COG space its useless to me". Well spotted. Post has been edited.
As for the forum rules and behavior, I like to think we all have mutual respect. Even if sometimes the debating gets a bit rough. Such is the nature of human interaction.
Hmm...thinking some more, if we could conditionally give up the four-bit condition field and if the ZCI bits could be conditionally sacrificed (I think the current R will be gone), then that could free up 7 bits as an offset. So, there could be a move with offset instruction from a base register plus the offset to a destination register, as in movo dest, base, 7-bit-offset (movi may be taken). That would allow for long access within a table that could span a quarter of cog memory, wherein the offset is encoded in the 7 freed-up bits. That's likely not the first time that has been considered/suggested.
Yeah, that would mean adding an exception mode of sorts for the flag and condition bits. Anything that's seen as an exception or as breaking orthogonality is frowned on here, I know, and usually for good reason, as simplicity is a major win for the Prop. But some tradeoffs can be worth it. Here, my only argument in favor of such an exception is that it seems like there should be a way around self-modifying code (SMC) for table-like access, whether in this way or another. One wouldn't have to use the mode and could keep the standard usage of all those bits if willing to use SMC.
The thinking above is that the instruction opcode itself would determine whether to use the flag and condition bits as an offset. However, I suppose that, if a "magic bit" stored somewhere else in a cog indicated the usage of those bits, then such a magic bit could be set/cleared with a prior (and likey post) instruction for a block of code, such that a more general offset could be applied to other instructions, such as math instructions (I'm assuming one of the 7 (6 on the P1) opcode bits can't be sacrificed for that), but that's getting more complicated (setting/clearing) and I think table access would likely be the most important usage. So, I'm not currently advocating for that, but it could trigger an idea (in someone's mind) that is worthy of consideration.
A generic (use any register as an indirect register) requires WAY too much change at this point, and I for one do not wish to sacrifice the condition code bits, or half the register space.
Maybe in the next chip we can go to four IND registers.
Edit/Update: Thanks, Bill. I composed this "Or..." post as you were posting. I haven't at all closely followed the INDA/INDB thing (though I recall seeing it) or the status thereof (and I'm a bit mixed up on the new versus old design for the new chip). Maybe what I just posted overlaps with INDA/INDB. Perhaps someone could summarize the details and/or update me/us on the status (yes, perhaps I could search/Google it, too). --Jim
INDA/INDB are indirection registers, and store the address of the register whose content is to be used. Basically, they are what you asked for
There will also be a prefix instruction, INDS (?) that will allow auto increment/decrement of INDA/INDB and possibly other tweaks - we will know when the docs pop up
Bandwidth is no problem. More ports is straight forward enough. I was personally very surprised how easy, and seemingly cheap, it was for Chip to put in the crosspoint switch. Hub burst duration becomes increasingly longer though.
Not relevant here. General computing is a different world. We ditched compatibility with Prop1 a long time back.
HubExec becomes the norm in this scenario.
Thanks, Bill. Good to know!
Andy
Note that I have used and seen used the first 7 generations of the Intel family in real-time embedded systems (Well OK not the 8080 personally). Did we?
The current Prop II design may not be a drop in replacement for the Prop 1 software wise. Code will need "porting". The last time I expressed concerns about code compatibility I was assured that Propeller 1 PASM could be tweaked to run on a Propeller II without a total rewrite and re-architecting. This is very important in terms of reusing much existing code, the Propeller is particularly in need of this as it has no hardware peripherals and requires that code in order to be useful to people out of the box.
Discussion of ALTS:
http://forums.parallax.com/showthread.php/156242-Question-about-ALTDS-implementation-in-new-chip
See ALTDS instruction as a indirect replacement.
Edit: Thanks Bill for the link.