Might it be best to always start cogs in hub-exec mode?
I'm thinking about it for the following reasons:
1) It would simplify debug-interrupt initialization, as the pre-run setup technique could be standardized.
2) It would simplify COGINIT - no bits needed to specify cog/lut/hub
3) Loading cog or LUT with code is only two instructions: SETQ(2) #longs + RDLONG begin,PTRB. Affords flexible cog/LUT load begin address and length. One more instruction JMPs into cog/LUT. So, three instructions get you there with flexibility.
4) All code starts in the hub, anyway.
5) Feels simpler.
6) it puts your COG initialization and "one time" code in HUB memory space instead of COG memory space
I Haven't thought of a reason to start in COG memory...but I just woke up to this!
Simpler is good from our side and from your side!
Yeah, it makes things a lot cleaner, all around. I will attack this first thing Monday morning. This is nice, because it actually REMOVES logic in a few different places and makes things very flexible. No need to start your cog code at 0 and load up the whole memory. Just load what is needed, if anything, right where you want it at 1 long per clock into cog and/or LUT. The RDLONG-repeat instruction makes it practical to go this direction.
This will make things easier for people to understand, as well, because there aren't three different potential startup scenarios with different rules - just one, now. Everything else is exceptional and you will be forced to consider the ramifications, as you are the one making it happen.
"more pins" is what everyone has been shouting since the P1 arrived and everyone was eagerly expecting the 64 I/O version of the P1 to be "any time soon"
Just that and more RAM would have made many people happy years ago.
"more pins" is what everyone has been shouting since the P1 arrived and everyone was eagerly expecting the 64 I/O version of the P1 to be "any time soon"
Just that and more RAM would have made many people happy years ago.
Does P1 really have too few pins for "normal" microcontroller applications or is it just a problem when you try to use it to emulate classic computers and need more than 32k of RAM?
All I can say is that it was not just me and the other crazy emulator builders who were clamouring for more pins on the P1.
A cursory glance at the spec shows that each 32 bit processor has an average of 3.5 I/O pins in the normal configuration. Doesn't that seem a bit meagre?
More IO definitely before more ram or cogs.
Seems the first thing the FPGA P1V adopters did was enable PortB for 64 IO pins.
I have a application that needs 128+ right now.
You will never have enough pins. Having stated this, you always have enough pins. Because you will only solve the problems you can solve with the given hardware. I never had enough cogs, but I have a standard process to communicate between the cogs in one chip. Now I transparently expanded the communication process to cross the chip border and it turned out: I very simply can increase the # of I/Os. The communication mechanism copies a register from chip A to chip B, on one side, I uses the address INA or OUTA, on the other side just a memory cell. That all. I had no ideas, it is such simple. That is the funny thing with the propeller: whenever there is a problem, there is a simple solution. At least as an idea. Like Chip just showed: default start from hub solves a lot of problems, you only have to find this solution.
Yeah, the P1 could drive 64 pins with ease, and had we got them, they would have been used for all sorts of things. Heater has it about right. A few pins per COG is a significant constraint given what COGS can actually do.
@Chip: Yes! Start in HUBEXEC. It's simple. We like simple.
You will never have enough pins. Having stated this, you always have enough pins. Because you will only solve the problems you can solve with the given hardware.
Too true. At the end of the day we have to work with what we have in reality or can reasonably create. And so there are limits to what we can achieve that we might resign ourselves to.
Still, when discussing processor architectures, or the design of anything I guess, we have ideas of "harmony" and "balance".
That sounds a bit zen or arty-farty but it's not really, it has practical outcomes:
If you are going to multiple processors accessing a central RAM in a round robin fashion, how many processors makes sense? 1, 2...8, 16, 32...a million? I think you will agree there is a sweet spot here.
How much RAM should we have, 1K, 2K...32K, 64K,...4GB? I will argue 1K is obviously too constraining, 4GB is pointless given the speed of the Prop. Again there is a sweet spot.
And so on, and so on.
In all these cases the usefulness of a feature is constrained by other features of the overall design. A balance becomes optimal.
So how many pins is the the sweet spot of balance and harmony for a machine with eight 32 bit cores?
Of course we should not discuss these "aesthetics" wrt the PI which is done and dusted but the P II.
Chip,
I really, REALLY, like the idea of starting in hubexec mode. Also, I assume that the "bootup" sequence doesn't read all 512kb from flash/eeprom into hub ram and then go. It just loads in some initial chunk of the flash/eeprom and starts that. Then that code does the actual loading up of the rest.
That will make for a much more flexible startup solution, and allow for better hub memory usage. We could do things like read in cog/lut resident code and get it all loaded into the cogs, then load in the hub resident code overlaying the same hub memory, leaving more hub memory free for vars/buffers/etc.
Of course, because it's flexible, the above is just an example, everyone is free to do whatever they want.
Chip, Absolutely love starting in hubexec mode!
Yes, start the cogs in hubexec makes real sense!!!
Only necessary to have a standard hubexec routine to perform a cog load and jmp cog $0 in hub ram/rom and this can be soft (or we just include an object for this in the library).
1. With a tiny stub block in hub we can load just how much we need to load into a cog which will be faster to get going.
2. We don't have to start the cog at address $000.
Means we can run the same object in a number of cogs, and differentiate by starting at a different address, or we can load the object and patch load the differences to each cog individually.
3. We can start a cog in hubexec and actually not load anything into cog (ie totally run in hubexec mode) - extremely fast startup. This makes particular sense for initial P2 bootup.
And, I am thinking of many other possibilities that we cannot currently perform
Caveats/Questions:
1. COGSTART/COGINIT will have to clear the special registers in cog - or could it be done at BOOT and COGSTOP ???
2. Will remainder of COG RAM/registers & LUT be cleared?
If possible, suggest NO - we can then in fact hold what was in Cog RAM on a previous start ???
-or do you power down the whole cog and it's RAM & LUT at boot and/or COGSTOP ???
3. HUB RAM - will it be cleared on BOOT ???
Again suggest NO - it's been nice with the P2 FPGA to be able to reboot and examine the hub ram by jumping into the monitor.
How is the HUB ROM going to work? Will it be part of the HUB Address space or will it switch out after bootup ???
PINS:
Yes I have always wanted more pins, commercial products included (where I have used 2 props together to get more pins to overcome HUB RAM shortage by using external SRAM). This project uses a total of 3 props.
2. Will remainder of COG RAM/registers & LUT be cleared?
If possible, suggest NO - we can then in fact hold what was in Cog RAM on a previous start ???
-or do you power down the whole cog and it's RAM & LUT at boot and/or COGSTOP ???
3. HUB RAM - will it be cleared on BOOT ???
Again suggest NO - it's been nice with the P2 FPGA to be able to reboot and examine the hub ram by jumping into the monitor.
RAM memory clear is extra logic, and has to be done sequentially.
so that means NO
So how many pins is the the sweet spot of balance and harmony for a machine with eight 32 bit cores?
Back when the P1 started DIP40 was the mainstream, low cost package.
There was a 64 pin DIP (same size & width as DIP40) that I think Hitachi used, but it was not mainstream.
I see you can still buy DIP 64 pin 1.78mm sockets, but at eye watering prices!
( so probably a good thing Parallax did not choose that )
As there is separate power for each 8pin group,
It would be really nice if one/any bank would allow gpio retention and counter active with the use of a supercap.
And when you add one these new 1Hz mems-osc, your counter-rtc is still running.
extra bonus, this pin group Vdd is also connected to a 8KB hub ram section for symbolic memory retention.
I guess NMOS at the right places will block reverse current?
The 64 pin DIP of the Motorola 68000 is a thing to behold. I have a few here. And I would have been over joyed to get a Prop I in that package.
Practically I realize that would be far too expensive and probably not in much demand.
Still the are SMD packages that will allow 64 I/O pins whilst still solderable by hand. And bread board friendly modules with EEPROMs, XTALS, USB/serial, coupling caps, etc would have become available.
"more pins" is what everyone has been shouting since the P1 arrived and everyone was eagerly expecting the 64 I/O version of the P1 to be "any time soon"
Just that and more RAM would have made many people happy years ago.
I couldn't agree more: you can never have too many pins. That said, if the P1 had had 64 pins from the beginning, probably no-one would have bothered writing a 1-pin TV driver or a 1-pin PS/2 driver...
On the other hand, I also like the idea of the DIP Propeller where P8-P23 are cut off.
So let's get a P2 with as many pins as possible/practical for now, and then maybe consider reduced pin versions?
2. Will remainder of COG RAM/registers & LUT be cleared?
If possible, suggest NO - we can then in fact hold what was in Cog RAM on a previous start ???
-or do you power down the whole cog and it's RAM & LUT at boot and/or COGSTOP ???
3. HUB RAM - will it be cleared on BOOT ???
Again suggest NO - it's been nice with the P2 FPGA to be able to reboot and examine the hub ram by jumping into the monitor.
RAM memory clear is extra logic, and has to be done sequentially.
so that means NO
The current P1 & P1V clears COG RAM on a COGINIT/COGSTART. Yes that requires logic, but that's how it is currently done, and it extends into the Special Registers which is how they are cleared.
So that doesn't mean NO. IMHO it is a valid question.
How is the HUB ROM going to work? Will it be part of the HUB Address space or will it switch out after bootup ???
IIRC, ROM is no longer patched-RAM, but instead is true (serial?) ROM (did Chip mention 16k bits or bytes ?) that loads via a tiny state engine
I missed the part about it being serial ROM. That would make the P2 quite slow to boot up
At one stage Chip was considering just loading the Boot ROM into COG 0. This would have to be a small ROM though.
So I am asking how it is going to work now.
1. Very few applications use all I/O pins, therefore most of the time there are some left over.
For the P1 it seems to me that the biggest problem (at least the one I run into, with every board I'm using) is that there are never enough I/O pins. As soon as you start to attach RAM, for example.. before you know it you'll have to choose between VGA or something else because there are not enough pins to have it all. Of course, with the P2 there are more pins and the situation may be totally different. I have been waiting for those pins.
-Tor
Absolutely, it's easy to find applications that exceed the capacity of any chip. I primarily use the Prop for smaller, simpler applications where the ability to get a lot done in a short time using SPIN and the OBEX allows me to roll a relatively inexpensive custom board for much less than a traditional PLC. I'm even competitive on most one-offs.
So,again, I say some simple asynchronous PEEL like logic between pins would enable some amazing applications, and cut the BOM on almost everything I make...
I'm no chip designer, Is this easy? Possible? Useful outside the Wonderland between my ears?
Configurable logic is certainly possible but it takes a lot more real-estate for the equivalent features, 10x maybe? Just compare the transistor count of FPGAs being used to contain one Prop2 vs the target transistor count of the finished Prop2. And if you take away block RAM transistors which will be a 1:1 mapping then the ratio will be pretty bad.
From Chip's earlier posts I estimate an average gate contains 10 transistors, with a target gate count of 300k, that makes 3M transistors for all the logic. Chip has allowed for 4M in real-estate.
Funnily, a quick google doesn't give me any examples of transistors per LE for FPGAs, but it must be in the 100's once you include the config and routing.
I've been systematically ignoring the hubexec conversation... thinking that it was something for someone else... doing something I didn't really understand for people that I don't know in a market segment that I wished would just grow up and learn the right way to do things... and now you are telling me that I was wrong the whole time?
Well that's a bag of turtles.
Anyone other than Chip care to provide a quick refresher?
It's pretty simple. With the Prop 1, instructions in a cogs' 512 word memory are executed by the hardware. With the Prop 2, instruction memory is extended with addresses 0-511 still being cog memory. Addresses 512-1023 are LUT memory (a secondary cog memory that can be used as a CLUT (color lookup table) for video). Addresses 1024 and above are hub memory with a small cache helping to speed up things. Since instructions normally use only 9 bit addresses, some instructions have been added with larger address fields. Effectively all of hub memory is available for nearly full speed instruction execution. Jumps, for example, require that the cache be reloaded so there's a wait for the hub to be available for the cog executing the instructions.
The Prop 1 has been doing something similar for C code for some time now (LMM). There's a speed penalty since there's an interpreter involved, but it's the same idea. The Prop 2 is doing it directly in the hardware and doing it better.
What's the point of non-long-aligned hubexec? From what I understand, it just complicates addressing, and makes even streamer-aligned jumps take a tick longer because the first instruction can't be run until two longs are fetched because it spans two longs. Of course, I could just ignore it by putting "long" at the top of all my PASM...
Otherwise, I'm very excited for the P2 and wish I had the time and money for an FPGA!
We have been executing code from HUB since forever. First with the byte codes of the Spin interpreter, later with the LMM codes of the C compilers or the byte codes of Forth or even the Z80 instructions of the Z80 emulators, etc, etc.
HUBEXEC is basically the same except the COG itself does the fetch, decode, execute loop for itself rather than have some code in COG doing it. Much faster. Of course HUBEXEC works with actual Propeller instructions rather than any arbitrary codes we may dream up and write software to handle.
I'm curious, in case I ever grow up, what is the "right way to do things" ?
@Elecrodude,
I agree. I see no point in going out of the way to support non-aligned instructions, for HUB EXEC or any other case.
Comments
Might it be best to always start cogs in hub-exec mode?
I'm thinking about it for the following reasons:
1) It would simplify debug-interrupt initialization, as the pre-run setup technique could be standardized.
2) It would simplify COGINIT - no bits needed to specify cog/lut/hub
3) Loading cog or LUT with code is only two instructions: SETQ(2) #longs + RDLONG begin,PTRB. Affords flexible cog/LUT load begin address and length. One more instruction JMPs into cog/LUT. So, three instructions get you there with flexibility.
4) All code starts in the hub, anyway.
5) Feels simpler.
6) it puts your COG initialization and "one time" code in HUB memory space instead of COG memory space
I Haven't thought of a reason to start in COG memory...but I just woke up to this!
Simpler is good from our side and from your side!
Yeah, it makes things a lot cleaner, all around. I will attack this first thing Monday morning. This is nice, because it actually REMOVES logic in a few different places and makes things very flexible. No need to start your cog code at 0 and load up the whole memory. Just load what is needed, if anything, right where you want it at 1 long per clock into cog and/or LUT. The RDLONG-repeat instruction makes it practical to go this direction.
This will make things easier for people to understand, as well, because there aren't three different potential startup scenarios with different rules - just one, now. Everything else is exceptional and you will be forced to consider the ramifications, as you are the one making it happen.
Just that and more RAM would have made many people happy years ago.
All I can say is that it was not just me and the other crazy emulator builders who were clamouring for more pins on the P1.
A cursory glance at the spec shows that each 32 bit processor has an average of 3.5 I/O pins in the normal configuration. Doesn't that seem a bit meagre?
Seems the first thing the FPGA P1V adopters did was enable PortB for 64 IO pins.
I have a application that needs 128+ right now.
Yeah, the P1 could drive 64 pins with ease, and had we got them, they would have been used for all sorts of things. Heater has it about right. A few pins per COG is a significant constraint given what COGS can actually do.
@Chip: Yes! Start in HUBEXEC. It's simple. We like simple.
Still, when discussing processor architectures, or the design of anything I guess, we have ideas of "harmony" and "balance".
That sounds a bit zen or arty-farty but it's not really, it has practical outcomes:
If you are going to multiple processors accessing a central RAM in a round robin fashion, how many processors makes sense? 1, 2...8, 16, 32...a million? I think you will agree there is a sweet spot here.
How much RAM should we have, 1K, 2K...32K, 64K,...4GB? I will argue 1K is obviously too constraining, 4GB is pointless given the speed of the Prop. Again there is a sweet spot.
And so on, and so on.
In all these cases the usefulness of a feature is constrained by other features of the overall design. A balance becomes optimal.
So how many pins is the the sweet spot of balance and harmony for a machine with eight 32 bit cores?
Of course we should not discuss these "aesthetics" wrt the PI which is done and dusted but the P II.
I really, REALLY, like the idea of starting in hubexec mode. Also, I assume that the "bootup" sequence doesn't read all 512kb from flash/eeprom into hub ram and then go. It just loads in some initial chunk of the flash/eeprom and starts that. Then that code does the actual loading up of the rest.
That will make for a much more flexible startup solution, and allow for better hub memory usage. We could do things like read in cog/lut resident code and get it all loaded into the cogs, then load in the hub resident code overlaying the same hub memory, leaving more hub memory free for vars/buffers/etc.
Of course, because it's flexible, the above is just an example, everyone is free to do whatever they want.
Absolutely love starting in hubexec mode!
Yes, start the cogs in hubexec makes real sense!!!
Only necessary to have a standard hubexec routine to perform a cog load and jmp cog $0 in hub ram/rom and this can be soft (or we just include an object for this in the library).
1. With a tiny stub block in hub we can load just how much we need to load into a cog which will be faster to get going.
2. We don't have to start the cog at address $000.
Means we can run the same object in a number of cogs, and differentiate by starting at a different address, or we can load the object and patch load the differences to each cog individually.
3. We can start a cog in hubexec and actually not load anything into cog (ie totally run in hubexec mode) - extremely fast startup. This makes particular sense for initial P2 bootup.
And, I am thinking of many other possibilities that we cannot currently perform
Caveats/Questions:
1. COGSTART/COGINIT will have to clear the special registers in cog - or could it be done at BOOT and COGSTOP ???
2. Will remainder of COG RAM/registers & LUT be cleared?
If possible, suggest NO - we can then in fact hold what was in Cog RAM on a previous start ???
-or do you power down the whole cog and it's RAM & LUT at boot and/or COGSTOP ???
3. HUB RAM - will it be cleared on BOOT ???
Again suggest NO - it's been nice with the P2 FPGA to be able to reboot and examine the hub ram by jumping into the monitor.
How is the HUB ROM going to work? Will it be part of the HUB Address space or will it switch out after bootup ???
Yes I have always wanted more pins, commercial products included (where I have used 2 props together to get more pins to overcome HUB RAM shortage by using external SRAM). This project uses a total of 3 props.
so that means NO
IIRC, ROM is no longer patched-RAM, but instead is true (serial?) ROM (did Chip mention 16k bits or bytes ?) that loads via a tiny state engine
There was a 64 pin DIP (same size & width as DIP40) that I think Hitachi used, but it was not mainstream.
I see you can still buy DIP 64 pin 1.78mm sockets, but at eye watering prices!
( so probably a good thing Parallax did not choose that )
It would be really nice if one/any bank would allow gpio retention and counter active with the use of a supercap.
And when you add one these new 1Hz mems-osc, your counter-rtc is still running.
extra bonus, this pin group Vdd is also connected to a 8KB hub ram section for symbolic memory retention.
I guess NMOS at the right places will block reverse current?
The 64 pin DIP of the Motorola 68000 is a thing to behold. I have a few here. And I would have been over joyed to get a Prop I in that package.
Practically I realize that would be far too expensive and probably not in much demand.
Still the are SMD packages that will allow 64 I/O pins whilst still solderable by hand. And bread board friendly modules with EEPROMs, XTALS, USB/serial, coupling caps, etc would have become available.
I couldn't agree more: you can never have too many pins. That said, if the P1 had had 64 pins from the beginning, probably no-one would have bothered writing a 1-pin TV driver or a 1-pin PS/2 driver...
On the other hand, I also like the idea of the DIP Propeller where P8-P23 are cut off.
So let's get a P2 with as many pins as possible/practical for now, and then maybe consider reduced pin versions?
===Jac
So that doesn't mean NO. IMHO it is a valid question. I missed the part about it being serial ROM. That would make the P2 quite slow to boot up
At one stage Chip was considering just loading the Boot ROM into COG 0. This would have to be a small ROM though.
So I am asking how it is going to work now.
Absolutely, it's easy to find applications that exceed the capacity of any chip. I primarily use the Prop for smaller, simpler applications where the ability to get a lot done in a short time using SPIN and the OBEX allows me to roll a relatively inexpensive custom board for much less than a traditional PLC. I'm even competitive on most one-offs.
I'm no chip designer, Is this easy? Possible? Useful outside the Wonderland between my ears?
Funnily, a quick google doesn't give me any examples of transistors per LE for FPGAs, but it must be in the 100's once you include the config and routing.
I've been systematically ignoring the hubexec conversation... thinking that it was something for someone else... doing something I didn't really understand for people that I don't know in a market segment that I wished would just grow up and learn the right way to do things... and now you are telling me that I was wrong the whole time?
Well that's a bag of turtles.
Anyone other than Chip care to provide a quick refresher?
The Prop 1 has been doing something similar for C code for some time now (LMM). There's a speed penalty since there's an interpreter involved, but it's the same idea. The Prop 2 is doing it directly in the hardware and doing it better.
Otherwise, I'm very excited for the P2 and wish I had the time and money for an FPGA!
Yes, it's bags of turtles, all the way down...
We have been executing code from HUB since forever. First with the byte codes of the Spin interpreter, later with the LMM codes of the C compilers or the byte codes of Forth or even the Z80 instructions of the Z80 emulators, etc, etc.
HUBEXEC is basically the same except the COG itself does the fetch, decode, execute loop for itself rather than have some code in COG doing it. Much faster. Of course HUBEXEC works with actual Propeller instructions rather than any arbitrary codes we may dream up and write software to handle.
I'm curious, in case I ever grow up, what is the "right way to do things" ?
@Elecrodude,
I agree. I see no point in going out of the way to support non-aligned instructions, for HUB EXEC or any other case.