So SNAPCNT just spins doing nothing for a while a wastes processing power.
Given that HUB accesses are randomized how does this help get any kind of determinism back again?
It gives determinism to your CODE, exactly the same as WAITCNT can give determinism to your CODE now in P1.
Unlike WAITCNT, SNAPCNT has multiple exit conditions k * (M/fSys)
Tiny values, <<16 would not make sense mixed in with HUB access, but values > 16 would allow code loops containing HUB Randoms, to become cycle.-deterministic.
Easy to use, you set a SNAPCNT and that sets your loop rate.
I would call those 'less-stalling' rather than non-stalling, as they allow code to fill in some gaps, but done in a loop, they will still stall if the loop is too fast. ie there is no free lunch.
I think the system needs both less-stalling RD/WR and SnapCNT , as they improve different areas.
Ah, you knew what I meant, but I suppose not everyone else might have.
Just to reiterate on a point about "less-stalling" RD/WRx: they're also handy when your loop is longer than 8 instructions.
I do like Bill's write buffering idea...
Maybe there could be a read buffer too? Perhaps you ask to read some location with one instruction and then 20 clocks later are able to get it?
I'm not sure yet. It might need to buffer a block first in 16 cycles, then transfer that to the shifter and load the next block. That would take a lot of flops, though. I need to map out the timing.
You said your nibble adder was good for odd-N values, right?
Yes, Odd-N values are relatively simple.in Nibble_Adder, but it is worth Chasing Even-N (certainly for HW(DMA) moves)
I still don't get it. Can you present the simplest possible use of SNAPCNT that I can think about?
Suppose you have code that needs 200ns cycle-precise-granularity, and it has a Hub access, (that's a P1 equivalent time, but P1 can fit fewer opcodes)
Choices are
a) - (Simplest) Just do the Hub Access, and run any code up to ~ 11 opcodes, including one SNAPCNT 40
The Loop always snaps/rounds to Multiples of 40 SysCLKS. (200ns)
b) If you want to save SNAPCNT in the loop, use the mode where it 'attaches' to the HUB ACCESS.
Now HUBACCESS occurs and exits snapped to 40 SysCLKs, and 11 opcodes of other loop work can be used.
Both forms will allow easier porting of P1 code, as this behaviour is closer to P1
Maybe Chip can give a simple table of MHz vs P1 Equivalent Cycles ?
I think ~110MHz, the P1+ fits in ~same opcodes, and at ~200MHz it can do over 5x as many opcodes in the same 200ns time example.
Can someone remind me what was wrong with using a 128-bit or 256-bit bus? It seems like it is sooooo much cleaner than the new bus idea. How much more die space does it need than all the muxes and circuitry for special instructions that will be needed with to support the new bus idea?
Can someone remind me what was wrong with using a 128-bit or 256-bit bus? It seems like it is sooooo much cleaner than the new bus idea. How much more die space does it need than all the muxes and circuitry for special instructions that will be needed with to support the new bus idea?
I think the two fundamental differences are
a) New method has a much higher bandwidth, and the old design was a bottleneck.
b) Chip says the new bus needs less power. I think because the Wide-Bus read 128b every time, even if the opcode needed just a Byte from that read..
The new bus always reads 32, and if it needs a Block, simply uses 16 adjacent clocks - That change means bits are not read and discarded, which wastes power.
c) The new BUS opens DMA style transfers, which can use fSys/N of the available spokes.
16 LUTs (one per cog) would cost 1.15 square mm of silicon. I don't know if I'd make stacks out of them, as it complicates the cog more, and I think they should remain just LUTs for the sake of simplicity. They'd need to be writeable, of course, but maybe not readable.
Thanks, Chip. BTW, I don't normally say thanks unless there's something to add (to avoid thread clutter)
So, as a rough estimate, if 128KB takes 5.7 sq. mm (though this likely changed with the new memory layout) and there's 4 times that amount of hub ram, then that's 22.8 sq. mm. And if that's 70% of the available die area (not sure about the pad situation), then the total die area available is 32.57 sq. mm. TDAA = 5.7 x 4 / 0.7 = 32.57.
Then, for the ratio of area needed for separate LUT's to TDAA, we have 1.15 sq mm (from Chip, above) / 32.57 sq. mm = 3.35%, if I've reasoned correctly.
As for a cog area ratio, well, for the cog area, I guess we have 32.57 sq. mm x 0.30 / 16 cogs = 0.6107 sq. mm per cog, which is 1.1875% of TDAA (0.6107 / 32.57) used per cog. So, the space needed for 16 LUT's is about 3X that needed for a single cog (though see the poor assumption involved below).
By the way, the above analysis doesn't take into consideration that Chip indicated that the HUB 2.0 "spinning" circuitry would take a couple cogs worth of logic, which would likely alter the above 70% and 30% figures for die space consumed by ram and logic, respectively. Also, the above calculations assume that the 30% for logic is all for cogs, when some of it is for smart pins and some other "overhead." But this should provide a rough estimate, though it could be off by a factor of, say, two.
So, maybe that 3.35% figure can help us better consider the cost involved vs. advantages. Still, we would do well to factor in Chip's statement about what logic layout seems to "flow."
...the total ALU size w/24-bit multiplier is now 0.25 square mm per cog.
Please consider the above just rough approximations and, of course, I may have made some big mistakes or bad assumptions (I often do, like my assuming that the separate LUT's would imply providing a secondary (primary?) usage for stacks).
We just got back from a father/son campout and I had a realization when I woke up for good at 3:45am about how to proceed with the video/hubexec buffering: A 20-stage FIFO in the cog that spits out hub longs at any rate at or below Fsys would simplify video quite a bit and provide hubexec instructions. And it would only take 640 flipflops.
I realized a FIFO can be built with each stage mux'ing either the hub read or the above FIFO stage into its inputs. Then we have a 5-bit counter to keep track of how many longs are stacked up, and what stage gets the next write. This would even bust out of the 16-long block constraint, since you can't really get started until you have your initial data, anyway. With this, you can get started with video or hub exec as soon as you get the first long, because the second and third are definitely following right behind. And if there's a stall in long consumption, there will be enough longs stacked up in the FIFO to allow loading to resume at the next identical hub opportunity (16 clocks later). This is a fix-all for hub long data needed at any rate, up to and including Fsys.
I realized a FIFO can be built with each stage mux'ing either the hub read or the above FIFO stage into its inputs. Then we have a 5-bit counter to keep track of how many longs are stacked up, and what stage gets the next write. ...
. This is a fix-all for hub long data needed at any rate, up to and including Fsys.
Sounds great, what sets the HUB Address on the 'far side' of this FIFO ?
Addit: Is this only for the DMA - HW pathways, not for SW-paced loops ?
Chip, for those of us here who are drowning in most of the intricate details of processor engineering:
A. Are we looking at any possibility of doing standard 720/1080 video with the new Prop?
B. If so, how many Core/s is it going to take do you think?
C. Am I going to able to output HDMI easily, or be restricted to VGA/Composite/NTSC/PAL?
C. Am I going to able to output HDMI easily, or be restricted to VGA/Composite/NTSC/PAL?
HDMI has differential LVDS pathways, so that cannot be directly delivered.
There are HDMI (not so cheap) parts that can take wide-parallel-data and stream HDMI.
Every four I/O pins have their own VIO supply pin (3.3V) that is separate from all other groups. This keeps analog signal groups partitioned so that there's no crosstalk outside of groups of four. This is a little excessive, but will be very safe. We may be able to get by with fewer VDD pins, too, but this arrangement is very conservative and will help minimize switching noise on the internal power grid. If we cut all the VIO and VDD pins in half, we could get 16 more I/O pins. That could mean another 4 cogs, too, for 20, total. I do feel the pull for more I/O to support SDRAM.
Does anyone have any data on how other chips apportion VDD pins based on their core current? I know on very high-current chips, they place power pads right in the middle of the die and attach the package directly to them.
I don't think Chip received any alternative packaging suggestions in response to his inquiry. Anyway, although I'm doubtful that we'll have room for 4 more cogs, if there were a way to get 16 extra pins, they could be put to good use as a general port. I realize that suggestion will be considered as heresy, but I'm prepared for that.
Over on this short/forgotten/near-dead thread, I inquired about the possibility of having a general purpose port, though I initially over-limited it by referred to it as pins for SDRAM and also erroneously referred to it as "dedicated."
I would guess that the logic required for a general purpose port is quite minimal (almost nothing compared to everything else that's being provided). However, I don't know to what extent (if any) the additional pins (with pads, though maybe they're already in the outer ring and wouldn't adversely affect things) would add to the required logic. Hopefully, that's minimal, too.
Having a general-use port could be applied to a lot of things. I'm mainly thinking of SDRAM (there's too few pins left after interfacing SDRAM in my opinion), but there could be other uses. I don't know if LUT outputs could readily be routed to such a port, such as for driving a "raw" LCD panel, but maybe. Anyway, there must be lots of uses.
There are at least three main issues involved here: [1] finding a reliable way to reduce the number of VIO/VDD pins needed (assuming the same package), [2] determining if a general purpose port somehow violates the Propeller philosophy (I don't think so), and, [3] determining if a general purpose port would offer good utility beyond what the existing 64 smart pins already provide.
Anyway, I know I'm off topic, but I've posted here because this is now the "active" thread (even the "New 16-Cog" thread has grown cold). But responses could be directed back to that old thread I created or a new one. Apologies for deviating from the hub scheme topic (though LUT's kind of goes beyond it, too, though tangentially relates).
By the way, if this new hub scheme does get implemented, then I think it might be helpful to create an entirely new "New 16-Cog" Propeller thread, as quite a bit of the information in that thread would be out of date, such as the quad access to the hub and so on. Maybe we won't know until there's new FPGA code, though.
I'm confused again... Is this FIFO part of the HUB or part of the COG? Are there 16 of them?
My reading is one per COG.
["A 20-stage FIFO[B] in the cog[/B] that spits out hub longs at any rate at or below Fsys would simplify video quite a bit and provide hubexec instructions"]
We just got back from a father/son campout and I had a realization when I woke up for good at 3:45am about how to proceed with the video/hubexec buffering: A 20-stage FIFO in the cog that spits out hub longs at any rate at or below Fsys would simplify video quite a bit and provide hubexec instructions. And it would only take 640 flipflops.
I realized a FIFO can be built with each stage mux'ing either the hub read or the above FIFO stage into its inputs. Then we have a 5-bit counter to keep track of how many longs are stacked up, and what stage gets the next write. This would even bust out of the 16-long block constraint, since you can't really get started until you have your initial data, anyway. With this, you can get started with video or hub exec as soon as you get the first long, because the second and third are definitely following right behind. And if there's a stall in long consumption, there will be enough longs stacked up in the FIFO to allow loading to resume at the next identical hub opportunity (16 clocks later). This is a fix-all for hub long data needed at any rate, up to and including Fsys.
Yes this sounds ideal if 640 flops are available. If its too costly, ~230 flops (6-stage), plus a direct write for Fsys/2 would let you do everything except Fsys/1. It'd be nice to be universal, though.
I figured you needed something like a variable length fifo (wasn't sure exactly how this would be done - uart style head/tail??) but you've hit the nail on the head with the ability for each stage to grab from either next stage or hub.
How do the HUB longs get into the cog FIFO? Does the cog do it with code? Or, is there some new hardware mechanism?
I think this is related to Chips DMA idea, for fSys/N, so this is probably called a "new hardware mechanism"
- not sure if this helps streaming reads TO the HUB, as it seems to be one-way & HW, not SW.
This requires licenses and some restrictions on what the output may or may not do. Analog has no such restrictions, and can feed into a little chip that handles HDMI easily enough. If that feed is component (YPbPr), or VGA (RGB), the resulting HDMI display is going to be excellent.
I personally hope somebody designs a board with an HDMI interface, right next to an analog one. I don't think HDMI should be in the device for a lot of reasons, but the IP / license / approval process is the primary one.
I personally hope somebody designs a board with an HDMI interface, right next to an analog one. I don't think HDMI should be in the device for a lot of reasons, but the IP / license / approval process is the primary one.
From memory, Rayman already has a SSD1963 to HDMI, and if the new Prop gets parallel steaming Video, it should drop-into the SSD1963 location. (and another COG can do Analog)
Yeah, my thoughts too. There might be some real advantages doing it that way. HDMI offers some low frame rate modes, (24P) which might work well with either analog or parallel feed.
I've not read up on it all just yet. But as we get close, I will.
From memory, Rayman already has a SSD1963 to HDMI, and if the new Prop gets parallel steaming Video, it should drop-into the SSD1963 location. (and another COG can do Analog)
We just got back from a father/son campout and I had a realization when I woke up for good at 3:45am about how to proceed with the video/hubexec buffering: A 20-stage FIFO in the cog that spits out hub longs at any rate at or below Fsys would simplify video quite a bit and provide hubexec instructions. And it would only take 640 flipflops.
I realized a FIFO can be built with each stage mux'ing either the hub read or the above FIFO stage into its inputs. Then we have a 5-bit counter to keep track of how many longs are stacked up, and what stage gets the next write. This would even bust out of the 16-long block constraint, since you can't really get started until you have your initial data, anyway. With this, you can get started with video or hub exec as soon as you get the first long, because the second and third are definitely following right behind. And if there's a stall in long consumption, there will be enough longs stacked up in the FIFO to allow loading to resume at the next identical hub opportunity (16 clocks later). This is a fix-all for hub long data needed at any rate, up to and including Fsys.
That sounds great and it's good to hear that hubexec is still in the plans.
We just got back from a father/son campout and I had a realization when I woke up for good at 3:45am about how to proceed with the video/hubexec buffering: A 20-stage FIFO in the cog that spits out hub longs at any rate at or below Fsys would simplify video quite a bit and provide hubexec instructions. And it would only take 640 flipflops.
I realized a FIFO can be built with each stage mux'ing either the hub read or the above FIFO stage into its inputs. Then we have a 5-bit counter to keep track of how many longs are stacked up, and what stage gets the next write. This would even bust out of the 16-long block constraint, since you can't really get started until you have your initial data, anyway. With this, you can get started with video or hub exec as soon as you get the first long, because the second and third are definitely following right behind. And if there's a stall in long consumption, there will be enough longs stacked up in the FIFO to allow loading to resume at the next identical hub opportunity (16 clocks later). This is a fix-all for hub long data needed at any rate, up to and including Fsys.
Sounds good for video. The benefits for HubExec would be very limited. You'd have to dump the FIFO before it was even full nearly every time. How many programs (especially compiled ones) go 20 instructions before branching?
Comments
It gives determinism to your CODE, exactly the same as WAITCNT can give determinism to your CODE now in P1.
Unlike WAITCNT, SNAPCNT has multiple exit conditions k * (M/fSys)
Tiny values, <<16 would not make sense mixed in with HUB access, but values > 16 would allow code loops containing HUB Randoms, to become cycle.-deterministic.
Easy to use, you set a SNAPCNT and that sets your loop rate.
Ah, you knew what I meant, but I suppose not everyone else might have.
Just to reiterate on a point about "less-stalling" RD/WRx: they're also handy when your loop is longer than 8 instructions.
Also I like the Bill's write buffering #331 idea.
Yes, Odd-N values are relatively simple.in Nibble_Adder, but it is worth Chasing Even-N (certainly for HW(DMA) moves)
- Even-N was more of a challenge, but I have updated the tables in the other thread with a now-working Even-N case.
http://forums.parallax.com/showthread.php/155692-Nibble-Carry-Higher-speed-Buffers-FIFOs-using-new-HUB-Rotate?p=1268158&viewfull=1#post1268158
I still don't get it. Can you present the simplest possible use of SNAPCNT that I can think about?
Suppose you have code that needs 200ns cycle-precise-granularity, and it has a Hub access, (that's a P1 equivalent time, but P1 can fit fewer opcodes)
Choices are
a) - (Simplest) Just do the Hub Access, and run any code up to ~ 11 opcodes, including one SNAPCNT 40
The Loop always snaps/rounds to Multiples of 40 SysCLKS. (200ns)
b) If you want to save SNAPCNT in the loop, use the mode where it 'attaches' to the HUB ACCESS.
Now HUBACCESS occurs and exits snapped to 40 SysCLKs, and 11 opcodes of other loop work can be used.
Both forms will allow easier porting of P1 code, as this behaviour is closer to P1
Maybe Chip can give a simple table of MHz vs P1 Equivalent Cycles ?
I think ~110MHz, the P1+ fits in ~same opcodes, and at ~200MHz it can do over 5x as many opcodes in the same 200ns time example.
I think the two fundamental differences are
a) New method has a much higher bandwidth, and the old design was a bottleneck.
b) Chip says the new bus needs less power. I think because the Wide-Bus read 128b every time, even if the opcode needed just a Byte from that read..
The new bus always reads 32, and if it needs a Block, simply uses 16 adjacent clocks - That change means bits are not read and discarded, which wastes power.
c) The new BUS opens DMA style transfers, which can use fSys/N of the available spokes.
Thanks, Chip. BTW, I don't normally say thanks unless there's something to add (to avoid thread clutter)
Good question. Chip can provide the exact details, but from the "New 16-Cog...Propeller" thread I found these posts:
So, as a rough estimate, if 128KB takes 5.7 sq. mm (though this likely changed with the new memory layout) and there's 4 times that amount of hub ram, then that's 22.8 sq. mm. And if that's 70% of the available die area (not sure about the pad situation), then the total die area available is 32.57 sq. mm. TDAA = 5.7 x 4 / 0.7 = 32.57.
Then, for the ratio of area needed for separate LUT's to TDAA, we have 1.15 sq mm (from Chip, above) / 32.57 sq. mm = 3.35%, if I've reasoned correctly.
As for a cog area ratio, well, for the cog area, I guess we have 32.57 sq. mm x 0.30 / 16 cogs = 0.6107 sq. mm per cog, which is 1.1875% of TDAA (0.6107 / 32.57) used per cog. So, the space needed for 16 LUT's is about 3X that needed for a single cog (though see the poor assumption involved below).
By the way, the above analysis doesn't take into consideration that Chip indicated that the HUB 2.0 "spinning" circuitry would take a couple cogs worth of logic, which would likely alter the above 70% and 30% figures for die space consumed by ram and logic, respectively. Also, the above calculations assume that the 30% for logic is all for cogs, when some of it is for smart pins and some other "overhead." But this should provide a rough estimate, though it could be off by a factor of, say, two.
So, maybe that 3.35% figure can help us better consider the cost involved vs. advantages. Still, we would do well to factor in Chip's statement about what logic layout seems to "flow."
Just for comparison:
Please consider the above just rough approximations and, of course, I may have made some big mistakes or bad assumptions (I often do, like my assuming that the separate LUT's would imply providing a secondary (primary?) usage for stacks).
I realized a FIFO can be built with each stage mux'ing either the hub read or the above FIFO stage into its inputs. Then we have a 5-bit counter to keep track of how many longs are stacked up, and what stage gets the next write. This would even bust out of the 16-long block constraint, since you can't really get started until you have your initial data, anyway. With this, you can get started with video or hub exec as soon as you get the first long, because the second and third are definitely following right behind. And if there's a stall in long consumption, there will be enough longs stacked up in the FIFO to allow loading to resume at the next identical hub opportunity (16 clocks later). This is a fix-all for hub long data needed at any rate, up to and including Fsys.
Sounds great, what sets the HUB Address on the 'far side' of this FIFO ?
Addit: Is this only for the DMA - HW pathways, not for SW-paced loops ?
A. Are we looking at any possibility of doing standard 720/1080 video with the new Prop?
B. If so, how many Core/s is it going to take do you think?
C. Am I going to able to output HDMI easily, or be restricted to VGA/Composite/NTSC/PAL?
HDMI has differential LVDS pathways, so that cannot be directly delivered.
There are HDMI (not so cheap) parts that can take wide-parallel-data and stream HDMI.
I don't think Chip received any alternative packaging suggestions in response to his inquiry. Anyway, although I'm doubtful that we'll have room for 4 more cogs, if there were a way to get 16 extra pins, they could be put to good use as a general port. I realize that suggestion will be considered as heresy, but I'm prepared for that.
Over on this short/forgotten/near-dead thread, I inquired about the possibility of having a general purpose port, though I initially over-limited it by referred to it as pins for SDRAM and also erroneously referred to it as "dedicated."
I would guess that the logic required for a general purpose port is quite minimal (almost nothing compared to everything else that's being provided). However, I don't know to what extent (if any) the additional pins (with pads, though maybe they're already in the outer ring and wouldn't adversely affect things) would add to the required logic. Hopefully, that's minimal, too.
Having a general-use port could be applied to a lot of things. I'm mainly thinking of SDRAM (there's too few pins left after interfacing SDRAM in my opinion), but there could be other uses. I don't know if LUT outputs could readily be routed to such a port, such as for driving a "raw" LCD panel, but maybe. Anyway, there must be lots of uses.
There are at least three main issues involved here: [1] finding a reliable way to reduce the number of VIO/VDD pins needed (assuming the same package), [2] determining if a general purpose port somehow violates the Propeller philosophy (I don't think so), and, [3] determining if a general purpose port would offer good utility beyond what the existing 64 smart pins already provide.
Anyway, I know I'm off topic, but I've posted here because this is now the "active" thread (even the "New 16-Cog" thread has grown cold). But responses could be directed back to that old thread I created or a new one. Apologies for deviating from the hub scheme topic (though LUT's kind of goes beyond it, too, though tangentially relates).
By the way, if this new hub scheme does get implemented, then I think it might be helpful to create an entirely new "New 16-Cog" Propeller thread, as quite a bit of the information in that thread would be out of date, such as the quad access to the hub and so on. Maybe we won't know until there's new FPGA code, though.
["A 20-stage FIFO[B] in the cog[/B] that spits out hub longs at any rate at or below Fsys would simplify video quite a bit and provide hubexec instructions"]
Yes this sounds ideal if 640 flops are available. If its too costly, ~230 flops (6-stage), plus a direct write for Fsys/2 would let you do everything except Fsys/1. It'd be nice to be universal, though.
I figured you needed something like a variable length fifo (wasn't sure exactly how this would be done - uart style head/tail??) but you've hit the nail on the head with the ability for each stage to grab from either next stage or hub.
I think this is related to Chips DMA idea, for fSys/N, so this is probably called a "new hardware mechanism"
- not sure if this helps streaming reads TO the HUB, as it seems to be one-way & HW, not SW.
This requires licenses and some restrictions on what the output may or may not do. Analog has no such restrictions, and can feed into a little chip that handles HDMI easily enough. If that feed is component (YPbPr), or VGA (RGB), the resulting HDMI display is going to be excellent.
I personally hope somebody designs a board with an HDMI interface, right next to an analog one. I don't think HDMI should be in the device for a lot of reasons, but the IP / license / approval process is the primary one.
From memory, Rayman already has a SSD1963 to HDMI, and if the new Prop gets parallel steaming Video, it should drop-into the SSD1963 location. (and another COG can do Analog)
I've not read up on it all just yet. But as we get close, I will.
There are chips out there from TI and Analog that can do the conversion for you... I think they're ~ $5 each or so...
$5 seems a great deal!
Right, a cool/clever display controller chip-to-DVI chip solution: http://rayslogic.com/Propeller/Products/DviGraphics/DVI.htm
Sounds good for video. The benefits for HubExec would be very limited. You'd have to dump the FIFO before it was even full nearly every time. How many programs (especially compiled ones) go 20 instructions before branching?
Ross.
But for other 'N' the FIFO can be shorter