Variation on the New Hub Hub Scheme
Cluso99
Posts: 18,069
Take Chip & Roys New Hub Scheme, but do the following...
In fact, together with the 2 Table Hub Slot allocation, we can increase the hub bandwidth > 800MB/s.
So, each cog gets it's normal 1:16 slot. It is supported by the following instructions...
- Hub can fetch 16 Longs per Clock (same)
- Each slot is dedicated to a Cog (like old scheme) ie 1:16
- Blocks (16 Longs = 512 bits) are fetched on 16 Long boundaries
- Determinism
- No jitter
- Software can now hit the "sweet spot"
- Way simpler silicon implementation
- Only 1 address bus per clock (because each cycle is only dealing with a single cog)
In fact, together with the 2 Table Hub Slot allocation, we can increase the hub bandwidth > 800MB/s.
So, each cog gets it's normal 1:16 slot. It is supported by the following instructions...
- RD/WR BYTE / WORD / LONG / BLOCK(=16xLONGS)
- Is the FIFO now required, or can we do it in sw?
- Is the LUT now required, or can we do it in sw?
Comments
1) FIFO not required if and only if hubexec has a 512 bit wide latch (cache) and video has another 512 bit wide latch, auto loaded every 16 cycles.
2) LUT still required, we still need 1 cycle 8-->32 expansion, which this would not help with at all
I don't think there is room for a 512 bit wide bus to all 16 cogs.
I'm not following ? if the Hub delivers 16 longs per clock (same), and Each slot is dedicated to a Cog, then that is exactly what is done now - ie the Present Rotator delivers each COG one long per clock.
The same address LSN appears on a 1:16 basis.
Read #2. By the "old scheme" he means like the P1. So it needs a 512-wide bus into cog RAM to deliver 16 longs in one clock to one cog. 'Not gonna happen.
-Phil
One of the reasons the old P2 design had the large power problem was because of the 256bit bus it had between HUB and cogs (for the 8 long wide reads/writes).
What I am unsure about is the potential of having a 512 bit bus. Certainly, the hub has 512 bits wide (that is how you can read 16 cogs worth of 32 bits per clock.
There has to be a full 512 bit bus going to the cogs. It's either 16x separate 32 bit busses out of a multiplexer, or 1x 512 bit bus going to all cogs.
The power solution is not transferable.
just ignore this for a minute while i rethink it thru again
Of couse it is Roy. Each of the 16 blocks of ram would remain separate blocks. If you only access 1x32bit ram in a clock, the other 15 ram blocks don't get activated. This is the same with the other method. The difference is whether in one clock, all cogs can have 32 bits each, or whether one cog can access 16x32 bits, and in 16 clocks each cog gets that turn in sequence.
Looking at it in another way...
Other: 16 clocks, each capable of 16 cogs of 32bits
Mine: 16 clocks, each one for a different cog capable of 16x32 bits.
Both result in a maximum of 16x32 bits each clock = 64B per clock =1280MB/s =800MB/s per cog.
Tks, missing a "0".
16x32bits @ 200MHz = 64B @ 200MHz = 12800MB/s = 800MB/s per cog
Roy (or Chip, if needs to relax his mind with something else for a short),
could you help me understand, because I always try to understand the fpga development and its issues comparing it with my experience when years ago I was designing 8051/32/52 based boards with external rams and eproms.
How is different, in regards to power consumption, to have 16 cogs (uC) accessing 16 times 16Kx32bit chips, asserting their CS and driving its address/data signals, all at the same time, ... from a single cog (uC) accessing 16 times 4Kx128bit chips, but one at a time because the CS and address/data signals are driven based on the address CS decode?
Thinking in this way, I will say that the new rotator is more power hungry, because at each clock it exchange data with all (16 blocks/chips) of ram (if cogs need access) while the old way it will always access only 1/16 of the whole ram at a time (provided it is segmented and address/CS decoded). Thus the former solution switches 14 address (for 16K) and 512(16x32) data lines while the latter only 12 address (for 4K) and 128 data lines at a time. When in idle I think that any scheme uses the same power as there are always the same amount of memory cells to power independently of how they are organized.
Edit: The new rotator deliver higher bandwidth (horsepower), how can it came at a lower energy (providing same technology is used)?
Thanks in advance
Now, analysing a little further, my original assumption was to use cogs that would be 16 longs wide. But with the FIFO (which could just be a small 16 long ram block per cog), it would be 32 bits wide and each FIFO would be in series for the total 512 bits wide hub ram. In order to have the same cog ram width in my case, each cog FIFO would be parallel loaded 512 bits wide, and the other side would be serial 32 bit wide. Therefore this may be better implemented as ram??? Each of my cog FIFOs would be in parallel on the hub side, instead of serial.
Anyway, I have described what I think could work. It has lots of advantages as far as I can see. But only Chip will know if this is any less/more silicon and less/more complex to implement.
remaining inactive means idle.
With new Roy/Chip scheme you have 16 16Kx32bit blocks that can be potentially all used at the same time (clock cycle) if the 16 cogs require access.
With the old 1/16 design the whole ram 512KB can still be fragmented in 16 (or more or less) 4Kx128bit (or 8Kx64bit, or 2Kx256bit) and you'll be sure that only 1 block per time will be active (accessed, non-idle).
I am not driving one scheme toward the other (even if I like (prefer) the new one, even if I think that someone else has suggested this before Roy, but on the forum instead of in the Chip's house ).
The point is that here everybody is saying that something is more power friendly or hungry in regards to something else and often uses this argument to like or dislike someone other's suggestion/question/idea.
Even if I like the "Roy's" idea, I still can't understand how a so segmented/organized ram blocks that are requiring more "active" address lines, more data lines (32*16 compared to single 32/64/128/256 with old stile hub), more switching/multiplexing logic to cross-link them to every cog ... AND where potentially more ram cells and signal (address/data) lines are changing its states every clock cycle, thus the higher bandwidth, can be more power friendly.
As I know the switching is using energy while the idle states are only lasting the leakage current and the latter is the same independently of the ram organization/segmentation (at the same size of course).
More bandwidth on the same technology requires more things in parallel or the same at higher frequency (which both translates to higher power needs) ... like on the same car engine to increase its power (horses) beside other things you need more fluel/propellent.
As I know you don't create energy, you transform it, and because we are not in ideal world you do it with loses......
As I asked in the above post, may someone explain this because I REALLY not understand it, please.
The key difference with your idea is that the COGs all now need to have 512bit busses/ram (256bit dual ported at least) to be able to read/write that 512 bits in one clock as you suggest.
dMajo,
I don't pretend to understand the power usage for the synthesised ram, but Chip showed me several data tables he got from OnSemi showing the power usage and max speed values for all kinds of memory configurations (32, 64, 128, 256 bit and various sizes of each). In the previous setup the cog rams were 128bit, and it was quite a bit more power and significantly slower speed than the 32bit rams. Chip used the term "sense amps" when talking about the memory (you can look it up on wikipedia) power usage a lot
I believe a big part of the reason that the current setup is lower power overall is because the cogs are now 32bit all the time instead of 128bit all the time. So every cog memory access (for execution of code, etc.) is now 32bit and lower power. That's every instruction clock (half system clock). Also, remember that cog memory is dual ported, so when it was 128bit, you were getting 256 bits of power usage per instruction clock, now it's 64bits. In Cluso's proposed plan, the cogs would need 256bit dual ported memory in order to write the 16 longs in one clock. So you see, it ramps up the cog power usage all the time, instead of having it spread out over 8 instruction clocks (or 16 system clocks).
I give up with you. You just ignore what you don't like, and forge on with what you want.
Sorry, but I'll leave it to Chip...
Correct, the waste in the previous design. was that even if 32 bits was needed, it would read 128 and discard the not-needed 96 inside the COG. There was RAM power plus BUS-Width power wasted.
The new design will be much lower power, given the same Longs/second rate.
A 512b wide bus is going backwards, in that now a 32b read discards 480 bits.
By being able to BLOCK copy on every SysCLK, you consume power only when you really actually need a Block of Data, and yet still get a Block moved in 16 clocks.
A side effect of this, is the fSys/N Video streaming comes too, and besides the clear Video+LUT benefits, that can also be to the smart pins, which is going to have a lot of benefits when the smart io cells get worked on seriously.
Well, almost no problem at all
Currently, I dont see my variation as being any more logic andperhaps less. Same goes for power.
If that's the case, then my variant has a number of advantages as outlined at the start of this thread.
Meanwhile, it's nice to see that Chip has basically solved the fpga hub scheme.
The felling is mutual. I am trying to solve the problems that the new scheme has. Obviously you just want to ignore them.
As far as I am concerned, I would rather the P2 suit Parallax's objectives. Ken has stated them quite clearly, resulting from commercial users. If the new chip doesn't fly with commercial users, I don't see us getting a later P2+ or P3 any time soon, and possibly not at all.
I have experienced all the issues on Ken's list except for security, although I endorse that too.
My commercial products (excluding my hobby boards) don't use video - I have a small parallel LCD and a keypad. They are not large volumes, but do use 3 x P1's. While I design my boards with minimum and low cost, and readily available parts, they are not price sensitive. The P1's are used for software simplicity - multi-core and no interrupts.
I'll have a shot at addressing your original points:
Yep, equal.
This is the only point of concern. I think it's worth having a play with the new scheme before condemning it.
Equal. With the mini-DMA feeding the FIFO, the individual order of fetching/writing is different but the 16 long result is the same.
These points are all one and the same. It's being worked on and, as some have mentioned, the ultimate amount of jitter may not be such a big issue. There is ways to manage it, it just won't be quite the same as the Prop1.
I was thinking of another way to knock back on the transistor count by dropping back to an 8x8 switch but Chip currently doesn't seem too phased by the size of the 16x16 switch.
Yes, it was known as the cache on the big Prop2 design. This allows the Cog to keep ticking while the Hub fetch/write is occurring.
Whole other question. The LUT isn't depended on for Hub access. It's usefulness is decided on other criteria.
You've got to remember Chip had already dropped back to a 128 bit bus width, before changing to the 16 x 32 bit, for the very reason that the 256 bit bus was producing nasty figures on simulation.
My main reasons for drawing block diagrams is to explain to others where the saving of logic are on my variation, and why it does not consume any more power than the current new scheme.
I understand logic design (40+ years experience) and that's where I am concentrating on simplifications. I understood the power savings by blocking the hub ram, and this concept works with either scheme. As far as the hub access goes, at each clock there is a 512 bit bus present to the hub in both schemes although not all sets of the 16 sets of 32bits may be active on any one clock.
I think you are missing that on your case, a 512 read is always required, whilst on Chip's new design, a 512 read is optional, and the big power save comes in not discarding Memory reads.
There is Sense-Amp energy in reading, plus transport energy in sending large widths of data, to somewhere else, where it is then discarded.
Better to be smarter, and only enable the Memory you actually need data from = :Less Energy.
( - and all of this before the routing overhead of running 512 to every COG is included. )
-Phil
But the address bus to the hub can be one bus where A0-1 selects the byte within a long, A2-5 selects the hub block(s) of the lowest 16 longs,and A6-18 selects the long within the hubblock of 32KB=8K-LONGS. The other scheme requires 16x separate A6-18 (13 bit buses) to each hub block since each block is servicing a different cog for every clock. This saves reasonable logic.
Cluso,
this is also what i was trying to understand.
But when 16 cogs are accessing 16 hubs on 32 data bits, compared to a single cog accessing a single portion of the hub ram on a much wider data bus, they get immediately the data formatted as needed.
The single cog access on a wider data bus can be effectively used only for quad or higher block transfers. Since now the ram is quad (or higher) aligned/addressed, the access to a single long/word/byte (eg in random access) would require a lot of shifting/multiplexing of the wider data bus to bring the needed detail to the cpu/alu. Perhaps here the higher power needs, Chip was referring to, comes from.
Note that the 512bit hub bus is a requirement and that this then flows into each FIFO/RAM.
There are possibilities here to also flow it into the cog ram too. I also wonder if, rather than having a FIFO at all, if it could be possible to use the cog ram for this, and have a small state m/c (that is required anyway) just drive the cog ram (the block used for the fifo) that then drives the LUT or to the DAC directly. Perhaps there is some synergy and simplification possible.
You will note that if the die is laid out with the hub ram on the top edge, with the 512b bus under that, followed by the FIFO and then the Cog Rams, then the flow seems quite nice. The cog ram may then fit nicely as 512b wide. On the other side of the cogs wold be the logic for the cogs (ALU etc).
Anyway, these are just food for thought for Chip and others to think about.
Notes:
* The "x" bus junctions will be controlled by decoding.
* A5-2 is decoded to select a specific hub block, or all hub blocks are selected by the FIFO R/W (and Block R/W if implemented as cog 512bits wide).
In the Roy/Chip scheme, the Hub Block implementation is basically the same, except that is also requires separate 13bit A18-6 busses from each cog to each hub block. This is because on each hub slot, each cog may put out a different A18-6 address. In my variation, each hub slot only has a single cog accessing the hub, so the A18-6 will be the same to all hub blocks.