Like I said with 2 cogs, you could get continuous reading at 2 sysclks per sample. one would read 16 and write it to HUB, the other would read 16 while the other was writing to hub. They would alternate. They could be synced up easily, so I no jitter and continuous reading...
People have done similar setups on the P1 with multiple cogs synced up to read data quickly to HUB. The same will be possible on P2, with MUCH higher rates possible.
Brian's block model makes a lot more sense than what was previously put out. Nothing but a glorified cross-connect switch of sorts.
Your answer to the datasheet issue is a sales weenie evasion.
When I did T&E of new hardware for the DoD(Air Force side of the house), the first time a sales/engineer would avoid answering a technical question and goes on a spiel of blah, blah like you are doing. I'd get suspicious, that's used car salesman talking. It's dishonest. The devil is in the details. Do you have to hire a PASM guru like you, Phil, Cluso or Roy or can it be done with a C compiler and a average coder without being handheld by a guru?
It definitely more difficult to work out precise timing (deterministic) while interacting with the hub.
However, most of the timing when hub is involved does not need to be that precise, and there are ways to synchronise this.
We definitely don't need to squeeze the last ounce out of the chip. What I didn't like before was that there was heaps being left on the table and many did not want to entertain any other discussion about it.
The slot-sharing 2 table concepts were much cleaner to calculate determinism than this because this method has those unusual additional clocks being required. As I said, I don't consider it an issue though. Careful planning where it is precisely required will prevail.
I actually find some of the arguments conflicting. Why do many require all cogs to be equal, when the real world of prop cog usage is never equal?
Anyway, I cannot wait to try this out. And I will get at least a few cogs in my DE0 whereas all the previous images were restricted to a single cog. Infinitely better than multitasking IMHO.
Bring it on Chip. And many thanks Roy for chatting with Chip to reach this conclusion.
On this clock, the hub has rotated anticlockwise to this position, and, all cogs can access the following hub addresses, in parallel (ie simultaneously)
* Cog 0 can access any hub long address $xxxx0 (ie hub long address ends in nibble $0)
* Cog 1 can access any hub long address $xxxx1
..... (and all cogs 2..13 where 13=$D)
* Cog 14 can access any hub long address $xxxxE
* Cog 15 can access any hub long address $xxxxF
On the next clock, the Hub rotates anticlockwise, and again, all cogs can access the following hub addresses, in parallel (ie simultaneously)
* Cog 0 can access any hub long address $xxxx1 (ie hub long address ends in nibble $1)
* Cog 1 can access any hub long address $xxxx2
..... (and all cogs 2..13 where 13=$D)
* Cog 14 can access any hub long address $xxxxF
* Cog 15 can access any hub long address $xxxx0
This process continues for another 14 clocks. After the total of 16 clocks, the hub will be back in its original position above, where cog 0 is aligned with hub $xxxx0.
Does this explanation help?
For a RDBLOCK instruction executed by Cog 5 at the first hub position above (we will call this clock n0)...
* Clock +0: Cog 5 = hub addr $xxxx5; RDBLOCK instruction fetched (takes 2 clocks to setup instruction)
* Clock +1: Cog 5 = hub addr $xxxx6; RDBLOCK decode continues
* Clock +2: Cog 5 = hub addr $xxxx7; First long transfer from hub $xxxx7 is copied to cog addr $xx7
* Clock +3: Cog 5 = hub addr $xxxx8; Next long transfer from hub $xxxx8 is copied to cog addr $xx8
* Clock +4: Cog 5 = hub addr $xxxx9; Next long transfer from hub $xxxx9 is copied to cog addr $xx9
* Clock +5: Cog 5 = hub addr $xxxxA; Next long transfer from hub $xxxxA is copied to cog addr $xxA
* Clock +6: Cog 5 = hub addr $xxxxB; Next long transfer from hub $xxxxB is copied to cog addr $xxB
* Clock +7: Cog 5 = hub addr $xxxxC; Next long transfer from hub $xxxxC is copied to cog addr $xxC
* Clock +8: Cog 5 = hub addr $xxxxD; Next long transfer from hub $xxxxD is copied to cog addr $xxD
* Clock +9: Cog 5 = hub addr $xxxxE; Next long transfer from hub $xxxxE is copied to cog addr $xxE
* Clock +10: Cog 5 = hub addr $xxxxF; Next long transfer from hub $xxxxF is copied to cog addr $xxF
* Clock +11: Cog 5 = hub addr $xxxx0; Next long transfer from hub $xxxx0 is copied to cog addr $xx0
* Clock +12: Cog 5 = hub addr $xxxx1; Next long transfer from hub $xxxx1 is copied to cog addr $xx1
* Clock +13: Cog 5 = hub addr $xxxx2; Next long transfer from hub $xxxx2 is copied to cog addr $xx2
* Clock +14: Cog 5 = hub addr $xxxx3; Next long transfer from hub $xxxx3 is copied to cog addr $xx3
* Clock +15: Cog 5 = hub addr $xxxx4; Next long transfer from hub $xxxx4 is copied to cog addr $xx4
* Clock +16: Cog 5 = hub addr $xxxx5; Next long transfer from hub $xxxx5 is copied to cog addr $xx5
* Clock +17: Cog 5 = hub addr $xxxx6; Next long transfer from hub $xxxx6 is copied to cog addr $xx6
* Clock +18: Cog 5 = hub addr $xxxx7; next cog instruction is fetched; 16x Long transfer has completed and available.
From the above RDBLOCK instruction, you can see that it will always take 2+16 clocks to complete 16 Long transfers. The RDBLOCK will begin 2 clocks later at whatever hub slot window is presented, and will wrap around to 0 after the hub slot rotates past slot 15.
Unfortunately, I think that is becoming a politically incorrect term.
I may be a newbie, however it strikes me that Chip's new idea requires a lot of re-do on layout, and s/w.
After a lot of work by some folks on the Sharing scheme, which seemed to require little/nothing in the way of layout or s/w and thus time and money, the 'open process' apparently ended up being nothing more than whoever showed up in Rocklin to bend Chip's ear?
Which requires significantly more layout, s/w, time and money, not to mention at least a few more *,*,* caveats to use.
For weeks and hundreds of posts, time was of the essence, ANY delay for anything was not worth it, simple sharing was too difficult a burden for a programmer to wrap his/her head around, and anyways, high B/W is a niche corner-case for the Prop anyways....
One visit, and now a major re-design (arguably), more complex s/w changes, more timing "issues/complexity", time and money, and nary a peep...
This is now at least the 3rd time, possibly more, that this design is going from what was 'simple' to complex.
Why not compare this kitchen-sink-in-the-making with something far simpler(seemingly) such as Cluso's bolt-on idea in his poll?
Having local 18-26K Cog/Core RAM would seem to be a significant improvement, and for many use cases you would not have to even use hub as done now, as a substitute for actual local RAM.
I believe someone posted that Ken had posted that video was not high on any large customer's wishlist.
Does it really make sense to focus on that as one of the new primary use cases, at the expense of all the benefits Cluso's suggestion offers?
Like I said with 2 cogs, you could get continuous reading at 2 sysclks per sample. one would read 16 and write it to HUB, the other would read 16 while the other was writing to hub. They would alternate. They could be synced up easily, so I no jitter and continuous reading...
This sounds pretty good... With 4 cogs then, you could read a set of 32 pins on every clock and store it in the hub?
Sounds like you could do this up to the 512 kB of HUB... This could make for a much better oscilloscope type thing than P1 could...
Seems pretty trivial to make a 200 MSPS, 4-channel analog capture device...
Could probably double things up and do 400 MSPS by sampling on the falling clock edge.
Or, use 8 cogs and have 8-channels...
...now a major re-design (arguably), more complex s/w changes, more timing "issues/complexity", time and money, and nary a peep...
This does not sound quite right.
The major redesign started the day Chip realized the old behemoth design was not workable. If you follow Chip's posts you see that applies to the HUB mechanism as well.
It seems that this new approach actually simplifies things internally. At least doing away with a ton of interconnections. So simple in fact it seems to be done already!
What "complex s/w changes" ?
There are no "timing issues/complexity". At least no more for the user than with the old round-robin approach. Certainly less than any of the HUB slot sharing schemes that required user configuration.
If you follow Chip's posts you see that applies to the HUB mechanism as well.
And, I suspect, to the whole original P1. Let's face it, the last couple of years have shown that it doesn't scale well. What it now looks like we'll get is a whole new chip apart from some odd hang-overs from the original. Although it does seem that one-by-one they too are going as Chip discovers problems. Only no-one seems to have stepped back and looked at the bigger picture.
There are no "timing issues/complexity". At least no more for the user than with the old round-robin approach.
Oh, but there is, and it has been discussed countless times already. If you're doing random writes or reads to/from an array spread across more than one page of ram there's no way you can prevent stalling like you could with the P1 by timing your RD/WRx instructions just right, at least not without significant overhead (which wouldn't be worthwhile).
It seems that this new approach actually simplifies things internally. At least doing away with a ton of interconnections. So simple in fact it seems to be done already!
To be replaced with a ton of symmetrical interconnect! It's no small amount of transistors and wiring required for a 32+13+4+1=50 bit 16x16 crosspoint switch.
There are no "timing issues/complexity". At least no more for the user than with the old round-robin approach. Certainly less than any of the HUB slot sharing schemes that required user configuration.
There certainly is an extra burden on random access. I can see the block r/w instructions being used often just to keep determinism simple even though it'll burn excess power.
And, I suspect, to the whole original P1. Let's face it, the last couple of years have shown that it doesn't scale well.
I felt this P16 edition was scaling quite nicely as a better Prop1.
But we are pushing for more than that, in particular for Hubexec, and it might now be fair to say that Chip has become a bit frustrated trying to make Hubexec streamlined so it would stand out as superior to LMM.
Hmm. This new approach does take some effort to grok.
Chip will design the chip that he want to, which is fine. This forum is a nice place to discuss ideas and offer some to Chip, though in the end its up to him to design it. He's laid out an idea that gets people a lot of what they want. Is there any point, ever, where people will be happy with his design? At some point, the conversation is less about design ideas (I think he has settled on a design) and becomes group microbation.
The major redesign started the day Chip realized the old behemoth design was not workable. If you follow Chip's posts you see that applies to the HUB mechanism as well.
--- Well, it seemed from his posts that the most recent redesign was 16 core P1+, with possible hub exec, and then a note from Ken that Chip was going to stop all the 'what ifs' and concentrate on getting that whole design far more complete before looking at hub exec, etc.
And from then on, it was a couple of weeks with hundreds of posts on how to easily, simply, and in a timely fashion, try to improve B/W-hub access.
It seems that this new approach actually simplifies things internally. At least doing away with a ton of interconnections. So simple in fact it seems to be done already!
----Yes, thats quite/possibly true. There may be a lot of muxes and such able to be removed, with a reduction/addition of RAM lines somehow, that ultimately is less complex. Sure seems to be some contention with that conclusion though.
What "complex s/w changes" ?
---Well, sounds like the compiler needs some sort of work to figure out if you are inc/dec/same loops, which affects how it will compile the actual code, talk of needing some Java app to help the programmer figure out this/that, still people discussing timing/determinism x days after introduction.
I don't remember any of these issues coming up with any of the sharing schemes, or at least not so bad that it took -days- for some of the brighter folks to wrap their heads around.
There are no "timing issues/complexity". At least no more for the user than with the old round-robin approach. Certainly less than any of the HUB slot sharing schemes that required user configuration.
--- Every complaint about hub sharing seemed to end with a post by JMG with code/verilog simulations that were never disputed by those unhappy with sharing. All in all, sharing seemed to cause heartbreak at somehow making Core/Cogs different. I didn't follow JMG's posts to the deepest level, but then I didn't see many times where anyone appeared to disprove them either.
You write your code. Sounds great.
--- In the end, it may ultimately be as easy-peazy as Roy and you believe. Its still early days.
For people who prefer to write LMM, or use the Prop as a video generator, I guess its great.
For the people who want to use this as it was originally sold, and as something much more useful and closer to what it purports to be, seems like Cluso's idea which uses part of this concept would be a better 'fit' and certainly a better 'sell', if anyone were really interested in Parallax increasing actual volume/revenue.
--- I am actually a bit disappointed that Chip didn't comment on some of the sharing schemes. IIRC, I thought he thought they might be useful, though I may be misremembering.
Good thing Ken kept him off the forums, that sure ended up working
Maybe if JMG had just happened to drop by Parallax.....
Hmm. This new approach does take some effort to grok.
Chip will design the chip that he want to, which is fine. This forum is a nice place to discuss ideas and offer some to Chip, though in the end its up to him to design it. He's laid out an idea that gets people a lot of what they want. Is there any point, ever, where people will be happy with his design? At some point, the conversation is less about design ideas (I think he has settled on a design) and becomes group microbation.
Group microbation?
It was just 3-4 weeks ago that we ended up on the latest new design, and Chip was supposed to be offline, so he could avoid some of the tree and get to work on the forest.
Now we all of a sudden have a new forest, and there seem to be a lot more questions and concerns than there were when he left. And thats including sharing.
Arguing that people should not argue when there is a Major Change is rather.. odd.
At this point, if you could get 16 Cores that actually could be used the same as a normal uC (meaning actual RAM not some hub allocated resource), with 18-26K RAM, and a shared 128K resource,
how would that be worse than this, since you could still have the same 'new' B/W to hub?
You would probably see 50-90% of the need for hub access simply removed, the Cores doing more useful work instead of being stalled, and the hub used for what it should be, and not as a replacement for RAM.
Need more program space, THEN use LMM/Hub RAM.
Another bonus this method brings is faster cog start times!
My question is do we still have the REPS instruction? that would allow bigger burst than 16Longs.
One down side, is that if you were doing a program that required byte data, reading a byte at a time, you would pretty much always hit a stall.
Or if you were doing an interpreter or old CPU emulation, reading an instruction, decoding then next read would also most likely cause a hit, unless you were lucky enough to make sure the timing of the decoding works with the new slot timings.
Or if you're waiting for a variable flag to be set, the loop would be slow.
I guess it makes HUB-RAM more like SDRAM, preferred access in bursts, rather than random access.
But either way we will be able to live with it, they will both have advantages and disadvantages.
I can see it being less deterministic though, because more apps don't read incrementally, plus if you add a variable to your code, it will change the previous timings, as it will offset later variables by one! same with spin, every time you change the code, it'll take different times to execute.
@Baggers. I worry too about what someone else mentioned the COGs potentially being 'jittery', but the concept sounds pretty cool. I guess we'll see how it works pretty soon on FPGA and we'll know.
At this point, if you could get 16 Cores that actually could be used the same as a normal uC (meaning actual RAM not some hub allocated resource), with 18-26K RAM, and a shared 128K resource,
how would that be worse than this, since you could still have the same 'new' B/W to hub?
You would probably see 50-90% of the need for hub access simply removed, the Cores doing more useful work instead of being stalled, and the hub used for what it should be, and not as a replacement for RAM.
Need more program space, THEN use LMM/Hub RAM.
Fair enough. There is a strong argument to be made for your position. I personally like having 512kb to play with after being memory constrained for some many years.
I've been wondering if the muxes could ultimately be eliminated from at least the data bus. To visualize how, imagine how a single cog propeller with its PortA I/O would look. The port connects directly to a memory-mapped register in the cog. There are no muxes necessary (remember, the propeller only requires the muxes for I/O because the ports are shared among cogs). Is there a reason that each page of hub ram couldn't similarly route directly to one of 16 non-memory mapped registers via a 512-bit bus, and use a circular shifter to sequentially select the appropriate register for reading/writing via RD/WRx?
>block r/w instructions being used often just to keep determinism simple even though it'll burn excess power.
If it end up not being any snapcnt/nibbelcnt, a simply dummy rdlong from address 0 should create determinism.
Should it be placed after your 'random' rdlong if next random is to address xx0 or xx1, I'm not sure.
Hi, Brian. Your statement about getting rid of analogies is a bold one. I like daring statements with insightful analysis that cut through the smog. But I think you need to be more specific on how the analogies fall short in explaining this new hub (HUB 2.0?) concept.
The first thing that has to go is any attempt to explain it by way of any analogy. Gears, hubs, lazy susan's, Ferris wheels, they all have to go. It's a big rectangular box in the middle of the block diagram that people have to explain
But I'll admit that it can be hard to come up with a good analogy. When I speculated about a similar hub mechanism, at least in terms of 16 cogs, each accessing 1 of 16 memory units, all on a simultaneous and rotating basis, I struggled to come up with a suitable analogy. I partly typed out an analogy of an old car distributor cap with 16 blades instead of one that connected with 16 separate coils. But before I moved on to the next sentence, I deleted the analogy because I was struggling to explain why the heck there were 16 coils.
Now, had I gone on to make a picture of what I was struggling to express in words, I believe I would have come up with one similar to the illustration that Chip released at the beginning of this thread, as it clearly shows the 16 "blades" between the cogs and the overall hub. Using a circular mechanism is kind of a given for illustrating this, though one could put the cogs on the inside or move the cogs instead of the hub, but this way seems best.
Anyway, I know that you drew up your own block diagram, which is useful. But you don't have a problem with this illustration, do you? I mean, the rotating 16-stage hub at the center of the 16 cogs, each cog with its own blade connector, seems to illustrate the "HUB 2.0" concept quite well. What's more, the illustration isn't an analogy anymore than it has to be. That is, the spinning hub shows that the connections between cogs to the individual hub memories are temporary at any one moment. If that's an analogy, it seems like a good one. But I'm also willing to entertain everyday analogies, like a lazy Susan or whatever if they are helpful and don't mislead (seems to beat the heck out of my distributor cap rotably connected with 16 coils).
Anyway, even if the underlining concept can be illustrated so nicely, I'll admit that the ramifications of this hub change could easily go beyond what we are currently envisioning (for better and/or worse). But regarding the potential downside, we may find solutions or patterns that address them. When considering HUB 2.0, we should keep in mind that, had we all been sitting around the "hopefully round" table during discussions for the P1, we may have scoffed at the proposal for the hub (HUB 1.0), and look how interesting and useful the P1 turned out as a whole. While I was quite pleased with the way the new design was "scaling" in terms of more ram, cogs and speed, I think that this new hub proposal has the potential to be the next evolutionary step (update: perhaps I mean "revolutionary," in more ways than one) for the Propeller family. When one looks at it from Chip's standpoint, it not only builds upon the P1 concept but goes well beyond it, which makes it exciting. It's scary, too, just as I imagine the P1 was.
But it's absolutely great that Brian and koehler and so on care enough to play a kind of "devil's advocate" or be willing to say that "the emperor is (or might be) wearing no clothes." It is from such discussions that we will better understand the merits or pitfalls of this proposal. Thanks for doing so!
I may be a newbie, however it strikes me that Chip's new idea requires a lot of re-do on layout, and s/w.
After a lot of work by some folks on the Sharing scheme, which seemed to require little/nothing in the way of layout or s/w and thus time and money, the 'open process' apparently ended up being nothing more than whoever showed up in Rocklin to bend Chip's ear?
It's not that complex, at all. In fact, the Verilog code that I posted yesterday IS the complete solution, as it just gets synthesized and poses no extra layout effort.
Before Roy showed up, I was thinking about how the hub memory would work and I was dissatisfied with what I was imagining. I was delaying implementation because I knew I didn't have the right idea, yet. So, this is all providence.
I've been running this scheme through my head and I think it is as fast as any scheme could be that has to connect 16 cogs to 16 RAMs. Each RAM mux's one of 16 cog's control signals for its inputs and each cog mux's one of 16 RAM's data outputs for reading. Though rather large in transistor count, It's way fewer circuit stages than a central hub would have imposed on the signals. So, it seems pretty ideal.
One thing about the current scheme that is not ideal is that when data is transferring between hub and cog, execution is stalled. This means that for some high-bandwidth apps, two cogs will be needed to tag-team, so that while one loads/saves data from/to the hub, the other outputs/inputs data via pins/video. If the foundry had a 3-port RAM, the hub memory exchange could occur in the background, which would be fantastic. What we have is pretty good, though, and if the cogs can be kept small, it's no big deal to use two of them in an app where sustained bandwidth is needed.
On the surface this new scheme reminds me of an egg beater. I hate it. You have injected utter chaos into hub access. Now instead of waiting for a window to grab a few bytes from my neatly pland out data structures, I'm having to watch an EF5 tornado of data whizz around until what I need is in front of me. I really think that, apart from a few forum members, you're going to scare everyone away with this. It's not a Propeller any more, it's a "thing" that looks like I could get my hand caught in and lose a finger.
That berfore-after picture you posted scares the hell out me. I wonder if the "volume buyers" have any similar reactions. I sure hope not. I hope they are smarter than I!
Maybe I'm overreacting, but this latest scheme is a tough pill to swallow.
On the surface this new scheme reminds me of an egg beater. I hate it. You have injected utter chaos into hub access. Now instead of waiting for a window to grab a few bytes from my neatly pland out data structures, I'm having to watch an EF5 tornado of data whizz around until what I need is in front of me. I really think that, apart from a few forum members, you're going to scare everyone away with this. It's not a Propeller any more, it's a "thing" that looks like I could get my hand caught in and lose a finger.
That berfore-after picture you posted scares the hell out me. I wonder if the "volume buyers" have any similar reactions. I sure hope not. I hope they are smarter than I!
Maybe I'm overreacting, but this latest scheme is a tough pill to swallow.
It's honey for some, maybe a handful of people with a few specialized applications. I can't say I'm in that crowd though
I think I had a similar initial reaction... But, after a few days of thinking about it, I think the bandwidth increase offered by this scheme makes it worth the trouble.
With 2 cogs you can get a sustained transfer at half the system clock frequency between HUB and cog (or HUB and I/O) (if I understand it right...)
Maybe you need 4 cogs to sustain transfer between HUB and a set of 32 I/O pins?
Rayman, 2 COGs. And (as you probably already know), it's half the system clock speed for reading pins, because that's the instruction rate, and it takes an instruction to read the pins.
Hmm. This new approach does take some effort to grok.
Chip will design the chip that he want to, which is fine. This forum is a nice place to discuss ideas and offer some to Chip, though in the end its up to him to design it. He's laid out an idea that gets people a lot of what they want. Is there any point, ever, where people will be happy with his design? At some point, the conversation is less about design ideas (I think he has settled on a design) and becomes group microbation.
Rayman, 2 COGs. And (as you probably already know), it's half the system clock speed for reading pins, because that's the instruction rate, and it takes an instruction to read the pins.
First thanks for the given explanation
Second half sys clock sampling (you have said already 3 times) is not possible IMHO with 2 cogs.
It's true that you need 16 instructions to transfer pins from/to cog ram, and this will overlap with the other cog transfering cog ram from/to hub, but you have still those 2 clocks to setup the rd/wr-block that will jitter, it cant be continuous: 16 clocks are needed to sample 16 times(a long) from pins but then you need 18 to transfer these to hub.
16vs18, two cogs can't team for continuous sampling, it's not enough.
And what about doing multiple wrxxxx in a row? Now you have to do all the ugly flipflop thing many times so they can all avoid stalling? Seems bad to me...
If you need to respond to a async pin event (at a variable rate of eg 20/25 clocks) and you need some instructions to elaborate prior to wrlong to the hub, having a buffered write will prevent loosing the hub window and thus loosing the next pin event. Just one is enough, because if the sampling rate is higher that the transfer one than you would really end like a uart buffer, but one will help to deal with two unsynced events (the hub and the pin)
Comments
Here is a diagram I found
http://forums.parallax.com/attachment.php?attachmentid=108689&d=1400011603
What is rotating is the Memory-Nibble to COG allocator.
The Rotation model is important, as it helps reveal that INC or DEC of address, yield differing resultant slot rates.
No ONE COG solutions ?
Brian's block model makes a lot more sense than what was previously put out. Nothing but a glorified cross-connect switch of sorts.
Your answer to the datasheet issue is a sales weenie evasion.
When I did T&E of new hardware for the DoD(Air Force side of the house), the first time a sales/engineer would avoid answering a technical question and goes on a spiel of blah, blah like you are doing. I'd get suspicious, that's used car salesman talking. It's dishonest. The devil is in the details. Do you have to hire a PASM guru like you, Phil, Cluso or Roy or can it be done with a C compiler and a average coder without being handheld by a guru?
However, most of the timing when hub is involved does not need to be that precise, and there are ways to synchronise this.
We definitely don't need to squeeze the last ounce out of the chip. What I didn't like before was that there was heaps being left on the table and many did not want to entertain any other discussion about it.
The slot-sharing 2 table concepts were much cleaner to calculate determinism than this because this method has those unusual additional clocks being required. As I said, I don't consider it an issue though. Careful planning where it is precisely required will prevail.
I actually find some of the arguments conflicting. Why do many require all cogs to be equal, when the real world of prop cog usage is never equal?
Anyway, I cannot wait to try this out. And I will get at least a few cogs in my DE0 whereas all the previous images were restricted to a single cog. Infinitely better than multitasking IMHO.
Bring it on Chip. And many thanks Roy for chatting with Chip to reach this conclusion.
I'm not following what point you were trying to make.
You need both types of information, no Data Sheet on a device like this, has only one drawing.
Have you spotted yet that INC or DEC of address, yield differing resultant slot rates.
How did you grasp that, using what diagram ?
http://forums.parallax.com/attachment.php?attachmentid=108689&d=1400011603
The diagram shows that in this position...
On this clock, the hub has rotated anticlockwise to this position, and, all cogs can access the following hub addresses, in parallel (ie simultaneously)
* Cog 0 can access any hub long address $xxxx0 (ie hub long address ends in nibble $0)
* Cog 1 can access any hub long address $xxxx1
..... (and all cogs 2..13 where 13=$D)
* Cog 14 can access any hub long address $xxxxE
* Cog 15 can access any hub long address $xxxxF
On the next clock, the Hub rotates anticlockwise, and again, all cogs can access the following hub addresses, in parallel (ie simultaneously)
* Cog 0 can access any hub long address $xxxx1 (ie hub long address ends in nibble $1)
* Cog 1 can access any hub long address $xxxx2
..... (and all cogs 2..13 where 13=$D)
* Cog 14 can access any hub long address $xxxxF
* Cog 15 can access any hub long address $xxxx0
This process continues for another 14 clocks. After the total of 16 clocks, the hub will be back in its original position above, where cog 0 is aligned with hub $xxxx0.
Does this explanation help?
For a RDBLOCK instruction executed by Cog 5 at the first hub position above (we will call this clock n0)...
* Clock +0: Cog 5 = hub addr $xxxx5; RDBLOCK instruction fetched (takes 2 clocks to setup instruction)
* Clock +1: Cog 5 = hub addr $xxxx6; RDBLOCK decode continues
* Clock +2: Cog 5 = hub addr $xxxx7; First long transfer from hub $xxxx7 is copied to cog addr $xx7
* Clock +3: Cog 5 = hub addr $xxxx8; Next long transfer from hub $xxxx8 is copied to cog addr $xx8
* Clock +4: Cog 5 = hub addr $xxxx9; Next long transfer from hub $xxxx9 is copied to cog addr $xx9
* Clock +5: Cog 5 = hub addr $xxxxA; Next long transfer from hub $xxxxA is copied to cog addr $xxA
* Clock +6: Cog 5 = hub addr $xxxxB; Next long transfer from hub $xxxxB is copied to cog addr $xxB
* Clock +7: Cog 5 = hub addr $xxxxC; Next long transfer from hub $xxxxC is copied to cog addr $xxC
* Clock +8: Cog 5 = hub addr $xxxxD; Next long transfer from hub $xxxxD is copied to cog addr $xxD
* Clock +9: Cog 5 = hub addr $xxxxE; Next long transfer from hub $xxxxE is copied to cog addr $xxE
* Clock +10: Cog 5 = hub addr $xxxxF; Next long transfer from hub $xxxxF is copied to cog addr $xxF
* Clock +11: Cog 5 = hub addr $xxxx0; Next long transfer from hub $xxxx0 is copied to cog addr $xx0
* Clock +12: Cog 5 = hub addr $xxxx1; Next long transfer from hub $xxxx1 is copied to cog addr $xx1
* Clock +13: Cog 5 = hub addr $xxxx2; Next long transfer from hub $xxxx2 is copied to cog addr $xx2
* Clock +14: Cog 5 = hub addr $xxxx3; Next long transfer from hub $xxxx3 is copied to cog addr $xx3
* Clock +15: Cog 5 = hub addr $xxxx4; Next long transfer from hub $xxxx4 is copied to cog addr $xx4
* Clock +16: Cog 5 = hub addr $xxxx5; Next long transfer from hub $xxxx5 is copied to cog addr $xx5
* Clock +17: Cog 5 = hub addr $xxxx6; Next long transfer from hub $xxxx6 is copied to cog addr $xx6
* Clock +18: Cog 5 = hub addr $xxxx7; next cog instruction is fetched; 16x Long transfer has completed and available.
From the above RDBLOCK instruction, you can see that it will always take 2+16 clocks to complete 16 Long transfers. The RDBLOCK will begin 2 clocks later at whatever hub slot window is presented, and will wrap around to 0 after the hub slot rotates past slot 15.
Unfortunately, I think that is becoming a politically incorrect term.
I may be a newbie, however it strikes me that Chip's new idea requires a lot of re-do on layout, and s/w.
After a lot of work by some folks on the Sharing scheme, which seemed to require little/nothing in the way of layout or s/w and thus time and money, the 'open process' apparently ended up being nothing more than whoever showed up in Rocklin to bend Chip's ear?
Which requires significantly more layout, s/w, time and money, not to mention at least a few more *,*,* caveats to use.
For weeks and hundreds of posts, time was of the essence, ANY delay for anything was not worth it, simple sharing was too difficult a burden for a programmer to wrap his/her head around, and anyways, high B/W is a niche corner-case for the Prop anyways....
One visit, and now a major re-design (arguably), more complex s/w changes, more timing "issues/complexity", time and money, and nary a peep...
This is now at least the 3rd time, possibly more, that this design is going from what was 'simple' to complex.
Why not compare this kitchen-sink-in-the-making with something far simpler(seemingly) such as Cluso's bolt-on idea in his poll?
Having local 18-26K Cog/Core RAM would seem to be a significant improvement, and for many use cases you would not have to even use hub as done now, as a substitute for actual local RAM.
I believe someone posted that Ken had posted that video was not high on any large customer's wishlist.
Does it really make sense to focus on that as one of the new primary use cases, at the expense of all the benefits Cluso's suggestion offers?
This sounds pretty good... With 4 cogs then, you could read a set of 32 pins on every clock and store it in the hub?
Sounds like you could do this up to the 512 kB of HUB... This could make for a much better oscilloscope type thing than P1 could...
Seems pretty trivial to make a 200 MSPS, 4-channel analog capture device...
Could probably double things up and do 400 MSPS by sampling on the falling clock edge.
Or, use 8 cogs and have 8-channels...
The major redesign started the day Chip realized the old behemoth design was not workable. If you follow Chip's posts you see that applies to the HUB mechanism as well.
It seems that this new approach actually simplifies things internally. At least doing away with a ton of interconnections. So simple in fact it seems to be done already!
What "complex s/w changes" ?
There are no "timing issues/complexity". At least no more for the user than with the old round-robin approach. Certainly less than any of the HUB slot sharing schemes that required user configuration.
You write your code. Sounds great.
And, I suspect, to the whole original P1. Let's face it, the last couple of years have shown that it doesn't scale well. What it now looks like we'll get is a whole new chip apart from some odd hang-overs from the original. Although it does seem that one-by-one they too are going as Chip discovers problems. Only no-one seems to have stepped back and looked at the bigger picture.
To be replaced with a ton of symmetrical interconnect! It's no small amount of transistors and wiring required for a 32+13+4+1=50 bit 16x16 crosspoint switch.
There certainly is an extra burden on random access. I can see the block r/w instructions being used often just to keep determinism simple even though it'll burn excess power.
I felt this P16 edition was scaling quite nicely as a better Prop1.
But we are pushing for more than that, in particular for Hubexec, and it might now be fair to say that Chip has become a bit frustrated trying to make Hubexec streamlined so it would stand out as superior to LMM.
Chip will design the chip that he want to, which is fine. This forum is a nice place to discuss ideas and offer some to Chip, though in the end its up to him to design it. He's laid out an idea that gets people a lot of what they want. Is there any point, ever, where people will be happy with his design? At some point, the conversation is less about design ideas (I think he has settled on a design) and becomes group microbation.
Group microbation?
It was just 3-4 weeks ago that we ended up on the latest new design, and Chip was supposed to be offline, so he could avoid some of the tree and get to work on the forest.
Now we all of a sudden have a new forest, and there seem to be a lot more questions and concerns than there were when he left. And thats including sharing.
Arguing that people should not argue when there is a Major Change is rather.. odd.
At this point, if you could get 16 Cores that actually could be used the same as a normal uC (meaning actual RAM not some hub allocated resource), with 18-26K RAM, and a shared 128K resource,
how would that be worse than this, since you could still have the same 'new' B/W to hub?
You would probably see 50-90% of the need for hub access simply removed, the Cores doing more useful work instead of being stalled, and the hub used for what it should be, and not as a replacement for RAM.
Need more program space, THEN use LMM/Hub RAM.
Another bonus this method brings is faster cog start times!
My question is do we still have the REPS instruction? that would allow bigger burst than 16Longs.
One down side, is that if you were doing a program that required byte data, reading a byte at a time, you would pretty much always hit a stall.
Or if you were doing an interpreter or old CPU emulation, reading an instruction, decoding then next read would also most likely cause a hit, unless you were lucky enough to make sure the timing of the decoding works with the new slot timings.
Or if you're waiting for a variable flag to be set, the loop would be slow.
I guess it makes HUB-RAM more like SDRAM, preferred access in bursts, rather than random access.
But either way we will be able to live with it, they will both have advantages and disadvantages.
I can see it being less deterministic though, because more apps don't read incrementally, plus if you add a variable to your code, it will change the previous timings, as it will offset later variables by one! same with spin, every time you change the code, it'll take different times to execute.
Fair enough. There is a strong argument to be made for your position. I personally like having 512kb to play with after being memory constrained for some many years.
If it end up not being any snapcnt/nibbelcnt, a simply dummy rdlong from address 0 should create determinism.
Should it be placed after your 'random' rdlong if next random is to address xx0 or xx1, I'm not sure.
But I'll admit that it can be hard to come up with a good analogy. When I speculated about a similar hub mechanism, at least in terms of 16 cogs, each accessing 1 of 16 memory units, all on a simultaneous and rotating basis, I struggled to come up with a suitable analogy. I partly typed out an analogy of an old car distributor cap with 16 blades instead of one that connected with 16 separate coils. But before I moved on to the next sentence, I deleted the analogy because I was struggling to explain why the heck there were 16 coils.
Now, had I gone on to make a picture of what I was struggling to express in words, I believe I would have come up with one similar to the illustration that Chip released at the beginning of this thread, as it clearly shows the 16 "blades" between the cogs and the overall hub. Using a circular mechanism is kind of a given for illustrating this, though one could put the cogs on the inside or move the cogs instead of the hub, but this way seems best.
Anyway, I know that you drew up your own block diagram, which is useful. But you don't have a problem with this illustration, do you? I mean, the rotating 16-stage hub at the center of the 16 cogs, each cog with its own blade connector, seems to illustrate the "HUB 2.0" concept quite well. What's more, the illustration isn't an analogy anymore than it has to be. That is, the spinning hub shows that the connections between cogs to the individual hub memories are temporary at any one moment. If that's an analogy, it seems like a good one. But I'm also willing to entertain everyday analogies, like a lazy Susan or whatever if they are helpful and don't mislead (seems to beat the heck out of my distributor cap rotably connected with 16 coils).
Anyway, even if the underlining concept can be illustrated so nicely, I'll admit that the ramifications of this hub change could easily go beyond what we are currently envisioning (for better and/or worse). But regarding the potential downside, we may find solutions or patterns that address them. When considering HUB 2.0, we should keep in mind that, had we all been sitting around the "hopefully round" table during discussions for the P1, we may have scoffed at the proposal for the hub (HUB 1.0), and look how interesting and useful the P1 turned out as a whole. While I was quite pleased with the way the new design was "scaling" in terms of more ram, cogs and speed, I think that this new hub proposal has the potential to be the next evolutionary step (update: perhaps I mean "revolutionary," in more ways than one) for the Propeller family. When one looks at it from Chip's standpoint, it not only builds upon the P1 concept but goes well beyond it, which makes it exciting. It's scary, too, just as I imagine the P1 was.
But it's absolutely great that Brian and koehler and so on care enough to play a kind of "devil's advocate" or be willing to say that "the emperor is (or might be) wearing no clothes." It is from such discussions that we will better understand the merits or pitfalls of this proposal. Thanks for doing so!
It's not that complex, at all. In fact, the Verilog code that I posted yesterday IS the complete solution, as it just gets synthesized and poses no extra layout effort.
Before Roy showed up, I was thinking about how the hub memory would work and I was dissatisfied with what I was imagining. I was delaying implementation because I knew I didn't have the right idea, yet. So, this is all providence.
I've been running this scheme through my head and I think it is as fast as any scheme could be that has to connect 16 cogs to 16 RAMs. Each RAM mux's one of 16 cog's control signals for its inputs and each cog mux's one of 16 RAM's data outputs for reading. Though rather large in transistor count, It's way fewer circuit stages than a central hub would have imposed on the signals. So, it seems pretty ideal.
One thing about the current scheme that is not ideal is that when data is transferring between hub and cog, execution is stalled. This means that for some high-bandwidth apps, two cogs will be needed to tag-team, so that while one loads/saves data from/to the hub, the other outputs/inputs data via pins/video. If the foundry had a 3-port RAM, the hub memory exchange could occur in the background, which would be fantastic. What we have is pretty good, though, and if the cogs can be kept small, it's no big deal to use two of them in an app where sustained bandwidth is needed.
That berfore-after picture you posted scares the hell out me. I wonder if the "volume buyers" have any similar reactions. I sure hope not. I hope they are smarter than I!
Maybe I'm overreacting, but this latest scheme is a tough pill to swallow.
It's honey for some, maybe a handful of people with a few specialized applications. I can't say I'm in that crowd though
With 2 cogs you can get a sustained transfer at half the system clock frequency between HUB and cog (or HUB and I/O) (if I understand it right...)
Maybe you need 4 cogs to sustain transfer between HUB and a set of 32 I/O pins?
Second half sys clock sampling (you have said already 3 times) is not possible IMHO with 2 cogs.
It's true that you need 16 instructions to transfer pins from/to cog ram, and this will overlap with the other cog transfering cog ram from/to hub, but you have still those 2 clocks to setup the rd/wr-block that will jitter, it cant be continuous: 16 clocks are needed to sample 16 times(a long) from pins but then you need 18 to transfer these to hub.
16vs18, two cogs can't team for continuous sampling, it's not enough.
If you need to respond to a async pin event (at a variable rate of eg 20/25 clocks) and you need some instructions to elaborate prior to wrlong to the hub, having a buffered write will prevent loosing the hub window and thus loosing the next pin event. Just one is enough, because if the sampling rate is higher that the transfer one than you would really end like a uart buffer, but one will help to deal with two unsynced events (the hub and the pin)