New Hub Scheme For Next Chip

Rayman · 2014-05-16 08:23

For higher resolution displays, it would be interesting to try the trick that FTDI's EVE uses...

It uses memory, not as a frame buffer, but as a container for various graphics objects.
Then, it builds scan lines based on these objects, just in time...

P2 might work well for this as you can put several tasks to work to build the scan lines.
This is kinda the way the P1 works at high resolution now, but only for text using ROM font.

The EVE allows multiple fonts, photos and graphics all on the screen at the same time.

cgracey · 2014-05-16 08:45

Question:

Would it be very limiting to have all video modes use a LUT?

This would mean that all pixels would be represented by 8, 4, 2, or 1 bit(s), and those pixels would translate to 32-bit values which would drive four 8-bit DACs in parallel.

I ask because a separate LUT memory (single-port 256x32) would allow function generation, aside from video.

Brian Fairchild · 2014-05-16 08:47

Rayman wrote: »

For higher resolution displays, it would be interesting to try the trick that FTDI's EVE uses...

Or, for $5, buy an EVE.

Brian Fairchild · 2014-05-16 08:52

cgracey wrote: »

Question:...

Madness. Here we are just 46 days after the P2 burst into flames and it's looking like we're going to get....a P2 again.

JRetSapDoog · 2014-05-16 09:05

cgracey wrote: »

Would it be very limiting to have all video modes use a LUT?

This would mean that all pixels would be represented by 8, 4, 2, or 1 bit(s), and those pixels would translate to 32-bit values which would drive four 8-bit DACs in parallel.

I ask because a separate LUT memory (single-port 256x32) would allow function generation, aside from video.

Is that 8/4/2 or 1 bits in total for driving all 4 (3, for just RGB, ignoring H) DACS? If so, that limits (by design) the total number of on-screen colors to 256 (at most), though chosen among 16M colors (ignoring alpha).

Such a 256 color limitation would be just fine with me...because I kind of feel the P2 has no business, pardon my bluntness, trying to drive displays at full color (and the memory is not there, either, without SDRAM). And if it simplifies things (and possibly adds power elsewhere), all the better. Besides, 256-color images done with a good CLUT can look photo-realistic.

cgracey · 2014-05-16 09:16

JRetSapDoog wrote: »

Is that 8/4/2 or 1 bits in total for driving all 4 (3, for just RGB, ignoring H) DACS? If so, that limits (by design) the total number of on-screen colors to 256 (at most), though chosen among 16M colors (ignoring alpha).

Such a 256 color limitation would be just fine with me...because I kind of feel the P2 has no business, pardon my bluntness, trying to drive displays at full color (and the memory is not there, either, without SDRAM). And if it simplifies things (and possibly adds power elsewhere), all the better. Besides, 256-color images done with a good CLUT can look photo-realistic.

It would limit a scan line to 256 different colors. It's true that we don't have the memory to do 24bpp, and an SDRAM to support 24bpp is complicated.

By using a separate small LUT memory, the pixel spooling/translation can happen in the background, which lets the cog do other things. Most importantly, a LUT run from an NCO affords very flexible function generation which can be parlayed into a simple Goertzel implementation, which opens big doors into measurements.

potatohead · 2014-05-16 09:18

I think the function generation / measurement attributes are worth it.

Besides, there are gonna be tricks to improve on 256 colors / scanline

Demo play time someday.

Cluso99 · 2014-05-16 09:20

Chip,
How many video modules do you propose?
If its 16 as in 1 for each cog, then no LUT, as it will take too much silicon, and we'll be back to the same clut/stack/aux as the old P2 except there are now 16 cogs.
If it's 1 or 2, then fine.

Would it be possible to use the cog ram as the clut, even if it means losing the cog when in this mode?

dnalor · 2014-05-16 09:26

Brian Fairchild wrote: »

Madness. Here we are just 46 days after the P2 burst into flames and it's looking like we're going to get....a P2 again.

Same feeling here.
It started so good here: http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip

It's exasperating.

JRetSapDoog · 2014-05-16 09:28

Per scan line, I see. Thanks for the correction. That makes sense, as the LUT could be reloaded during h-blanking (allowing NSLx256 total colors, where NSL = No. of scanlines), though such could be difficult to take advantage of in practice.

But a limit of NSLx256 (or even 256) sounds like a small price to pay for background processing of the look-ups allowing the cog(s) to keep running, while offering the powerful and flexible function generation you mentioned. It sounds more organized and easier to conceptualize/use, too. At least that's my first take on it.

Update: I don't think I would quickly dismiss the possibility of 16 individual LUTs, one per cog, at all. Why does it consume too much silicon? What are the numbers? The LUT's can be used for other things: signal generation and stacks, right? Each cog needs one.

Cluso99 · 2014-05-16 09:40

16 x 256 x 32b = 16 x 1KB = 16KB

And I don't think the current LUT could be used as aux or stack because it's externel to the cog.

There is already another 2 cogs worth of space used in the new hub scheme. I can see that 512KB rapidly shrinking.

JRetSapDoog wrote: »

Per scan line, I see. Thanks for the correction. That makes sense, as the LUT could be reloaded during h-blanking (allowing NSLx256 total colors, where NSL = No. of scanlines), though such could be difficult to take advantage of in practice.

But a limit of NSLx256 (or even 256) sounds like a small price to pay for background processing of the look-ups allowing the cog(s) to keep running, while offering the powerful and flexible function generation you mentioned. It sounds more organized and easier to conceptualize/use, too. At least that's my first take on it.

Update: I don't think I would quickly dismiss the possibility of 16 individual LUTs, one per cog, at all. Why does it consume too much silicon? What are the numbers? The LUT's can be used for other things: signal generation and stacks, right? Each cog needs one.

cgracey · 2014-05-16 09:43

Cluso99 wrote: »

Chip,
How many video modules do you propose?
If its 16 as in 1 for each cog, then no LUT, as it will take too much silicon, and we'll be back to the same clut/stack/aux as the old P2 except there are now 16 cogs.
If it's 1 or 2, then fine.

Would it be possible to use the cog ram as the clut, even if it means losing the cog when in this mode?

The video modes are just DAC output modes that can be used for video. There will be nothing about them (if we use a LUT) that would indicate any special video purpose. It's just generic data through DACs. So, aside from video usage, these are function generators.

The cog RAM could be used as a LUT, but it would completely tie up the cog during output. By having a separate LUT, it can become a free-running state machine, where the cog can drive the LUT with an NCO, causing functions to be output on the DACs while an ADC stream is correlated with the LUT values to form a spectral I/O loop which could resolve all kinds of wild things. This implementation would be much simpler than what was going on in the Prop2, but would enable the next chip to do some really amazing things.

cgracey · 2014-05-16 09:46

JRetSapDoog wrote: »

Per scan line, I see. Thanks for the correction. That makes sense, as the LUT could be reloaded during h-blanking (allowing NSLx256 total colors, where NSL = No. of scanlines), though such could be difficult to take advantage of in practice.

But a limit of NSLx256 (or even 256) sounds like a small price to pay for background processing of the look-ups allowing the cog(s) to keep running, while offering the powerful and flexible function generation you mentioned. It sounds more organized and easier to conceptualize/use, too. At least that's my first take on it.

Update: I don't think I would quickly dismiss the possibility of 16 individual LUTs, one per cog, at all. Why does it consume too much silicon? What are the numbers? The LUT's can be used for other things: signal generation and stacks, right? Each cog needs one.

16 LUTs (one per cog) would cost 1.15 square mm of silicon. I don't know if I'd make stacks out of them, as it complicates the cog more, and I think they should remain just LUTs for the sake of simplicity. They'd need to be writeable, of course, but maybe not readable.

JRetSapDoog · 2014-05-16 09:48

Thanks, Cluso. Yeah, I made an unwarranted assumption about the stacks thing, though maybe it could still be done by going to tri-port (or whatever) cells (though that increases the silicon space). Anyway, the signal generation benefits still seem quite attractive. I still lean toward separate LUT's for some of the reasons I touched on pg 22, in post 430. You don't have to agree with me now; you can thank me later.

Cluso99 · 2014-05-16 09:49

Might I suggest it might be better to leave these extra features for a P2+ ???

cgracey wrote: »

The video modes are just DAC output modes that can be used for video. There will be nothing about them (if we use a LUT) that would indicate any special video purpose. It's just generic data through DACs. So, aside from video usage, these are function generators.

The cog RAM could be used as a LUT, but it would completely tie up the cog during output. By having a separate LUT, it can become a free-running state machine, where the cog can drive the LUT with an NCO, causing functions to be output on the DACs while an ADC stream is correlated with the LUT values to form a spectral I/O loop which could resolve all kinds of wild things. This implementation would be much simpler than what was going on in the Prop2, but would enable the next chip to do some really amazing things.

Cluso99 · 2014-05-16 09:58

Chip,
If you added say 2KB ram single port ram to the cog, address $200-$3FF...
Could it be used as the clut ?
It could be used like hub for hubexec, but at full speed.
It could be used as a stack(s).
It could be likely used as data.

I am not meaning the same implementation as the previous aux, but as an extension to hub without the slot. ie just like a traditional micros normal memory which is usually flash.

cgracey · 2014-05-16 09:58

Cluso99 wrote: »

Might I suggest it might be better to leave these extra features for a P2+ ???

I understand your concern. The thing is, this has to be completed somehow, and the easiest way is what flows best. That's what I'm trying to resolve now.

potatohead · 2014-05-16 09:59

I personally want Chip to maximize this process. So long as the clock is good, power consumption good, etc... buildable in other words, we will get a great chip.

Flow. Exactly. The major considerations that could bring failure are known. Being in flow means great things get done quickly. Go Chip!

Cluso99 · 2014-05-16 10:03

Sorry -editting on a xoom is almost impossible.

By stack, I mean one implemented manually by sw, using GCC style call where the return address is placed in a fixed register, say $1EF.

danielstritt · 2014-05-16 10:31

A little off topic from video, will the new chip still support outputting digital audio cd quality, like the P1? I know it has DACs, but if I remember, they are 8 bit, and I am more thinking of 16 bit sound.

User Name · 2014-05-16 10:37

potatohead wrote: »

I personally want Chip to maximize this process. So long as the clock is good, power consumption good, etc... buildable in other words, we will get a great chip.

Flow. Exactly. The major considerations that could bring failure are known. Being in flow means great things get done quickly. Go Chip!

What Doug said!

Bill Henning · 2014-05-16 11:00

Aloha!

I go off-line for a while and sheesh another re-architecting happens! :-)

Actually, I quite like Chip's new scheme, but it could use a bit of TLC to smooth out the random access aspect.

I LOVE the high bandwidth it enables.

Frankly, I am fine with the way it is, but there are some performance improvements possible

I know, it complicates things a bit, and would need some extra gates, but Chip is the judge of what goes in.

1) RDxxxC is a big win for byte interpreters / vm's / compiled code (1 - 4 lines)

2) A single level of write buffering would help a lot (cog writes, does not stall until next write, by which time the buffer may be available again)

Scatter/gather of byte/word writes could also happen, don't know how cheap logic wise

3) hubexec could run close to cog speed... especially with one line of prefetch

[PUTTING ON FLAME PROOF SUIT]

p.s.

I am back from my extended vacation (Apr.28-May.16), I had a bit of sporadic internet access until about a week ago

potatohead · 2014-05-16 11:01

I trust you had fun, got well rested, etc... ?

Bill Henning · 2014-05-16 11:04

Yep, thanks! It was GREAT!

I'll post some pics on my site this weekend. Apr.28-May.3 on Waikiki beach, May.3-May.10 cruising around Hawaii, then cruising home.

Now I am starting to catch up on the forum, and digest what has been posted while I am away.

I really like the bandwidth of Chip's new hub model.

potatohead wrote: »

I trust you had fun, got well rested, etc... ?

Brian Fairchild · 2014-05-16 11:05

potatohead wrote: »

...we will get a great chip...Being in flow means great things get done quickly.

But when?

I said it the other day, and I'll repeat it again now, there'll be no silicon available in the wild for 12 months.

Cluso99 · 2014-05-16 11:07

Welcome back Bill. Trust the last week was pleasurable.
Was any part in bad seas - I understand they are very steady in bad seas these days.

potatohead · 2014-05-16 11:09

Brian Fairchild wrote: »

But when?

I said it the other day, and I'll repeat it again now, there'll be no silicon available in the wild for 12 months.

Yep. That's very likely true. Take Chip out of flow, and that time is likely extended.

Cluso99 · 2014-05-16 11:11

Bill,
Bandwidth is fantastic. There are a few caveats. The way rdblock works is magnificant.

Brian Fairchild · 2014-05-16 11:12

potatohead wrote: »

Take Chip out of flow, and that time is likely extended.

Stop Chip messing with the spec every few weeks and it'll likely be sooner.

Bill Henning · 2014-05-16 11:15

Yep. I am still chewing on it.

1 level write buffer + 2 level read caching would fix pretty much everything, and allow cache fills in the "background" - ie full speed hubexec. But it would cost gates.

I was thinking that instead of RDxxxC if we got the movfb and movfw instructions back for the 16 long block those could be used for vm's.

One day was a little choppy, rest was fine ... except for wifey who gets seasick. I never get seasick.

Cluso99 wrote: »

Bill,
Bandwidth is fantastic. There are a few caveats. The way rdblock works is magnificant.

New Hub Scheme For Next Chip

Comments