New Hub Scheme For Next Chip

Cluso99 · 2014-05-16 11:15

Brian Fairchild wrote: »

Stop Chip messing with the spec every few weeks and it'll likely be sooner.

That's enough please.

Kerry S · 2014-05-16 11:17

Why I see this as over complicated for process control timing.

With the old style hub you had 16 clocks + get in sync for your first data read. Once you did that you KNOW it is 16 clocks back around so you can program your code to do work or wait for the next access. That is NOT dependent on the memory address you just read or the one you are going to read next.

With the new system this is no longer true, unless you are doing a block read.

Example:

I want to read a value at xxx1 and then xxx5. I wait to sync with xxx1. Now instead of a known 16 til next read it is 4. Want to read xxx2 then xxx12... oh now my wait is 10... you NEVER know how much time you have. You cannot use the time in between to do other things.

To say it again: The memory address I am randomly reading is going to change the time I have between reads. I can no longer program other things during that wait because I no longer KNOW how many instructions I have until I have to get back to issuing the next read instruction. Every random hub read is going to throw us out of sync.

<posted this by mistake over in the poll thread, putting it here where it really belongs, sorry for the double post>

Cluso99 · 2014-05-16 11:21

IIRC we can extract a byte or word, so we only have to index thru the 16 longs.

For wifey: chew ginger. Seasick tablets don't work once you feel queezy. I presume its only in bad weather.

Bill Henning wrote: »

Yep. I am still chewing on it.

1 level write buffer + 2 level read caching would fix pretty much everything, and allow cache fills in the "background" - ie full speed hubexec. But it would cost gates.

I was thinking that instead of RDxxxC if we got the movfb and movfw instructions back for the 16 long block those could be used for vm's.

One day was a little choppy, rest was fine ... except for wifey who gets seasick. I never get seasick.

Rayman · 2014-05-16 11:28

I like the CLUT idea only because it seems required if there is going to be this HUB to DAC direct DMA type thing...
But, I also think I could live without the DMA thing...

I do like Bill's write buffering idea...
Maybe there could be a read buffer too? Perhaps you ask to read some location with one instruction and then 20 clocks later are able to get it?

Heater. · 2014-05-16 11:48

Kerry S,

Why I see this as over complicated for process control timing.

Hmmm...

Most process control and other real-time embedded systems I have worked on have looked like this:

1) There is a bunch of external events that can be signalled to the processor by interrupts.

2) Those interrupts are in no way synchronized to your processor clock. They arrive at random times as far as your system is concerned.

3) Those interrupts are prioritized. The important or high speed ones displace processing of less important or lower speed ones.

4) Execution time of those interrupt handlers is often poorly controlled. It may well vary depending on the nature of the actual interrupting event this time. Do you have one byte to read from a device this time or many? Or it may vary depending on the internal state of the system.

The upshot of all this is that we see total random chaos in the internal timing of the execution of code within the system!

By comparison, having external events all handled by their own processor with a known upper bound on shared memory access is a billion times more deterministic.

By comparison to a traditional system as described this is certainly not "over complicated". In fact it's more predictable. Much easier to reason about.

Which is not to say I don't have some reservations about this random timing of HUB access as I have expressed elsewhere.

Martin Hodge · 2014-05-16 12:20

Kerry S wrote: »

Why I see this as over complicated for process control timing.

With the old style hub you had 16 clocks + get in sync for your first data read. Once you did that you KNOW it is 16 clocks back around so you can program your code to do work or wait for the next access. That is NOT dependent on the memory address you just read or the one you are going to read next.

With the new system this is no longer true, unless you are doing a block read.

Example:

I want to read a value at xxx1 and then xxx5. I wait to sync with xxx1. Now instead of a known 16 til next read it is 4. Want to read xxx2 then xxx12... oh now my wait is 10... you NEVER know how much time you have. You cannot use the time in between to do other things.

To say it again: The memory address I am randomly reading is going to change the time I have between reads. I can no longer program other things during that wait because I no longer KNOW how many instructions I have until I have to get back to issuing the next read instruction. Every random hub read is going to throw us out of sync.

<posted this by mistake over in the poll thread, putting it here where it really belongs, sorry for the double post>

+1 Thanks, Kerry. That is a much more intelligent and diplomatic way to say what I was trying to get across.

Rayman · 2014-05-16 12:36

Bill suggested a non-blocking write and maybe there could be a non-blocking read?

You'd do the non-blocking read for a long or byte or word and execution could continue, but the read data wouldn't be there until say 8 instructions later...

If you were in a loop, you could ask for the next long or byte or word you needed and then operate on the last one perhaps...

But, if this doesn't get added, I don't think we are really much worse off than before, except in rare circumstances.
I seem to recall only a few cases where people were synced to the hub for reading or writing...
Is there a particular example in use on P1 now that would be worse off?

mark · 2014-05-16 12:43

I'd find it very hard to justify complex video generation logic in every single cog when the number of practical uses is extremely limited. Sure, there might be some edge cases out there, but I just can't see how dedicating so much silicon to that could be worth it when there are at least a 100x different and more valuable things that could be done with said silicon would would be useful to 1000x more people. You really want composite video? Bit-bang it like people have done on other micros, then. Even analog VGA is pushing it as a "useful" feature. There are plenty of ICs out there which will do PRGB -> LVDS/VGA/DVI/HDMI/MIPI/DP/etc.

If anything gets added, it should be as general purpose as possible. If this LUT has uses far beyond video generation into more broad signal generation, then that's great, and might certainly be worth pursuing. I'd go as far as to say that if there's highly valuable functionality that would need to take place between the pins and hub ram, but would be a kludge to incorporate in a cog, it would be worth reducing the number of cogs to be able to implement it. Why use, say, two cogs to do something which could be performed in half a cog's worth of silicon or less?

Anyway, that's just my (highly inflated) two cents.

mark · 2014-05-16 12:45

Rayman wrote: »

Bill suggested a non-blocking write and maybe there could be a non-blocking read?

You'd do the non-blocking read for a long or byte or word and execution could continue, but the read data wouldn't be there until say 8 instructions later...

If you were in a loop, you could ask for the next long or byte or word you needed and then operate on the last one perhaps...

But, if this doesn't get added, I don't think we are really much worse off than before, except in rare circumstances.
I seem to recall only a few cases where people were synced to the hub for reading or writing...
Is there a particular example in use on P1 now that would be worse off?

This has been discussed almost since the beginning of the thread. I asked chip if it was possible, and he said yes, but I guess that wasn't a confirmation that he would implement it.

potatohead · 2014-05-16 12:51

Didn't Chip say the issue was the dual port RAM?

OnSemi doesn't have a triple port RAM, and that would be needed for these capabilities.

Baggers · 2014-05-16 13:05

mark

jmg · 2014-05-16 13:08

Kerry S wrote: »

To say it again: The memory address I am randomly reading is going to change the time I have between reads. I can no longer program other things during that wait because I no longer KNOW how many instructions I have until I have to get back to issuing the next read instruction. Every random hub read is going to throw us out of sync.

Yes, but only if you forget to turn on the suggested SNAPCNT attach

It comes back to the example of Bigger Motor => fune-tune of Clutch,Gearbox, Brakes....
SNAPCNT is a variant of the already existing WAIT, which is itself a simpler version of WAITCNT

jmg · 2014-05-16 13:11

Rayman wrote: »

I do like Bill's write buffering idea...
Maybe there could be a read buffer too? Perhaps you ask to read some location with one instruction and then 20 clocks later are able to get it?

Yes, already discussed - that RD is best split into RDREQ and RDGET for code interleave designs.
WRREQ self waits on the second one, if too close to the first.

Kerry S · 2014-05-16 13:11

Heater. wrote: »

Kerry S,

Hmmm...

Most process control and other real-time embedded systems I have worked on have looked like this:

....

Which is not to say I don't have some reservations about this random timing of HUB access as I have expressed elsewhere.

Heater,

Yes that is all true for random input type control.

However once you move into motion control of stepper or servo motors then you need drivers that are locked. Same is true if you need more I/O than you have and need to use multiplexing yet still keep stable timed pulses (PWM, Step/Dir, etc.). Even on reading multiplexed inputs you NEED to know exactly how long the maximum latency will be in order to make sure that your minimum real world pulse will never be shorter than you can read reliably.

Think 12 stepper/servo motors all independently controlled with step/direction signals and with optical encoder feedback for closed loop operation in a robotics or CNC application with 24 limit switches + misc I/O. Industrial/manufacturing type systems, not RC servo types.

That is just one example. I am sure those involved in DSP type applications would be just as concerned about being able to accurately, and easily, control their program timings.

Right now that is easy to do on the P1 from a timing standpoint. What is lacking is speed, memory and bit operators to optimize code.

Kerry S · 2014-05-16 13:15

jmg wrote: »

Yes, but only if you forget to turn on the suggested SNAPCNT attach

It comes back to the example of Bigger Motor => fune-tune of Clutch,Gearbox, Brakes....
SNAPCNT is a variant of the already existing WAIT, which is itself a simpler version of WAITCNT

Yes but does that not lock the cog while in the SNAPCNT wait mode? Will that not make it where we have to waste up to 16 clocks to get back in sync instead of being able to put in some intermediate code to get work done while we wait?

Just curious, the concept of fast block reads is great. I am just not sure that the performance hit for all other memory access is worth it.

jmg · 2014-05-16 13:17

cgracey wrote: »

The video modes are just DAC output modes that can be used for video. There will be nothing about them (if we use a LUT) that would indicate any special video purpose. It's just generic data through DACs. So, aside from video usage, these are function generators.

The cog RAM could be used as a LUT, but it would completely tie up the cog during output. By having a separate LUT, it can become a free-running state machine, where the cog can drive the LUT with an NCO, causing functions to be output on the DACs while an ADC stream is correlated with the LUT values to form a spectral I/O loop which could resolve all kinds of wild things. This implementation would be much simpler than what was going on in the Prop2, but would enable the next chip to do some really amazing things.

Unless the LUT is in COG RAM, and the HW NCO you suggest can cycle steal ?
- which makes it very similar to the COG low speed Clock-Enable mode possible in lower speed Video fSys/N
This avoids an 'extra' LUT memory, which will have a low usage % over the die, for reasonable area impact.

Baggers · 2014-05-16 13:23

Chip, how power intensive was the RDBYTEC cache? could we have it again?

The reason I'm asking, is... I've had an idea....

Would it be possible to say, have a cache reader that when you RDXXXXC, in the background it DMAs a long possibly more? say 16 longs, only whilst the HUB is inactive for that cog, then you can rdbytec the next byte from the cache without it affecting or having to wait the hub slot?

Once it starts caching in from a RDXXXXC, only a HUB-OP would pause it from caching, that is also assuming that the HUB-OP was at the right slot, otherwise cache reading could continue.

Is that too much messing/too awkward?

As I think this would also vastly help increase missing slots on reading bytes for buffers etc, and also for LMM help too.

mark · 2014-05-16 13:26

Baggers wrote: »

mark

jmg · 2014-05-16 13:27

Kerry S wrote: »

However once you move into motion control of stepper or servo motors then you need drivers that are locked. Same is true if you need more I/O than you have and need to use multiplexing yet still keep stable timed pulses (PWM, Step/Dir, etc.). Even on reading multiplexed inputs you NEED to know exactly how long the maximum latency will be in order to make sure that your minimum real world pulse will never be shorter than you can read reliably.

Think 12 stepper/servo motors all independently controlled with step/direction signals and with optical encoder feedback for closed loop operation in a robotics or CNC application with 24 limit switches + misc I/O. Industrial/manufacturing type systems, not RC servo types.

Of course time-determinism matters, but none of your examples need sub 100ns loop precision.
Encoders are done in the Smart Pins, PWM is done in the smart pins,
Stepper Motor step you are never going to chase up to 100ns, and if you go that fast, perhaps Video modes are better.
Plenty of WAIT opcodes can re-sync any code that needs clock-snap.

mark · 2014-05-16 13:30

jmg wrote: »

Unless the LUT is in COG RAM, and the HW NCO you suggest can cycle steal ?
- which makes it very similar to the COG low speed Clock-Enable mode possible in lower speed Video fSys/N
This avoids an 'extra' LUT memory, which will have a low usage % over the die, for reasonable area impact.

Is this more or less implementing TASKS?

jmg · 2014-05-16 13:35

cgracey wrote: »

Question:

Would it be very limiting to have all video modes use a LUT?

This would mean that all pixels would be represented by 8, 4, 2, or 1 bit(s), and those pixels would translate to 32-bit values which would drive four 8-bit DACs in parallel.

I ask because a separate LUT memory (single-port 256x32) would allow function generation, aside from video.

Sure, a separate LUT is nice to have, but there are also large-block-move apps where you need to go direct to pins.
ie the all modes is less than ideal.

Using the COG memory as LUT seems a great way to save die area - it comes down to what area that (single-port 256x32) needs ?

jmg · 2014-05-16 13:39

[QUOTE=mark

jmg · 2014-05-16 13:50

cgracey wrote: »

16 LUTs (one per cog) would cost 1.15 square mm of silicon.

What is that in COG-Area ratios ?, or how much Main Ram does that remove ?

cgracey wrote: »

I don't know if I'd make stacks out of them, as it complicates the cog more, and I think they should remain just LUTs for the sake of simplicity. They'd need to be writeable, of course, but maybe not readable.

hehe at 'not readable'

I'm not sure single port will work, as in many cases a change to the LUT is needed while it is being used by the background HW.
So it needs a COG Write, but it could be HW-Read only, at a pinch. However that would make testing harder, and limits other uses, and I think once you allow SWwrite-while-HWread, you get 2nd port SWread for free in most RAM macros ?

cgracey wrote: »

By using a separate small LUT memory, the pixel spooling/translation can happen in the background, which lets the cog do other things. Most importantly, a LUT run from an NCO affords very flexible function generation which can be parlayed into a simple Goertzel implementation, which opens big doors into measurements.

Yes, separate LUTs do have appealing use cases.

potatohead · 2014-05-16 13:52

Re: Composite

Guys, there is no cost associated with this. The planned video features Chip has put out there involve VGA hardware support. That's it. Now, a side effect of that is being able to generate waves and perform measurements, etc... if it's all kept to an 8 bit look up table / DMA type of scheme.

The only reason Composite gets mentioned is so that it can be done in software. It's easy to insure that happens, that's all.

Having a display, and or multiple display generators makes sense because one of the goals Chip has for his chip is to be able to interact with lots of things in real time, display, measure, etc... A secondary goal is to self-host development. Beyond that, VGA display support of the kind being considered here is nice for instrumentation, user interaction, measure, signal generate, and some interaction with the smart pins.

Composite won't be costing anything. We have DACS and whatever video serializer gets implemented will be used to do a software composite display for those who want it. It can also do a component display too, for those who want that.

In the first spec chip posted, it was VGA hardware support, and that's all this is, just maximized to mesh with the rest of the design in progress.

koehler · 2014-05-16 14:20

mindrobots wrote: »

Most of us will be!

But there will be a healthy trail of feature creepers to lead any survivors back to our rotting carcasses and dying dreams!

Heater. · 2014-05-16 14:37

What is this SNAPCNT thing supposed to do?

As far as I understand the idea It does not seem to help with the issue of randomized HUB access at all.

mark · 2014-05-16 14:41

Heater. wrote: »

What is this SNAPCNT thing supposed to do?

As far as I understand the idea It does not seem to help with the issue of randomized HUB access at all.

It's a simple counter used to maintain determinism of loops which incorporate RD/WRx instructions by stalling execution for a certain number of clock cycles after hub data has been received/committed.

Heater. · 2014-05-16 14:58

mark

mark · 2014-05-16 15:28

Heater. wrote: »

mark

jmg · 2014-05-16 15:38

[QUOTE=mark

New Hub Scheme For Next Chip

Comments