The Block transfer opcode will stall the COG, as it needs 16 SysClks to fill the fifo ...
I guess, in that case, it's not really a FIFO -- maybe a RIFO (random-in first-out) or a BIFO (block-in first-out). It would be nice, I suppose, if reading from the "*IFO" could begin as soon as the first (lowest-hub-address) slot is filled, since further reads cannot keep up with subsequent filling. OTOH, that could affect determinism.
I guess, in that case, it's not really a FIFO -- maybe a RIFO (random-in first-out) or a BIFO (block-in first-out).
I've rephrased that a little better as I'm not sure if the Block read even needs the FIFO - the FIFO is really there for video-DMA style transfers, up to fSys/N. It fills when it can, and empties at a fixed rate. (& vice-versa on read to hub)
Block[16] transfers always needs 16 fSys, but does not have to use the FIFO, - it does needs the address / data lines to the HUB, but in theory, Block Read could work time-shared with a FIFO running video values less than fSys/1.
I've rephrased that a little better as I'm not sure if the Block read even needs the FIFO - the FIFO is really there for video-DMA style transfers, up to fSys/N. It fills when it can, and empties at a fixed rate. (& vice-versa on read to hub)
Block[16] transfers always needs 16 fSys, but does not have to use the FIFO, - it does needs the address / data lines to the HUB, but in theory, Block Read could work time-shared with a FIFO running video values less than fSys/1.
Hmm, I figured the FIFO was intended for Cog instructions to use. We'll just have to wait and see I guess.
Hmm, I figured the FIFO was intended for Cog instructions to use. We'll just have to wait and see I guess.
You mean for HUB exec style code fetch, or general data flows ?
HUB exec is still mentioned by Chip, but no details yet.
A fifo could be useful for LMM style code, but the block[16] read would be more deterministic, and that would free FIFOs for data flows.
It maybe possible to do a hw-assisted form of LMM, that behaves like HUB exec with a speed between LMM and COG code.
Straight line code could be 66% of the speed of COG code, and a size-definable block would allow software to tune/optimise the block sizes.(multiples of 16)
Chip, a question: Presumably the FIFO's are single ported. I'm going to guess that any Cog instruction that acesses it's FIFO while the hub is acessing it will stall the Cog, right?
I ask, trying not to be too greedy, because a Cog working on the FIFO contents concurently would seem the most streamlined ... Double buffering anyone? 2x8 FIFO per Cog maybe? /me ducks.
To reliably accommodate an automatic peripheral in the cog that is using the FIFO, the FIFO gets priority over RDxxxx/WRxxxx instructions.
The FIFO is 19 levels deep, maximally. This is to accommodate the worst case of a long being read or written on every clock. Most of the time, in practice, the FIFO will only be a few levels deep.
To reliably accommodate an automatic peripheral in the cog that is using the FIFO, the FIFO gets priority over RDxxxx/WRxxxx instructions.
The FIFO is 19 levels deep, maximally. This is to accommodate the worst case of a long being read or written on every clock. Most of the time, in practice, the FIFO will only be a few levels deep.
To reliably accommodate an automatic peripheral in the cog that is using the FIFO, the FIFO gets priority over RDxxxx/WRxxxx instructions.
What can be an automatic peripheral? What sort of flexibility is there?
The FIFO is 19 levels deep, maximally. This is to accommodate the worst case of a long being read or written on every clock. Most of the time, in practice, the FIFO will only be a few levels deep.
So the plan for the FIFO is mainly around having it pace it's stream at the rate needed for the I/O. So, then, the Cog can still RDxxx/WRxxx in between the FIFO's Hub accesses. Am I on to it now? That almost sounds a bit too luxurious for the Prop.
What can be an automatic peripheral? What sort of flexibility is there?
So the plan for the FIFO is mainly around having it pace it's stream at the rate needed for the I/O. So, then, the Cog can still RDxxx/WRxxx in between the FIFO's Hub accesses. Am I on to it now? That almost sounds a bit too luxurious for the Prop.
That's right. RDxxxx/WRxxxx must wait for cycles when the FIFO is not needing to issue reads or writes.
There will be simple state machines to read pins (byte/word/long groups) and write them the hub RAM via the FIFO per NCO setting. Data can also be read from the hub RAM via the FIFO and written to pins or DACs per NCO setting. Video is a case of the latter.
So if I understand correctly: the NCO controls the pin read/write rate, and the FIFO buffers, for both reading and writing pins (in groups of 8/16/32 pins for byte/word/long hub reads/writes), also for writes, it can send them to DAC's?
I have to ask...
1) how fast are the ADC's in the pins? The fifo engine would be more symmetric if the ADC's could also write to the hub - but I suspect this is not needed, as I don't think the ADC's are fast enough to require dma
2) what is the maximum NCO frequency? 200Mhz? (ie sysclk?)
Too bad mixing clock domains is not easy in the Altera software. It would be nice to allow an external clock input for use instead of an NCO frequency. 165MHz oscillator comes to mind.
That's right. RDxxxx/WRxxxx must wait for cycles when the FIFO is not needing to issue reads or writes.
There will be simple state machines to read pins (byte/word/long groups) and write them the hub RAM via the FIFO per NCO setting. Data can also be read from the hub RAM via the FIFO and written to pins or DACs per NCO setting. Video is a case of the latter.
Have You write before read ---- If Fifo's point to same destination?
This would be rare, - only in Pin streaming -> Hub, not for Hub -> video/pins.
It would also be tricky, as any element in the FIFO might be due to replace what you are about to read in SW.
Checking all FIFO elements will require too much logic.
Most apps would first start the FIFO/DMA and then know whether they were ahead or behind of the write pointer.
(usually SW would be slower, and behind the pointer, but at low NCO speeds, you may need to slow the SW.
There, and for burst cases, it could be useful to have some means to read/track the FIFO burst progress.
Read of the registers used to config this could give that ? This might already be planned ?
Too bad mixing clock domains is not easy in the Altera software. It would be nice to allow an external clock input for use instead of an NCO frequency. 165MHz oscillator comes to mind.
Why not clock at 165MHz in that case ?
External clock can be done, but it will always be sampled by SysCLK, as on most uC.
I'd expect smart pin counters to have External Edge option, & the sampling limits INC rates to < SysCLK/2 (100MHz)
So if I understand correctly: the NCO controls the pin read/write rate, and the FIFO buffers, for both reading and writing pins (in groups of 8/16/32 pins for byte/word/long hub reads/writes), also for writes, it can send them to DAC's?
Yep, that's my understanding now.
1) how fast are the ADC's in the pins? The fifo engine would be more symmetric if the ADC's could also write to the hub - but I suspect this is not needed, as I don't think the ADC's are fast enough to require dma
The counters are still in limbo I think so that part is still up in the air.
SERDES would be another candidate, but again not done yet.
I've got the cog working! The next step is to write the new boot loader. I'll need to update the assembler to do this. So far, I've been coding short test programs by hand and typing them into the memory files used by Quartus to compile the FPGA image.
Since we can't substitute ROM bit cells for RAM cells in the main memory (since we are using OnSemi's RAM), I had to instantiate a separate 4K*8 ROM (0.067 square mm) that is read via the CLKSET instruction. Each cog has a 5-unique-instruction program (that only cog0 uses) to load the ROM into the base of hub RAM at start-up. Doing it this way keeps the main memory simple, but it added a layer of complexity to the development. This has been slow-going, but once I get PNut.exe working with the downloader, things are going to really accelerate. That's when I'll add hub exec.
Thanks for your patience. I'm feeling confident about the direction of things. I think the chance of failure on the next chip will be very low.
... I had to instantiate a separate 4K*8 ROM (0.067 square mm) that is read via the CLKSET instruction. Each cog has a 5-unique-instruction program (that only cog0 uses) to load the ROM into the base of hub RAM at start-up.
Interesting. So this is a serial-like ROM, only accessible at Power-on ? (so the RAM is 100% RAM)
How does that compare with the previous ROM size ?
I noticed TI have released a ROM version of their Piccolo, with Motor control code/libraries in the ROM.
Interesting. So this is a serial-like ROM, only accessible at Power-on ? (so the RAM is 100% RAM)
How does that compare with the previous ROM size ?
I noticed TI have released a ROM version of their Piccolo, with Motor control code/libraries in the ROM.
You can always access it by using WC in the CLKSET instruction, so that D returns the next byte. Being only 8 bits wide, it doesn't cost a lot of gates and wires.
The previous ROM was 4KB, too. I might make this ROM 16Kx8, since it would still only take 0.25 sq mm, but could be changed later to accommodate USB boot code, etc. I'd just need to make the 5-unique-instruction boot program read 16KB, instead of 4KB. That just means changing two 0's into 1's.
You can always access it by using WC in the CLKSET instruction, so that D returns the next byte. Being only 8 bits wide, it doesn't cost a lot of gates and wires.
The previous ROM was 4KB, too. I might make this ROM 16Kx8, since it would still only take 0.25 sq mm, but could be changed later to accommodate USB boot code, etc. I'd just need to make the 5-unique-instruction boot program read 16KB, instead of 4KB. That just means changing two 0's into 1's.
Sounds great! I like the idea of the full 512K being available as hub RAM.
I've got the cog working! The next step is to write the new boot loader. I'll need to update the assembler to do this. So far, I've been coding short test programs by hand and typing them into the memory files used by Quartus to compile the FPGA image.
Since we can't substitute ROM bit cells for RAM cells in the main memory (since we are using OnSemi's RAM), I had to instantiate a separate 4K*8 ROM (0.067 square mm) that is read via the CLKSET instruction. Each cog has a 5-unique-instruction program (that only cog0 uses) to load the ROM into the base of hub RAM at start-up. Doing it this way keeps the main memory simple, but it added a layer of complexity to the development. This has been slow-going, but once I get PNut.exe working with the downloader, things are going to really accelerate. That's when I'll add hub exec.
Thanks for your patience. I'm feeling confident about the direction of things. I think the chance of failure on the next chip will be very low.
Congratulations Chip, I know you must have been working hard on this and it certainly is a milestone in the P16X64A development (that is the name for "Next Chip"?).
Yes, it is annoying having to implement the ROM in this manner but no matter how good the tools are or having over 114,000 logic elements and memory to match to play with, engineers always have something to complain about
I've got the cog working! The next step is to write the new boot loader. I'll need to update the assembler to do this. So far, I've been coding short test programs by hand and typing them into the memory files used by Quartus to compile the FPGA image.
Since we can't substitute ROM bit cells for RAM cells in the main memory (since we are using OnSemi's RAM), I had to instantiate a separate 4K*8 ROM (0.067 square mm) that is read via the CLKSET instruction. Each cog has a 5-unique-instruction program (that only cog0 uses) to load the ROM into the base of hub RAM at start-up. Doing it this way keeps the main memory simple, but it added a layer of complexity to the development. This has been slow-going, but once I get PNut.exe working with the downloader, things are going to really accelerate. That's when I'll add hub exec.
Thanks for your patience. I'm feeling confident about the direction of things. I think the chance of failure on the next chip will be very low.
Congratulations on getting the COG working! Is it complete enough that you can post an instruction set?
With the full 512KB hub ram, we are not constrained with the ROM. This can be replaced from flash with whatever the user requires. While what you did was neat putting ROM into the RAM, OnSemi's cells make much more sense now. It's a shame of the wasted resources but the result will be way better.
IMHO, the only ROM use that makes sense to me, is to be able to boot to get running. If we can get USB working then that makes sense as one of the boot alternatives. The monitor code also makes sense to get a minimum user system running too. And of course we have the security code too.
Do OnSemi have any fuse bit cells? I am presuming they don't have flash cells, or that it increases the process steps ???
Great work!
Keeping the whole 512k Ram free is a bonus that rewards your efforts.
As for the fuses construction technique, when selecting between yours and OnSemi, besides reliability and compactness considerations, IMHO an important aspect to take in account is the ability to "keep them well buried" in the mix of transistors and wires, so they cannot be easily pinpointed by geometrically (electron beam / optically) differentiable irregularities introduced in the layout by their presence and programmed state.
As for the current design step, does the Ram speed still outperforms the maximum expected COG clock rate by about 2:1?
Great work!
Keeping the whole 512k Ram free is a bonus that rewards your efforts.
As for the fuses construction technique, when selecting between yours and OnSemi, besides reliability and compactness considerations, IMHO an important aspect to take in account is the ability to "keep them well buried" in the mix of transistors and wires, so they cannot be easily pinpointed by geometrically (electron beam / optically) differentiable irregularities introduced in the layout by their presence and programmed state.
As for the current design step, does the Ram speed still outperforms the maximum expected COG clock rate by about 2:1?
Yanomani
The chip is designed so that the logic speed does not exceed the RAM speed, allowing the logic to be sufficiently complicated, while not going (much) under the speed limit imposed by the RAMs. So, the RAM speed is the target for everything else. RAMs are slow, compared to logic, so they set the speed limit.
Comments
-Phil
I've rephrased that a little better as I'm not sure if the Block read even needs the FIFO - the FIFO is really there for video-DMA style transfers, up to fSys/N. It fills when it can, and empties at a fixed rate. (& vice-versa on read to hub)
Block[16] transfers always needs 16 fSys, but does not have to use the FIFO, - it does needs the address / data lines to the HUB, but in theory, Block Read could work time-shared with a FIFO running video values less than fSys/1.
Hmm, I figured the FIFO was intended for Cog instructions to use. We'll just have to wait and see I guess.
-Phil
HUB exec is still mentioned by Chip, but no details yet.
A fifo could be useful for LMM style code, but the block[16] read would be more deterministic, and that would free FIFOs for data flows.
It maybe possible to do a hw-assisted form of LMM, that behaves like HUB exec with a speed between LMM and COG code.
Straight line code could be 66% of the speed of COG code, and a size-definable block would allow software to tune/optimise the block sizes.(multiples of 16)
To reliably accommodate an automatic peripheral in the cog that is using the FIFO, the FIFO gets priority over RDxxxx/WRxxxx instructions.
The FIFO is 19 levels deep, maximally. This is to accommodate the worst case of a long being read or written on every clock. Most of the time, in practice, the FIFO will only be a few levels deep.
Have You write before read ---- If Fifo's point to same destination?
What can be an automatic peripheral? What sort of flexibility is there?
So the plan for the FIFO is mainly around having it pace it's stream at the rate needed for the I/O. So, then, the Cog can still RDxxx/WRxxx in between the FIFO's Hub accesses. Am I on to it now? That almost sounds a bit too luxurious for the Prop.
That's right. RDxxxx/WRxxxx must wait for cycles when the FIFO is not needing to issue reads or writes.
There will be simple state machines to read pins (byte/word/long groups) and write them the hub RAM via the FIFO per NCO setting. Data can also be read from the hub RAM via the FIFO and written to pins or DACs per NCO setting. Video is a case of the latter.
No. Do you think it would that helpful to have?
I have to ask...
1) how fast are the ADC's in the pins? The fifo engine would be more symmetric if the ADC's could also write to the hub - but I suspect this is not needed, as I don't think the ADC's are fast enough to require dma
2) what is the maximum NCO frequency? 200Mhz? (ie sysclk?)
Too bad mixing clock domains is not easy in the Altera software. It would be nice to allow an external clock input for use instead of an NCO frequency. 165MHz oscillator comes to mind.
This would be rare, - only in Pin streaming -> Hub, not for Hub -> video/pins.
It would also be tricky, as any element in the FIFO might be due to replace what you are about to read in SW.
Checking all FIFO elements will require too much logic.
Most apps would first start the FIFO/DMA and then know whether they were ahead or behind of the write pointer.
(usually SW would be slower, and behind the pointer, but at low NCO speeds, you may need to slow the SW.
There, and for burst cases, it could be useful to have some means to read/track the FIFO burst progress.
Read of the registers used to config this could give that ? This might already be planned ?
Yes, SysClk/N, N >= 1,
Why not clock at 165MHz in that case ?
External clock can be done, but it will always be sampled by SysCLK, as on most uC.
I'd expect smart pin counters to have External Edge option, & the sampling limits INC rates to < SysCLK/2 (100MHz)
As I don't know how FIFO's You made works -- I can't answer that.
BUT I will always calculate on last values --- Not old ones
Yep, that's my understanding now.
The counters are still in limbo I think so that part is still up in the air.
SERDES would be another candidate, but again not done yet.
I've got the cog working! The next step is to write the new boot loader. I'll need to update the assembler to do this. So far, I've been coding short test programs by hand and typing them into the memory files used by Quartus to compile the FPGA image.
Since we can't substitute ROM bit cells for RAM cells in the main memory (since we are using OnSemi's RAM), I had to instantiate a separate 4K*8 ROM (0.067 square mm) that is read via the CLKSET instruction. Each cog has a 5-unique-instruction program (that only cog0 uses) to load the ROM into the base of hub RAM at start-up. Doing it this way keeps the main memory simple, but it added a layer of complexity to the development. This has been slow-going, but once I get PNut.exe working with the downloader, things are going to really accelerate. That's when I'll add hub exec.
Thanks for your patience. I'm feeling confident about the direction of things. I think the chance of failure on the next chip will be very low.
Interesting. So this is a serial-like ROM, only accessible at Power-on ? (so the RAM is 100% RAM)
How does that compare with the previous ROM size ?
I noticed TI have released a ROM version of their Piccolo, with Motor control code/libraries in the ROM.
You can always access it by using WC in the CLKSET instruction, so that D returns the next byte. Being only 8 bits wide, it doesn't cost a lot of gates and wires.
The previous ROM was 4KB, too. I might make this ROM 16Kx8, since it would still only take 0.25 sq mm, but could be changed later to accommodate USB boot code, etc. I'd just need to make the 5-unique-instruction boot program read 16KB, instead of 4KB. That just means changing two 0's into 1's.
Congratulations Chip, I know you must have been working hard on this and it certainly is a milestone in the P16X64A development (that is the name for "Next Chip"?).
Yes, it is annoying having to implement the ROM in this manner but no matter how good the tools are or having over 114,000 logic elements and memory to match to play with, engineers always have something to complain about
With the full 512KB hub ram, we are not constrained with the ROM. This can be replaced from flash with whatever the user requires. While what you did was neat putting ROM into the RAM, OnSemi's cells make much more sense now. It's a shame of the wasted resources but the result will be way better.
IMHO, the only ROM use that makes sense to me, is to be able to boot to get running. If we can get USB working then that makes sense as one of the boot alternatives. The monitor code also makes sense to get a minimum user system running too. And of course we have the security code too.
Do OnSemi have any fuse bit cells? I am presuming they don't have flash cells, or that it increases the process steps ???
Here's an updated instruction list:
They do. I will ask them about their construction. I think ours are fine, but they might have some other technique that is more reliable and compact.
Looking forward to flexing the new chips muscles...
Great work!
Keeping the whole 512k Ram free is a bonus that rewards your efforts.
As for the fuses construction technique, when selecting between yours and OnSemi, besides reliability and compactness considerations, IMHO an important aspect to take in account is the ability to "keep them well buried" in the mix of transistors and wires, so they cannot be easily pinpointed by geometrically (electron beam / optically) differentiable irregularities introduced in the layout by their presence and programmed state.
As for the current design step, does the Ram speed still outperforms the maximum expected COG clock rate by about 2:1?
Yanomani
With the ROM being copied to RAM now, we can design things that patch into the default code. Remember to add some clever hooks.
"LINK" appears both "current instructions" and "instructions to be added" with different opcodes. Perhaps these are different variants of Link
The chip is designed so that the logic speed does not exceed the RAM speed, allowing the logic to be sufficiently complicated, while not going (much) under the speed limit imposed by the RAMs. So, the RAM speed is the target for everything else. RAMs are slow, compared to logic, so they set the speed limit.
The LINK instruction not implemented yet will be able to provide a 19-bit constant, but is limited on where the return address can be stored.