The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

David Betz · 2017-07-12 13:48

Dave Hein wrote: »

Each cog could implement an instruction cache in hub RAM. This would limit the collisions with other cogs. They would only need to access external memory when there is a cache miss. In PropGCC on the P1 we use a cache size of 8K. A larger cache on the P2 would be even better. With relative jumps, the P2 code is position independent. So code could be executed directly out of a hub cache even though the code is loaded at a different address than the original location. However, the P2 would need to be able to interrupt if it tried to execute an instruction outside of a cache line. It currently doesn't have that capability. Maybe in the P3.

This works for an LMM approach like is done in P1 PropGCC but it doesn't work as well with hub exec where we don't explicitly fetch instructions in software. Some sort of minimal TLB would be needed to allow code to run at full hardware speed from a cache. Otherwise we will have to use overlays.

Dave Hein · 2017-07-12 16:24

TLB -- I had to look that up. A translation lookaside buffer isn't needed as long as the data is in a fixed location and relative jumps are done and there is a way to interrupt when a jump is performed outside of the cache line. So the only thing the P2 is missing is the interrupt. I'm certainly not suggesting that it be added at this point, but maybe in a latter version of the P2. Overlays should be a good solution for the current version of the P2.

Relative jumps are a useful feature of the P2 that allows making position independent code. I use this feature when I was messing around with the Taz compiler to allow programs to be executed from anywhere in memory. I did have to fix up data addresses however. I might be nice to have a relative data addressing feature in a future of version of the P2 that would use a base address register to indicate the location of data space.

David Betz · 2017-07-12 16:40

I don't see how relative jumps help since they will be interpreted by the hardware as being relative to the position of the instruction in the cache not relative to its position in the larger program being run from external memory.

Seairth · 2017-07-12 17:50

One approach I'm expecting to see is to use shared LUT mode where one cog is the external memory driver for the other code that's executing from the LUT.

David Betz · 2017-07-12 18:07

Seairth wrote: »

One approach I'm expecting to see is to use shared LUT mode where one cog is the external memory driver for the other code that's executing from the LUT.

I still don't see how executing code from external memory will work unless we use an LMM approach. Any branches that target code that is in external memory will go off into the woods. Overlays, of course, should be fine.

potatohead · 2017-07-12 19:18

Agreed with David.

I wonder if it makes sense to make JMP / Branch instructions an interrupt event?

Just brainstorming.

The nice thing about LMM is it offers the capability to supervise the code.

How fast is an LMM loop on this one anyway? Has anyone made an LMM kernel for P2 yet?

An interrupt, pre JMP / Branch instructions, could split the middle. Run it all at HUBEXEC speed, supervise and act on jumps. Maybe only do that for jumps, and compile to local, relative branches, so that only the JMP needs supervision.

Again, just thinking out loud here.

potatohead · 2017-07-12 19:25

Or never use JMP or maybe even branch at all in a big, XMM type program. Use CALL instead and load the target address in advance of the CALL.

COG code could manage the branch case in a number of ways. Call jumps into COG code ready to do the work quickly.

The idea being to put the necessary handling into the COG where it runs quickly, is always available, etc...

The COG creates the right target and or fetches or signals for more XMM code to be loaded, and or waits for that to happen before continuing.

Call, in this context, becomes a path for custom "microcode" like instructions as far as a HUBEXEC program is concerned.

Agree on an address storage register or dummy instruction that precedes the call, so the COG can fetch that address to work with.

It does what is needed, depending on a range, or table made for that purpose.

Dave Hein · 2017-07-12 19:28

David Betz wrote: »

I don't see how relative jumps help since they will be interpreted by the hardware as being relative to the position of the instruction in the cache not relative to its position in the larger program being run from external memory.

Like I said, relative jumps works only if the code stays within the cache line buffer. Let's say we have a program that has 900K of code and 350K of data, and we only have 512K of hub RAM. We use 350K for the data, and we'll use 128K for a program cache. Let's make a cache line 1K in size, which provides for 128 cache lines in our buffer. So at any one time we will have about 14% of our code in the cache. Depending on how the program is organized this may result in a very small number of cache misses.

A 1K chunk of code that originally started at location $23000 can run in the cache starting at location $3000 because it uses relative jumps. The problem occurs when it jump to a region outside of the caches line, such as less than $3000 or greater than or equal to $3400. An interrupt would be needed so that the ISR could ensure that the correct code is loaded in the target cache line, and then once loaded it would jump to the target location in the cache line. As long as the program jump within the address range of $23000 to $233FF in the original address space it would stay within the range of $3000 to $33FF within the caches. A jump to $25018 in the original address space would require loading the 1K chunk starting at address $25000 into the cache are at location $5000, and then jumping to $5018 in the cache.

Data would reside at a fixed address, so there is no need to change the addresses for data access. If we wanted to cache data also we would need a way to easily remap the addresses. We could use a base address register to this, but since data could be scattered randomly in memory it might be better to use an MMU at that point.

David Betz · 2017-07-12 19:30

potatohead wrote: »

Or never use JMP or maybe even branch all. Use CALL instead and load an address in advance. COG code could manage the branch case in a number of ways. Call jumps into COG code ready to do the work quickly.

The idea being to put the necessary handling into the COG where it runs quickly, is always available, etc...

The COG creates the right target and or fetches or signals for more XMM code to be loaded, and or waits for that to happen before continuing.

Call, in this context, becomes a path for custom "microcode" like instructions as far as a HUBEXEC program is concerned.

Agree on an address storage register or dummy instruction that precedes the call, so the COG can fetch that address to work with.

It does what is needed, depending on a range, or table made for that purpose.

That's an interesting idea. All branches essentially become LMM macros but everything else runs at full speed. Nice.

Dave Hein · 2017-07-12 19:39

Replacing jumps with a call eliminates the need for an interrupt except for the case where the program crosses from one cache line to the next without a jump. So code that reaches the end of a cache line it would need to perform a call to get to the next address. In theory, the linker could insert calls at the end of each cache line boundary. It would have to add 4 to all the addresses after that, but it seems doable.

David Betz · 2017-07-12 19:41

Dave Hein wrote: »

Replacing jumps with a call eliminates the need for an interrupt except for the case where the program crosses from one cache line to the next without a jump. So code that reaches the end of a cache line it would need to perform a call to get to the next address. In theory, the linker could insert calls at the end of each cache line boundary. It would have to add 4 to all the addresses after that, but it seems doable.

Or just make the cache lines a little longer and have the cache manager supply the CALL instructions.

Dave Hein · 2017-07-12 19:50

I'm not sure how that would work. If I'm running straight line code that crosses a cache line the cache manager wouldn't know it. The cache manager would only run when the user code explicitly calls it. It would be easy for the linker to ensure that small object never cross a cache line. When linking an object it could just skip to the next cache line boundary if the object won't fit in the current cache lines. Small holes at the end of cache lines could be filled in by small objects later on in the linking process. Objects that are larger than a cache line would need special handling.

In theory, straight line chunks could be packed in any order as long as the previous straight-line chunk didn't just fall into it. The compiler could ensure that a straight-line chunk is never greater than a cache line by breaking it up with jumps.

Seairth · 2017-07-12 20:11

David Betz wrote: »

Seairth wrote: »

One approach I'm expecting to see is to use shared LUT mode where one cog is the external memory driver for the other code that's executing from the LUT.

I still don't see how executing code from external memory will work unless we use an LMM approach. Any branches that target code that is in external memory will go off into the woods. Overlays, of course, should be fine.

Overlays is all I had in mind. There is no provision for virtual memory and the use of external RAM doesn't change that.

Seairth · 2017-07-12 20:14

I'm not understanding the discussions about cache lines. What cache lines?

David Betz · 2017-07-12 20:26

Seairth wrote: »

I'm not understanding the discussions about cache lines. What cache lines?

We've basically started talking about XMM for P2. There is no need for LMM with hub exec but there needs to be some way to handle branches in cached code if we are to implement XMM. Potatohead came up with a good idea for how to handle that.

Dave Hein · 2017-07-12 20:48

This is why it's important to be working on PropGCC for the P2 now. I'm sure there will be things that we'll discover that would be useful to get into the chip for C. It will be too late if we wait till after the chip is done. Ken, are you listening?

potatohead · 2017-07-13 00:14

Agreed. Might be a resource matter though. Ken hinted it might be at this time.

For proofs, can we use something old and simple? Something mere mortals can work with? I'm good with PASM, as many of us are. SPIN will happen, and no brainer there.

It takes me a while to be successful with a big C environment, and I don't know anything about compiler guts. Linkers and loaders are a little more familiar, but LOL, you get the idea.

A simple compiler, maybe just C, could be used to make and or run son big C programs out there. Maybe an old one and it's guts are enough? I'm not sure we need intense optimization to work out the bits needed to make XMM sing. It's maybe even better that programs are big!

The P2 will be like P1 in terms of RAM relative to capability.

Byte code and XMM, that is fast, will both see use, as will overlays, which are likely straight PASM. Hand or compiled.

Byte code SPIN will be a simple, big program option. Or could be. A lot will fit anyway. And, if we do just enough to make loading objects dynamically easy, that's done. Really big programs, simple. Just get the stuff from SD or storage as needed. No XMM needed. The simpler this is, the better. People will use it hard and make big programs in chunks. And will share those like they do objects now.

This cache XMM discussion seems important. It's the "just do it" big program case that seems like it will or could really perform. Maybe on par with or faster than byte code without overlays or loading schemes?

IMHO, the current expectation for byte code speed at 160Mhz is killer! I know PASM is way faster, but it's about matching speed to tasks people may want to do, and byte code at 160mhz may be in the range of native COG PASM is on P1. Cool. Very useful and relevant in terms of speed, IMHO given the COGS help out P1 style.

Testing XMM now, maybe to see if it is in a similar range? And or, so we don't miss out on some XBYTE type thing that would have been cake to do now?

Should happen somehow.

A P2 with a big RAM could be really attractive when used this way. Most of the HUB is data, some execute. One, maybe two big programs run fine, and a bunch of little ones drive the COGS as we all know how to do, and we know doing that works great. That may well be a very common use case. I think it might be, if its real and not a hassle, or requires very high skill to develop on.

There could be setups. Load one, and it's got goodies needed: video, sound, USB, storage, math, serial, motor drives, whatever. That stuff is ready to go, just a library call away. That's a lot like Arduino sketches.

From there, write your program as if that stuff were hardware. Same vision as P1 and the obex, only it's bigger. Big enough to tackle a much broader set of use cases. And more capable. Real color depths, for example, even at modest resolution can look good and be compelling to use.

P2 with 128Mb RAM? XMM at byte code type speed overall would make for an application that doesn't need to scrimp so much. People could maybe even use some library code out there. Or not worry about stuff like full on printf, or filesystem support as major constraints on the main task.

The core task will be lean and mean, but the user experience code could be much larger and the resulting application or appliance / control system presentation on par with much greater system on a chip solutions, but not requiring an OS and all that comes with doing that.

Anyhow, just a few more thoughts out loud.

jmg · 2017-07-13 00:28

potatohead wrote: »

..
P2 with 128Mb RAM? XMM at byte code type speed overall would make for an application that doesn't need to scrimp so much. People could maybe even use some library code out there.

Yes, Chip has the byte-code engine well optimised now, so could be a good time to check how that tight code can/could reach into off chip memory, and if any simple hardware can help there.
eg Many MCUs used a simple ready signal, to allow them to auto-stretch fetch from slower memory sources.

potatohead · 2017-07-13 00:40

Yeah jmg I was really thinking more compiled PASM XMM style. Byte code was more of a speed reference, not so much complicating it to go XMM.

Nvm, I see you wrote code. Didn't see it.

msrobots · 2017-07-13 01:36

Somebody here tested PropGCC and the linker for overlay support on a P1.

And it did work, so overlays are still supported by Gcc&Co. The concept of the overlays dates a long time back where common PCs had less Ram as the P2 is proposing.

I never dug into compiler design and how stuff like overlays are implemented, but it seems to me a valid option to explore if it can be used to provide support for larger programs.

Enjoy!

Mike

Dave Hein · 2017-07-13 02:46

I've been working on the p2gcc compiler for about 3 months. It's documented in the "Can't Wait for PropGCC thread on the P2?" thread. Maybe I could get it to work with external memory. Has anybody tried interfacing external memory chips to the FPGA boards? I am able to access an SD card using a C program on the P2. Maybe I could try running programs from SD. I'll have to speed up my SPI driver though. It's currently written in C. Does anybody have a fast SPI driver for the P2?

jmg · 2017-07-13 04:57

Dave Hein wrote: »

I've been working on the p2gcc compiler for about 3 months. It's documented in the "Can't Wait for PropGCC thread on the P2?" thread. Maybe I could get it to work with external memory. Has anybody tried interfacing external memory chips to the FPGA boards? I am able to access an SD card using a C program on the P2. Maybe I could try running programs from SD. I'll have to speed up my SPI driver though. It's currently written in C. Does anybody have a fast SPI driver for the P2?

The Smart Pin cells have SPI modes, unclear what is the top MHz limit on this - something that would be good to know.
Mentions 2 clks delay, Edge to Data, so that may limit to SysCLK/4 ?

%11100 = synchronous serial transmit This mode overrides OUT to control the pin output state.

Words of 1 to 32 bits are shifted out on the pin, LSB first, with each new bit being output two internal clock cycles after registering a positive edge on the B input.

%11101 = synchronous serial receive

Words of 1 to 32 bits are shifted in by sampling the A input around the positive edge of the B input. For negative-edge clocking, the B input may be inverted by setting B[3] in WRPIN’s D value.

The streamer has 1,2,4,8 bit modes so that should be able to combine with the SPI modes to do Quad and HyperRAM DDR type IO.

Flash SPI in one bit mode has 8b command and 24b address, with an optional dummy byte for high speed

The device I have open FT25H16 says 80MHz max for 03H read, and 120MHz for 0BH reads (dummy), but it seems the highest speed may not be possible via Smart Pins.

There is Quad Fast Read (6BH) that uses 40 clocks one-bit, then streams Quad.

That might be supported as
a) * One 32b Pin Cell send, 06BH + Addr24
b) * One 8b Pin cell dummy (Rx?)
c) * Streamer x4 read ( I think this has a higher clock ability, so might get to 120MHz )

The hand-over details around the CLOCKs, between a), b), c) would need checking.

Slower, but simpler initially would be to use just Smart Pin cells
a) * One 32b Pin Cell send, 06BH + Addr24
b) * One 8b Pin cell dummy (Rx?)
c) * Multiple 32b reads

32b reads gives quite a few SysCLKs to decide what to do with the data.

AntoineDoinel · 2017-07-13 13:54

Dave Hein wrote: »

I've been working on the p2gcc compiler for about 3 months. It's documented in the "Can't Wait for PropGCC thread on the P2?" thread. Maybe I could get it to work with external memory. Has anybody tried interfacing external memory chips to the FPGA boards? I am able to access an SD card using a C program on the P2. Maybe I could try running programs from SD. I'll have to speed up my SPI driver though. It's currently written in C. Does anybody have a fast SPI driver for the P2?

Some time ago I tried with P1V on the DE0-nano, to connect the onboard SDRAM in essentially a way similar to what Jazzed did for his SDRAM board.
P1V had PORTB implemented, and I included a 16-bit latch in verilog, so from the P1V point of view it took less pins (~24pins, with 16 being data/addr bus).

The advantage would be, even for P2, reusing most of the existing SDRAM cache driver.

Some time *LATER* from now, I meant to ask Chip to implement a similar thing for P2 for the DE0-nano (and maybe some other minor mods), but for the time being I don't want to distract him with such "amenities".

Yanomani · 2017-07-13 16:27

Dave Hein wrote: »

I've been working on the p2gcc compiler for about 3 months. It's documented in the "Can't Wait for PropGCC thread on the P2?" thread. Maybe I could get it to work with external memory. Has anybody tried interfacing external memory chips to the FPGA boards? I am able to access an SD card using a C program on the P2. Maybe I could try running programs from SD. I'll have to speed up my SPI driver though. It's currently written in C. Does anybody have a fast SPI driver for the P2?

About one month ago, Richard (RJSM) has done an outstanding work, interfacing HyperRam to P2 and also P1.

They were documented at the following threads:

https://forums.parallax.com/discussion/166802/hyperram-solutions-for-p2-and-p1/p1

https://forums.parallax.com/discussion/166850/boost-your-propeller-s-memory-256x-from-32kb-8mb#latest

I hope he can help your efforts too

Henrique

Dave Hein · 2017-07-13 17:11

Thanks for the info on external memory. I think I'll try it first with an SD card or maybe the flash memory that is already connected to the P2 on the DE2-115. If I get that working then I'll try it with higher speed memory. However, it will be several weeks before I can get to it.

The_Master · 2017-08-07 11:31

cgracey wrote: »

Engineering is being over-prescribed these days. It's now for everybody. The reality is that maybe 2% of the population are inclined to be interested in it.

This is thought provoking and reminds me of something else I think I read somewhere. It said there are less programmers today (professional or hobbyist) than 20 years ago. Despite increased devices everywhere, fewer people attempt to do any programming, and this even includes html. 99% of people don't do any programming.

This goes against what you would think. Maybe there is a higher barrier to entry, with more complicated languages. You guys think that 99% assesment sounds right?

Heater. · 2017-08-07 17:30

A few years back Eben Upton was the admissions office for computer science at Cambridge University. He noticed that the kids entering those courses did not know how to program. By contrast with 10 or 20 years before when all the intake had been programming since they were ten or twelve years old. They used to dive straight into computer science, now they wasted much of the first year bringing them up to speed just programming.

This observation led Upton to create the Raspberry Pi, an attempt to attract kids into computers and programming. Now there are some 14 million Pi out there...

Did all that inspire a new generation of programmers? We shall see ....

David Betz · 2017-08-12 22:31

What's happening with P2? Is Chip back from China yet?

msrobots · 2017-08-13 03:25

no, they are keeping Chip until they can produce a clone of him.

Mike

Dave Hein · 2017-08-13 13:38

On July 6 Chip stated that the new test chip should go into a shuttle run in about 4 weeks.

cgracey wrote: »

...
Things are coming together nicely on all fronts to get a full chip built. Our current test chip will go into a shuttle in about 4 weeks. It usually takes 10 weeks to get back. By that time, the Prop2 digital design should be quite certain and proven. Notice that in the last release, there were no new instructions, just refinements to what already exists.

Work on the Spin2 interpreter has been going slowly lately, as I've been working on refining the main Verilog code and getting prepared for a China trip where we are taking a bunch of kids to help teach Chinese kids how to program our Scribbler 3 robots in Blockly.
...

I suspect this has not happened yet since Chip has been away. However, Chip did know about the China trip when he made this estimate, so maybe he had factored that in.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments