David Betz has recently informed me that the P2 is finally available, so if I can get my hands on one, there will be a P2 version of Catalina. However, I can't figure out how to order the evaluation board. The P2 discussion forums point you to the shop link for the "Propeller 2 ES Evaluation Board" but that page just says it is "unavailable". Does anyone know if it is actually available? Perhaps it is already sold out?
David Betz has recently informed me that the P2 is finally available, so if I can get my hands on one, there will be a P2 version of Catalina. However, I can't figure out how to order the evaluation board. The P2 discussion forums point you to the shop link for the "Propeller 2 ES Evaluation Board" but that page just says it is "unavailable". Does anyone know if it is actually available? Perhaps it is already sold out?
You need to talk with Chip, but there are about 100 P2 EV's in the wild, and some P2D2's as well.
I believe some damaged P2EV's have been swapped for refurbished so you might score a fixed return unit ?
ie most packaged P2 die, ended up on boards.
There is a pause right now, as P2+ is routed/taped out - which must be due for a progress report on that from OnSemi any moment now ?
The target for P2+ was/is to hit 250MHz and lowest HDMI speeds. Current P2's top out on the bench, overclocked at somwehere over 300MHz, on good copper planes.
Welcome back, Ross! I should probably just send you my P2-Eval board since you're likely to make better use of it than I will. Actually, Peter Jakacki might have another P2D2 board available.
Ok - thanks all. Clearly I need to do some more reading. I'm not sure what state the P2 evaluation board is actually in, or what a P2D2 is. Were there problems with the original eval board? Or could it just not be run at the full clock speed? (which wouldn't be a problem for me, at least).
Ok - thanks all. Clearly I need to do some more reading. I'm not sure what state the P2 evaluation board is actually in, or what a P2D2 is. Were there problems with the original eval board? Or could it just not be run at the full clock speed? (which wouldn't be a problem for me, at least).
To answer, the existing Prop2 is "engineering sample" chips, so not for general evaluation of the finished chip. There is some identified issues but speed certainly wasn't one of them. It can reach over 300 MHz, double the original estimate of 160 MHz! Although it does run hot then.
I think everyone's hoping next run will be the finished product.
To answer, the existing Prop2 is "engineering sample" chips, so not for general evaluation of the finished chip.
? Engineering samples are certainly intended for evaluation, and whilst they may not operate exactly as the 'finished chip; they are certainly working now, and are fine for much of software and hardware development.
The errata is quite small, and the code changes also small, some new hardware is coming in P2+, but that will not affect a compiler.
Biggest gotcha I think in P2-ES is the verilog sign extension oops, between FPGA flows and ASIC flows.
IIRC, I saw that the small opcode extensions in P2+, can be made backward binary compatible.
Thanks again, everyone. Fortuitously, a health issue has given me a few weeks where I can't be doing my normal work around our eco-retreat, so I have a window of opportunity to get up to speed on the P2.
Hi Ross,
There are two pcbs...
The P2D2 board which Peter Jakacki (Brisbane) made and contain one of the first 10 P2-ES chips which were epoxied by OnSemi into the IC package.
The P2_EVAL board was made by Parallax and 100 were built with the P2-ES chips correctly packaged. There were a few more of these P2-ES chips sent to Peter. IIRC there were about 110 of these P2-ES chips.
There are some minor problems with the P2-ES silicon but IMHO we could have lived with them although the sign problem was a problem.
There are threads for all these titles.
Now, the P2 is real and we're expecting the next run back somewhere around May if my memory serves me correctly.
There are some really exciting instructions in the P2
Here is the real P2 hardware basics...
* 8 Cogs with 2KB Cog RAM and 2KB LUT RAM
* 512KB of HUB RAM dual-mapped into 1MB. The top 16KB is preloaded from internal serial ROM and can be write protected.
* Cog access to the hub is via the "egg-beater" (see below)
* 64 I/O, all with "smart-pins" (see below) and ADC
* inbuilt DACs
* 100pin QFP
* clock aimed at 160/180MHz but we've pushed this (with cooling) to over 340MHz
* code can execute from cog, lut and hub (hubexec)
* most instructions take 2 clocks
egg-beater
* the hub is divided into 8 x 64KB (16Kx32bits) blocks of ram
* each block can be considered the next long address from the previous block, wrapping at the end again.
* On every clock, the cog has access to one of the hub ram blocks!!!
* on each clock the cog gains access to the next hub block
* so, after 8 clocks the cog may have read in 8 longs!!!
* there is a streamer to facilitate this fast access.
smart-pins
* each pin has a smart-pin state machine to do all sorts of things
And there is so much more.
Only down-side is that the P2 uses a lot of power. Hopefully the next P2 will not use so much if sections are not running.
Aim is for 200MHz
To answer, the existing Prop2 is "engineering sample" chips, so not for general evaluation of the finished chip.
? Engineering samples are certainly intended for evaluation, and whilst they may not operate exactly as the 'finished chip; they are certainly working now, and are fine for much of software and hardware development.
*Early* development with lots of caveats type evaluation sure. By general and finished I mean as a supported and production ready product.
"The globally-accessible hub RAM can be read and written as bytes, words, and longs. Hub addresses are always byte-oriented. There are no special alignment rules for words and longs in hub RAM. Cogs can read and write bytes, words, and longs at any hub address, as well as execute instruction longs from any hub address starting at $400."
But then the description of the "egg beater" says this ...
"Hub RAM is comprised of 32-bit-wide single-port RAMs with byte-level write controls. For each cog, there is one of these RAMs, but it is multiplexed among all cogs. Let’s call these separate RAMs “slices”. Each RAM slice holds every single/2nd/4th/8th/16th (depending on number of cogs) set of 4 bytes in the composite hub RAM. At every clock, each cog can access the “next” RAM slice, allowing for continuously-ascending bidirectional streaming of 32 bits per clock between the composite hub RAM and each cog."
So, what happens if I read or write a long that is not aligned on a long boundary? Wouldn't this long contain bytes from different slices?
So, what happens if I read or write a long that is not aligned on a long boundary? Wouldn't this long contain bytes from different slices?
Yes, it adds one clock to the read/write time. The big feature of the egg-beater is it provides a fast burst like action available to consecutive addresses.
So, what happens if I read or write a long that is not aligned on a long boundary? Wouldn't this long contain bytes from different slices?
Yes, it adds one clock to the read/write time. The big feature of the egg-beater is it provides a fast burst like action available to consecutive addresses.
Ok, I guess that makes sense. But I can see that the timings of instructions on the P2 are going to be a headache!
So, what happens if I read or write a long that is not aligned on a long boundary? Wouldn't this long contain bytes from different slices?
Yes, it adds one clock to the read/write time. The big feature of the egg-beater is it provides a fast burst like action available to consecutive addresses.
Ok, I guess that makes sense. But I can see that the timings of instructions on the P2 are going to be a headache!
Executing from hub, yes, timing is hard to know on branching, but straight-line code is just like within cog/LUT RAM.
Yes, I can already see several features that will make compilers easier ... but also quite a few that I can't figure out how a compiler could ever possibly use!
Yes, I can already see several features that will make compilers easier ... but also quite a few that I can't figure out how a compiler could ever possibly use!
That's often true. RISC instruction sets were designed to only include instructions that were of use to compilers but I guess we've moved way beyond that here. However, all of these fancy features will be great for PASM programmers.
The P2 documentation says "The lookup RAM must be read and written using RDLUT/WRLUT instructions."
But the documentation also says you can access LUT using RDLONG and WRLONG, if you also use SETQ - e.g. "Use SETQ2+RDLONG to read multiple hub longs into cog lookup RAM". But I can't quite see how the RDLONG knows to use lookup RAM and not register RAM. Also, if it can do this, can you also just use individual RDLONG instructions to read into LUT RAM - i.e. without using SETQ?
My next question is about LUT sharing. You can use SETLUTS to share (for example) the LUT RAM between cog 0 and cog 1. I just want to make sure I understand this. My reading is that before any LUT writes, the LUT RAM of cog 0 and cog 1 would both contain their respective original values for a specific LUT RAM location, which may be different. But if either one writes to a LUT RAM location, they will both thereafter read the same value from that location. What happens if both cogs write to the same LUT RAM location at the same time - which value ends up in LUT RAM?
The P2 documentation says "The lookup RAM must be read and written using RDLUT/WRLUT instructions."
But the documentation also says you can access LUT using RDLONG and WRLONG, if you also use SETQ - e.g. "Use SETQ2+RDLONG to read multiple hub longs into cog lookup RAM". But I can't quite see how the RDLONG knows to use lookup RAM and not register RAM. Also, if it can do this, can you also just use individual RDLONG instructions to read into LUT RAM - i.e. without using SETQ?
Short answer is no. SETQ2 + RD/WRLONG is special case.
Internally, SETQ2 must be setting a hidden flag than RD/WRLONG alone will check for. When those two instructions see this flag waving at them their become a different, and "complex", instruction. Same story for SETQ with another flag.
My next question is about LUT sharing. You can use SETLUTS to share (for example) the LUT RAM between cog 0 and cog 1. I just want to make sure I understand this. My reading is that before any LUT writes, the LUT RAM of cog 0 and cog 1 would both contain their respective original values for a specific LUT RAM location, which may be different. But if either one writes to a LUT RAM location, they will both thereafter read the same value from that location. What happens if both cogs write to the same LUT RAM location at the same time - which value ends up in LUT RAM?
LUTSON (SETLUTS #1) is a one-way control, it allows writes to this cog's LUTRAM from the other cog's WRLUT instructions. But both cogs can issue the same and get bidirectional going. I don't know the answer ... and it won't be the same behaviour between the P2ES and the final Prop2 hardware either since there was a design flaw that got fixed around this.
Two simultaneous writes to the one location is really not something that has been considered. The fix was with respect to simultaneous read and write.
LUTSON (SETLUTS #1) is a one-way control, it allows writes to this cog's LUTRAM from the other cog's WRLUT instructions. But both cogs can issue the same and get bidirectional going. I don't know the answer ... and it won't be the same behaviour between the P2ES and the final Prop2 hardware either since there was a design flaw that got fixed around this.
Two simultaneous writes to the one location is really not something that has been considered. The fix was with respect to simultaneous read and write.
I wonder if it might happen that each cog's write might end up in the LUT of the other cog? That would at least be symmetrical and consistent!
The P2 documentation says "The lookup RAM must be read and written using RDLUT/WRLUT instructions."
But the documentation also says you can access LUT using RDLONG and WRLONG, if you also use SETQ - e.g. "Use SETQ2+RDLONG to read multiple hub longs into cog lookup RAM". But I can't quite see how the RDLONG knows to use lookup RAM and not register RAM. Also, if it can do this, can you also just use individual RDLONG instructions to read into LUT RAM - i.e. without using SETQ?
SETQ2 is a special case for RDLONG. This is copying a block from HUB to LUT, or LUT to HUB with SETQ2 & WRLONG.
This is how you move HUB <--> LUT
RD/WRLUT is for moving between COG <--> LUT and it has no SETQ/SETQ2 equivalent.
The PTRA/B effects will also be valid for RD/WRLUT in the next silicon.
My next question is about LUT sharing. You can use SETLUTS to share (for example) the LUT RAM between cog 0 and cog 1. I just want to make sure I understand this. My reading is that before any LUT writes, the LUT RAM of cog 0 and cog 1 would both contain their respective original values for a specific LUT RAM location, which may be different. But if either one writes to a LUT RAM location, they will both thereafter read the same value from that location. What happens if both cogs write to the same LUT RAM location at the same time - which value ends up in LUT RAM?
COG0 & COG1 LUTs may contain different values.
If COG0 has enabled sharing, then when COG1 writes to its LUT, that will be also written to COG0's LUT.
Similarly, if COG1 has enabled sharing, then when COG0 writes to its LUT, that will also be written to COG1's LUT.
There is currently a bug when reading from the same location as writing. This will be fixed in the next silicon. Currently you need to read it twice and compare.
I don't think there is, or will be, any contention protection when both cogs write to the same LUT address.
LUTSON (SETLUTS #1) is a one-way control, it allows writes to this cog's LUTRAM from the other cog's WRLUT instructions. But both cogs can issue the same and get bidirectional going. I don't know the answer ... and it won't be the same behaviour between the P2ES and the final Prop2 hardware either since there was a design flaw that got fixed around this.
Two simultaneous writes to the one location is really not something that has been considered. The fix was with respect to simultaneous read and write.
I wonder if it might happen that each cog's write might end up in the LUT of the other cog? That would at least be symmetrical and consistent!
I'd prefer each cog's LUT to get its own data.
What happens if there are simultaneous writes to different locations - does or will that work? And can we say for certain yet what will work in the final hardware for different combinations of LUT sharing and simultaneous reading and writing?
What happens if there are simultaneous writes to different locations - does or will that work?
That one's no issue. The RAM is dual-ported so there is no conflict there.
And can we say for certain yet what will work in the final hardware for different combinations of LUT sharing and simultaneous reading and writing?
Yes, Chip has it implemented in the FPGA. I've tested it. The read data corruption no longer occurs.
A little difficult to prove given the somewhat unknown nature of coordinating two cogs but, supposedly, the reading cog receives the new data being written by the writing cog. This will be done with extra mux circuit to copy around the RAM when the two accesses coincide.
Comments
I doubt there will be any P2 variant by Ross.
David Betz has recently informed me that the P2 is finally available, so if I can get my hands on one, there will be a P2 version of Catalina. However, I can't figure out how to order the evaluation board. The P2 discussion forums point you to the shop link for the "Propeller 2 ES Evaluation Board" but that page just says it is "unavailable". Does anyone know if it is actually available? Perhaps it is already sold out?
Thanks in advance!
Ross.
You need to talk with Chip, but there are about 100 P2 EV's in the wild, and some P2D2's as well.
I believe some damaged P2EV's have been swapped for refurbished so you might score a fixed return unit ?
ie most packaged P2 die, ended up on boards.
There is a pause right now, as P2+ is routed/taped out - which must be due for a progress report on that from OnSemi any moment now ?
The target for P2+ was/is to hit 250MHz and lowest HDMI speeds. Current P2's top out on the bench, overclocked at somwehere over 300MHz, on good copper planes.
Ross.
Everyone's been wondering for years if you were ever returning.
If you can private-message me, I will see what we can do to get you a P2 Eval board.
I think everyone's hoping next run will be the finished product.
The errata is quite small, and the code changes also small, some new hardware is coming in P2+, but that will not affect a compiler.
Biggest gotcha I think in P2-ES is the verilog sign extension oops, between FPGA flows and ASIC flows.
IIRC, I saw that the small opcode extensions in P2+, can be made backward binary compatible.
Of course
Chip - any word on how OnSemi is going on the P&R of P2+ and clock gating etc ?
I wonder how David Betz knew???
There are two pcbs...
The P2D2 board which Peter Jakacki (Brisbane) made and contain one of the first 10 P2-ES chips which were epoxied by OnSemi into the IC package.
The P2_EVAL board was made by Parallax and 100 were built with the P2-ES chips correctly packaged. There were a few more of these P2-ES chips sent to Peter. IIRC there were about 110 of these P2-ES chips.
There are some minor problems with the P2-ES silicon but IMHO we could have lived with them although the sign problem was a problem.
There are threads for all these titles.
Now, the P2 is real and we're expecting the next run back somewhere around May if my memory serves me correctly.
There are some really exciting instructions in the P2
Here is the real P2 hardware basics...
* 8 Cogs with 2KB Cog RAM and 2KB LUT RAM
* 512KB of HUB RAM dual-mapped into 1MB. The top 16KB is preloaded from internal serial ROM and can be write protected.
* Cog access to the hub is via the "egg-beater" (see below)
* 64 I/O, all with "smart-pins" (see below) and ADC
* inbuilt DACs
* 100pin QFP
* clock aimed at 160/180MHz but we've pushed this (with cooling) to over 340MHz
* code can execute from cog, lut and hub (hubexec)
* most instructions take 2 clocks
egg-beater
* the hub is divided into 8 x 64KB (16Kx32bits) blocks of ram
* each block can be considered the next long address from the previous block, wrapping at the end again.
* On every clock, the cog has access to one of the hub ram blocks!!!
* on each clock the cog gains access to the next hub block
* so, after 8 clocks the cog may have read in 8 longs!!!
* there is a streamer to facilitate this fast access.
smart-pins
* each pin has a smart-pin state machine to do all sorts of things
And there is so much more.
Only down-side is that the P2 uses a lot of power. Hopefully the next P2 will not use so much if sections are not running.
Aim is for 200MHz
DOCs are in the first post of this thread
http://forums.parallax.com/discussion/162298/prop2-fpga-files-updated-2-june-2018-final-version-32i/p1
and this thread covers the improvements/fixes in P2+
http://forums.parallax.com/discussion/169282/list-of-changes-in-next-p2-silicon/p1
I found this in the P2 documentation ...
"The globally-accessible hub RAM can be read and written as bytes, words, and longs. Hub addresses are always byte-oriented. There are no special alignment rules for words and longs in hub RAM. Cogs can read and write bytes, words, and longs at any hub address, as well as execute instruction longs from any hub address starting at $400."
But then the description of the "egg beater" says this ...
"Hub RAM is comprised of 32-bit-wide single-port RAMs with byte-level write controls. For each cog, there is one of these RAMs, but it is multiplexed among all cogs. Let’s call these separate RAMs “slices”. Each RAM slice holds every single/2nd/4th/8th/16th (depending on number of cogs) set of 4 bytes in the composite hub RAM. At every clock, each cog can access the “next” RAM slice, allowing for continuously-ascending bidirectional streaming of 32 bits per clock between the composite hub RAM and each cog."
So, what happens if I read or write a long that is not aligned on a long boundary? Wouldn't this long contain bytes from different slices?
(signed)
Confused!
Yes, it adds one clock to the read/write time. The big feature of the egg-beater is it provides a fast burst like action available to consecutive addresses.
Ok, I guess that makes sense. But I can see that the timings of instructions on the P2 are going to be a headache!
Executing from hub, yes, timing is hard to know on branching, but straight-line code is just like within cog/LUT RAM.
The P2 is a much nicer target for compilers than P1. No more worrying about LMM, and no hassles about getting large constants into registers.
Cheers,
Eric
Yes, I can already see several features that will make compilers easier ... but also quite a few that I can't figure out how a compiler could ever possibly use!
The P2 documentation says "The lookup RAM must be read and written using RDLUT/WRLUT instructions."
But the documentation also says you can access LUT using RDLONG and WRLONG, if you also use SETQ - e.g. "Use SETQ2+RDLONG to read multiple hub longs into cog lookup RAM". But I can't quite see how the RDLONG knows to use lookup RAM and not register RAM. Also, if it can do this, can you also just use individual RDLONG instructions to read into LUT RAM - i.e. without using SETQ?
My next question is about LUT sharing. You can use SETLUTS to share (for example) the LUT RAM between cog 0 and cog 1. I just want to make sure I understand this. My reading is that before any LUT writes, the LUT RAM of cog 0 and cog 1 would both contain their respective original values for a specific LUT RAM location, which may be different. But if either one writes to a LUT RAM location, they will both thereafter read the same value from that location. What happens if both cogs write to the same LUT RAM location at the same time - which value ends up in LUT RAM?
Internally, SETQ2 must be setting a hidden flag than RD/WRLONG alone will check for. When those two instructions see this flag waving at them their become a different, and "complex", instruction. Same story for SETQ with another flag.
LUTSON (SETLUTS #1) is a one-way control, it allows writes to this cog's LUTRAM from the other cog's WRLUT instructions. But both cogs can issue the same and get bidirectional going. I don't know the answer ... and it won't be the same behaviour between the P2ES and the final Prop2 hardware either since there was a design flaw that got fixed around this.
Two simultaneous writes to the one location is really not something that has been considered. The fix was with respect to simultaneous read and write.
I wonder if it might happen that each cog's write might end up in the LUT of the other cog? That would at least be symmetrical and consistent!
This is how you move HUB <--> LUT
RD/WRLUT is for moving between COG <--> LUT and it has no SETQ/SETQ2 equivalent.
The PTRA/B effects will also be valid for RD/WRLUT in the next silicon.
COG0 & COG1 LUTs may contain different values.
If COG0 has enabled sharing, then when COG1 writes to its LUT, that will be also written to COG0's LUT.
Similarly, if COG1 has enabled sharing, then when COG0 writes to its LUT, that will also be written to COG1's LUT.
There is currently a bug when reading from the same location as writing. This will be fixed in the next silicon. Currently you need to read it twice and compare.
I don't think there is, or will be, any contention protection when both cogs write to the same LUT address.
I'd prefer each cog's LUT to get its own data.
What happens if there are simultaneous writes to different locations - does or will that work? And can we say for certain yet what will work in the final hardware for different combinations of LUT sharing and simultaneous reading and writing?
Yes, Chip has it implemented in the FPGA. I've tested it. The read data corruption no longer occurs.
A little difficult to prove given the somewhat unknown nature of coordinating two cogs but, supposedly, the reading cog receives the new data being written by the writing cog. This will be done with extra mux circuit to copy around the RAM when the two accesses coincide.