Catalina and the P2
Ale
Posts: 2,363
in Propeller 2
I was wondering, what happened to the efforts to produce P2 code with Catalina ?
Comments
I doubt there will be any P2 variant by Ross.
David Betz has recently informed me that the P2 is finally available, so if I can get my hands on one, there will be a P2 version of Catalina. However, I can't figure out how to order the evaluation board. The P2 discussion forums point you to the shop link for the "Propeller 2 ES Evaluation Board" but that page just says it is "unavailable". Does anyone know if it is actually available? Perhaps it is already sold out?
Thanks in advance!
Ross.
You need to talk with Chip, but there are about 100 P2 EV's in the wild, and some P2D2's as well.
I believe some damaged P2EV's have been swapped for refurbished so you might score a fixed return unit ?
ie most packaged P2 die, ended up on boards.
There is a pause right now, as P2+ is routed/taped out - which must be due for a progress report on that from OnSemi any moment now ?
The target for P2+ was/is to hit 250MHz and lowest HDMI speeds. Current P2's top out on the bench, overclocked at somwehere over 300MHz, on good copper planes.
Ross.
Everyone's been wondering for years if you were ever returning.
If you can private-message me, I will see what we can do to get you a P2 Eval board.
I think everyone's hoping next run will be the finished product.
The errata is quite small, and the code changes also small, some new hardware is coming in P2+, but that will not affect a compiler.
Biggest gotcha I think in P2-ES is the verilog sign extension oops, between FPGA flows and ASIC flows.
IIRC, I saw that the small opcode extensions in P2+, can be made backward binary compatible.
Of course
Chip - any word on how OnSemi is going on the P&R of P2+ and clock gating etc ?
I wonder how David Betz knew???
There are two pcbs...
The P2D2 board which Peter Jakacki (Brisbane) made and contain one of the first 10 P2-ES chips which were epoxied by OnSemi into the IC package.
The P2_EVAL board was made by Parallax and 100 were built with the P2-ES chips correctly packaged. There were a few more of these P2-ES chips sent to Peter. IIRC there were about 110 of these P2-ES chips.
There are some minor problems with the P2-ES silicon but IMHO we could have lived with them although the sign problem was a problem.
There are threads for all these titles.
Now, the P2 is real and we're expecting the next run back somewhere around May if my memory serves me correctly.
There are some really exciting instructions in the P2
Here is the real P2 hardware basics...
* 8 Cogs with 2KB Cog RAM and 2KB LUT RAM
* 512KB of HUB RAM dual-mapped into 1MB. The top 16KB is preloaded from internal serial ROM and can be write protected.
* Cog access to the hub is via the "egg-beater" (see below)
* 64 I/O, all with "smart-pins" (see below) and ADC
* inbuilt DACs
* 100pin QFP
* clock aimed at 160/180MHz but we've pushed this (with cooling) to over 340MHz
* code can execute from cog, lut and hub (hubexec)
* most instructions take 2 clocks
egg-beater
* the hub is divided into 8 x 64KB (16Kx32bits) blocks of ram
* each block can be considered the next long address from the previous block, wrapping at the end again.
* On every clock, the cog has access to one of the hub ram blocks!!!
* on each clock the cog gains access to the next hub block
* so, after 8 clocks the cog may have read in 8 longs!!!
* there is a streamer to facilitate this fast access.
smart-pins
* each pin has a smart-pin state machine to do all sorts of things
And there is so much more.
Only down-side is that the P2 uses a lot of power. Hopefully the next P2 will not use so much if sections are not running.
Aim is for 200MHz
DOCs are in the first post of this thread
http://forums.parallax.com/discussion/162298/prop2-fpga-files-updated-2-june-2018-final-version-32i/p1
and this thread covers the improvements/fixes in P2+
http://forums.parallax.com/discussion/169282/list-of-changes-in-next-p2-silicon/p1
I found this in the P2 documentation ...
"The globally-accessible hub RAM can be read and written as bytes, words, and longs. Hub addresses are always byte-oriented. There are no special alignment rules for words and longs in hub RAM. Cogs can read and write bytes, words, and longs at any hub address, as well as execute instruction longs from any hub address starting at $400."
But then the description of the "egg beater" says this ...
"Hub RAM is comprised of 32-bit-wide single-port RAMs with byte-level write controls. For each cog, there is one of these RAMs, but it is multiplexed among all cogs. Let’s call these separate RAMs “slices”. Each RAM slice holds every single/2nd/4th/8th/16th (depending on number of cogs) set of 4 bytes in the composite hub RAM. At every clock, each cog can access the “next” RAM slice, allowing for continuously-ascending bidirectional streaming of 32 bits per clock between the composite hub RAM and each cog."
So, what happens if I read or write a long that is not aligned on a long boundary? Wouldn't this long contain bytes from different slices?
(signed)
Confused!
Yes, it adds one clock to the read/write time. The big feature of the egg-beater is it provides a fast burst like action available to consecutive addresses.
Ok, I guess that makes sense. But I can see that the timings of instructions on the P2 are going to be a headache!
Executing from hub, yes, timing is hard to know on branching, but straight-line code is just like within cog/LUT RAM.
The P2 is a much nicer target for compilers than P1. No more worrying about LMM, and no hassles about getting large constants into registers.
Cheers,
Eric
Yes, I can already see several features that will make compilers easier ... but also quite a few that I can't figure out how a compiler could ever possibly use!
The P2 documentation says "The lookup RAM must be read and written using RDLUT/WRLUT instructions."
But the documentation also says you can access LUT using RDLONG and WRLONG, if you also use SETQ - e.g. "Use SETQ2+RDLONG to read multiple hub longs into cog lookup RAM". But I can't quite see how the RDLONG knows to use lookup RAM and not register RAM. Also, if it can do this, can you also just use individual RDLONG instructions to read into LUT RAM - i.e. without using SETQ?
My next question is about LUT sharing. You can use SETLUTS to share (for example) the LUT RAM between cog 0 and cog 1. I just want to make sure I understand this. My reading is that before any LUT writes, the LUT RAM of cog 0 and cog 1 would both contain their respective original values for a specific LUT RAM location, which may be different. But if either one writes to a LUT RAM location, they will both thereafter read the same value from that location. What happens if both cogs write to the same LUT RAM location at the same time - which value ends up in LUT RAM?
Internally, SETQ2 must be setting a hidden flag than RD/WRLONG alone will check for. When those two instructions see this flag waving at them their become a different, and "complex", instruction. Same story for SETQ with another flag.
LUTSON (SETLUTS #1) is a one-way control, it allows writes to this cog's LUTRAM from the other cog's WRLUT instructions. But both cogs can issue the same and get bidirectional going. I don't know the answer ... and it won't be the same behaviour between the P2ES and the final Prop2 hardware either since there was a design flaw that got fixed around this.
Two simultaneous writes to the one location is really not something that has been considered. The fix was with respect to simultaneous read and write.
I wonder if it might happen that each cog's write might end up in the LUT of the other cog? That would at least be symmetrical and consistent!
This is how you move HUB <--> LUT
RD/WRLUT is for moving between COG <--> LUT and it has no SETQ/SETQ2 equivalent.
The PTRA/B effects will also be valid for RD/WRLUT in the next silicon.
COG0 & COG1 LUTs may contain different values.
If COG0 has enabled sharing, then when COG1 writes to its LUT, that will be also written to COG0's LUT.
Similarly, if COG1 has enabled sharing, then when COG0 writes to its LUT, that will also be written to COG1's LUT.
There is currently a bug when reading from the same location as writing. This will be fixed in the next silicon. Currently you need to read it twice and compare.
I don't think there is, or will be, any contention protection when both cogs write to the same LUT address.
I'd prefer each cog's LUT to get its own data.
What happens if there are simultaneous writes to different locations - does or will that work? And can we say for certain yet what will work in the final hardware for different combinations of LUT sharing and simultaneous reading and writing?
Yes, Chip has it implemented in the FPGA. I've tested it. The read data corruption no longer occurs.
A little difficult to prove given the somewhat unknown nature of coordinating two cogs but, supposedly, the reading cog receives the new data being written by the writing cog. This will be done with extra mux circuit to copy around the RAM when the two accesses coincide.