Wish it was SERDES. I begged, and begged, and begged.
I know you did. I just didn't have time to venture into what it would take to make a universal serializer/deserializer with all kinds of modulation schemes, bit-stuffing, and so on. Attacking that problem is as hard as video, it seems. Next time, we will incorporate a SERDES, though.
I know you did. I just didn't have time to venture into what it would take to make a universal serializer/deserializer with all kinds of modulation schemes, bit-stuffing, and so on. Attacking that problem is as hard as video, it seems. Next time, we will incorporate a SERDES, though.
Ah, the power of HDLs... You can retain all the IP you've already generated and just add new stuff when your brain's not so fried and you're not so under-the-gun. So glad to hear there will be a next time!
I don't know if it helps to point out that SER is still an order of magnitude faster than the fastest Transputer link. It's fabulous to have a machine-supported interprop channel.
I'm still not seeing it. How is TASKSW skipped over the first time?
It's not skipped over. Sorry about that.
What happens, though is that TASKSW (1st time) saves the PC into pc+0, then loads the PC from pc+1, which points to thread, which would have been the next instruction, anyway. Then, the loop keeps executing for each task in a round-robin fashion.
What happens, though is that TASKSW (1st time) saves the PC into pc+0, then loads the PC from pc+1, which points to thread, which would have been the next instruction, anyway. Then, the loop keeps executing for each task in a round-robin fashion.
OK! That's what I thought would happen. Now, in the original example, would that mean that there would have been five TASKSW calls before the first iteration through the loop, the first four being a jump back to loop (the TASKSW statement itself) with a PC+1 adjustment to each of the pc[] entries?
OK! That's what I thought would happen. Now, in the original example, would that mean that there would have been five TASKSW calls before the first iteration through the loop, the first four being a jump back to loop (the TASKSW statement itself) with a PC+1 adjustment to each of the pc[] entries?
A question came up at the conference yesterday regarding how Prop I assembler code will be portable to the Prop II.
Seems for most Prop I PASM only a few small changes will be required to get it running. Before, that is one starts looking at optimizing using new Prop II features.
Is there a porting guide anywhere that points out the changes to look for?
Most of the P1 stuff we do today works just fine. The workhorse instructions are pretty much unchanged. You do have to think about the pipe line more because it has more stages and sometimes things need to be done at a particular stage, or you need to use a delay effect instruction to avoid nop instructions.
Getting at pins is more complicated, and the pins are smart now and not anywhere close to emulated now either. Gotta have the real chip to really grok the pins. I'm waiting on that personally, not connecting much to the FPGA at the moment.
Video works much differently, though I think it's generally better, faster, easier, etc...
IMHO, a P1 port might go fairly easy if it's not touching too much, but then again, most things in PASM are touching the hardware to get speed, enable capability.
The new instructions change things. We can use REP on lots of loops where before we would invoke a register and DNJZ or CMP.
Most things need a rewrite, and they will be smaller when done, more capable.
Anything with math is going to be huge! P2 has most things people need in hardware so whole routines go away replaced by a few instructions.
Thing is I've got this Z80 emulator which has been patiently waiting for Prop II because it will no longer need the complexity of external RAM there. Then there is PullMoll's emulator. They are going to want moving to the Prop II.
In my case the emulator engine does nothing with hardware and will probably port easily. No doubt it can be optimized with P2 instructions once it is up and running.
Actually I'm planning on migrating all the Spin parts to C where I guess the equivalent of FSRW and such already exit.
Then there is the FFT. Again no hardware involvement but certainly the complex multiplications can be turbo charged somewhat. Some pointers there would be helpful.
Just hope I can find the time to get into all of this.
Doug touched on the various points well. Based on what I've seen, a direct port will usually be possible because most of the instructions from the Prop 1 function identically in the Prop2.
Off the top of my head, here are some of the details:
* Pins are not addressed as INA and OUTA anymore, the register is now called PINx where x is ABCD
* Pins are configured to do different things (smart pins as Doug alluded to) and must be setup prior
* Self modifying code must account for the change being effective on the 3rd clock after the instruction (see line 958 in doc TXT)
* Memory read instructions take 3..10 clock cycles instead of 8..23
More semantic changes are abound, such as the counters support many more modes.
If you are actually reimplementing code from the Prop 1 to the Prop 2, there are many changes that you will implement such as:
* REPS for simple immediate loops
* REPD for variable iteration loops
* PTRx for HUB memory transfer
* INDx for COG memory transfer
* Using the CLUT for lookup, stack, buffer, cache, or CLUT
* Rewriting of code to take advantage of P2 specific instructions like the math instructions, bit manipulation, and data manipulation.
* Cooperative multi-tasking with TASKSW or temporal multi-threading via SETTASK/JMPTASK
An important item to note is looping. On the P1 looping is typically done with DJNZ, the P2 has this same construct and actually requires that you use this if you plan to perform any branch instructions within a loop. The REPx instructions will not work if you have a branch occur within the loop, they are only intended for compact local code loops.
Those are some of the specific differences, but by no means is it a complete list.
Thanks Pedward. That is certainly enough to inspire me to get started. Looks like there is a lot of new instructions that will speed up the emulation beyond a simple port as well.
Now that the emulator has the chance to run CP/M without external RAM it seems the first available Prop II boards will come with stonking great SDRAMS attached:)
Thanks Pedward. That is certainly enough to inspire me to get started. Looks like there is a lot of new instructions that will speed up the emulation beyond a simple port as well.
Now that the emulator has the chance to run CP/M without external RAM it seems the first available Prop II boards will come with stonking great SDRAMS attached:)
heater: It will be much simpler that you think to convert ZiCog. Only a few places where any changes are necessary.
I just tried porting my P2 debugger back to P1 and it is quite simple.
getcnt, setp, etc instructions are the main issues. But you will have to be careful of self-modifying code due to the pipeline delays.
Of course, these simple changes will not use the new features of P2. But who cares because it will run at least 8 times faster without this.
Documentation has slowed down quite a bit lately. As discussed earlier, a big factor is the performance issues we've been encountering with Google Docs. I have been discussing the idea of "moving" the documentation to wikispaces, where I believe it can be more flexibly implemented (among other things). However, the Propeller conference has be second-guessing this approach. There is clearly new, official documentation (web-based) that will be forthcoming. At the same time, there was mention of working with the community documentation efforts, but not much in the way of details.
So where does that leave us? Continue updating GDocs for the moment? Go ahead with migrating the documentation to wikispaces? Rework the community documentation to complement the official documentation? Something else entirely?
Here is an updated visual decoded instruction set for the P2. The spinfile also contains the conditional assembly parameters too.
Added are the PTRA/PTRB for rd/wr-byte/word/long/quad-c instructions and INDA/INDB for cog addresses (as shown below).
My info shows the second one is a spare opcode. I checked the latest pdf too.
Are you looking at the latest info or where are you seeing the second coginit?
My info shows the second one is a spare opcode. I checked the latest pdf too.
Are you looking at the latest info or where are you seeing the second coginit?
It is in the list at the end of the GDoc version. This looks like it was a verbatim copy from an earlier post by Chip in this thread. But it's not mentioned in any of the details sections.
I believe I read somewhere that the bus between the hub and the cogs was 128 bits wide, which would make sense for the xxQUAD operations to be able to atomically latch the data. However, why does the RDQUAD require the additional (two?) clock cycles before they can be read? Is this only when the quads are mapped to registers? I'm guessing not, as the RDxxxxC operations will also block for those same two clock cycles (when the cache is dirty).
(note: This is only partially a clarification question. I'm also just curious about the internal workings of that mechanism.)
I believe I read somewhere that the bus between the hub and the cogs was 128 bits wide, which would make sense for the xxQUAD operations to be able to atomically latch the data. However, why does the RDQUAD require the additional (two?) clock cycles before they can be read? Is this only when the quads are mapped to registers? I'm guessing not, as the RDxxxxC operations will also block for those same two clock cycles (when the cache is dirty).
(note: This is only partially a clarification question. I'm also just curious about the internal workings of that mechanism.)
Yes, the hub to cog bus is 128 bits wide.
From the info I summarised into an Excel Spreadsheet, all Reads (Byte/Word/Long) take an extra 2 clocks (ie 3..10 clocks). The cached versions vary depending upon whether they are in the cache or not. If they are in cache then it is a single cycle, otherwise there appears to be 3..10 clocks. Therefore, I presume there is some setup (2 clocks) required when interfacing the reads to the hub.
Comments
I know you did. I just didn't have time to venture into what it would take to make a universal serializer/deserializer with all kinds of modulation schemes, bit-stuffing, and so on. Attacking that problem is as hard as video, it seems. Next time, we will incorporate a SERDES, though.
Thanks Chip.
Ah, the power of HDLs... You can retain all the IP you've already generated and just add new stuff when your brain's not so fried and you're not so under-the-gun. So glad to hear there will be a next time!
I don't know if it helps to point out that SER is still an order of magnitude faster than the fastest Transputer link. It's fabulous to have a machine-supported interprop channel.
I'm still not seeing it. How is TASKSW skipped over the first time?
It's not skipped over. Sorry about that.
What happens, though is that TASKSW (1st time) saves the PC into pc+0, then loads the PC from pc+1, which points to thread, which would have been the next instruction, anyway. Then, the loop keeps executing for each task in a round-robin fashion.
OK! That's what I thought would happen. Now, in the original example, would that mean that there would have been five TASKSW calls before the first iteration through the loop, the first four being a jump back to loop (the TASKSW statement itself) with a PC+1 adjustment to each of the pc[] entries?
That's right.
Seems for most Prop I PASM only a few small changes will be required to get it running. Before, that is one starts looking at optimizing using new Prop II features.
Is there a porting guide anywhere that points out the changes to look for?
Most of the P1 stuff we do today works just fine. The workhorse instructions are pretty much unchanged. You do have to think about the pipe line more because it has more stages and sometimes things need to be done at a particular stage, or you need to use a delay effect instruction to avoid nop instructions.
Getting at pins is more complicated, and the pins are smart now and not anywhere close to emulated now either. Gotta have the real chip to really grok the pins. I'm waiting on that personally, not connecting much to the FPGA at the moment.
Video works much differently, though I think it's generally better, faster, easier, etc...
IMHO, a P1 port might go fairly easy if it's not touching too much, but then again, most things in PASM are touching the hardware to get speed, enable capability.
The new instructions change things. We can use REP on lots of loops where before we would invoke a register and DNJZ or CMP.
Most things need a rewrite, and they will be smaller when done, more capable.
Anything with math is going to be huge! P2 has most things people need in hardware so whole routines go away replaced by a few instructions.
In my case the emulator engine does nothing with hardware and will probably port easily. No doubt it can be optimized with P2 instructions once it is up and running.
Actually I'm planning on migrating all the Spin parts to C where I guess the equivalent of FSRW and such already exit.
Then there is the FFT. Again no hardware involvement but certainly the complex multiplications can be turbo charged somewhat. Some pointers there would be helpful.
Just hope I can find the time to get into all of this.
Off the top of my head, here are some of the details:
* Pins are not addressed as INA and OUTA anymore, the register is now called PINx where x is ABCD
* Pins are configured to do different things (smart pins as Doug alluded to) and must be setup prior
* Self modifying code must account for the change being effective on the 3rd clock after the instruction (see line 958 in doc TXT)
* Memory read instructions take 3..10 clock cycles instead of 8..23
More semantic changes are abound, such as the counters support many more modes.
If you are actually reimplementing code from the Prop 1 to the Prop 2, there are many changes that you will implement such as:
* REPS for simple immediate loops
* REPD for variable iteration loops
* PTRx for HUB memory transfer
* INDx for COG memory transfer
* Using the CLUT for lookup, stack, buffer, cache, or CLUT
* Rewriting of code to take advantage of P2 specific instructions like the math instructions, bit manipulation, and data manipulation.
* Cooperative multi-tasking with TASKSW or temporal multi-threading via SETTASK/JMPTASK
An important item to note is looping. On the P1 looping is typically done with DJNZ, the P2 has this same construct and actually requires that you use this if you plan to perform any branch instructions within a loop. The REPx instructions will not work if you have a branch occur within the loop, they are only intended for compact local code loops.
Those are some of the specific differences, but by no means is it a complete list.
Now that the emulator has the chance to run CP/M without external RAM it seems the first available Prop II boards will come with stonking great SDRAMS attached:)
I just tried porting my P2 debugger back to P1 and it is quite simple.
getcnt, setp, etc instructions are the main issues. But you will have to be careful of self-modifying code due to the pipeline delays.
Of course, these simple changes will not use the new features of P2. But who cares because it will run at least 8 times faster without this.
So where does that leave us? Continue updating GDocs for the moment? Go ahead with migrating the documentation to wikispaces? Rework the community documentation to complement the official documentation? Something else entirely?
P2_Instructions.spin
Added are the PTRA/PTRB for rd/wr-byte/word/long/quad-c instructions and INDA/INDB for cog addresses (as shown below). P2_Instructions.spin
Edit: err. That would obviously be a question for the source field only.
What is the second version for?
Are you looking at the latest info or where are you seeing the second coginit?
It is in the list at the end of the GDoc version. This looks like it was a verbatim copy from an earlier post by Chip in this thread. But it's not mentioned in any of the details sections.
(note: This is only partially a clarification question. I'm also just curious about the internal workings of that mechanism.)
Where is that list (the original post)?
From the info I summarised into an Excel Spreadsheet, all Reads (Byte/Word/Long) take an extra 2 clocks (ie 3..10 clocks). The cached versions vary depending upon whether they are in the cache or not. If they are in cache then it is a single cycle, otherwise there appears to be 3..10 clocks. Therefore, I presume there is some setup (2 clocks) required when interfacing the reads to the hub.
It was the preliminary features PDF that was posted, I don't have a link, just the PDF.