Shop Learn P1 Docs P2 Docs
What would be a good idea for a new CPU and platform to try on a P2? - Page 3 — Parallax Forums

What would be a good idea for a new CPU and platform to try on a P2?

13»

Comments

  • PurpleGirlPurpleGirl Posts: 107
    edited 2022-11-15 21:32

    @msrobots said:
    I tried to write a emulator for the IBM 350-370 ++ Processor.

    Very nice instruction set, still supported by the new Z-OS series of mainframes.

    But sadly not bytecode but nibble orientated and I had no clue how to handle that.

    There is a open source project called 'HERCULES' emulating a mainframe, and all needed software to run MVS and all goodies.

    I wish I had time to go back to that.

    Mike

    I'm not sure if this will work for the P2, but you could probably do an inline interpretation style. For instance, move the nibble to the upper half of the byte and then address your jump list with interspersed code. So you have 16 bytes of room for each instruction. So just address the jumplist by the upper nibble. Just let the last P2 instruction return to the main loop handler. And you can adjust where you move the jump address to fit the number of instructions you need per instruction. If you need fewer, only shift the opcode 3 bits in to have 8, and if you need more shift it 5 in. And it depends on the variability in the size of the handler per opcode. If you run out of space for just a few instructions, then handle what you can in the opcode conversion table space and in the last instruction in the allotted entry, you jump elsewhere, take what space you need, and then return to the main handler.

    I hope this makes sense. It is just mapping a 16-byte instruction map to a 256 P2 address space. So if you shift the opcode left by 4, OR that with or add that to the starting address (assuming that it's at the start of a page boundary), and then jump there, I think that should work. (And adjust for any errors I made in terms of word size. Maybe only 4 P2 instructions would fit per entry in the table.)

    That is one way to do it. If you only want a list of addresses and treat it like a trampoline, you could do that too, and then use those to jump to the handlers, and return back to the main loop from each. Just find a way to convert your opcodes to real addresses.

  • Would UCSD-P System make sense?
    Some information is available: pascal.hansotten.com/ucsd-p-system/more-on-p-code/
    Basic-compiler, Pascal-Compiler, Fortran-compiler, Editor. Overlays. Runs with 64k.

  • Again, I'm just going for a classic but new ISA that could have been done in the day. Like maybe something like a 16-bit 6502, with a 16-bits word size.

  • pik33pik33 Posts: 1,818

    I thought about a 65032, 32-bit 6502.

    As a normal 6502, with 32-bit registers.

    This thing needs additional segment registers, as in x86, to retain compatibility. Boots as a normal 6502, with stack segment register set at $00000100, CS and DS at 0, and 8 bit wide A,X,Y registers.

    A KIL, (HALT, STOP HCF) instruction slots can be used to switch the thing to 32-bit mode and reload segments. So you load a code segment with the pointer to 32-bit code block, stack segment to something other than $100, then switch the mode to 32 bit. There is enough KIL instructions available to do this.

    While in 32-bit mode, nothing changes in the instruction set... only they are 32 bit now.

    Having CS, DS and SS segment registers make possible to run several virtual 6502s:

    • in 8-bit mode, reloading segments does nothing until switched to 32-bit
    • in 32-bit mode, reloading segments works

    so you can load CS, DS and SS with something, then go to 8 bit and the CPU starts to run as 8-bit with the memory addressed with the new segment values.

    Nothing more than an idea now.. and I have a instruction matrix for it somewhere.

  • @PurpleGirl said:
    Again, I'm just going for a classic but new ISA that could have been done in the day. Like maybe something like a 16-bit 6502, with a 16-bits word size.

    Well, Uscd-p was a 16-bit Isa that was done in the day as western digital wd-90. I think it is at least worth studying, because it is a processor tailored for compilers. It has rather powerful instructions. What was microcode then can be cog code on P2.

  • 4x5n4x5n Posts: 744

    Has anyone tried a 6809? then there is also the 6309. no one has gotten the timing right on the 6309.

  • @4x5n said:
    Has anyone tried a 6809? then there is also the 6309. no one has gotten the timing right on the 6309.

    Yes, see links in first post. As long as you don't want a lot of complicated periphery hardware (MMU!) the emulator written in C for both processors can be modified to work faster than original. https://forums.parallax.com/discussion/174794/towards-os9-operating-system-on-p2/p1

  • @pik33 said:
    I thought about a 65032, 32-bit 6502.

    As a normal 6502, with 32-bit registers.

    I like this idea a lot!!

  • msrobotsmsrobots Posts: 3,614

    How about a P1-PASM-emulator for P2?

    just asking for a friend...

    Mike

  • What might be nice is a custom opcode set that is mostly routines. Memory is the bottleneck, so if you can have a "compression" that has low overhead, you can make up for a lot of that. So in a sense, sacrifice a cog to minimize memory usage. So in a sense, keep a runtime library in the P2 ROM and have an ISA that calls those and has the most common opcodes. Maybe paired instructions too.

  • Other ideas I've seen have been to do stuff like do a 6502/Z80 chimera or to do an '816, but only its native mode, and not honor instructions to change it. One that might be fun to modify would be the TMS9900. I haven't been able to find a good opcode list for it yet.

  • The tms9900 with its register block in Ram would need a lot of Hub accesses, if you use some old software. For P2 mc68k or lsi11 as 16bit CPUs might fit better, I think?

  • PurpleGirlPurpleGirl Posts: 107
    edited 2022-12-09 06:26

    @"Christof Eb." said:
    The tms9900 with its register block in Ram would need a lot of Hub accesses, if you use some old software. For P2 mc68k or lsi11 as 16bit CPUs might fit better, I think?

    That is where one should optimize it by putting the first 256 or 512 bytes in cog RAM. So use actual registers for the area where the TI994A put an SRAM. So make things roughly how it was done on real hardware. The TI994A used SRAM for the workspace RAM and DRAM for the rest of the system. So treat cog RAM as SRAM and hub RAM as DRAM (with wait states and all).

    But with what info I could gather about that ISA, I don't think I'd make one based on that for other reasons. The ISA seems a bit bulky in how it is laid out, and it uses a weird Big Endian memory map where everything is arranged backward for each word. Bit 15 is the lowest bit.

    Let's see if I remember how the instructions are laid out. There are 3-bit opcodes with a 1-bit flag to select whether it is a byte or word instruction. Then a 2-bit destination memory mode descriptor, then the 4-bit destination address, then a 2-bit source mode descriptor, and then the 4-bit source. That is for 2 register arguments, I think. And the first 3 bits (logically speaking, physically the highest 3) changed the template depending on the type of instruction.

    Another CPU to try would be the 65CE02. One can mod the 6502 core here to add the C and CE instructions (and extra registers). And an advantage to the 6502 core here is that you won't need to constrain the cycles as much. The CE changed the pipelining to where there are no errant reads. The 6502 was optimized for 2-cycle instructions. So if it hits what should be a 1-cycle instruction, it loads the next byte (an opcode), decides it doesn't need that as an immediate operand, discards it, and reads that exact same byte again. The 65CE02 catches this, forwards the errant "operand" to the decoder as an instruction, and then fetches the rest of the possibly partial instruction. So for us, that might mean simplifying the 6502 core and removing any artificial timing constraints on single-byte instructions. That alone can make things operate about 25% faster at the same clock rate. And of course, coding around the added instructions can get things closer to 40% more performance at the same clock rate.

  • pik33pik33 Posts: 1,818

    Another idea: a virtual machine/instruction set optimized for PSRAM based code execution.

  • PurpleGirlPurpleGirl Posts: 107
    edited 2022-12-09 12:42

    @pik33 said:
    Another idea: a virtual machine/instruction set optimized for PSRAM based code execution.

    Yes, for sure. I mean, my initial thoughts were to use some sort of "compression" in a sense, like complex opcodes to where you are sacrificing a cog to get around even the hub bottleneck. I don't think a given cog will get more than say the real-world equivalent of about 20 Mhz out of the hub.

    I'm not sure if actual compression such as LZW or Deflate would be of use or justify the resources vs. throughput. In that case, you'd likely need virtualization and a cache, since you'd need to read entire blocks of data, each with its own dictionary.

    So maybe, an approach for a faster emulation would be to use 2-3 cogs for a single core, though I'd hate to see the pipeline penalties on branches. I mean, since LUTs can be shared by pairs of cogs, you could have one doing all the memory work and one doing pure processing, with the LUT being used as a cache, with room left for registers. And having the PC in the LUT RAM may be good so that a caching/virtualization cog can alter it if it needs to, as long as you can live with the constraint of relocatable code.

    For my own design, I'd want to use 20 address lines, 16 data lines, and up to 5 control lines. So 40-41 GPIO lines would be costly. But, with more board logic, maybe using the /CE line on the board and the RAM, you could stop processing and do other things with those lines if you have to.

Sign In or Register to comment.