@msrobots said:
I tried to write a emulator for the IBM 350-370 ++ Processor.
Very nice instruction set, still supported by the new Z-OS series of mainframes.
But sadly not bytecode but nibble orientated and I had no clue how to handle that.
There is a open source project called 'HERCULES' emulating a mainframe, and all needed software to run MVS and all goodies.
I wish I had time to go back to that.
Mike
I'm not sure if this will work for the P2, but you could probably do an inline interpretation style. For instance, move the nibble to the upper half of the byte and then address your jump list with interspersed code. So you have 16 bytes of room for each instruction. So just address the jumplist by the upper nibble. Just let the last P2 instruction return to the main loop handler. And you can adjust where you move the jump address to fit the number of instructions you need per instruction. If you need fewer, only shift the opcode 3 bits in to have 8, and if you need more shift it 5 in. And it depends on the variability in the size of the handler per opcode. If you run out of space for just a few instructions, then handle what you can in the opcode conversion table space and in the last instruction in the allotted entry, you jump elsewhere, take what space you need, and then return to the main handler.
I hope this makes sense. It is just mapping a 16-byte instruction map to a 256 P2 address space. So if you shift the opcode left by 4, OR that with or add that to the starting address (assuming that it's at the start of a page boundary), and then jump there, I think that should work. (And adjust for any errors I made in terms of word size. Maybe only 4 P2 instructions would fit per entry in the table.)
That is one way to do it. If you only want a list of addresses and treat it like a trampoline, you could do that too, and then use those to jump to the handlers, and return back to the main loop from each. Just find a way to convert your opcodes to real addresses.
Again, I'm just going for a classic but new ISA that could have been done in the day. Like maybe something like a 16-bit 6502, with a 16-bits word size.
This thing needs additional segment registers, as in x86, to retain compatibility. Boots as a normal 6502, with stack segment register set at $00000100, CS and DS at 0, and 8 bit wide A,X,Y registers.
A KIL, (HALT, STOP HCF) instruction slots can be used to switch the thing to 32-bit mode and reload segments. So you load a code segment with the pointer to 32-bit code block, stack segment to something other than $100, then switch the mode to 32 bit. There is enough KIL instructions available to do this.
While in 32-bit mode, nothing changes in the instruction set... only they are 32 bit now.
Having CS, DS and SS segment registers make possible to run several virtual 6502s:
in 8-bit mode, reloading segments does nothing until switched to 32-bit
in 32-bit mode, reloading segments works
so you can load CS, DS and SS with something, then go to 8 bit and the CPU starts to run as 8-bit with the memory addressed with the new segment values.
Nothing more than an idea now.. and I have a instruction matrix for it somewhere.
@PurpleGirl said:
Again, I'm just going for a classic but new ISA that could have been done in the day. Like maybe something like a 16-bit 6502, with a 16-bits word size.
Well, Uscd-p was a 16-bit Isa that was done in the day as western digital wd-90. I think it is at least worth studying, because it is a processor tailored for compilers. It has rather powerful instructions. What was microcode then can be cog code on P2.
What might be nice is a custom opcode set that is mostly routines. Memory is the bottleneck, so if you can have a "compression" that has low overhead, you can make up for a lot of that. So in a sense, sacrifice a cog to minimize memory usage. So in a sense, keep a runtime library in the P2 ROM and have an ISA that calls those and has the most common opcodes. Maybe paired instructions too.
Other ideas I've seen have been to do stuff like do a 6502/Z80 chimera or to do an '816, but only its native mode, and not honor instructions to change it. One that might be fun to modify would be the TMS9900. I haven't been able to find a good opcode list for it yet.
The tms9900 with its register block in Ram would need a lot of Hub accesses, if you use some old software. For P2 mc68k or lsi11 as 16bit CPUs might fit better, I think?
@"Christof Eb." said:
The tms9900 with its register block in Ram would need a lot of Hub accesses, if you use some old software. For P2 mc68k or lsi11 as 16bit CPUs might fit better, I think?
That is where one should optimize it by putting the first 256 or 512 bytes in cog RAM. So use actual registers for the area where the TI994A put an SRAM. So make things roughly how it was done on real hardware. The TI994A used SRAM for the workspace RAM and DRAM for the rest of the system. So treat cog RAM as SRAM and hub RAM as DRAM (with wait states and all).
But with what info I could gather about that ISA, I don't think I'd make one based on that for other reasons. The ISA seems a bit bulky in how it is laid out, and it uses a weird Big Endian memory map where everything is arranged backward for each word. Bit 15 is the lowest bit.
Let's see if I remember how the instructions are laid out. There are 3-bit opcodes with a 1-bit flag to select whether it is a byte or word instruction. Then a 2-bit destination memory mode descriptor, then the 4-bit destination address, then a 2-bit source mode descriptor, and then the 4-bit source. That is for 2 register arguments, I think. And the first 3 bits (logically speaking, physically the highest 3) changed the template depending on the type of instruction.
Another CPU to try would be the 65CE02. One can mod the 6502 core here to add the C and CE instructions (and extra registers). And an advantage to the 6502 core here is that you won't need to constrain the cycles as much. The CE changed the pipelining to where there are no errant reads. The 6502 was optimized for 2-cycle instructions. So if it hits what should be a 1-cycle instruction, it loads the next byte (an opcode), decides it doesn't need that as an immediate operand, discards it, and reads that exact same byte again. The 65CE02 catches this, forwards the errant "operand" to the decoder as an instruction, and then fetches the rest of the possibly partial instruction. So for us, that might mean simplifying the 6502 core and removing any artificial timing constraints on single-byte instructions. That alone can make things operate about 25% faster at the same clock rate. And of course, coding around the added instructions can get things closer to 40% more performance at the same clock rate.
@pik33 said:
Another idea: a virtual machine/instruction set optimized for PSRAM based code execution.
Yes, for sure. I mean, my initial thoughts were to use some sort of "compression" in a sense, like complex opcodes to where you are sacrificing a cog to get around even the hub bottleneck. I don't think a given cog will get more than say the real-world equivalent of about 20 Mhz out of the hub.
I'm not sure if actual compression such as LZW or Deflate would be of use or justify the resources vs. throughput. In that case, you'd likely need virtualization and a cache, since you'd need to read entire blocks of data, each with its own dictionary.
So maybe, an approach for a faster emulation would be to use 2-3 cogs for a single core, though I'd hate to see the pipeline penalties on branches. I mean, since LUTs can be shared by pairs of cogs, you could have one doing all the memory work and one doing pure processing, with the LUT being used as a cache, with room left for registers. And having the PC in the LUT RAM may be good so that a caching/virtualization cog can alter it if it needs to, as long as you can live with the constraint of relocatable code.
For my own design, I'd want to use 20 address lines, 16 data lines, and up to 5 control lines. So 40-41 GPIO lines would be costly. But, with more board logic, maybe using the /CE line on the board and the RAM, you could stop processing and do other things with those lines if you have to.
riscv with 32MB ram. @ersmith has already written a riscv emulator. It uses hub ram only. This emulator is a Just In Time compiler so it performs well on loops. But that makes it a little harder than simply replacing hub access with external ram access. EDIT: forgot about interrupts. I think Linux needs an interrupt which is not supported by riscvp2.
Well, my intention here is for a new CPU type altogether. I want something that could have been made in the Home Computing era but wasn't. I was not intending to look for a meme CPU or things commonly done in FPGA cores already. So RISCV, Blaze (or OpenFire, its open-source clone), etc., are all out for me. The idea was to create something new on par with older CPUs that is more lightweight than the P2, not as complex or more so.
As for Doom, I don't remember what I had when I first run it, but it was like 4 MB of memory or something under MS-DOS. I mean the original Doom. Doom 3 required much stronger hardware. That essentially had a script interpreter running underneath that extracted game resources from game resource files into memory and then read a human-readable script and interpreted it. It seems that they should have at least tokenized it to increase speed and decrease memory usage.
I don't get all these hobby projects that use Linux. Hell, if you want to go lightweight, go with your own build of CPM or DOS. Or not use an OS at all beyond the ROMs. You could do a lot of stuff with just BASIC back in the day. You know, format disks/tapes, etc. I remember how I sometimes sped up BASIC programs on the Atari 800. If I needed extra math speed, such as during loading, I'd disable the ANTIC chip. The screen would go black, but that turned off most of the DMA and interrupt usage (refresh still occurred, but no other DMA).
I am wondering something. Would it be good to have a few 4-bit opcodes with 12-bit operands? I ask that because I want to use word memory and word addresses as the memory model. I wouldn't want to have only that as you can only have 16 opcodes that way. It would be good to leave 1 or more slots as tunnels to longer opcodes. If you do things the Suite-16 way (somewhat similar to Sweet 16, an emulated ISA), you'd have only 31 opcodes. If a word is always read, then for instructions that have only the opcode and no operands, I guess some instructions could go in what would otherwise be the operand field. And 8-bit opcodes with 8-bit operands would be good, as would 24-bit operands. I don't know if I'd want to do 16 in such a scheme since 2 opcodes of the 8/8 format would likely be the same length of an instruction that uses the 16/16 format, and an 8/16 format would mess up the alignment.
A few years ago I gave lectures on the basics of FPGA programming. I wanted to show students how to design a simple processor. The result was a simple CPU with this instruction set. It has only A register, then a stack pointer and Z,C flags and 8-bit address space, with separated I/O and memory. That's all. One of simplest CPU possible - abouit 380 LE used.
No so much for emulation by P2 but for real hardware:
Perhaps it would be fun to build something like the Gigatron as mentoned in the other thread https://gigatron.io/media/Gigatron-manual.pdf but with** dual ported RAM** https://renesas.com/us/en/document/dst/7028-data-sheet?r=13320 to have a Harvard-Von-Neumann-Architecture. It would be able to use the simple Harvard hardware concept of Gigatron but still be able to edit or load the program.
On a physical Gigatron, I have an idea to make one run at about 75 Mhz with discrete chips, but it is quite advanced. I likely won't do it as the chips would be hard to get and would be hard to mount. That is, redesign the base machine from the ground up to use a 4-stage pipeline. Have Fetch, Decode, Access, and Execute. Now, you'd have to keep the stages under 15 ns each. I wouldn't know how to make a fast decoder unit or a fast ALU. Sure, you can use transparent latches to make an ALU and have it be 100 Mhz compatible. And you could probably replace the control unit with a fast GAL or CPLD. But if nothing else, if you don't mind using loads of memory chips, you can make your own control unit and ALU in LUT form in ROMs, and have a circuit on boot to copy from slow ROMs to fast SRAMs, and use SRAMs in every stage. Put the core ROM in an SRAM, the CU in an SRAM, the user memory is an SRAM, and put the ALU in an SRAM. Just the pipeline changes would require a new core ROM. The standard one won't work due to adding more delay slots. And going that far, you might as well add new instructions. Like if you use a 16-bit wide ROM/SRAM, you could add 8/8/16 multiplication and 8/8/8+Mod division in a single cycle. Plus you could throw in a scrambled list of numbers, for RND, etc. And of course, change the control matrix (which with memory can be done arbitrarily) to where you change out all the unused instructions and add more useful ones. On the Ac=Ac+Ac instruction, you could make it more useful by giving it a parameter (the operand area in the ROM is unused for it anyway). So you could add a range to that instruction, and even a shift direction. Plus, to deal with the 75 Mhz (or faster) and still use a 6.25 Mhz pixel clock, you'd want a way to keep both the vCPU and video threads active together, and the easiest way would be to add about 3 more registers. So have at least 2 new memory indexes. Thus in your video loop, you have the port command and then 11 instructions for vCPU, another pixel, etc. So you structure the code to divide the video activity by 12.
As for dual-ported RAM (which is deprecated), you don't need that or even bus-mastering DMA for the Gigatron. Instead, there are expander boards you can build/order to be able to increase the RAM, reduce the amount of bit-banging, add SD cards, etc. And if you really need DMA, I know of a way to add something close enough. The Gigatron expanders use the really odd instructions (the ones that assert both memory lines) to talk to external cards plugged into the SRAM socket. The address lines send the commands. Now, if you can do that, you could send an instruction to do something that requires bus-mastering, and immediately put the ROM into a spinlock that looks for a known value in RAM at a specific spot. That would only help if you move the bit-banged peripherals out into hardware.
I'm still interested in 16-bit since, at some point, I may want to use external word memory. And the TMS9900 doesn't interest me, and a 286 would be too slow to do in a P2. So I'd want to make my own 16-bit ISA for it, and I could use help in deciding what other board features to provide. Obviously, it would need to support SD or even CF cards. I think I'd prefer CF cards since they are mostly IDE, and thus you have a wider pipe. And then I don't know what I'd do for the sound, but I'd want better sound than on the Gigatron and most legacy machines for that matter. The PC had a 1-bit port for the sound, but at least you had an unused system timer channel, so at least you had square waves. Bit-banging could be done, but not consistently (due to the IRQ and DMA usage of everything else, so you'd have variation, though, for brief amounts of time, you could disable interrupts). Other machines had 3-4 sound channels (including the C64, TI/99-4A, and the Tandy 1000). And I haven't worked out what video capabilities to have. As for a keyboard, a PS/2 would be fine, and I'd likely want to port the SPIN code for that to P2ASM.
I'm still interested in 16-bit since, at some point, I may want to use external word memory. And the TMS9900 doesn't interest me, and a 286 would be too slow to do in a P2. So I'd want to make my own 16-bit ISA for it, and I could use help in deciding what other board features to provide. Obviously, it would need to support SD or even CF cards. I think I'd prefer CF cards since they are mostly IDE, and thus you have a wider pipe.
>
I know, that you are not interested in Forth, but have you had a look into RTX2000 https://users.ece.cmu.edu/~koopman/stack_computers/sec4_5.html ?
It seems to be interesting for P2 in my opinion, because
A) the 2 stacks have 256 elements each and will fit into LUT+COG. Helping to keep HUB access low. And the instruction set is so small, that I think it should fit into COG+LUT.
C) Being 16 bit it still can use 1meg RAM.
D) Perhaps it is possible to find some software for it. At least there is for Novix4016.
I know, that you are not interested in Forth, but have you had a look into RTX2000 https://users.ece.cmu.edu/~koopman/stack_computers/sec4_5.html ?
It seems to be interesting for P2 in my opinion, because
A) the 2 stacks have 256 elements each and will fit into LUT+COG. Helping to keep HUB access low. And the instruction set is so small, that I think it should fit into COG+LUT.
C) Being 16 bit it still can use 1meg RAM.
D) Perhaps it is possible to find some software for it. At least there is for Novix4016.
I don't really understand that one.
I think I'd still want to make my own ISA, and unlike most 16-bit machines, I'd like the try word addressable. Most that used word memory were byte addressable. And maybe try to force alignment too.
Now, there is something I need to work out. What is a good way to add a character set to a P2? I mean, the P1 had a built-in character set, and the P2 gives you LUTs instead. If you do an 8x8 set, that is 8 bytes per character, and if you want high-ASCII too, that is 2K. Yes, you can stuff a LUT with that, but if you do so, you don't have it for other purposes. Wait, I just worked it out. I've been thinking. Okay, so what if you want to use 2 cogs to do the video? Like having one for loading/caching, and one for the actual display. So the one loads as fast as it can, gated by the amount the other has displayed. Once the buffer is saturated (up to 6 lines @320 pixels width), it only starts loading new lines when old ones are used. So maybe have a counter that runs in the porch time that updates a value in LUT memory. So once saturated, only start replacing lines as the lines are completely used. Emulating a 240 height means using scan lines twice. As for memory usage, I guess the other 2K LUT could be for a character map that's initially stored in the hub. But then, what about sprites? Too bad that 3 cogs cannot share LUTs. But with caching the framebuffer, I guess 2 cogs could do it, though cache aliasing isn't usually good practice, and there could be the cache for the display and the cache for the sprites and/or display list manager. However, simpler is probably better. So if one could just use the hub for the framebuffer but use the LUTs of 2 cogs for sprites and a character set, but that depends on timing.
So, can a P2 do 320x240, 256 colors (and maybe 640x480, 4 colors), with a palate up to 512 colors (to still give pure greys), a text mode, and also hardware sprites? And what about a mixed mode where you can do graphics and a text mode simultaneously? Or even split-screen mode where you have a text box or 2 and a graphics box?
Part of me wants to do a "Gigatron" remake from the ground up where things can run vCPU code, but use a P2 ROM and not a Gigatron native ROM. But that will be a huge job to pull off. What I'd like to see with that is maybe more of a hybrid design. Like maybe have a compatible BASIC in ROM, but have it written in native P2 code. So BASIC gets a "coprocessor" and is executed outside the "Gigatron" memory range. So have a vCPU cog, a video/sound cog (which may or may not have keyboard input), a supervisor cog, a keyboard cog if it isn't covered in any other place, and a file I/O cog. Then the P2 code would have to make sure what is exposed to programs is the same as what the Gigatron and its ROM do. One of the reasons for having a supervisor or management engine cog is to make up for the tasks that emulating vCPU directly without emulating the Gigatron wouldn't provide. For instance, it would be what boots first, tests/initializes the memory, displays a starting menu, and provides the program loader. Plus it should provide hotkeys, and if you want frequent random numbers in RAM, that could be the cog that does that. The vCPU does have a syscall mechanism, so you can also ask it for random numbers on the fly. And for compatibility, give it both the RAM approach and the syscall approach. Having a dedicated cog for the vCPU would be good in that it would simplify the emulation and reduce overhead. There would be no "dispatcher" in the code since you'd have a single thread only and no software multitasking. So without a context switcher, you can do more efficient emulation. Now, if that can make things too fast, so the way to do that is only to have the vCPU run during video porches. If one could make such a framework they could probably experiment with compatibility with other systems. Like, put the 6502 core in it, etc.
Yes, the Forwardcom thing sounds interesting, and even if you don't want all the features mentioned, those are some good principles.
Something I likely wouldn't want to try would be a way to use older CPUs in a somewhat parallel fashion to get wider data. For instance, what if you had a front end for 2 6502s that calls both CPUs to do single tasks? So have 16-bit instructions using both CPUs in parallel. But one bottleneck would be dealing with carries/borrows. And then, if you could do that, it would be more efficient to design an entire CPU. To do addition, you'd have to propagate the carry so the high byte gets incremented, and thus you consume an extra cycle or whatever. So the lower CPU couldn't continue until that happens in the upper one.
A 32 Bit 6502 has been developed already, I have a preliminary datasheet at work dated somewhere in the 90s, I do not remember right now if it is from Rockwell or wdc.
@PurpleGirl said:
I am wondering something. Would it be good to have a few 4-bit opcodes with 12-bit operands? I ask that because I want to use word memory and word addresses as the memory model. I wouldn't want to have only that as you can only have 16 opcodes that way. It would be good to leave 1 or more slots as tunnels to longer opcodes. If you do things the Suite-16 way (somewhat similar to Sweet 16, an emulated ISA), you'd have only 31 opcodes. If a word is always read, then for instructions that have only the opcode and no operands, I guess some instructions could go in what would otherwise be the operand field. And 8-bit opcodes with 8-bit operands would be good, as would 24-bit operands. I don't know if I'd want to do 16 in such a scheme since 2 opcodes of the 8/8 format would likely be the same length of an instruction that uses the 16/16 format, and an 8/16 format would mess up the alignment.
With a clean slate, you can do anything you want !
The main drawback with a roll-your-own, is that the software side becomes a ship-load of work, and is likely to never be completed.
That's why emulations that hook into a large software base, make more practical sense.
For a current-era clean slate MCU, one opcode area that appeals is a variable length skip.
These days with streaming memory, and QPI flash, and XIP, you are better to skip rather than jump-forward, for short if/else conditionals.
Smarter HW or an emulation could take a short-jump opcode, and invisibly decide when to re-issue an address-and-fetch, and when to just skip in the stream.
And 8-bit opcodes with 8-bit operands would be good, as would 24-bit operands. I don't know if I'd want to do 16 in such a scheme since 2 opcodes of the 8/8 format would likely be the same length of an instruction that uses the 16/16 format, and an 8/16 format would mess up the alignment.
The 8051 has 8 bit opcodes and 8 bit operands, and it has 1,2,3 byte opcodes.
The newest 8051's actually fetch 24(+) bits, and decode the complete long opcode in the same time as the 1 byte opcodes.
That's possible, with on chip flash.
@Ale said:
A 32 Bit 6502 has been developed already, I have a preliminary datasheet at work dated somewhere in the 90s, I do not remember right now if it is from Rockwell or wdc.
WDC never got any further than that preliminary datasheet, which was really just a slightly marked up 65C816 datasheet.
@jmg said:
...
That's why emulations that hook into a large software base, make more practical sense.
I have been thinking a lot about this. I do agree very much with this sentence and I think there are some other criteria if we speak about doing the emulation on P2.
Software + good Documentation must be available.
Speed: The emulator must be at least as fast as the original to be fun: I think, at all times optimisation stopped, when speed was "just good enough". Later bloatware was created for better looks. Due to the HUB data access bandwidth, we are looking for a machine in the area something like < 10 MHz.
RAM: We do have 512k and are limited to this amount. I do know, that people are experimenting with external RAM but I do not see this as "emulated RAM" because it's bandwidth for random access will be rather slow. It might be useful for video. ( I am reluctant to buy P2 hardware with external RAM, because if a project needs it, there are far better solutions cheaper available... ) So I think we should simply accept the limit of 512k. This includes the emulator, ROM and RAM. This means, that we are looking for an operating system, that is designed to run with <=256kBytes of RAM. And we are looking for a hardware/processor, that can handle >=512k. This means that we are looking for a 16/32bit machine. A pure 8bit machine needs a MMU and we will loose a lot of the power of P2.
Multitasking: The special strength of P2 is that it has got 8 cores. It would be good, if the operating system would open the possibility to use the cores in parallel for user tasks. If we dedicate one for video, one for communication, one for sound, we still have something like 5 cores for some user tasks. Actually I think, this criteria has rather high priority, because if we are not able to use the cores, it would better to use a mc, that has better memory bandwidth.
Emulator source available: Well, it might be more easy to port an emulator than to invent a new one....
(Was not able to include a better table)
Here I try to give 0...3 points for the criteria:
a) MC6809+OS9 2 Software+Docu available, 1 Speed, 3 Designed for <256k, 3 Multitasking, 3 Emulator source available
b) MC68k+OS9k 0 Software+Docu available, 1? Speed, 1? Designed for <256k, 3 Multitasking, 2? Emulator source available
c) MC68k+CP/M 2(not too much) Software+Docu available, 2? Speed, 2? Designed for <256k, 0 Multitasking, 2? Emulator source available
d) MC68008+QDOS 2? Software+Docu available, 2? Speed, 3? Designed for <256k, 3 Multitasking, 1? Emulator source available
e) LSI11/23+RT11 1?? Software+Docu available, 1? Speed, 3? Designed for <256k, 1 Multitasking, ? Emulator source available
f) i8088+MSDOS 2? Software+Docu available, 1? Speed, 3? Designed for <256k, 0 Multitasking, ? Emulator source available
g) Z80+CP/M 3? Software+Docu available, 3? Speed, 3 Designed for <256k, 0 Multitasking, 2? Emulator source available
h) Uscd-p 1? Software+Docu available, 3? Speed, 3 Designed for <256k, 0 Multitasking, 1? Emulator source available
((( i) FORTH 1 Software+Docu available, 3 Speed, 3 Designed for <256k, 3 Multitasking, 3 Emulator source available )))
Well, MC6809+OS9=coco3 is still a quite an interesting match.
Funny last line. The virtual FORTH processor of TAQOZ fits really well. :-) Just not too much software for it? Perhaps I should try to port the c-compiler of SWIEROS for it?
At this very moment, I think an emulator of sinclair QL with QDOS would also fit rather well to P2. MC68008 was slow enough to be emulated. To be fast, software was written in assembler. It was designed for 128Bytes of RAM. It has got Multitasking. As it was the first of a new family, things seem to be straight forward, no quirks for compatibility with a predecessor. No MMU. Just 2 bitmapped graphic modes, which need moderate 32k. I have not yet found an emulator source for it. Ada has written an emulator for the mcu.
With a clean slate, you can do anything you want !
The main drawback with a roll-your-own, is that the software side becomes a ship-load of work, and is likely to never be completed.
That's why emulations that hook into a large software base, make more practical sense.
That doesn't bother me. Once I come up with the specification, others can improve on it and write more code.
For a current-era clean slate MCU, one opcode area that appeals is a variable length skip.
These days with streaming memory, and QPI flash, and XIP, you are better to skip rather than jump-forward, for short if/else conditionals.
Smarter HW or an emulation could take a short-jump opcode, and invisibly decide when to re-issue an address-and-fetch, and when to just skip in the stream.
Or, just settle for simple predication. For single instructions, they make more sense than jumps. Like "Move if Equal," "Move if Not Equal," etc.
The 16-bit instructions that waste an extra opcode byte could have other uses such as using the extra byte to add more registers and/or modes.
Comments
I'm not sure if this will work for the P2, but you could probably do an inline interpretation style. For instance, move the nibble to the upper half of the byte and then address your jump list with interspersed code. So you have 16 bytes of room for each instruction. So just address the jumplist by the upper nibble. Just let the last P2 instruction return to the main loop handler. And you can adjust where you move the jump address to fit the number of instructions you need per instruction. If you need fewer, only shift the opcode 3 bits in to have 8, and if you need more shift it 5 in. And it depends on the variability in the size of the handler per opcode. If you run out of space for just a few instructions, then handle what you can in the opcode conversion table space and in the last instruction in the allotted entry, you jump elsewhere, take what space you need, and then return to the main handler.
I hope this makes sense. It is just mapping a 16-byte instruction map to a 256 P2 address space. So if you shift the opcode left by 4, OR that with or add that to the starting address (assuming that it's at the start of a page boundary), and then jump there, I think that should work. (And adjust for any errors I made in terms of word size. Maybe only 4 P2 instructions would fit per entry in the table.)
That is one way to do it. If you only want a list of addresses and treat it like a trampoline, you could do that too, and then use those to jump to the handlers, and return back to the main loop from each. Just find a way to convert your opcodes to real addresses.
Would UCSD-P System make sense?
Some information is available: pascal.hansotten.com/ucsd-p-system/more-on-p-code/
Basic-compiler, Pascal-Compiler, Fortran-compiler, Editor. Overlays. Runs with 64k.
Again, I'm just going for a classic but new ISA that could have been done in the day. Like maybe something like a 16-bit 6502, with a 16-bits word size.
I thought about a 65032, 32-bit 6502.
As a normal 6502, with 32-bit registers.
This thing needs additional segment registers, as in x86, to retain compatibility. Boots as a normal 6502, with stack segment register set at $00000100, CS and DS at 0, and 8 bit wide A,X,Y registers.
A KIL, (HALT, STOP HCF) instruction slots can be used to switch the thing to 32-bit mode and reload segments. So you load a code segment with the pointer to 32-bit code block, stack segment to something other than $100, then switch the mode to 32 bit. There is enough KIL instructions available to do this.
While in 32-bit mode, nothing changes in the instruction set... only they are 32 bit now.
Having CS, DS and SS segment registers make possible to run several virtual 6502s:
so you can load CS, DS and SS with something, then go to 8 bit and the CPU starts to run as 8-bit with the memory addressed with the new segment values.
Nothing more than an idea now.. and I have a instruction matrix for it somewhere.
Well, Uscd-p was a 16-bit Isa that was done in the day as western digital wd-90. I think it is at least worth studying, because it is a processor tailored for compilers. It has rather powerful instructions. What was microcode then can be cog code on P2.
Has anyone tried a 6809? then there is also the 6309. no one has gotten the timing right on the 6309.
Yes, see links in first post. As long as you don't want a lot of complicated periphery hardware (MMU!) the emulator written in C for both processors can be modified to work faster than original. https://forums.parallax.com/discussion/174794/towards-os9-operating-system-on-p2/p1
I like this idea a lot!!
How about a P1-PASM-emulator for P2?
just asking for a friend...
Mike
What might be nice is a custom opcode set that is mostly routines. Memory is the bottleneck, so if you can have a "compression" that has low overhead, you can make up for a lot of that. So in a sense, sacrifice a cog to minimize memory usage. So in a sense, keep a runtime library in the P2 ROM and have an ISA that calls those and has the most common opcodes. Maybe paired instructions too.
Other ideas I've seen have been to do stuff like do a 6502/Z80 chimera or to do an '816, but only its native mode, and not honor instructions to change it. One that might be fun to modify would be the TMS9900. I haven't been able to find a good opcode list for it yet.
The tms9900 with its register block in Ram would need a lot of Hub accesses, if you use some old software. For P2 mc68k or lsi11 as 16bit CPUs might fit better, I think?
That is where one should optimize it by putting the first 256 or 512 bytes in cog RAM. So use actual registers for the area where the TI994A put an SRAM. So make things roughly how it was done on real hardware. The TI994A used SRAM for the workspace RAM and DRAM for the rest of the system. So treat cog RAM as SRAM and hub RAM as DRAM (with wait states and all).
But with what info I could gather about that ISA, I don't think I'd make one based on that for other reasons. The ISA seems a bit bulky in how it is laid out, and it uses a weird Big Endian memory map where everything is arranged backward for each word. Bit 15 is the lowest bit.
Let's see if I remember how the instructions are laid out. There are 3-bit opcodes with a 1-bit flag to select whether it is a byte or word instruction. Then a 2-bit destination memory mode descriptor, then the 4-bit destination address, then a 2-bit source mode descriptor, and then the 4-bit source. That is for 2 register arguments, I think. And the first 3 bits (logically speaking, physically the highest 3) changed the template depending on the type of instruction.
Another CPU to try would be the 65CE02. One can mod the 6502 core here to add the C and CE instructions (and extra registers). And an advantage to the 6502 core here is that you won't need to constrain the cycles as much. The CE changed the pipelining to where there are no errant reads. The 6502 was optimized for 2-cycle instructions. So if it hits what should be a 1-cycle instruction, it loads the next byte (an opcode), decides it doesn't need that as an immediate operand, discards it, and reads that exact same byte again. The 65CE02 catches this, forwards the errant "operand" to the decoder as an instruction, and then fetches the rest of the possibly partial instruction. So for us, that might mean simplifying the 6502 core and removing any artificial timing constraints on single-byte instructions. That alone can make things operate about 25% faster at the same clock rate. And of course, coding around the added instructions can get things closer to 40% more performance at the same clock rate.
Another idea: a virtual machine/instruction set optimized for PSRAM based code execution.
Yes, for sure. I mean, my initial thoughts were to use some sort of "compression" in a sense, like complex opcodes to where you are sacrificing a cog to get around even the hub bottleneck. I don't think a given cog will get more than say the real-world equivalent of about 20 Mhz out of the hub.
I'm not sure if actual compression such as LZW or Deflate would be of use or justify the resources vs. throughput. In that case, you'd likely need virtualization and a cache, since you'd need to read entire blocks of data, each with its own dictionary.
So maybe, an approach for a faster emulation would be to use 2-3 cogs for a single core, though I'd hate to see the pipeline penalties on branches. I mean, since LUTs can be shared by pairs of cogs, you could have one doing all the memory work and one doing pure processing, with the LUT being used as a cache, with room left for registers. And having the PC in the LUT RAM may be good so that a caching/virtualization cog can alter it if it needs to, as long as you can live with the constraint of relocatable code.
For my own design, I'd want to use 20 address lines, 16 data lines, and up to 5 control lines. So 40-41 GPIO lines would be costly. But, with more board logic, maybe using the /CE line on the board and the RAM, you could stop processing and do other things with those lines if you have to.
riscv with 32MB ram. @ersmith has already written a riscv emulator. It uses hub ram only. This emulator is a Just In Time compiler so it performs well on loops. But that makes it a little harder than simply replacing hub access with external ram access. EDIT: forgot about interrupts. I think Linux needs an interrupt which is not supported by riscvp2.
https://hackaday.com/2022/12/07/a-tiny-risc-v-emulator-runs-linux-with-no-mmu-and-yes-it-runs-doom/ In my quick testing, 32MB was not enough to run doom. chlohr's emulator is written in C so it should be possible to compile that for P2 and make memory access for the emulated riscv core use the psram. It won't be fast. But we could say "hey, it runs Linux."
Well, my intention here is for a new CPU type altogether. I want something that could have been made in the Home Computing era but wasn't. I was not intending to look for a meme CPU or things commonly done in FPGA cores already. So RISCV, Blaze (or OpenFire, its open-source clone), etc., are all out for me. The idea was to create something new on par with older CPUs that is more lightweight than the P2, not as complex or more so.
As for Doom, I don't remember what I had when I first run it, but it was like 4 MB of memory or something under MS-DOS. I mean the original Doom. Doom 3 required much stronger hardware. That essentially had a script interpreter running underneath that extracted game resources from game resource files into memory and then read a human-readable script and interpreted it. It seems that they should have at least tokenized it to increase speed and decrease memory usage.
I don't get all these hobby projects that use Linux. Hell, if you want to go lightweight, go with your own build of CPM or DOS. Or not use an OS at all beyond the ROMs. You could do a lot of stuff with just BASIC back in the day. You know, format disks/tapes, etc. I remember how I sometimes sped up BASIC programs on the Atari 800. If I needed extra math speed, such as during loading, I'd disable the ANTIC chip. The screen would go black, but that turned off most of the DMA and interrupt usage (refresh still occurred, but no other DMA).
I am wondering something. Would it be good to have a few 4-bit opcodes with 12-bit operands? I ask that because I want to use word memory and word addresses as the memory model. I wouldn't want to have only that as you can only have 16 opcodes that way. It would be good to leave 1 or more slots as tunnels to longer opcodes. If you do things the Suite-16 way (somewhat similar to Sweet 16, an emulated ISA), you'd have only 31 opcodes. If a word is always read, then for instructions that have only the opcode and no operands, I guess some instructions could go in what would otherwise be the operand field. And 8-bit opcodes with 8-bit operands would be good, as would 24-bit operands. I don't know if I'd want to do 16 in such a scheme since 2 opcodes of the 8/8 format would likely be the same length of an instruction that uses the 16/16 format, and an 8/16 format would mess up the alignment.
A few years ago I gave lectures on the basics of FPGA programming. I wanted to show students how to design a simple processor. The result was a simple CPU with this instruction set. It has only A register, then a stack pointer and Z,C flags and 8-bit address space, with separated I/O and memory. That's all. One of simplest CPU possible - abouit 380 LE used.
No so much for emulation by P2 but for real hardware:
Perhaps it would be fun to build something like the Gigatron as mentoned in the other thread https://gigatron.io/media/Gigatron-manual.pdf but with** dual ported RAM** https://renesas.com/us/en/document/dst/7028-data-sheet?r=13320 to have a Harvard-Von-Neumann-Architecture. It would be able to use the simple Harvard hardware concept of Gigatron but still be able to edit or load the program.
On a physical Gigatron, I have an idea to make one run at about 75 Mhz with discrete chips, but it is quite advanced. I likely won't do it as the chips would be hard to get and would be hard to mount. That is, redesign the base machine from the ground up to use a 4-stage pipeline. Have Fetch, Decode, Access, and Execute. Now, you'd have to keep the stages under 15 ns each. I wouldn't know how to make a fast decoder unit or a fast ALU. Sure, you can use transparent latches to make an ALU and have it be 100 Mhz compatible. And you could probably replace the control unit with a fast GAL or CPLD. But if nothing else, if you don't mind using loads of memory chips, you can make your own control unit and ALU in LUT form in ROMs, and have a circuit on boot to copy from slow ROMs to fast SRAMs, and use SRAMs in every stage. Put the core ROM in an SRAM, the CU in an SRAM, the user memory is an SRAM, and put the ALU in an SRAM. Just the pipeline changes would require a new core ROM. The standard one won't work due to adding more delay slots. And going that far, you might as well add new instructions. Like if you use a 16-bit wide ROM/SRAM, you could add 8/8/16 multiplication and 8/8/8+Mod division in a single cycle. Plus you could throw in a scrambled list of numbers, for RND, etc. And of course, change the control matrix (which with memory can be done arbitrarily) to where you change out all the unused instructions and add more useful ones. On the Ac=Ac+Ac instruction, you could make it more useful by giving it a parameter (the operand area in the ROM is unused for it anyway). So you could add a range to that instruction, and even a shift direction. Plus, to deal with the 75 Mhz (or faster) and still use a 6.25 Mhz pixel clock, you'd want a way to keep both the vCPU and video threads active together, and the easiest way would be to add about 3 more registers. So have at least 2 new memory indexes. Thus in your video loop, you have the port command and then 11 instructions for vCPU, another pixel, etc. So you structure the code to divide the video activity by 12.
As for dual-ported RAM (which is deprecated), you don't need that or even bus-mastering DMA for the Gigatron. Instead, there are expander boards you can build/order to be able to increase the RAM, reduce the amount of bit-banging, add SD cards, etc. And if you really need DMA, I know of a way to add something close enough. The Gigatron expanders use the really odd instructions (the ones that assert both memory lines) to talk to external cards plugged into the SRAM socket. The address lines send the commands. Now, if you can do that, you could send an instruction to do something that requires bus-mastering, and immediately put the ROM into a spinlock that looks for a known value in RAM at a specific spot. That would only help if you move the bit-banged peripherals out into hardware.
I'm still interested in 16-bit since, at some point, I may want to use external word memory. And the TMS9900 doesn't interest me, and a 286 would be too slow to do in a P2. So I'd want to make my own 16-bit ISA for it, and I could use help in deciding what other board features to provide. Obviously, it would need to support SD or even CF cards. I think I'd prefer CF cards since they are mostly IDE, and thus you have a wider pipe. And then I don't know what I'd do for the sound, but I'd want better sound than on the Gigatron and most legacy machines for that matter. The PC had a 1-bit port for the sound, but at least you had an unused system timer channel, so at least you had square waves. Bit-banging could be done, but not consistently (due to the IRQ and DMA usage of everything else, so you'd have variation, though, for brief amounts of time, you could disable interrupts). Other machines had 3-4 sound channels (including the C64, TI/99-4A, and the Tandy 1000). And I haven't worked out what video capabilities to have. As for a keyboard, a PS/2 would be fine, and I'd likely want to port the SPIN code for that to P2ASM.
>
I know, that you are not interested in Forth, but have you had a look into RTX2000 https://users.ece.cmu.edu/~koopman/stack_computers/sec4_5.html ?
It seems to be interesting for P2 in my opinion, because
A) the 2 stacks have 256 elements each and will fit into LUT+COG. Helping to keep HUB access low.
And the instruction set is so small, that I think it should fit into COG+LUT.
C) Being 16 bit it still can use 1meg RAM.
D) Perhaps it is possible to find some software for it. At least there is for Novix4016.
I don't really understand that one.
I think I'd still want to make my own ISA, and unlike most 16-bit machines, I'd like the try word addressable. Most that used word memory were byte addressable. And maybe try to force alignment too.
Now, there is something I need to work out. What is a good way to add a character set to a P2? I mean, the P1 had a built-in character set, and the P2 gives you LUTs instead. If you do an 8x8 set, that is 8 bytes per character, and if you want high-ASCII too, that is 2K. Yes, you can stuff a LUT with that, but if you do so, you don't have it for other purposes. Wait, I just worked it out. I've been thinking. Okay, so what if you want to use 2 cogs to do the video? Like having one for loading/caching, and one for the actual display. So the one loads as fast as it can, gated by the amount the other has displayed. Once the buffer is saturated (up to 6 lines @320 pixels width), it only starts loading new lines when old ones are used. So maybe have a counter that runs in the porch time that updates a value in LUT memory. So once saturated, only start replacing lines as the lines are completely used. Emulating a 240 height means using scan lines twice. As for memory usage, I guess the other 2K LUT could be for a character map that's initially stored in the hub. But then, what about sprites? Too bad that 3 cogs cannot share LUTs. But with caching the framebuffer, I guess 2 cogs could do it, though cache aliasing isn't usually good practice, and there could be the cache for the display and the cache for the sprites and/or display list manager. However, simpler is probably better. So if one could just use the hub for the framebuffer but use the LUTs of 2 cogs for sprites and a character set, but that depends on timing.
So, can a P2 do 320x240, 256 colors (and maybe 640x480, 4 colors), with a palate up to 512 colors (to still give pure greys), a text mode, and also hardware sprites? And what about a mixed mode where you can do graphics and a text mode simultaneously? Or even split-screen mode where you have a text box or 2 and a graphics box?
Part of me wants to do a "Gigatron" remake from the ground up where things can run vCPU code, but use a P2 ROM and not a Gigatron native ROM. But that will be a huge job to pull off. What I'd like to see with that is maybe more of a hybrid design. Like maybe have a compatible BASIC in ROM, but have it written in native P2 code. So BASIC gets a "coprocessor" and is executed outside the "Gigatron" memory range. So have a vCPU cog, a video/sound cog (which may or may not have keyboard input), a supervisor cog, a keyboard cog if it isn't covered in any other place, and a file I/O cog. Then the P2 code would have to make sure what is exposed to programs is the same as what the Gigatron and its ROM do. One of the reasons for having a supervisor or management engine cog is to make up for the tasks that emulating vCPU directly without emulating the Gigatron wouldn't provide. For instance, it would be what boots first, tests/initializes the memory, displays a starting menu, and provides the program loader. Plus it should provide hotkeys, and if you want frequent random numbers in RAM, that could be the cog that does that. The vCPU does have a syscall mechanism, so you can also ask it for random numbers on the fly. And for compatibility, give it both the RAM approach and the syscall approach. Having a dedicated cog for the vCPU would be good in that it would simplify the emulation and reduce overhead. There would be no "dispatcher" in the code since you'd have a single thread only and no software multitasking. So without a context switcher, you can do more efficient emulation. Now, if that can make things too fast, so the way to do that is only to have the vCPU run during video porches. If one could make such a framework they could probably experiment with compatibility with other systems. Like, put the 6502 core in it, etc.
https://forwardcom.info/ ?
Yes, the Forwardcom thing sounds interesting, and even if you don't want all the features mentioned, those are some good principles.
Something I likely wouldn't want to try would be a way to use older CPUs in a somewhat parallel fashion to get wider data. For instance, what if you had a front end for 2 6502s that calls both CPUs to do single tasks? So have 16-bit instructions using both CPUs in parallel. But one bottleneck would be dealing with carries/borrows. And then, if you could do that, it would be more efficient to design an entire CPU. To do addition, you'd have to propagate the carry so the high byte gets incremented, and thus you consume an extra cycle or whatever. So the lower CPU couldn't continue until that happens in the upper one.
A 32 Bit 6502 has been developed already, I have a preliminary datasheet at work dated somewhere in the 90s, I do not remember right now if it is from Rockwell or wdc.
With a clean slate, you can do anything you want !
The main drawback with a roll-your-own, is that the software side becomes a ship-load of work, and is likely to never be completed.
That's why emulations that hook into a large software base, make more practical sense.
For a current-era clean slate MCU, one opcode area that appeals is a variable length skip.
These days with streaming memory, and QPI flash, and XIP, you are better to skip rather than jump-forward, for short if/else conditionals.
Smarter HW or an emulation could take a short-jump opcode, and invisibly decide when to re-issue an address-and-fetch, and when to just skip in the stream.
The 8051 has 8 bit opcodes and 8 bit operands, and it has 1,2,3 byte opcodes.
The newest 8051's actually fetch 24(+) bits, and decode the complete long opcode in the same time as the 1 byte opcodes.
That's possible, with on chip flash.
WDC never got any further than that preliminary datasheet, which was really just a slightly marked up 65C816 datasheet.
I have been thinking a lot about this. I do agree very much with this sentence and I think there are some other criteria if we speak about doing the emulation on P2.
(Was not able to include a better table)
Here I try to give 0...3 points for the criteria:
Well, MC6809+OS9=coco3 is still a quite an interesting match.
Funny last line. The virtual FORTH processor of TAQOZ fits really well. :-) Just not too much software for it? Perhaps I should try to port the c-compiler of SWIEROS for it?
At this very moment, I think an emulator of sinclair QL with QDOS would also fit rather well to P2. MC68008 was slow enough to be emulated. To be fast, software was written in assembler. It was designed for 128Bytes of RAM. It has got Multitasking. As it was the first of a new family, things seem to be straight forward, no quirks for compatibility with a predecessor. No MMU. Just 2 bitmapped graphic modes, which need moderate 32k. I have not yet found an emulator source for it. Ada has written an emulator for the mcu.
That doesn't bother me. Once I come up with the specification, others can improve on it and write more code.
Or, just settle for simple predication. For single instructions, they make more sense than jumps. Like "Move if Equal," "Move if Not Equal," etc.
The 16-bit instructions that waste an extra opcode byte could have other uses such as using the extra byte to add more registers and/or modes.