What would be a good idea for a new CPU and platform to try on a P2?
I sorta think it would be nice to have something somewhat based on a 6502, but made more for performance. For instance, make all the regular instructions at least 16-bit, and maybe keep the opcodes at 8-bits and variable length. Then one wouldn't have to get complex with the '816, multiple modes, etc. If X and Y are 16-bits, then 20-24-bit memory addressing would not be an issue, and Page 0 would be 64K. I guess it would be wise to throw in some 8-bit and 32-bit ops. I'd want RNG, multiplication, and division (with modulus) to be a part of the opcodes. Those are things that the early machines lacked as instructions.
I don't know if BCD support should be added, or how to handle it. For instance, with 16-bits, would all 4 nibbles be treated as a 4-digit decimal number? One reason the older machines used BCD, was not just for accuracy in accounting, but also to somewhat simplify things for scores in games. Yes, BCD takes more overhead to do, but then converting to ASCII was easier. Just add each nibble to the ASCII code for 0 and build that into a string. Converting from binary to ASCII was a little harder. I know on the PC platforms, a way to do that was to keep dividing by 10, keep the modulus each time, add each modulus to the ASCII code for 0, and build a string from right to left. I don't know if there are other ways to do that. I know you can subtract in place of dividing, but that gets messy fast and can take a while. I think to go the other way would be to subtract the ASCII offset, then multiply each digit by its place and then add them. I don't think it would be feasible to have both of these conversions as commands since you'd need more registers to be able to specify the source, the destination buffer, and the number of characters involved.
Are there any more instructions that would be good to add? I guess a stack would be good, and likely calls and interrupts. There are questions about where to put the reset vector, the interrupt vectors, etc.
And then there are the technical issues involved. When you get to 16-bit and higher, you must consider alignment. On the P2, alignment is not an issue since you can read from the hub in any place and read 1-4 bytes. However, I'd like to include the ability to add external SRAM, preferably parallel (yes, I know 1M of word addresses means 40 GPIO lines). In that case, you'd incur an alignment penalty since you'd need to access it twice and increment the address as you go. For a homebrew design using just chips, one could mitigate this if you had two address and data buses, allowing you to read the low byte from the odd bus and the high byte from the next address of the even bus. But that sort of complex trick is out of the question here.
How would fetching be handled? If you assume a byte and a byte operand, that is a nice and even 16 bits. But what if that is a single-byte opcode? Then the next one is an opcode too. The 6502 discarded these. It wasn't until the limited run of the 65CE02 that they recycled those. That "data" byte would get forwarded back into the pipeline, and while that happened, the immediate operand was loaded (or the next instruction if this happened twice in a row). And what about the maximum immediate size? Should I add some 24-bit instructions? That would seem efficient as a single-byte opcode and a 3-byte immediate would be an even 32-bits. And of course, what about handling within the P2, should one just do 32-bit fetches from the hub?
How would this interpretation be implemented? I mean, is there an efficient way to branch based on virtual opcodes without polling or even testing bits and having instruction trees? How would one go about going from the virtual opcode to its "microcode?" Does the P2 have an efficient way to deal with 256 instruction handlers?
I mentioned the possibility of using parallel SRAM. Some would consider that not worth the effort and suggest PSRAM or some other serial arrangement. But I can see that as possible if one has plenty of multiplexers on the board (or hell, the /CS line on the memory) and allocates a line for that. And peripherals can watch that line too and know when they can speak to the P2 (even if it's another P2).
What about sound? I'd like the sound to be a little better than what existed in the retro era. The TI chip was nice in that you had 3 tone channels and a noise channel. and the Pokey improved on that in that you could use all 4 channels as you saw fit. Plus that added 16-bit sound mode and gave you the ability to bit-bang through it too. And the IBM PC, while it had the worst sound of all, you had the full range of sound. One of the high pitches I sometimes used was 15,750 Hz. That one is kind of nostalgic. And the Gigatron, I think you can only go up to around 3900 Hz or so before aliasing gets you. I never messed with a SID or even know how to program it. It seems a bit complex to me. At least it is more in tune, though jazz musicians and demo makers don't seem to care. There are features I wished existed for the old platforms, such as a note mode, and a chord mode. For some songs, at least 5 channels would be handy. It would also be nice to have a library of the most common samples built in. So have square, sine, triangle, ramp/sawtooth, noise, percussion, and anything used by common games. Maybe the interesting of the nonstandard waveforms, like 2 triangles and convex in the center, one that starts as a square or triangle and transitions into a sine, one that starts as a square but ends in noise, etc. So I'd like more options and better sound than in the old days, but not too awfully complex. And vibrato, pitch-bending, and a few other variations would be nice.
And what about the video? I wouldn't mind something that could do 320x240 and 640x480. And of course, to do that in bit-map mode, you'd need 75K and 300K for those, assuming 8-bit colors. Hardware sprites of some sort would be nice. So would a text mode. I don't know what to think about an indirection table as the Gigatron uses. That is to make special effects easier, such as scrolling, duplicating lines, flipping the screen, etc. But that approach can be clunky to work with and fragments the memory. David Murray thought of a way to do better than CGA, and trade that with memory. That is to store things so that each line has a 16-byte header to define the colors for that line and then let each byte represent 2 pixels. So you can use up to 256 colors on a page, but only 16 per line. You could do similar with 4-color modes by having a 4-byte header, and then using 4 pixels per byte. Once again, you can have 256 colors on a page, but only 4 per line. That would be more useful than the CGA hi-res mode. However, you couldn't use scrolling in these multi-pixel modes unless you scroll by 2-4 pixels at a time (unless you get really complex). I've thought about ways to have a 9-bit output which isn't possible for an 8-bit system unless you let the exact opcode number determine the status of the highest bit. But then, how would one store it? So 16-bit would make that more possible. At that point, you might as well go with 15 bits.
I thought it was interesting how MS-DOS and text mode worked. You had 80x25 text. That is 2000 characters. However, it was stored as 4000, giving each character a 16-color foreground and background. The ASCII code represented where the pixels were, and the color byte determined what they were.
Speaking of hardware sprites and layers, what do you think would be good? I mean, how large, and how many? Atari used a PMG scheme. One thing that I don't think has ever been tried is a hardware chase mode. Like you can name one sprite as the targeted sprite and assign that to the others, and pursuing sprites can try to find the targeted one. They would need to know the background color and can only move over the background color. So it would avoid the borders and not walk through walls, and return a signal when a collision with the target occurs. And even a sprite and textures library in hardware could be nice. For instance, what if you had Pacman, ghosts, stars, numbers, walls, doors, etc? So you'd have many of the elements used in games already there.
Plus, why not a video mode(s) that use some sort of a display list format? There are ways to have "opcodes" for what you want to happen on the screen. For simple screens, you'd do better to set foreground and background, position things, and describe what you have on the page. So it can work as compression.
6809 / Dragon 32/64 / Color Computer 1/2/3
Lots of interesting video modes in the CoCo3 GIME.
Yeah, but I want something different, something that isn't already a standard CPU type and one that can make the P2 features shine.
Native P2 makes the P2 features shine.
True, but it would be nice to make a non-existing CPU type and comparable to what was in the '80s with a little more pep.
With the Gigatron, that is why it uses syscalls. While giving that mechanism adds overhead, it helps for more complex things since that code is native, not vCPU. So with my approach proposed here, that might be a good thing to add. Then things like string manipulation, bin --> ASCII, etc., could run in native code.
The P2 is a non-existing CPU! It is only a dream. When you are working with the P2 you are dreaming. https://en.wikipedia.org/wiki/Dream_argument
perhaps a Forth stack machine would be a good project for a virtual CPU. The beauty of Forth is, that it can be extended starting from a small kernel. And there is Forth code for inspiration or applications.
There is already Tachyon, but you could try your different one.
You could try to include cooperative multitasking and local variables, which are extremely useful for readability early in your concept.
Something to read: forth.org/OffeteStore/1003_InsideF83.pdf
I'd rather try something more untried. Forth on a chip, BASIC on a chip, etc., have all been tried. I don't know what the draw is or why so many keep proposing such all over the web.
And yeah, some sort of local variables, private stack, Page 0 in a cog, etc., are all good.
I'm more after a new CPU I can call my own. I keep starting various threads with certain intentions, and they seem to keep getting hijacked, but I haven't complained much because some useful things of value to so many come out of them, even if they are not what I envision.
In my comment about making things shine, I meant I still have MY original design but want to let enough of the P2 features through to go a little better than in the 80s. You need some constraints for creativity. The worst killer of creativity and even of ourselves is to have everything we want. That is also the worst curse for those who practice the occult to pronounce, that someone gets all of their heart's desires. That sounds like a good thing, but it kills a person from the inside. Having something to accomplish is what pushes us to go on. It kills coding too since the more features you have, the less you understand them or make efficient use of them.
It exists. You can buy one and hold it. But that's not what I meant. I want to make a CPU specific to me and the P2 would just be my canvas and medium.
I was hoping to get discuss within the constraints I already laid out and some discussion as to what might be better. My dream is to make a CPU, whether it ever becomes made into an ASIC or not. I have an FPGA board I could use, but the P2 already has many of the features in a possibly more obtainable way. For instance, I don't have to worry about a memory arbiter.
Some other ideas:
P2 has plenty of adcs and dacs. You could build a hybrid cpu and implement some functions with analogue circuits.
Or: Use the relatively big Hub ram or the lut to build an ALU based on tables. For example with a completely and consequently logarithmic scale like dB.
I often think, that innovation is reached combining two known worlds.
Spoken by the people accomplishing great things with the P2. Who can argue with the results?
Yeah, you propose some neat ideas here.
If I wanted to do a LUT-based ALU/CPU, I think I'd rather do that with real hardware, and I already proposed such a design. I like the Gigatron TTL project, but I'd like to see something faster. So I proposed a Gigatron-similar machine with a 4-stage pipeline, and use 10 ns SRAM for each stage. So about 75 Mhz should be doable. The Harvard system core would be in a ROM that gets shadowed to an SRAM on boot, the Control unit would be ROMs that get copied to SRAMs on boot, then there would be an Access stage for dealing with an SRAM (and providing its own ALU and PRNG when it is not), and finally, there would be the LUT-based ALU which again would be a ROM and an SRAM. So having the CU and ALU in LUTs means that I can change the opcodes just by burning new ROMs. In a way, that would be like a homebrew FPGA.
If one were to make new real HW, something that I believe needs more development would be clockless computing. Sure, we have a hybrid in the form of clock-gating and power management, but there are few asynchronous CPUs. They do exist. The advantages include being faster if they were designed right. They don't seem to have any competitive advantage at the moment because few tools exist for them and that field is in its infancy. There is the Amulet that runs neck-to-neck with the ARM CPU it was to replace. It likely could go faster, but I don't think anyone is working on it anymore. The other advantages would be less heat dissipation, less power, and less RFI. I totally get why there would be less RFI. There would be no main clock and thus all the circuits work as fast as they can with handshake mechanisms all over the place. So the RFI given off would be more in the form of spread-spectrum noise. The design considerations are different for async as opposed to clocked CPUs. For clocked designs, you have to make things fit within the cycles. So everything is designed around worst-case scenarios. Asynchronous CPUs would be designed around average-case scenarios. While asynchronous CPUs require about twice the circuitry, they can be given less complexity in the parts that are in common with clocked CPUs. An example would be that carry-skip adders would be less necessary in unclocked designs. The ALU finishes when it finishes and it won't always have to give the carry signal time enough to occur if by chance it does. So you could get by with a ripple adder in your ALU, and not have to worry about carry-lookahead and the ways that can go wrong. Now, one application where this approach is used is GPUs, where faster and faster clock rates no longer make sense and leads to other engineering challenges. Plus things like wearable and flexible computers would be more likely to work with an asynchronous design where movement can throw off the timings for brief periods of time and lead to crashes. If something takes longer, the whole CPU just slows to accommodate. So if it slows because of overheating or power sags, the CPU just works slower and would be less likely to crash.
And yes, incorporating analog processing is starting to make a comeback. We've nearly reached the end of Moore's Law, and we can't keep miniaturizing CPU components to make them work faster for too long into the future. So what we will see beyond asynchronous computing is multi-core processing that uses specialized analog cores for specific tasks that can be done better with analog computing, while the digital cores can do what they do best. Just imagine analog GPUs. I would assume that might produce better images. Digital only approximates images so far. Monitors would still impose their limits, but still, if they had the ability to use different-sized pixels (I know CRT-based stuff technically did, not sure about the flat screens), the images would be amazing.
You wrote about flipper machines in another thread. The combination of mechanics with computers is of course not a new CPU but could be fun, perhaps. For example a dream would be to build a table soccer robot using a lower res camera or perhaps use the goertzel mode (????) to track a ball. Kalman filter? Fuzzy logic for speed?
Yes, pinball machines as we call them in the US. The solid-state ones have been around since 1978, if not by late 1977. Before that were electromechanical, and before that were purely mechanical. Both of the later types are interesting. The EM machines used a motor with cams much like a CPU, and relays took the place of logic chips. And while there were semi-hybrid EM and solid-state machines, one approach that wasn't done was using solid-state or hybrid without a CPU.
There is an opportunity to do that now with older machines that cannot be repaired. Rather than part them out, I know a way to save them, though it won't be original anymore. If the scoring motor and cams are damaged, one could possibly make a microcontroller board to emulate the cams. I'd use the power to the motor as the enable/halt line (of course, reduce and rectify it and maybe pass it through an optocoupler). Then the timings of the cams would need to be emulated in code with the code being stuck in a spinlock when the motor current is off. If you want to get fancy, provide sounds as the "switches" hit the virtual cams, and provide an LED so you will know when the motor line is active. The sounds would be more for nostalgia if you can get the right percussion sounds, and the LED (and the fake switch noises to a degree) would be to assist techs who stumble onto this. So if this virtual scoring unit is locked on, they would know it and consider looking for bad switches or contacts elsewhere. And there is a narrow range of deviation in the speeds, too slow is better than too fast as too fast won't work at all, but way too slow could cause overheating since coils would be energized for longer.
Would it be "your own" if you implement what others mention/discuss in this thread?
This feels like a déjà vu to me.
LOL! Well, I already laid out the framework. If others give suggestions, the framework would be the same. I gave a basic outline and was asking for guidance within that. If others suggest opcodes or access modes, then I'd still have to create the opcode map, the emulation logic, memory map, firmware, etc. It's not like anyone is demanding that I put their opcode suggestion at any given place in the opcode map. "I demand you put mine at $FF or you can't use it."
During his many improvement cycles of Tachyon Peter came to the conclusion, that on P2 it is best to use 16 bit wordcode instead of bytecode. He used this wordcode as the address of the P2 code. So the interpreter is very fast.
I would not think of a 8 or 16 bit machine and just use 32 bits for ALU and data.
What do you want to do with the new Cpu? A compiler will be needed.
I think it would be fun to do foundational development work on what would be a decimal computer. Decimal throughout. Later, it could be implemented in silicon if worthwhile.
However, I am specifically after what I said, and I will not compromise.
Of course, everyone knows a new assembler or compiler would be needed, and that is part of the point. Starting fresh with one's own ecosystem and not using anything old. And this is all just hypothetical at the moment. I was hoping everyone would have their own dream ideas. And as you can see, I was leaning toward a platform. So it would be a complete computer, but the CPU is just one part. I described all of that in my opening post.
It might be fun. That wouldn't be my cup of tea. I only know how to do rudimentary calculations with BCD. I would have to study how one would multiply and divide. I just don't care for the bottlenecks that would impose. For instance, you'd have to add or subtract 6 to carry and borrow.
Ternary would be another interesting design one can do. While you can attach 2 P2s together, I don't think there is a way to know if its partner is tri-stated or not. I think peripherals could figure out if a P2 is tri-stated, but I am not sure a P2 would be able to tell in the other direction. And I don't know if using all 3 states could result in faster traffic. And a problem with ternary is that it would be wasteful to store using binary memory. So you'd use 2 bits per trit and waste a state. Now, quaternary encoding would translate well to binary RAM. Two bits per quad would work.
There's no really fast way to detect Tri-state. It takes time and testing. Better to hold all data affirmatively.
Imagine an assembly language in base 10. Instead of numbers of bits, you have numbers of digits. An ADC might be 3 or 4 digits. It would make it really easy for people to learn assembly language.
We could implement a simulator for the IBM 1620. It was all decimal and had variable word length. It was the second computer I ever programmed after the Monrobot XI.
I've thought a lot about this and it could work beautifully within its own sphere. The difficulties come when you try to connect it with the rest of the world's hardware which is all binary.
Yeah, I suspect that would be true. It did support character I/O and I suppose that could be used to communicate with the outside world but any kind of binary data would be difficult.
You could make a shifted ascii to two-decimal-digit system:
$00 (terminator) = 00
$1B (escape) = 01
$08 (backspace) = 02
$09 (tab) = 03
$0D (cr+lf) = 04
$20..$7E (printable chars) = 05..99
I think that illustrates why Decimal mapping faded into the past : It discards usable space, for little benefit.
Memories will always be binary, because twice-as-big is a natural progression.
The vast majority of MCU uses do not care how the MCU works internally, they are using them as tools for making decisions, calculating and provide outputs.
I have heard some niche financial sector spreadsheets like BCD, but those are easily coded.
eg I have a PC calculator that has 300+ bits of precision, that I don't really care about how it does that under the hood, because it's always going to be way faster than I can create inputs.
I think it would be 10x easier for a new person to learn an assembly language if the architecture was completely decimal.
I'm not following here ?
When I write in assembler, I can use any radix I like, and the assembler converts that to 'native' hex.
It's possible for source to have almost no HEX numbers.
The biggest challenges in assembler are nothing to do with the number base, they are related to the core, opcodes and peripherals.
Only a junior student who had never seen HEX, could have a slight benefit, but the vast majority will come from other MCUs and already know HEX.
Yeah, I figured that. I was wondering if that could be used as a 3rd state and if compression or similar could use it. Because if that were possible, you could transmit in trits. There isn't really an easy way to do base-3 on the electronics end. It would deal with either floating voltages (changing polarities, ie., balanced trits) or different levels of voltages (unbalanced trits). That has been done to limited degrees. I do think one retro machine relied on trits for its video. Even with just 3 wires, that could do more colors than CGA. CGA could do 8 colors plus the luminance bit, for a total of 16. Three trits without an intensity trit would give 27 colors, and if there was such a line, you'd have 81.
I don't even think encoding base-10 on 2 wires will be fast. I tried to figure out all the ways to do that on the electronics end. You could encode frequencies on a wire, but then you'd have to decode them and split them back out, so the sample would need to take long enough so the circuitry can read what is encoded. So that would require radio-type circuitry. Maybe differing voltages would be better, but I wouldn't know how to build a voltage tree to where higher voltages can be detected as distinct without blowing the other detectors. But then again, maybe that's just an ADC circuit. The ranges don't need to be wide, only distinct. But being more analog, there are more chances of errors.
So, I am after something more traditional, but with twists, opcodes, or something, that are different. I think the ++ thing during memory operations is good since that saves an instruction. Like Mov [Y:X++], A. So the value stored at [Y:X] is transferred to the accumulator and at least X is incremented, making block memory operations easier. And the Gigatron TTL computer refined that even more. So you could do something like OR ([Y:X++], 192), Out. So you read [Y:X], OR that with 192 to force the top 2 bits high, increment X for next time, and send that to the Out port. That is why it can bit-bang its video as efficiently as it can.
And refining that, why not, at least for a Harvard, hybrid, or cached-VN machine have a way to bit-bang using the main RAM with some sort of loop in that same instruction, with register only or using cached or private memory at the same time? So "bit-bang x instructions while continuing on from that until a hazard occurs."
Even 2 stacks could be useful. Have a private stack (or one per core) for speed and guaranteed access to it, and a public one. And that is what the P2 natively has.
Even in the day, one could have made a custom coprocessor for a 6502 or similar, and some did. So you use logic, both 74xx and PALs, and maybe ROMs for decoding and ALU tables and intercept the useless instructions, giving them their own meaning, and even adding another index register. So you could have more memory modes and even some LUT math coprocessing such as single-cycle multiplication and division. Thus programs written for both processors would gain advantages.
And cartridge games and programs sometimes cheated by adding such logic or even entire CPUs. For instance, if one wanted to, they could almost upgrade the Atari 2600 to a full-blown 8-bit (or 16-bit) computer through the cartridge slot. So add an Antic, a 6502 or '816 in the cartridge, RAM, and a Pokey and/or a PIA. One cartridge did have a "sound coprocessor." That turned out to be a ROM used as a LUT to bitbang the bus with sounds/music. So if you are feeding that ROM externally and writing to only one register/address, that wouldn't cut into the program memory space. The Veronica BASIC cartridge for the A8 had its own '816 CPU. So the external CPU ran the code and the A8's CPU handled mostly I/O.