Shop OBEX P1 Docs P2 Docs Learn Events
What would be a good idea for a new CPU and platform to try on a P2? - Page 6 — Parallax Forums

What would be a good idea for a new CPU and platform to try on a P2?

12346»

Comments

  • I was meaning the throughput, not the memory size, but the maximum size there is enough time to do. So I was asking what is the highest it has time to do using just the hub. I know about the sizes and needing 300 KiB for VGA at 8 bits. So what is the largest res that can be done in light of the bottleneck of the hub? That's all. This is still in the design phase. So getting an idea as to what is reasonable to expect would inform the choices for res and so on. I'd be mostly after closer to vintage resolutions, so QVGA (75 KiB) may be what I'd shoot for, or some other simulated mode.

    I'd probably go for a strategy similar to the Vera board in the X16. It uses clothesline memory, I think. I'd want to snoop the bus, so you write to just a page of SRAM from the 65c02. So it is just a matter of setting segment:page:offset in the "registers" for that and then the cargo. That may seem slow, but like the Vera board, one can have an auto-increment feature. So you use a single byte or whatever to write to whatever selected address and send the next without updating the other parameters of the table. So you have a keyhole into a much larger amount of memory, even more than the 65C02 could address.

    Yes, the above would be inefficient for plotting (though perfect for bitmaps), but that is where a display list or primitives come in. Lines would be slow to draw if you had to keep doing 3 stores for each new location. But with a display list format and coprocessing, that would be mitigated. There would be no need to plot a diagonal line if you program in the Bressenham formula. Just specify the starting and ending coordinates or addresses followed by a command for a line.

  • @PurpleGirl said:
    I was meaning the throughput, ...

    Hi,
    with P2 the throughput is limited by the cogbeater HUB access. So I would recommend to study this for these questions. Random HUB access has to be minimized. Worst case is, that you need to wait for 8 cycles, until you can access the cell. Cached consecutive assembler execution are much better. Jumps in HUB assembler code are rather bad, because both the micro cache and the pipeline have to be refilled.
    I got some improvement, when I read 32bit from HUB into COG memory to be used for up to 4 consecutive byte accesses.
    I think, the usage of an external real processor makes most sense for a complicated one, which cannot be simulated in cog code due to limited size of COG and LUT RAM or power (or amount of work). So if it is to be an 8 bit (bus) processor, 6309, 8088, 68008?

  • AJLAJL Posts: 515
    edited 2023-07-16 05:23

    @PurpleGirl said:

    @pik33 said:
    Speed related resolution limits are 1024x600/50 Hz on HDMI, because of pixel clock = sysclk/10, and 1920x1200 on VGA, because of pixel clock<=sysclk/2 and 200 MHz limit on the VGA specifications.

    I only asked what is the highest known resolution it can do in light of the access speed of the hub. I am asking what the P2 has time to be able to do. Nothing else.

    At 336MHz there are approximately 26.5 P2 clocks per pixel for QVGA timing.

    Getting the data into or out of HubRAM is possible at suitably high speeds if the P2 sysclock is set high enough. Using a streamer at sysclock of 336MHz would definitely outpace QVGA pixel rate.

  • @"Christof Eb." said:

    @PurpleGirl said:
    I was meaning the throughput, ...

    Hi,
    with P2 the throughput is limited by the cogbeater HUB access. So I would recommend to study this for these questions. Random HUB access has to be minimized. Worst case is, that you need to wait for 8 cycles, until you can access the cell. Cached consecutive assembler execution are much better. Jumps in HUB assembler code are rather bad, because both the micro cache and the pipeline have to be refilled.
    I got some improvement, when I read 32bit from HUB into COG memory to be used for up to 4 consecutive byte accesses.
    I think, the usage of an external real processor makes most sense for a complicated one, which cannot be simulated in cog code due to limited size of COG and LUT RAM or power (or amount of work). So if it is to be an 8 bit (bus) processor, 6309, 8088, 68008?

    Thanks.

    I was meaning only video I/O. Like a cog dedicated to simply reading the hub and outputting to the screen.

  • @AJL said:
    At 336MHz there are approximately 26.5 P2 clocks per pixel for QVGA timing.

    Getting the data into or out of HubRAM is possible at suitably high speeds if the P2 sysclock is set high enough. Using a streamer at sysclock of 336MHz would definitely outpace QVGA pixel rate.

    Thank you. That was the closest thing to a straight answer. It was like others thought I was speaking Martian.

    And I am wondering if I'd actually do this project. I mean, I really don't need the 6502. Still, if I'm just doing it for the experience, that would be fine. And I'd go for QVGA since with a sufficient video controller, higher would be capable. So use tiling or keyhole access of the memory and use a small amount of SRAM to access a much larger block of hub RAM.

    As for the snooper portion, there is always the RDY to handle hub access, provided you pull it low early enough in the 6502 cycle. I think if you miss the setup time window, the 6502 will advance. And something to keep in mind is that I'd likely only be monitoring writes, and the 6502 is a Von Neumann architecture. So that means it has to take turns fetching and doing random accesses (including memory-mapped I/O). So for a write (6502's side), I was told it might not be possible for another one in under 4 6502 cycles. As for how to send data back to the 6502, unless I want to have return registers mapped to the SRAM, I guess I could always pull BE and RDY low and use bus mastering DMA.

    I haven't figured out what to do about the 6502 ROM yet. I could do it the old way and use a wait state while in that range, I guess I could consolidate it with the P2 ROM. Then the P2 would come up first, hold the reset, copy the ROM into SRAM, and then release the reset. The 6502 certainly needs the last 6 bytes to get the necessary vectors. It would need the start vector, the interrupt vector, and the NMI vector. So the P2 could be what places the ROM in the SRAM for the 6502. That could be faster in the long run and eliminate a need for a wait state. I mean 70 ns is at the edge of what is needed for 14 MHz.

  • Thanks.

    I was meaning only video I/O. Like a cog dedicated to simply reading the hub and outputting to the screen.

    Sorry, as your considerations in this thread are going in quite many directions, I wrote a relatively unspecific comment. ( A further one: I would not base a project on the need for overclocking, but that's just my personal view. )

  • No, I wasn't doing that. I was basing it on whether 320x240 (or maybe VGA) would have enough time just doing hub RAM. And this applied only to the most recent part of the conversation.

    Remember, this was intended as a general ideas thread for anyone. It isn't my thread, and everything mentioned here is only tentative.

  • I have 2 P2 boards now and only unwrapped one. What do I do next?

  • @PurpleGirl said:
    I have 2 P2 boards now and only unwrapped one. What do I do next?

    Well, you could unwrap the second.... ;)

    You will need an overall plan of your project. Write this down! Then you must split this into small parts, which can be tackled one after another. I like to use mindmaps for this work. So you can split again into subparts. Splitting stops, when it is clear, what is the next real step.
    I often have try out an element of a project. Do experiments.

    One of the first subparts might be to try to run an example of flexprop.

  • I'm more of a whim-based person. And I'd rather learn to write it in assembly. I want maximum performance.

    But it looks like I need more hardware/parts. Sound could be an easy first thing to work on. But then, I'd need an amplifier.

  • Christof Eb.Christof Eb. Posts: 1,106
    edited 2023-09-08 09:44

    @PurpleGirl said:
    I'm more of a whim-based person. And I'd rather learn to write it in assembly. I want maximum performance.

    But it looks like I need more hardware/parts. Sound could be an easy first thing to work on. But then, I'd need an amplifier.

    I would say, because I am a whim-based person too, it is very important for me to have a top down plan. Especially if I need hardware, because these steps need weeks.

    Are you new to assembler programming? The reason I ask, is that P2 is a MCISC. (Most complicated instruction set computer.) And the documentation is a problem. It might well be more easy to start with a higher level language on P2 and then switch over to assembler. It is also easy to find very good books on assembler programming for 6502 and it's contemporaries. I think, one can learn assembler with an emulator.

  • Thank you for your comments. I just have a different design philosophy. I know assembly, and learning it for the P2 would be a refreshing challenge.

  • AJLAJL Posts: 515
    edited 2023-09-08 13:45

    I think the beauty of PASM2 is that there are often several ways to achieve the same thing.

    You can start using algorithms of a simple subset of the instructions and then often find that there are single instructions or couplets that allow it to be faster, take a smaller footprint, or both.

    If your aim is to learn then tinker as your whim takes you. Plans can come along later when you have a product in mind.

  • Thanks. Well, I don't know what I'd do once I decide on a CPU and want to put it on a board with an actual P2 and whatever support parts. Looking at the P2 board, I notice there are ICs I can't even see. I presume those are LDOs. I think the other 3 chips that are more visible are the USB chip, flash, and a buffer for the LEDs (which are also hard to see on the board when they are not on). I never messed with CAD, so designing a PCB would be a learning curve.

    I think I'd prefer to work backward. The idea would be to start on peripherals, maybe use a cog as a test bench. Then once I work out things like protocols, memory maps, etc., then work on interfacing with external RAM and a CPU.

    I know I could design a "CPU" core on a P2, but I have it down to maybe 2 CPUs. One is the 65C02. It is the most flexible 6502. I mean, it has a BE line, Ready, and can use some unusual clocks. So there are plenty of ways to deal with DMA.

    I think the P2 can give the 65C02 options that you normally wouldn't think of. Most put the ROM in the upper part of the memory since the necessary vectors are in the last 6 addresses. But with a microcontroller like a P2, you can use SRAM for the entire range. Let the P2 hold the 65C02 in reset as it fills the "ROM" area of SRAM and plug in the 3 vectors at the top. Now, the 6502 only has 1 interrupt vector. But if the P2 is the only thing throwing interrupts, it could start the interrupt with a DMA transfer that edits the vector. So you can emulate multiple vectors. Retro machines didn't do that. So have multiple vector targets in the SRAM and the device calling the interrupts can select which one. Then you don't need to poll to try to figure out what needed the interrupt. So 6502 code can be simplified.


    The other CPU I've considered doesn't exist yet. I'm thinking of something similar to the Gigatron, but I'd need a total respin. I'm giving up compatibility there. Like why not replace the ALU with a PLCC68 one? Those tend to be pulls and NOS. I'm thinking of the L4C383 (or IDT7383). The fastest I can find readily is 26ns. That is 16-bit and has a lower latency than the Gigatron at about 64 ns. I'd really have to study the schematics to figure out how to retrofit the 16-bit ALU into it. I know, at the least, that would require modifying the diode-decoder ROMs. I understand how they work. The opcodes drive inverting line decoders. Those are the address "rows." Resistor packs are used as pull-ups and are the "data" lines. Diodes are placed on each row at each column where you need a 0. For a new design, I'd say use BAT43 or the larger of the Toshiba SMDs for the Schottky diodes. Now, the Gigatron ALU does 3 things this one won't inherently do, namely branches, loads, and stores. I don't really see that as a problem. I don't see why the ALU needs to handle the latter 2. I understand how branches work. That's a matter of adding to the program counter. If you want to do bidirectional relative branches, then I think that is a matter of checking the highest bit you want to use (and sign-extend any bits above that if it is high), and then adding that to the PC. I think doing load and store could be as simple as reworking the diode matrix. I'd have to then determine whether the Accumulator needs to be clobbered. I don't see why it would need to be since that should be the target or the source to begin with. I think that can be done without the ALU (let the control unit do it). But really, I'd probably need to rework the opcode map, and that would mean modifying the control unit. It would be nice to have more registers.

    Then comes RAM. Should I go with 16-bit RAM or use 8-bit as the standard Gigatron uses? Using 16-bit would certainly wreck running GT1 files (as would removing bit-banging from the ROM and moving the indirection table if used at all, note tables, frame buffer, etc. to the P2). Now, an 8-bit bus would be easier to send to to P2. But 16-bit would be much nicer. And I'm not sure what "word" size to map it to. I know of no older 16-bit CPU that used 16-bit addressing. They did tricks such as using the lowest bit to drive the extra control lines. That kept 8-bit addresses. But really, I think 16-bit addresses would be neat. But how would I interface it with the P2? I don't want to tie up 40 lines, but if I had to, I could I guess.

    Some things I've worked out. The Gigatron and the Gigasimilar one would not have inherent DMA or interrupt capabilities. That is no problem. If the homebrew CPU always initiates, then you can choose when DMA is needed and plan to emulate that. So essentially use a communications area in RAM to talk to the P2 and tell it what you want it to do and then immediately enter a polling spinlock. Have the P2 throw a multiplexer and use the RAM. Since the homebrew CPU is a Harvard machine, you can keep the RAM accesses clear by keeping the ROM busy. That is one of the abilities specific to Harvard machines. So constantly read an address for a specific value. If it is there, the spinlock breaks. For instance, adding an "FPU" could work like this. That would be a "function call." So you load the operands first, then when you send the opcode, you enter a spinlock. Maybe have a NOP before that to give time to switch the RAM connections. The reason for it being done as a function call is to let the native code handle it and have finer control over the time it takes without vCPU interfering. So doing it in native code would stop vCPU.

    So letting the P2 do all the I/O would greatly simplify the homebrew CPU's bytecode interpreter (vCPU). I mean, it would no longer have to multitask and compete with bit-banged peripherals. So the native ROM would not need to handle time-keeping tasks. So it would run code much faster not only because of offloading the I/O, but having a simpler interpreter. With the framebuffer done only on the P2 side, 320x240 should be no problem, and maybe more colors than 64. Plus, even at 320x240, Mandelbrot should be faster for a number of reasons. The base CPU rate would be faster since 6.25 MHz is no longer needed. With snooping, the P2 would have the necessary data already and thus there is no need to access video areas twice. The vCPU interpreter would require fewer cycles due to not needing multitasking. It would also have 16-bit ops. And then, for Mandelbrot, multiplication is critical. Well, there was the old Gigatron way of doing maybe 15-bit multiplication in maybe 120 cycles. There is a faster way that they are using in later ROMs which I don't think could be faster than 50-60 cycles. But what if it could be done in 2-3? So having some external math assistance would be helpful for sure.


    Regardless of the CPU used, I'd want to use snooping as the primary communication method between the 2 chips. Most traffic will go to the P2. For data that needs to flow the other way, use either cycle-stealing or some sort of bus-mastering. I'd say reserve a couple of pages for the communication area.

Sign In or Register to comment.