@rogloh said:
Wow, okay. Lots of ideas for you to try out. I think it's probably best to get yourself a P2 and simply start experimenting.
I don't know which one to do first, as you see it is mainly just 3 different ideas. So either make a Gigatron respin with a P2 as the I/O controller, a fully-hardware Gigatron similar with .GT1 compatibility (perhaps with an optional FPGA-based I/O controller), or a P2-only system that incorporates the functionality of the Gigatron ROM, but doesn't emulate the native Gigatron code.
The moment an FPGA goes onto the board, the whole computer will fit inside ... and run at 100 MHz native. The main circuit design effort then becomes supplying quality power to the FPGA ... and mounting the massive pin-count BGA package.
@evanh said:
The moment an FPGA goes onto the board, the whole computer will fit inside ... and run at 100 MHz native. The main circuit design effort then becomes supplying quality power to the FPGA ... and mounting the massive pin-count BGA package.
I know, but if one wants to do a hybrid design, there is nothing wrong with that. Take Stefany Allaire and the Foenix computer. The first one used a handful of FPGAs and still had an '816.
@PurpleGirl said:
I don't know where to start, as you see it is mainly just 3 different ideas. So either make a Gigatron respin with a P2 as the I/O controller, a fully-hardware Gigatron similar with .GT1 compatibility (perhaps with an optional FPGA-based I/O controller), or a P2-only system that incorporates the functionality of the Gigatron ROM, but doesn't emulate the native Gigatron code.
Sometimes projects can grow organically once you begin them, even if you don't fully know where to begin. It doesn't matter if it doesn't work out first go, you can always learn from it and refine/start over, particularly if you are using a P2 and aren't making boards right away. One can always spend a lot of time thinking of everything ahead of time but once you actually begin things can often start to fall into place and get resolved, especially if you have already spent quite a lot of prior time already thinking about it all like you have, and as I also do at times.
I would suggest initially to go down the path of your third option as a learning exercise before trying out your first option...i.e. make a P2 only system that incorporates the functionality of the Gigatron ROM. Get a P2 Edge board with PSRAM and a breakout adapter and download the tools/docs, read these forums, start simple and learn the P2 capabilities and it's video and hub memory stuff and its way to interface to IO pins. Mess about with my emulator, see what it can do, and modify/extend/break it, then you can start to try your I/O controller ideas with this same setup as you'll already have the HW to use and the P2 SW knowledge from your prior experimenting. You can try to attach a P2 to the bus of an existing Gigatron (with 3.3V conversion) and start tapping into it's transfers and generating video and expanding the memory etc. The benefit with the P2 is that its very flexible and can do so much in software that you don't have to worry about designing HW quite as much as with other micros. E.g. having the Smart pins on each IO pin helps enormously there.
One of the reasons I do quite like this Gigatron project is that some time back in the early 2000's before I got into Propeller chips, I spent a while messing about with AVR micros and my own video project ideas. I started out with an experimenter board with just an AVR microcontroller and a serial port and just started adding extra HW and SW functionality piece by piece and testing it as I went. Soon enough I ended up with this little 128kB system that ran my own BIOS and could run BASIC or an emulated mini-AVR core (like vCPU) from RAM instead of its 16kB ROM (Von Neumann instead of Harvard), It could also output a VGA signal and supported multi-channel soft-synth amplified audio and accepted PS/2 input and could read/write to SD cards, very similar to what a Gigatron does. I actually had no idea it would eventually become all this ahead of time when I started messing about and testing out my ideas but it just did. It grew organically as I replaced various parts on the board and it all worked out fine. It was a pretty fun time actually.
Yeah it was pretty cool little beastie. It ran the AVR overclocked from 16MHz to 20MHz and could do an external memory transfer from RAM in 3 clocks and regular instructions in 1 or 2, so with the memory overhead it was around 6.7 MIPS while accessing memory or 20MIPs otherwise. I had a few video modes possible on screen such as coloured text at 20MHz pixel rates with the shift register and a colour latch, and it did some 6bpp colour graphics stuff as it read video data directly out of RAM under control of the micro at up to 10MHz by toggling 8 memory address lines, giving me something around 256 coloured pixels per VGA scan line in that mode. It could run CPU stuff in the blanking or on alternating empty lines so it still felt pretty responsive with low latency. I wanted to write some games for it but eventually moved to other stuff...
I reckon it would be neat to try to put something somewhat like the Gigatron ROM functionality onto it to have it run GT1 code, although it only had 16kB for ROM so not much space there really. That limitation only added to the fun, keeping things small and making everything fit. I do now wish I'd made a full schematic for it but I think it was only hand drawn on paper back in the day (as I said, it evolved), and has been lost in time. There's no way I want to reverse engineer one from the rats nest underneath.
@evanh said:
The FIFO is built to handle the streamer doing 32-bit words at sysclock/1 without hiccup. There is no chance of ever depleting it. I thought the question was how much it stalls single data read/writes on collisions ... and how close the first RFxxx can be after non-blocking RDFAST.
Yes. I'm not seeing evidence of collisions with the spacing in this code and nor am I seeing problems with the gap I have between RDFAST (nowait) and the RET to XBYTE. So I think we can say we're pretty good now, and this emulator should run 100% fine.
Interestingly, the P2 still needed 52 clocks (~23 native instructions + XBYTE) to emulate a very minimal processor such as this one (no flags, no stack). A 325MHz P2 = 6.25MHz Gigatron. This probably means we're not going to get too much faster at emulating very many processors using XBYTE with 100% cycle accuracy if they have an instruction set that requires HUB RAM accesses and if they also do branching, so this 6.25MHz emulation speed is probably getting close to the limit of the P2 (though you could run the P2 faster at ~350MHz on a good day or higher with more cooling).
Most older CPUs probably don't do branching in a single cycle. It might be possible to do a GETCT during horizontal blanking to check total cycles/line.
Worst-case RDFAST of 17 cycles occurs when slice difference between previous read and RDFAST = +1 where slice = hub address[4:2], e.g. RDBYTE from $00000 then RDFAST from $00004. Best-case RDFAST of 10 cycles when RDFAST - RDxxxx slice diff = +2. (Also, best-case / worst-case random RDxxxx of 9 / 16 cycles when RDxxxx - RDxxxx slice diff = +1 / 0.)
I have not seen partial FIFO filling cause a delay to random reads or writes, yet. In my tests, I wait 64 cycles using WAITX after RDFAST to ensure FIFO is full and filling over, then do four or five or six RFLONGs, then one or two RDBYTE or WRBYTE. I choose slices so all hub accesses should be best-case and they are, i.e. no delays.
Full FIFO filling after RDFAST can delay random hub reads and writes, which tells us experimentally the FIFO depth (19 longs) but I can't get partial FIFO filling to delay random hub accesses at all.
Roger, could you measure time between HS or VS to see whether there are any "hidden" FIFO delays?
@TonyB_ said:
Roger, could you measure time between HS or VS to see whether there are any "hidden" FIFO delays?
Yeah I will try to do so when I get back onto that stuff. I'm porting this Babelfish thing right now from C to SPIN2 so we can eventually download programs into the system. I found a couple of useful looking Gigatron games that I'd like to test. They are visible and even playable online here on this Gigatron simulator web page. This could also be a good stress test of this emulator too IMO.
Update: It's interesting to see how much the system slows down when the scan lines are increased by pressing the Select button. This is something that the P2 could help with. Someone has built an Gigatron extender board that uses a hardware FIFO to enable all video lines to be displayed on screen while freeing the CPU by only having to render one scan line in four. The P2 could also do this quite easily by saving off the new video data to HUBRAM and just reading from that for 3 out of 4 scan lines. It may need an extra COG but we have them.
@TonyB_ said:
Roger, could you measure time between HS or VS to see whether there are any "hidden" FIFO delays?
I did a quick check of the HSYNC pulse on my scope while playing the Tetris game and didn't seen any wandering about. Its edges looked rock solid. Although this might not be the best way to examine it if any variation is very transient or random. Ideally we can get a COG to monitor the VSYNC or HSYNC pin and time it precisely. What's the best Smartpin mode for that? I should be able to reroute its input source from one of these SYNC pins with a neighboring distance of 3.
The thing is I am hearing poor sounding audio (aliased/distorted) and I'm not sure if this is a result of actual SYNC jitter or a side effect of the unfiltered 4 bit DAC they use, or by feeding it through the A/V breakout to my amplified speakers. Gigatron appears to use a total video line count of 521 scan lines and this is a problem because it is not a multiple of 4 and it will also affect audio. Looking at the ROM they seem to try to compensate for this in one case, but I'm not sure if this is a good scheme or not. In this case the soundDiscontinuity is (521 mod 4) = 1.
1045
1046 # When the total number of scan lines per frame is not an exact multiple of the
1047 # (4) channels, there will be an audible discontinuity if no measure is taken.
1048 # This static noise can be suppressed by swallowing the first `lines mod 4'
1049 # partial samples after transitioning into vertical blank. This is easiest if
1050 # the modulo is 0 (do nothing), 1 (reset sample when entering the last visible
1051 # scan line), or 2 (reset sample while in the first blank scan line). For the
1052 # last case there is no solution yet: give a warning.
1053 extra = 0
1054 if soundDiscontinuity == 2:
1055 st(sample, [sample]) # Sound continuity
1056 extra += 1
1057 if soundDiscontinuity > 2:
1058 highlight('Warning: sound discontinuity not suppressed')
...
1404 label('.lastpixels#34')
1405 if soundDiscontinuity == 1:
.lastpixels#34:
02e5 c003 st $03,[$03] 1406 st(sample, [sample]) #34 Sound continuity
1407 else:
1408 nop() #34
If I could find a way produce a pure tone, I could probably hear if it is clean or not, but the demo wave generators use the 4 bit DAC which complicates the sound.
It will interfere if you don't wait long enough. In my code I'm putting the RDFAST after the WRBYTE and it will have at least 36 clocks (up to 18 instructions including the extra effective 3 XBYTE ones) between the RDFAST and WRBYTE instructions. I think the FIFO is settled by then. It would be good to run this test again with different timing gaps between the RDFAST and WRBYTE to see when it tapers off (see below). I basically tried this independently in a different test and couldn't make it fail with my own setup's timing. The WRBYTE or RDBYTE would be okay. But that's not to say it's always perfect, there still might be some weird partially filled FIFO state replenishing itself that triggers a read or write delay even with this gap. It's just hard to know what's already in the FIFO when you start the test. In my case with 52 total clocks between XBYTE instructions and plenty of spare time for the FIFO to be loaded, there just doesn't seem to be a way to kill it (yet).
static uint32_t rdfast_test( uint32_t ticks, uint32_t * addr )
{
__asm volatile { // no optimising and enforces Fcache use - Needed to free up the FIFO
waitx ticks
getct ticks
rdfast nonblk, #64
<add a another variable waitx time here and iterate through different values until the effect wears out>
wrbyte ticks, addr
getct addr
subr ticks, addr
waitx #500 // wait for FIFO to fill before resuming hubexec
ret
nonblk long $8000_0000
}
return ticks;
}
@evanh I just modified your test and captured the results here. I've also removed the measurement overhead so I am timing just the execution of the RDBYTE or WRBYTE themselves. I added the RDBYTE case after WRBYTE and the reference case (no RDFAST) for that too.
It looks like things settle back down to the reference case once there are a 13 separating instructions (26 clocks) between a RDFAST and the next WRBYTE, and the same 13 separating instructions between a RDFAST and the next RDBYTE. However this is not being tested with the extra RFBYTE for XBYTE and another in my own emulator's codepath both included at this time - that may change things slightly, although being less a long read in total it probably shoudn't trigger another FIFO read IMO so with any luck this timing should still apply there too.
Here's a table showing when I issue different instructions. Both the earliest and latest positions of P2 instructions of interest vary depending on the exact instruction selected and code path taken by the EXECF bits. The main concern is the latest issuance of RDFAST is still giving enough time for the XBYTE read to be valid (which woudl otherwise corrupt data and crash the emulator), and that the FIFO refill activity has fully settled down by the time the next RDBYTE or WRBYTE is encountered following that RDFAST, so those operations don't get delayed beyond their usual worst case timing (which would upset the video timing).
In this table each column represents the different Gigatron instructions emulated by XBYTE and they are precisely synchronized by the streamer at the yellow row XCONT instruction. The full sequence takes 52 clocks and should not exceed it.
Note: for the LD/ALU (MEM) case, the earliest RDBYTE memory access will still always precede it's subsequent RDFAST in that instruction, so that is not a concern here. The preceding earliest possible RDFAST in that column only occurs when the memory is NOT accessed.
I still think that given what the emulator has to do, it basically requires the P2 operating at 325MHz just for the FIFO stuff to work and we are sort of rather lucky it still fits at all really and that it didn't turn out to require over 350MHz or something ugly like that.
@rogloh said:
If I could find a way produce a pure tone, I could probably hear if it is clean or not, but the demo wave generators use the 4 bit DAC which complicates the sound.
Yeah, there is really no way to get around that unless you sacrifice at least 2 lights. And then for software that is not aware of this would get sound artifacts from the lights. And those who are testing 6 and 8-bit sound configurations have to swap the order of the bits. I mean, the sounds are the lowest nibble, so if you add the higher nibble bits, you'd have to use those as lower order bits to keep any compatibility. So the extra bits of the sound would come from the higher nibble and be used as the lowest order bits.
The ROM uses 6-bit sound tables, and only 6 are used to provide headroom for mixing space. You have 4 channels and 2 bits are significant by 4. Digital mixing is just adding and right shifting. So you'd need no more than 8 total output bits to prevent clipping. So going to 6 bits of output would sound noticeably better. You won't notice as much improvement with 8 bits, at least not for single tones, but you can notice some improvement when multiple channels are active.
Now, if one went with an I/O controller or just a simple PSG, you could improve sound quality more. If one wanted to, the controller could use 8-bit samples and at least a 10-bit ALU/adder, then use an internal DAC. That could add some complexity since if one wanted to retain compatibility, one could add a mechanism to monitor the sample locations to use the external tables if they change. So then you'd have to have a way to mix the 2 types of samples. That way, you use the better samples unless a program changes the ones at the locations in the system memory map. Puckmon, the Pacman clone, edits the samples to provide the "siren" effect and other Pacman-type game sounds. So while better private samples would improve the sound and improve efficiency for most common tasks, you'd need a way to override that when the software changes the samples so that the games would work as expected. And if someone writes a tracker for the Gigatron, and there is one, such refinements of using better samples in a controller/PSG would be moot since the software would be constantly changing the Gigatron sample locations.
If I were to make a similar system with a new memory map and different locations, I likely wouldn't include Blinkenlights, or would find another way to generate those. The fact that the sound bits are the lowest bits tends to complicate things if you want to increase the sound resolution, and it makes the wiring more complex.
I think the sound bits are the highest nibble according to the schematic. One could just use more lower bits, which would be ignored (actually drive LEDs) in the case when a 4 bit DAC is fitted but used when a DAC has more bits. I think that would still work. You just have to not overdrive the output which admittedly would sacrifice quality for 4 bit HW if the audio levels are reduced for this purpose. I guess that is the point you are making.
A better Gigatron design would have had a full 8 bit DAC port updating once per scan line, and up to 8 LEDs (or other outputs) latched independently probably on the rising VSYNC edge instead of HSYNC. No need to update diagnostic LED patterns faster than 60Hz. It would only require one more latch, assuming it can be done in the SW timing available. It's certainly possible for a P2 to do if the ROM is adjusted.
I did a quick check of the HSYNC pulse on my scope while playing the Tetris game and didn't seen any wandering about. Its edges looked rock solid. Although this might not be the best way to examine it if any variation is very transient or random. Ideally we can get a COG to monitor the VSYNC or HSYNC pin and time it precisely. What's the best Smartpin mode for that? I should be able to reroute its input source from one of these SYNC pins with a neighboring distance of 3.
The official online emulator is a place that one can listen to as a reference. You will hear a lot of quantization noise; some say that is part of the charm when sound was "gritty" and a bit "dirty." It uses 6-bit samples, but you end up with just 4. I've heard of projects that mod that to 6, and they sound better. There is some filtering on the board, or it would sound worse. There are a couple of RC networks on the board. But with the way it is done, you will get quantization noise and possibly some clipping on the low end. Upper-end clipping shouldn't be possible.
So you might have to listen closely to see if there are any unexpected distortions due to timing issues.
Yeah, using another cog as a buffer does make sense here. If this were real hardware, the solution could be to add a register to insert a cycle delay if there are variations within a cycle as to when things happen. So I get the idea of a monitoring cog. In code, the closest thing to a register is adding another variable. That wouldn't be useful here (particularly when latency and memory accesses are already an issue), so a monitoring cog makes sense.
There was a similar issue on a modded stock Gigatron where someone came up with a hi-res mode. There was an issue of unwanted, black, vertical lines between pixels. I don't know how they implemented this, but that sounds a bit like a DDR scheme where you try to push out pixel data at 12.5 Mhz. They concluded there was a timing issue causing the lines between pixels. We debated in there how to fix it, and concluded that if you use faster chips, you only move where the timing glitch manifests. Something proposed and never tried was to pipeline that to make the transition between the sub-pixels happen smoother. It makes sense that a register could work there since that sounded a bit like metastability. Technically, in that case, there was clocking domain crossing, and adding registers tends to mitigate that sort of thing. (And in worst cases, you'd need up to 3 registers of depth, with one that is clocked at the slower speed and 2 at the faster speed.)
@rogloh said:
I think the sound bits are the highest nibble according to the schematic. One could just use more lower bits, which would be ignored (actually drive LEDs) in the case when a 4 bit DAC is fitted but used when a DAC has more bits. I think that would still work. You just have to not overdrive the output which admittedly would sacrifice quality for 4 bit HW if the audio levels are reduced for this purpose. I guess that is the point you are making.
Reading elsewhere, I was under the impression that the low nibble of x-out was the sound. Maybe I was wrong, but either way, the code to deal with that can get messy.
And the largest improvement is going to 6. You don't get as much going from 6 to 8 since the samples are only 6-bits. But you get some improvement when there are multiple sounds being played since they would add to more than 6 bits. If you add the samples in pairs, you get 7-bits max, and adding the 2 pairs gives you 8. So you only notice gains from 6 to 8 if multiple channels are being used.
The expander board isn't the fairest example since it uses active filtering in addition to an apparent R/C network. I don't think I've seen any real schematics, so I am going by observation of the board.
Using 6-bit samples from the ROM simplifies things since shifting is not needed. The standard way to do digital mixing is to add all the sample data and then divide by the number of channels. So 64 is 1/4 of 256. So adding them will not cause an overflow. Then you'd use the 4 highest bits of that for 4-bit.
If one wants a controller with true 8-bit output regardless of the number of samples used, then put 8-bit samples on the controller and have an adder/ALU on there that has at least 10 bits (assuming only 4 channels). If you use 1 channel, that would be 8-bit output, but if using all 4, it would be closer to 6 bits per sample (truncation losses), unless you use an internal DAC and the number of lines is not an issue.
I noticed what I thought was a discrepancy in the sound notes for the I/O expander, but the info is correct. He mentions that the Gigatron only updates the audio at 8 Khz, yet I discovered earlier that the maximum frequency shouldn't be much over 3900 Hz. Once you factor in the Nyquist theorem, it makes sense. In my calculation, I took Nyquist into account first. So the maximum frequency response should be nearly half the sampling rate. Once you go past that, produced sounds will be lower in pitch from what is expected. That is aliasing.
@TonyB_ said:
Roger, could you measure time between HS or VS to see whether there are any "hidden" FIFO delays?
I did a quick check of the HSYNC pulse on my scope while playing the Tetris game and didn't seen any wandering about. Its edges looked rock solid. Although this might not be the best way to examine it if any variation is very transient or random. Ideally we can get a COG to monitor the VSYNC or HSYNC pin and time it precisely. What's the best Smartpin mode for that? I should be able to reroute its input source from one of these SYNC pins with a neighboring distance of 3.
Monitoring the VS or HS pin with another cog seems easiest way to me. I haven't begun to look at Smartpin in detail yet. Evan would know, I expect.
It will interfere if you don't wait long enough. In my code I'm putting the RDFAST after the WRBYTE
Tony seemed to be saying it wasn't possible to measure the collisions.
BTW, your measured clearance needed of 26 ticks sounds bang on the expected.
Minimum time from RDFAST to random read or write is known now and any delay (aka interference) to the latter can be measured, as mentioned in some of my earlier posts.
What I cannot detect is partial FIFO loading that delays a random read or write, i.e when FIFO is definitely full do enough RFLONGs to force partial FIFO refilling followed immediately be a random read or write. The latter are never delayed in my tests which suggests they have precedence but that conflicts with the bad stalling we've seen when streamer is using FIFO. At higher sysclks FIFO must have priority.
@TonyB_ said:
What I cannot detect is partial FIFO loading that delays a random read or write, i.e when FIFO is definitely full do enough RFLONGs to force partial FIFO refilling followed immediately be a random read or write. The latter are never delayed in my tests which suggests they have precedence but that conflicts with the bad stalling we've seen when streamer is using FIFO. At higher sysclks FIFO must have priority.
Ah, okay. Partial FIFO refills are obviously tricky to intentionally collide with - They're only 6 longwords at a time.
FIFO always has priority. There's no doubting that.
Isn't the biggest concern on non-blocking full reloads though? When is partials happening?
@TonyB_ said:
What I cannot detect is partial FIFO loading that delays a random read or write, i.e when FIFO is definitely full do enough RFLONGs to force partial FIFO refilling followed immediately be a random read or write. The latter are never delayed in my tests which suggests they have precedence but that conflicts with the bad stalling we've seen when streamer is using FIFO. At higher sysclks FIFO must have priority.
Ah, okay. Partial FIFO refills are obviously tricky to intentionally collide with - They're only 6 longwords at a time.
FIFO always has priority. There's no doubting that.
Isn't the biggest concern on non-blocking full reloads though?
It's usually not difficult to ensure enough time between a no-wait RDFAST and the next RFxxxx by moving other instructions between the two, with WAITX added if necessary.
I meant when is partials possible/in-use in this emulator? Isn't it just lots of full reloads?
BTW, the partials were detected with our older work. Just by doing the long block transfers with SETQ+WRLONG ... in combination with streamer consuming the FIFO content.
I guess the other approach would be to use hubexec in place of the streamer.
Partials is certainly possible when the code is being executed without branching. In that case the FIFO (already filled from some prior RDFAST) will slowly deplete byte by byte, one byte being read per XBYTE and another "D" argument byte read by each instruction handler. In theory eventually this depletion will retrigger a FIFO refill which may upset things. This is probably the only unknown left IMO. I am waiting long enough between RDFAST and RET via XBYTE for data to be valid (even though I'm not waiting for RDFAST to fully complete), and I'm leaving long enough after RDFAST to the next RDBYTE or WRBYTE so as to not slow them down. But I am not waiting very long between an RFBYTE and a regular RDBYTE or WRBYTE.
What I have just discovered that a blocking RDFAST actually returns long before the FIFO completely fills. I'm getting WRLONG stalls up to 16 ticks after a hubexec branch. Err, maybe 12 ticks plus the usual slot rotation.
Comments
I don't know which one to do first, as you see it is mainly just 3 different ideas. So either make a Gigatron respin with a P2 as the I/O controller, a fully-hardware Gigatron similar with .GT1 compatibility (perhaps with an optional FPGA-based I/O controller), or a P2-only system that incorporates the functionality of the Gigatron ROM, but doesn't emulate the native Gigatron code.
The moment an FPGA goes onto the board, the whole computer will fit inside ... and run at 100 MHz native. The main circuit design effort then becomes supplying quality power to the FPGA ... and mounting the massive pin-count BGA package.
I know, but if one wants to do a hybrid design, there is nothing wrong with that. Take Stefany Allaire and the Foenix computer. The first one used a handful of FPGAs and still had an '816.
Sometimes projects can grow organically once you begin them, even if you don't fully know where to begin. It doesn't matter if it doesn't work out first go, you can always learn from it and refine/start over, particularly if you are using a P2 and aren't making boards right away. One can always spend a lot of time thinking of everything ahead of time but once you actually begin things can often start to fall into place and get resolved, especially if you have already spent quite a lot of prior time already thinking about it all like you have, and as I also do at times.
I would suggest initially to go down the path of your third option as a learning exercise before trying out your first option...i.e. make a P2 only system that incorporates the functionality of the Gigatron ROM. Get a P2 Edge board with PSRAM and a breakout adapter and download the tools/docs, read these forums, start simple and learn the P2 capabilities and it's video and hub memory stuff and its way to interface to IO pins. Mess about with my emulator, see what it can do, and modify/extend/break it, then you can start to try your I/O controller ideas with this same setup as you'll already have the HW to use and the P2 SW knowledge from your prior experimenting. You can try to attach a P2 to the bus of an existing Gigatron (with 3.3V conversion) and start tapping into it's transfers and generating video and expanding the memory etc. The benefit with the P2 is that its very flexible and can do so much in software that you don't have to worry about designing HW quite as much as with other micros. E.g. having the Smart pins on each IO pin helps enormously there.
One of the reasons I do quite like this Gigatron project is that some time back in the early 2000's before I got into Propeller chips, I spent a while messing about with AVR micros and my own video project ideas. I started out with an experimenter board with just an AVR microcontroller and a serial port and just started adding extra HW and SW functionality piece by piece and testing it as I went. Soon enough I ended up with this little 128kB system that ran my own BIOS and could run BASIC or an emulated mini-AVR core (like vCPU) from RAM instead of its 16kB ROM (Von Neumann instead of Harvard), It could also output a VGA signal and supported multi-channel soft-synth amplified audio and accepted PS/2 input and could read/write to SD cards, very similar to what a Gigatron does. I actually had no idea it would eventually become all this ahead of time when I started messing about and testing out my ideas but it just did. It grew organically as I replaced various parts on the board and it all worked out fine. It was a pretty fun time actually.
[No you don't want to see underneath... ]
Interesting machine above.
Yeah it was pretty cool little beastie. It ran the AVR overclocked from 16MHz to 20MHz and could do an external memory transfer from RAM in 3 clocks and regular instructions in 1 or 2, so with the memory overhead it was around 6.7 MIPS while accessing memory or 20MIPs otherwise. I had a few video modes possible on screen such as coloured text at 20MHz pixel rates with the shift register and a colour latch, and it did some 6bpp colour graphics stuff as it read video data directly out of RAM under control of the micro at up to 10MHz by toggling 8 memory address lines, giving me something around 256 coloured pixels per VGA scan line in that mode. It could run CPU stuff in the blanking or on alternating empty lines so it still felt pretty responsive with low latency. I wanted to write some games for it but eventually moved to other stuff...
I reckon it would be neat to try to put something somewhat like the Gigatron ROM functionality onto it to have it run GT1 code, although it only had 16kB for ROM so not much space there really. That limitation only added to the fun, keeping things small and making everything fit. I do now wish I'd made a full schematic for it but I think it was only hand drawn on paper back in the day (as I said, it evolved), and has been lost in time. There's no way I want to reverse engineer one from the rats nest underneath.
Full FIFO filling after RDFAST can delay random hub reads and writes, which tells us experimentally the FIFO depth (19 longs) but I can't get partial FIFO filling to delay random hub accesses at all.
Roger, could you measure time between HS or VS to see whether there are any "hidden" FIFO delays?
Yeah I will try to do so when I get back onto that stuff. I'm porting this Babelfish thing right now from C to SPIN2 so we can eventually download programs into the system. I found a couple of useful looking Gigatron games that I'd like to test. They are visible and even playable online here on this Gigatron simulator web page. This could also be a good stress test of this emulator too IMO.
https://gigatron.io/emu-pucmon
https://gigatron.io/emu-invader
Update: It's interesting to see how much the system slows down when the scan lines are increased by pressing the Select button. This is something that the P2 could help with. Someone has built an Gigatron extender board that uses a hardware FIFO to enable all video lines to be displayed on screen while freeing the CPU by only having to render one scan line in four. The P2 could also do this quite easily by saving off the new video data to HUBRAM and just reading from that for 3 out of 4 scan lines. It may need an extra COG but we have them.
I'm getting RDFAST interfering ...
Ticks required without a RDFAST:
Ticks required with a RDFAST:
Hehe, I'm crapping on low hubRAM there. It might not be the best means of testing. Seems to survive long enough for the test though.
EDIT: Updated - Doesn't trample unallocated memory now.
I did a quick check of the HSYNC pulse on my scope while playing the Tetris game and didn't seen any wandering about. Its edges looked rock solid. Although this might not be the best way to examine it if any variation is very transient or random. Ideally we can get a COG to monitor the VSYNC or HSYNC pin and time it precisely. What's the best Smartpin mode for that? I should be able to reroute its input source from one of these SYNC pins with a neighboring distance of 3.
The thing is I am hearing poor sounding audio (aliased/distorted) and I'm not sure if this is a result of actual SYNC jitter or a side effect of the unfiltered 4 bit DAC they use, or by feeding it through the A/V breakout to my amplified speakers. Gigatron appears to use a total video line count of 521 scan lines and this is a problem because it is not a multiple of 4 and it will also affect audio. Looking at the ROM they seem to try to compensate for this in one case, but I'm not sure if this is a good scheme or not. In this case the soundDiscontinuity is (521 mod 4) = 1.
If I could find a way produce a pure tone, I could probably hear if it is clean or not, but the demo wave generators use the 4 bit DAC which complicates the sound.
It will interfere if you don't wait long enough. In my code I'm putting the RDFAST after the WRBYTE and it will have at least 36 clocks (up to 18 instructions including the extra effective 3 XBYTE ones) between the RDFAST and WRBYTE instructions. I think the FIFO is settled by then. It would be good to run this test again with different timing gaps between the RDFAST and WRBYTE to see when it tapers off (see below). I basically tried this independently in a different test and couldn't make it fail with my own setup's timing. The WRBYTE or RDBYTE would be okay. But that's not to say it's always perfect, there still might be some weird partially filled FIFO state replenishing itself that triggers a read or write delay even with this gap. It's just hard to know what's already in the FIFO when you start the test. In my case with 52 total clocks between XBYTE instructions and plenty of spare time for the FIFO to be loaded, there just doesn't seem to be a way to kill it (yet).
@evanh I just modified your test and captured the results here. I've also removed the measurement overhead so I am timing just the execution of the RDBYTE or WRBYTE themselves. I added the RDBYTE case after WRBYTE and the reference case (no RDFAST) for that too.
It looks like things settle back down to the reference case once there are a 13 separating instructions (26 clocks) between a RDFAST and the next WRBYTE, and the same 13 separating instructions between a RDFAST and the next RDBYTE. However this is not being tested with the extra RFBYTE for XBYTE and another in my own emulator's codepath both included at this time - that may change things slightly, although being less a long read in total it probably shoudn't trigger another FIFO read IMO so with any luck this timing should still apply there too.
Here's a table showing when I issue different instructions. Both the earliest and latest positions of P2 instructions of interest vary depending on the exact instruction selected and code path taken by the EXECF bits. The main concern is the latest issuance of RDFAST is still giving enough time for the XBYTE read to be valid (which woudl otherwise corrupt data and crash the emulator), and that the FIFO refill activity has fully settled down by the time the next RDBYTE or WRBYTE is encountered following that RDFAST, so those operations don't get delayed beyond their usual worst case timing (which would upset the video timing).
In this table each column represents the different Gigatron instructions emulated by XBYTE and they are precisely synchronized by the streamer at the yellow row XCONT instruction. The full sequence takes 52 clocks and should not exceed it.
Note: for the LD/ALU (MEM) case, the earliest RDBYTE memory access will still always precede it's subsequent RDFAST in that instruction, so that is not a concern here. The preceding earliest possible RDFAST in that column only occurs when the memory is NOT accessed.
I still think that given what the emulator has to do, it basically requires the P2 operating at 325MHz just for the FIFO stuff to work and we are sort of rather lucky it still fits at all really and that it didn't turn out to require over 350MHz or something ugly like that.
Yeah, there is really no way to get around that unless you sacrifice at least 2 lights. And then for software that is not aware of this would get sound artifacts from the lights. And those who are testing 6 and 8-bit sound configurations have to swap the order of the bits. I mean, the sounds are the lowest nibble, so if you add the higher nibble bits, you'd have to use those as lower order bits to keep any compatibility. So the extra bits of the sound would come from the higher nibble and be used as the lowest order bits.
The ROM uses 6-bit sound tables, and only 6 are used to provide headroom for mixing space. You have 4 channels and 2 bits are significant by 4. Digital mixing is just adding and right shifting. So you'd need no more than 8 total output bits to prevent clipping. So going to 6 bits of output would sound noticeably better. You won't notice as much improvement with 8 bits, at least not for single tones, but you can notice some improvement when multiple channels are active.
Now, if one went with an I/O controller or just a simple PSG, you could improve sound quality more. If one wanted to, the controller could use 8-bit samples and at least a 10-bit ALU/adder, then use an internal DAC. That could add some complexity since if one wanted to retain compatibility, one could add a mechanism to monitor the sample locations to use the external tables if they change. So then you'd have to have a way to mix the 2 types of samples. That way, you use the better samples unless a program changes the ones at the locations in the system memory map. Puckmon, the Pacman clone, edits the samples to provide the "siren" effect and other Pacman-type game sounds. So while better private samples would improve the sound and improve efficiency for most common tasks, you'd need a way to override that when the software changes the samples so that the games would work as expected. And if someone writes a tracker for the Gigatron, and there is one, such refinements of using better samples in a controller/PSG would be moot since the software would be constantly changing the Gigatron sample locations.
If I were to make a similar system with a new memory map and different locations, I likely wouldn't include Blinkenlights, or would find another way to generate those. The fact that the sound bits are the lowest bits tends to complicate things if you want to increase the sound resolution, and it makes the wiring more complex.
I think the sound bits are the highest nibble according to the schematic. One could just use more lower bits, which would be ignored (actually drive LEDs) in the case when a 4 bit DAC is fitted but used when a DAC has more bits. I think that would still work. You just have to not overdrive the output which admittedly would sacrifice quality for 4 bit HW if the audio levels are reduced for this purpose. I guess that is the point you are making.
This post on the Gigatron forums links to audio MP3 samples generated with 4 and 8 bits. Makes a huge difference when you listen to it.
https://forum.gigatron.io/viewtopic.php?p=3110&sid=01553f15e04fe07f98b300ac6cfe1074#p3110
A better Gigatron design would have had a full 8 bit DAC port updating once per scan line, and up to 8 LEDs (or other outputs) latched independently probably on the rising VSYNC edge instead of HSYNC. No need to update diagnostic LED patterns faster than 60Hz. It would only require one more latch, assuming it can be done in the SW timing available. It's certainly possible for a P2 to do if the ROM is adjusted.
Tony seemed to be saying it wasn't possible to measure the collisions.
BTW, your measured clearance needed of 26 ticks sounds bang on the expected.
The official online emulator is a place that one can listen to as a reference. You will hear a lot of quantization noise; some say that is part of the charm when sound was "gritty" and a bit "dirty." It uses 6-bit samples, but you end up with just 4. I've heard of projects that mod that to 6, and they sound better. There is some filtering on the board, or it would sound worse. There are a couple of RC networks on the board. But with the way it is done, you will get quantization noise and possibly some clipping on the low end. Upper-end clipping shouldn't be possible.
So you might have to listen closely to see if there are any unexpected distortions due to timing issues.
Yeah, using another cog as a buffer does make sense here. If this were real hardware, the solution could be to add a register to insert a cycle delay if there are variations within a cycle as to when things happen. So I get the idea of a monitoring cog. In code, the closest thing to a register is adding another variable. That wouldn't be useful here (particularly when latency and memory accesses are already an issue), so a monitoring cog makes sense.
There was a similar issue on a modded stock Gigatron where someone came up with a hi-res mode. There was an issue of unwanted, black, vertical lines between pixels. I don't know how they implemented this, but that sounds a bit like a DDR scheme where you try to push out pixel data at 12.5 Mhz. They concluded there was a timing issue causing the lines between pixels. We debated in there how to fix it, and concluded that if you use faster chips, you only move where the timing glitch manifests. Something proposed and never tried was to pipeline that to make the transition between the sub-pixels happen smoother. It makes sense that a register could work there since that sounded a bit like metastability. Technically, in that case, there was clocking domain crossing, and adding registers tends to mitigate that sort of thing. (And in worst cases, you'd need up to 3 registers of depth, with one that is clocked at the slower speed and 2 at the faster speed.)
Reading elsewhere, I was under the impression that the low nibble of x-out was the sound. Maybe I was wrong, but either way, the code to deal with that can get messy.
And the largest improvement is going to 6. You don't get as much going from 6 to 8 since the samples are only 6-bits. But you get some improvement when there are multiple sounds being played since they would add to more than 6 bits. If you add the samples in pairs, you get 7-bits max, and adding the 2 pairs gives you 8. So you only notice gains from 6 to 8 if multiple channels are being used.
The expander board isn't the fairest example since it uses active filtering in addition to an apparent R/C network. I don't think I've seen any real schematics, so I am going by observation of the board.
Using 6-bit samples from the ROM simplifies things since shifting is not needed. The standard way to do digital mixing is to add all the sample data and then divide by the number of channels. So 64 is 1/4 of 256. So adding them will not cause an overflow. Then you'd use the 4 highest bits of that for 4-bit.
If one wants a controller with true 8-bit output regardless of the number of samples used, then put 8-bit samples on the controller and have an adder/ALU on there that has at least 10 bits (assuming only 4 channels). If you use 1 channel, that would be 8-bit output, but if using all 4, it would be closer to 6 bits per sample (truncation losses), unless you use an internal DAC and the number of lines is not an issue.
I noticed what I thought was a discrepancy in the sound notes for the I/O expander, but the info is correct. He mentions that the Gigatron only updates the audio at 8 Khz, yet I discovered earlier that the maximum frequency shouldn't be much over 3900 Hz. Once you factor in the Nyquist theorem, it makes sense. In my calculation, I took Nyquist into account first. So the maximum frequency response should be nearly half the sampling rate. Once you go past that, produced sounds will be lower in pitch from what is expected. That is aliasing.
Monitoring the VS or HS pin with another cog seems easiest way to me. I haven't begun to look at Smartpin in detail yet. Evan would know, I expect.
Minimum time from RDFAST to random read or write is known now and any delay (aka interference) to the latter can be measured, as mentioned in some of my earlier posts.
What I cannot detect is partial FIFO loading that delays a random read or write, i.e when FIFO is definitely full do enough RFLONGs to force partial FIFO refilling followed immediately be a random read or write. The latter are never delayed in my tests which suggests they have precedence but that conflicts with the bad stalling we've seen when streamer is using FIFO. At higher sysclks FIFO must have priority.
Ah, okay. Partial FIFO refills are obviously tricky to intentionally collide with - They're only 6 longwords at a time.
FIFO always has priority. There's no doubting that.
Isn't the biggest concern on non-blocking full reloads though? When is partials happening?
It's usually not difficult to ensure enough time between a no-wait RDFAST and the next RFxxxx by moving other instructions between the two, with WAITX added if necessary.
That's what I'm trying and failing to discover!
I meant when is partials possible/in-use in this emulator? Isn't it just lots of full reloads?
BTW, the partials were detected with our older work. Just by doing the long block transfers with SETQ+WRLONG ... in combination with streamer consuming the FIFO content.
I guess the other approach would be to use hubexec in place of the streamer.
Partials is certainly possible when the code is being executed without branching. In that case the FIFO (already filled from some prior RDFAST) will slowly deplete byte by byte, one byte being read per XBYTE and another "D" argument byte read by each instruction handler. In theory eventually this depletion will retrigger a FIFO refill which may upset things. This is probably the only unknown left IMO. I am waiting long enough between RDFAST and RET via XBYTE for data to be valid (even though I'm not waiting for RDFAST to fully complete), and I'm leaving long enough after RDFAST to the next RDBYTE or WRBYTE so as to not slow them down. But I am not waiting very long between an RFBYTE and a regular RDBYTE or WRBYTE.
I don't have a clue about the how that all works.
What I have just discovered that a blocking RDFAST actually returns long before the FIFO completely fills. I'm getting WRLONG stalls up to 16 ticks after a hubexec branch. Err, maybe 12 ticks plus the usual slot rotation.
Ticks above minimum (3) using 5 consecutive hubexec instructions, last one being a WRLONG:
Ticks above minimum (3) using 6 consecutive hubexec instructions, last one being a WRLONG:
Ticks above minimum (3) using 7 consecutive hubexec instructions, last one being a WRLONG:
Ticks above minimum (3) using 8 consecutive hubexec instructions, last one being a WRLONG:
Ticks above minimum (3) using 9 consecutive hubexec instructions, last one being a WRLONG:
Ticks above minimum (3) using 10 consecutive hubexec instructions, last one being a WRLONG:
Ticks above minimum (3) using 11 consecutive hubexec instructions, last one being a WRLONG:
Ticks above minimum (3) using 12 consecutive hubexec instructions, last one being a WRLONG:
Here's the source code if interested