I've updated the first post with a new version of micropython, upython_v20.zip. This one now supports a pyb.Cpu() object, which has 2 methods:, start and stop, to allow running compiled PASM code in another COG. For now you'll have to compile the PASM to binary using PNut or fastspin, and then load the bytes into a bytearray; see the README.txt for a discussion of how to do this.
import pyb
cog=pyb.Cpu()
cog.start(code, data) # start the COG running
cog.stop() # stop it
Here's a complete example that starts two COGs blinking pins 56 and 57. The Spin code is:
' simple blinking demo
' enter with ptra pointing at a mailbox
' ptra[0] is the pin to blink
' ptra[1] is the time to wait between blinks
' ptra[2] is a count which we will update
'
dat
org 0
rdlong pinnum, ptra[0]
rdlong delay, ptra[1]
rdlong count, ptra[2]
loop
drvnot pinnum ' toggle pin
add count, #1
wrlong count, ptra[2] ' update count
waitx delay ' wait
jmp #loop
pinnum long 0
delay long 0
count long 0
and the Python to run this is:
#
# sample python code for running a blink program in another Cpu
#
import array
import pyb
import ubinascii
# the PASM code to run, compiled into hex
# the hex string here is the output of
# xxd -c 256 -ps blink.binary
code=ubinascii.unhexlify('001104fb011304fb021504fb5f1060fd011404f1021564fc1f1260fdecff9ffd0000000000000000000000000000000000000000000000000000000000000000')
# data for the first pin (pin 57)
# we'll toggle 4 times/second (system clock frequency is 160 MHz)
data=array.array('i', [57, 40000000, 0])
cog=pyb.Cpu()
cog.start(code, data)
# start a second COG
data2=array.array('i', [56, 20000000, 0])
cog2=pyb.Cpu()
cog2.start(code, data2) # start on pin 56
In the example above I used ubinascii to convert the hex version of the compiled PASM. You could also load the binary code from an SD card, or get it into a byte array in some other way.
This version of micropython will be distributed with the next version of FlexGUI. For now you can just overwrite the files in flexgui\samples\upython\ with the contents of upython_v20.zip if you want to run the new micropython from FlexGUI.
Seems like a great start here Eric. I wonder if we can come up or use some type of module format that can be imported and that already provides a method to get its own byte array which is already embedded into the module. Maybe using these "frozen" modules that seem to contain compressed Python byte code. Then it might be possible that they could be distributed in some type of OBEX for P2 Python modules that wrap the code.
I'll try to take a look at your code when I can but does your new CPU object require anything that is specific only to your RISC-V based port, or do you expect it should be portable enough to allow a native p2gcc MicroPython port to do the same thing to spawn COGs (I expect the lowest level part would have to be different). It's been a while since I even looked at my P2 MicroPython low level GCC port as I've been on the video thing, hence this basic question being asked.
Had a false start a week or so ago. Got a couple of simple upython programs working, then became busy at work again!
I am interested in running upython from the uSD. There appears to be some gotchas regarding when the SD is declared and mounted, as otherwise upython programs freeze. How does the program know???
Hopefully I'll get some time to work this out over Xmas/NYr.
I wonder if we can come up or use some type of module format that can be imported and that already provides a method to get its own byte array which is already embedded into the module. Maybe using these "frozen" modules that seem to contain compressed Python byte code. Then it might be possible that they could be distributed in some type of OBEX for P2 Python modules that wrap the code.
I think for OBEX type things it's better to distribute source code than frozen binaries. Right now there's no way to compile PASM from within micropython, so we do have to compromise there, but it's certainly possible to include the binary blob in the python source code (e.g. see the blink.py example in the .zip file I posted -- there are micropython methods to convert hex or base64 strings to bytes, which can then be loaded into the COG).
I'll try to take a look at your code when I can but does your new CPU object require anything that is specific only to your RISC-V based port, or do you expect it should be portable enough to allow a native p2gcc MicroPython port to do the same thing to spawn COGs (I expect the lowest level part would have to be different).
No, there's nothing RISC-V specific. I'm just using functions like _coginit from the standard <propeller2.h> header file that FlexC, Catalina, and riscvp2 all support. I don't know if there's a propeller2.h for p2gcc yet, but it'd be easy to write.
I am interested in running upython from the uSD. There appears to be some gotchas regarding when the SD is declared and mounted, as otherwise upython programs freeze. How does the program know???
How does the program know what? Whether the SD is mounted? If the program runs from the SD, then it may safely assume that the SD is already mounted, otherwise it wouldn't have run . The mount is persistent and only has to happen once, at the very start of the session. Supposedly it should happen automatically if an SD is present at boot time, but I've found that not all cards seem to work for that.
Do you see any way to have a main upython program in cog 0 start separate upython modules in other cogs?
I don't see a way immediately for doing that at this stage if you mean running Python on the other COGs, not PASM or SPIN etc. MicroPython is single core based right now but from what I hear from Tubular who meets up with the creator of MicroPython on a regular basis here in Melbourne, he does have plans to try to make it multi-threaded. That still may not help out for running multiple P2 cores however. Each independent core may need its own heap which complicates the memory management. Not saying it's impossible but just not quite ready for this from what I can see. Eric may see it differently...
Do you see any way to have a main upython program in cog 0 start separate upython modules in other cogs?
As Roger said, micropython is single threaded at the moment, so there's no way to do this. If a multi-threaded micropython is created, we'll need some library and perhaps compiler support to enable it on P2.
Eric & Rodger,
Thanks for the answer.
I am not a computer science (or computer anything) major, so I don't understand the need for multi threaded.
But with the P2 rev A I have run Eric's Basic and Spin and Peter's tachyon in multiple cores and P1 I have run various forths and Simple IDE C functions as well as Spin methods in multiple cores, each operating independently, generally passing data between cores as global variables, sometimes with various types of locks when some time synchronization of data was needed.
In the C, the function would have been compiled into PASM or something similar (LMM for example). I think the same for Basic. So I guess that would be similar to running PASM in a new core for upython. The forths and Spin (P1) load the interpreter (or something like that) into the new core, but I guess that is not possible with upython, since it is so large. Maybe I answered my own question, although probably with the wrong terminology.
@twm47099,
Yes, you answered your own question. IIRC the upython interpreter is ~~250KB so there cannot be multiple copies of this in memory at the same time. Unless it was written with multi-threading from the ground up, it would be a biiig job to convert it.
On a PC, multiple copies is fine. They don’t even have to run on separate cores. But this is why the PC sometimes goes off into lala land and your typing gets buffered. Happening more and more on W10 even with “fast” multi core PCs
I've updated the first post with a newer micropython build -- thanks to @Tubular for pointing out to me that the input() function was missing! This version also has a bit of rudimentary help text for the P2 and a few other minor missing features added.
I've been playing with starting up my video driver in MicroPython and have had some success there with a custom version of my driver. However I am finding that it does consume a fair bit of extra heap space by using the code=ubinascii.unhexlify('xxxxx') method to create the code array before spawning the COG. It seems to need about 3 times the space of the code itself when I observed the free memory before and after it is created (including after the command leaves the history which I thought might free it, and after gc.collect() is called manually). I was hoping it would eventually free the hex string after converting it to byte array but it seems to keep it around in memory for some reason.
Does anyone know is there a way to define a byte array in Python without consuming a lot more heap than the actual array needs? I'm not a Python guy myself even though I was able to do a native P2GCC based port of it to the P2.
I just learned that "del <obj>" can be used to free the object. So perhaps I can do something like this...
codestr = 'xxxxxx'
code = ubinascii.unhexlify(codestr)
del codestr
This might delete the hex string but keep the converted code array.
Update: Just tried this and it didn't free any more heap space than without using it, though the "codestr" variable was deleted from the dir() list output. I guess we are at the mercy of the memory allocator in MicroPython if it doesn't ever free this memory.
Yeah @rosco_pc I tried running gc.collect(). It does seem to free some memory but not all of it. I can't seem to free the original string memory, or at least that seems to be the case according to the approximate memory sizes I see with gc.mem_free() output before/after calling it. I sort of wonder if underneath the covers the call to unhexlify is making some connection to the source string which stops it from being freed later, though I don't know this, it's just a guess. Or maybe it is just because it is global in scope being done at the REPL command line, not local inside a function, so it can't get freed... not sure. I was thinking I could try writing a function to do this but that probably has to maintain the hexstring too as part of the actual code. I might try the SD filesystem to see how that goes.
@ersmith , does your MicroPython have a way to honour the frequency being passed down via loadp2 or is it fixed at some frequency? I can't seem to get it to operate at 252MHz with my loadp2 command... and the perfTest benchmark appears seems to stay in the vicinity of 373000 or so with the various frequency values I tried. It's like it is fixed at something slower. My VGA output doesn't sync but it does operate, slowly. VGA hysnc is ~20kHz instead of 31.5kHz, so it is probably running at 160MHz instead of 252MHz.
Yeah @rosco_pc I tried running gc.collect(). It does seem to free some memory but not all of it. I can't seem to free the original string memory, or at least that seems to be the case according to the approximate memory sizes I see with gc.mem_free() output before/after calling it. I sort of wonder if underneath the covers the call to unhexlify is making some connection to the source string which stops it from being freed later, though I don't know this, it's just a guess. Or maybe it is just because it is global in scope being done at the REPL command line, not local inside a function, so it can't get freed... not sure. I was thinking I could try writing a function to do this but that probably has to maintain the hexstring too as part of the actual code. I might try the SD filesystem to see how that goes.
again that is by design in the python interpreter (samething is happening in 'normal' python)
have a look at this discussion (related to module removal in micro python, but same applies to other 'types'): https://forum.micropython.org/viewtopic.php?t=2639
Reading the binary directly from SD card with f.read() might produce less memory overhead, I'm not sure.
It'd be very cool to see your video driver working with micropython. Were you able to make it position independent? Remember that we don't know where in HUB memory micropython will load the code, so there can't be any absolute hub addresses in it.
Reading the binary directly from SD card with f.read() might produce less memory overhead, I'm not sure.
It'd be very cool to see your video driver working with micropython. Were you able to make it position independent? Remember that we don't know where in HUB memory micropython will load the code, so there can't be any absolute hub addresses in it.
Yes it is now a position independent video driver I am loading.
I hope there is a way to get the MicroPython object size down. The executable code for my driver is 3200 bytes plus I load a 4k font table. That is 7300 bytes or so in total that needs to be imported, but I don't want to burn 3x this (over 21kB) if it is not necessary, particularly if the excess space can't be freed afterwards. I also need two scan line buffers to operate however the byte arrays for that didn't seem to add too much more excess overhead.
@ersmith , does your MicroPython have a way to honour the frequency being passed down via loadp2 or is it fixed at some frequency?
I've updated the first post with a new version (v22) that honors the -PATCH settings. Older versions always set the frequency to 160 MHz.
Great, I'll give it a go. Yesterday I realized that the internal USB and your own embedded video driver will be affected by the loadp2 frequency change when looking at the codebase. Maybe your embedded video driver wouldn't be needed if replaced by this new driver, but it would be nice to be able to somehow keep USB in there as well. I think the best way forward is to see if @garryj could possibly extend that USB code so it could operate at different frequencies (and use a different P2 pin base) dynamically at COG spawn time based on some input parameters. This may be possible by patching in various values in the code like I do in my video code, based on the frequency and pin parameters passed in.
@ersmith , does your MicroPython have a way to honour the frequency being passed down via loadp2 or is it fixed at some frequency?
I've updated the first post with a new version (v22) that honors the -PATCH settings. Older versions always set the frequency to 160 MHz.
Great, I'll give it a go. Yesterday I realized that the internal USB and your own embedded video driver will be affected by the loadp2 frequency change when looking at the codebase.
I haven't checked USB, but my video driver is fine running at other frequencies, it reads the clock from the standard place in low memory. Obviously if the CPU frequency doesn't line up well with the pixel clock then there will be jitter. Ultimately it would be nice to replace both the video and USB with modules loaded from disk, particularly the video (some users may want bitmapped graphics, others would be fine with text, and the resolution is something they'll want to change). If we can't figure out runtime loading then as a fallback we should replace my video driver with yours, since yours seems quite a bit more capable.
Success with v22 and VGA. I now have a VGA text screen in MicroPython with my driver. It does burn some heap though.
I allocated:
6400 character hex string for COG code
3200 bytes converted from COG code hex string
8192 character hex string for font data
4096 bytes converted from font data hex string
4800 byte screen buffer array (80x30 words)
2x2560 byte line buffers (for 640 pixels x 32bpp mode worst case gfx mode, could be much less for text only)
Plus probably a couple of hundred more bytes for other miscellaneous data including display and region structs and a 16 entry VGA palette etc
Totals ~ 32kB
Before loading video stuff, with a fresh MicroPython heap:
>>> import gc
>>> gc.mem_free()
231616
After loading in and starting up my video COG I see this memory usage:
>>> gc.collect()
>>> gc.mem_free()
194752
The difference is 36864 bytes. I guess with any rounding and other internal heap overheads it might be in the ballpark of what is to be expected, but it would be nice to be able to free up that extra ~14kB from the temporary hex strings if there is some way to do this. Maybe what would be good is some way to pack extra code/data resources into flash (independently from the MicroPython image) for people who don't want to always include an SD. A type of small flash file system that is accessible (and updatable) from MicroPython could be quite handy and in some cases could eliminate the need for an additional EEPROM for dynamically configurable system settings for example.
Actually the screen write timing is not quite as bad as the above when I measured it properly (earlier I was looking at different code that wrote a pattern instead of zeroes above and visibly that slowed it down even further).
Running this specific benchmarking code:
def fillTest(loops):
x = pyb.millis()
for j in range(loops):
for i in range(2400):
screen.word[i] = 0
x = pyb.millis() - x
print(x)
I get these results with RISC-V MicroPython @ 252MHz:
>>> fillTest(20)
2090
>>> fillTest(20)
2089
>>> fillTest(100)
10484
so about 105ms per clear loop, or 22892 chars/second to the screen.
With my native p2gcc based MicroPython at the same P2 clock speed I get these times:
Here's a quick demo of my P2 VIDEO COG running under MicroPython on the P2...
You need to obtain Eric's latest v22 build and also setup the P2 clock frequency to 252MHz (similar to what I did below using these loadp2 arguments, but with your own serial port setup):
You will then need to edit and customize these lines in the attached vgapython file with your own P2 pin settings for any VGA or A/V breakout board on your system.
# setup the configuration for a basic VGA
vgaBasePin = 8
vgaVsyncPin = 12
You will then need to paste the vgapython text file into the console using MicroPython's PASTE-mode (press ^E to enable it from the REPL, then paste the text, then press ^D).
In my setup I found when using Eric's MicroPython version, the pasting needs to be done in two different sections to introduce a delay to cope with processing all the pasted content. The two step paste file position I used is identified at this comment below in the file, but exact timing might vary on your terminal setup, or this two step process could be avoided if you add some delay per line sent with some other terminal program.
# --- PASTE BREAK HERE FOR RISC-V MicroPython to avoid overrun
Completing this paste will automatically spawn a VGA driver COG and generate some sample text (in a 80x30 mode).
Once this video COG has started you can play with it and run various sample tests/demos below using these function names:
From the REPL command line you can also interactively experiment with lots of various dynamic driver settings such as these and observe the effect:
display.border.top = 100 # 100 scanline top border
display.colour.green = 255 # make the border green
region.config.flags = 4 # enable pixel doubling
region.config.size = 100 # setup only a 100 scan line region
region.font += 1
region.screen1 += 2
etc.
You can also write to the screen using this
screen.word[position] = attribute << 8 | character
A few things I thought of while enjoying this demo (it works great on my machine):
(1) Rather than cutting and pasting you can use loadp2's script facility to download the code to the P2. This should avoid any serial overruns too, because loadp2 pauses periodically to let the other end keep up (there's a script command to control how often it pauses, but the default worked fine for me). I used the command line:
or you could put the script commands after -e into a script file.
(2) The scrolling speed seemed OK at 252 MHz, but we could always add PASM code to accelerate it.
(3) In theory you could hook into the serial output so that it goes to your object for display on the screen as well as to the serial port. To do that your COG would have to save off the UART CSR 0xbc0 write hook, which is located at $80c in HUB, and replace it with its own. The vector should point to some HUB code that accepts the character to transmit in pb and returns via a regular "ret" instruction. It should preserve all registers except pb and ptrb. You'd probably want to start by doing a call to the original serial output handler that you saved earlier (so the character would go to the serial port), then write the character pb into the screen buffer. To find the screen buffer and/or scratch space for saving registers you'd have to do a pc-relative LOC instruction, but I think it's do-able.
Comments
In the example above I used ubinascii to convert the hex version of the compiled PASM. You could also load the binary code from an SD card, or get it into a byte array in some other way.
This version of micropython will be distributed with the next version of FlexGUI. For now you can just overwrite the files in flexgui\samples\upython\ with the contents of upython_v20.zip if you want to run the new micropython from FlexGUI.
I'll try to take a look at your code when I can but does your new CPU object require anything that is specific only to your RISC-V based port, or do you expect it should be portable enough to allow a native p2gcc MicroPython port to do the same thing to spawn COGs (I expect the lowest level part would have to be different). It's been a while since I even looked at my P2 MicroPython low level GCC port as I've been on the video thing, hence this basic question being asked.
Had a false start a week or so ago. Got a couple of simple upython programs working, then became busy at work again!
I am interested in running upython from the uSD. There appears to be some gotchas regarding when the SD is declared and mounted, as otherwise upython programs freeze. How does the program know???
Hopefully I'll get some time to work this out over Xmas/NYr.
I think for OBEX type things it's better to distribute source code than frozen binaries. Right now there's no way to compile PASM from within micropython, so we do have to compromise there, but it's certainly possible to include the binary blob in the python source code (e.g. see the blink.py example in the .zip file I posted -- there are micropython methods to convert hex or base64 strings to bytes, which can then be loaded into the COG).
No, there's nothing RISC-V specific. I'm just using functions like _coginit from the standard <propeller2.h> header file that FlexC, Catalina, and riscvp2 all support. I don't know if there's a propeller2.h for p2gcc yet, but it'd be easy to write.
How does the program know what? Whether the SD is mounted? If the program runs from the SD, then it may safely assume that the SD is already mounted, otherwise it wouldn't have run . The mount is persistent and only has to happen once, at the very start of the session. Supposedly it should happen automatically if an SD is present at boot time, but I've found that not all cards seem to work for that.
Thanks
Tom
As Roger said, micropython is single threaded at the moment, so there's no way to do this. If a multi-threaded micropython is created, we'll need some library and perhaps compiler support to enable it on P2.
Thanks for the answer.
I am not a computer science (or computer anything) major, so I don't understand the need for multi threaded.
But with the P2 rev A I have run Eric's Basic and Spin and Peter's tachyon in multiple cores and P1 I have run various forths and Simple IDE C functions as well as Spin methods in multiple cores, each operating independently, generally passing data between cores as global variables, sometimes with various types of locks when some time synchronization of data was needed.
In the C, the function would have been compiled into PASM or something similar (LMM for example). I think the same for Basic. So I guess that would be similar to running PASM in a new core for upython. The forths and Spin (P1) load the interpreter (or something like that) into the new core, but I guess that is not possible with upython, since it is so large. Maybe I answered my own question, although probably with the wrong terminology.
Tom
Yes, you answered your own question. IIRC the upython interpreter is ~~250KB so there cannot be multiple copies of this in memory at the same time. Unless it was written with multi-threading from the ground up, it would be a biiig job to convert it.
On a PC, multiple copies is fine. They don’t even have to run on separate cores. But this is why the PC sometimes goes off into lala land and your typing gets buffered. Happening more and more on W10 even with “fast” multi core PCs
Does anyone know is there a way to define a byte array in Python without consuming a lot more heap than the actual array needs? I'm not a Python guy myself even though I was able to do a native P2GCC based port of it to the P2.
This might delete the hex string but keep the converted code array.
Update: Just tried this and it didn't free any more heap space than without using it, though the "codestr" variable was deleted from the dir() list output. I guess we are at the mercy of the memory allocator in MicroPython if it doesn't ever free this memory.
PS can try calling gc.collect()
again that is by design in the python interpreter (same thing is happening in 'normal' python)
have a look at this discussion (related to module removal in micro python, but same applies to other 'types'): https://forum.micropython.org/viewtopic.php?t=2639
Reading the binary directly from SD card with f.read() might produce less memory overhead, I'm not sure.
It'd be very cool to see your video driver working with micropython. Were you able to make it position independent? Remember that we don't know where in HUB memory micropython will load the code, so there can't be any absolute hub addresses in it.
Yes it is now a position independent video driver I am loading.
I hope there is a way to get the MicroPython object size down. The executable code for my driver is 3200 bytes plus I load a 4k font table. That is 7300 bytes or so in total that needs to be imported, but I don't want to burn 3x this (over 21kB) if it is not necessary, particularly if the excess space can't be freed afterwards. I also need two scan line buffers to operate however the byte arrays for that didn't seem to add too much more excess overhead.
Great, I'll give it a go. Yesterday I realized that the internal USB and your own embedded video driver will be affected by the loadp2 frequency change when looking at the codebase. Maybe your embedded video driver wouldn't be needed if replaced by this new driver, but it would be nice to be able to somehow keep USB in there as well. I think the best way forward is to see if @garryj could possibly extend that USB code so it could operate at different frequencies (and use a different P2 pin base) dynamically at COG spawn time based on some input parameters. This may be possible by patching in various values in the code like I do in my video code, based on the frequency and pin parameters passed in.
I haven't checked USB, but my video driver is fine running at other frequencies, it reads the clock from the standard place in low memory. Obviously if the CPU frequency doesn't line up well with the pixel clock then there will be jitter. Ultimately it would be nice to replace both the video and USB with modules loaded from disk, particularly the video (some users may want bitmapped graphics, others would be fine with text, and the resolution is something they'll want to change). If we can't figure out runtime loading then as a fallback we should replace my video driver with yours, since yours seems quite a bit more capable.
I allocated:
6400 character hex string for COG code
3200 bytes converted from COG code hex string
8192 character hex string for font data
4096 bytes converted from font data hex string
4800 byte screen buffer array (80x30 words)
2x2560 byte line buffers (for 640 pixels x 32bpp mode worst case gfx mode, could be much less for text only)
Plus probably a couple of hundred more bytes for other miscellaneous data including display and region structs and a 16 entry VGA palette etc
Totals ~ 32kB
Before loading video stuff, with a fresh MicroPython heap:
>>> import gc
>>> gc.mem_free()
231616
After loading in and starting up my video COG I see this memory usage:
>>> gc.collect()
>>> gc.mem_free()
194752
The difference is 36864 bytes. I guess with any rounding and other internal heap overheads it might be in the ballpark of what is to be expected, but it would be nice to be able to free up that extra ~14kB from the temporary hex strings if there is some way to do this. Maybe what would be good is some way to pack extra code/data resources into flash (independently from the MicroPython image) for people who don't want to always include an SD. A type of small flash file system that is accessible (and updatable) from MicroPython could be quite handy and in some cases could eliminate the need for an additional EEPROM for dynamically configurable system settings for example.
Even at 252MHz visibly it appears to take somewhere in the range from 500-1000ms to clear the screen using a simple fill loop like this:
Running this specific benchmarking code: I get these results with RISC-V MicroPython @ 252MHz:
>>> fillTest(20)
2090
>>> fillTest(20)
2089
>>> fillTest(100)
10484
so about 105ms per clear loop, or 22892 chars/second to the screen.
With my native p2gcc based MicroPython at the same P2 clock speed I get these times:
>>> fillTest(20)
1654
>>> fillTest(20)
1646
>>> fillTest(100)
8253
about 83ms per clear loop, or 29080 chars/second to the screen.
I guess it's approximately comparable to sending the data over serial at 230400bps.
You need to obtain Eric's latest v22 build and also setup the P2 clock frequency to 252MHz (similar to what I did below using these loadp2 arguments, but with your own serial port setup): You will then need to edit and customize these lines in the attached vgapython file with your own P2 pin settings for any VGA or A/V breakout board on your system.
You will then need to paste the vgapython text file into the console using MicroPython's PASTE-mode (press ^E to enable it from the REPL, then paste the text, then press ^D).
In my setup I found when using Eric's MicroPython version, the pasting needs to be done in two different sections to introduce a delay to cope with processing all the pasted content. The two step paste file position I used is identified at this comment below in the file, but exact timing might vary on your terminal setup, or this two step process could be avoided if you add some delay per line sent with some other terminal program.
Completing this paste will automatically spawn a VGA driver COG and generate some sample text (in a 80x30 mode).
Once this video COG has started you can play with it and run various sample tests/demos below using these function names:
From the REPL command line you can also interactively experiment with lots of various dynamic driver settings such as these and observe the effect:
etc.
You can also write to the screen using this
Enjoy!
Roger.
A few things I thought of while enjoying this demo (it works great on my machine):
(1) Rather than cutting and pasting you can use loadp2's script facility to download the code to the P2. This should avoid any serial overruns too, because loadp2 pauses periodically to let the other end keep up (there's a script command to control how often it pauses, but the default worked fine for me). I used the command line: or you could put the script commands after -e into a script file.
(2) The scrolling speed seemed OK at 252 MHz, but we could always add PASM code to accelerate it.
(3) In theory you could hook into the serial output so that it goes to your object for display on the screen as well as to the serial port. To do that your COG would have to save off the UART CSR 0xbc0 write hook, which is located at $80c in HUB, and replace it with its own. The vector should point to some HUB code that accepts the character to transmit in pb and returns via a regular "ret" instruction. It should preserve all registers except pb and ptrb. You'd probably want to start by doing a call to the original serial output handler that you saved earlier (so the character would go to the serial port), then write the character pb into the screen buffer. To find the screen buffer and/or scratch space for saving registers you'd have to do a pc-relative LOC instruction, but I think it's do-able.