Sounds like it's mostly library work that needs to be done at this point. Maybe your approach is good enough for a first release of the P2? It would save Parallax money not having to do PropGCC for P2. On the other hand, I suspect that we could make use of more of the P2 features if we directly targeted it.
Yes, most of the work that I've done on p2gcc lately is on the library. At some point I want to grab the PropGCC library code from GitHub and compile it for p2gcc. A lot of the library functions are pretty generic and don't need to be tuned for the P2. I did create a fast version of memset that uses the P2 streamer. I also want to do a fast version of memcpy that uses the streamer, but this is a bit more complicated. memcpy will need to break the copy up into chunks, and alternate block reads and block writes using cog memory for the buffer.
I've found that the qvector and qrotate instructions have been very useful for doing the trig functions. qlog, qexp and qsqrt have also been useful for implementing the corresponding C functions. qdiv and qmul work well for 32-bit floating point. It's going to be a little trickier to implement 64-bit floating point with those instructions.
I'm implementing many of the P2-specific instructions by using small functions that contain __asm__() statements. This is easy to do, but it does require the overhead of function calls. Eventually I'll try to define them as instructions to eliminate the function calls and allow the optimizer to do it's magic. This will also improve the speed of hub execution by eliminating function calls.
Maybe Parallax could use p2gcc as a starting point, and replace components of it until we have a full GCC implementation. That might be a viable approach that would allow Parallax to spread the development cost over a longer period of time.
I just saw this post and wanted to leave a comment. I'm currently using prop1 in a production product in which I sell 3,000 to 5,000 units per year. Since my current codebase is in Spin/Pasm, as long as prop2 supported that along with c++ I'd love to move to it. As it is now the 32k EEPROM limitation of prop1 is making new developments almost impossible. If prop 2 doesn't come out soon where I can easily move my codebase I'm probably going to have to look at ARM processors, though I'd prefer the Prop as it is less power-hungery, and the unit I produce is installed with a lithium ion battery.
I just saw this post and wanted to leave a comment. I'm currently using prop1 in a production product in which I sell 3,000 to 5,000 units per year. Since my current codebase is in Spin/Pasm, as long as prop2 supported that along with c++ I'd love to move to it. As it is now the 32k EEPROM limitation of prop1 is making new developments almost impossible. If prop 2 doesn't come out soon where I can easily move my codebase I'm probably going to have to look at ARM processors, though I'd prefer the Prop as it is less power-hungery, and the unit I produce is installed with a lithium ion battery.
The Prop2 will be fabricated in 180nm technology and will probably leak 1mA quiescent current whenever powered up. That will not work well in your app. It can process a lot of signals, though.
I just saw this post and wanted to leave a comment. I'm currently using prop1 in a production product in which I sell 3,000 to 5,000 units per year. Since my current codebase is in Spin/Pasm, as long as prop2 supported that along with c++ I'd love to move to it. As it is now the 32k EEPROM limitation of prop1 is making new developments almost impossible. If prop 2 doesn't come out soon where I can easily move my codebase I'm probably going to have to look at ARM processors, though I'd prefer the Prop as it is less power-hungery, and the unit I produce is installed with a lithium ion battery.
The Prop2 will be fabricated in 180nm technology and will probably leak 1mA quiescent current whenever powered up...
You can always used another MCU as a sleep timer. (this is where fast boot times become very important... )
Will the test shuttle be able to indicate quiescent current, or is that dominated by the sea-of-logic not yet in the shuttle ?
Do you have a Core Vcc operating range ? Lowering that core Vcc, could give another means to reduce idle power.
Some 3v3 devices spec like eg that EEPROM needs 2.7V Min, but the MCU can operate down to 1.8V ?
Other, older MCUs used to specify a retention Vcc/Icc - that was the Clock stopped value, needed to preserve RAM contents.
Seems the P2 could do the same ?
We need to get the final chip built before we'll know what the quiescent leakage really is. Any time 1.8V is applied to the Prop2's VDD pins, this leakage is occurring.
If you want to eliminate this ~1mA of leakage, you'll need another circuit to control the 1.8V supply.
Thanks for the welcome, I've been a long time lurker and using my prop1 for a while (since 2011?) and love it. Much easier to work with than Arduino.
I think the leak will be okay; the board gets installed into a small space to do audio and some motion tracking, which is much more intensive than the leak. It would be nice though if the prop2 supported a way to sleep after in operation and then wake via one of the standard I/0 pins, as that would completely mitigate any issue with the prop itself leaking current as it could just go to "sleep" after some time, as long as the wake-up was from the button I already use off one of the I/O pins.
Also, there's a HUGE advantage to not having to start from scratch in the code, so I'd prefer to stick with what I already have and know than start over somewhere else, though I'm a big fan of C# so am looking at alternatives that support the .Net micro framework as well, I do think the prop2 is still going to be the best overall choice for my business.
I think the leak will be okay; the board gets installed into a small space to do audio and some motion tracking, which is much more intensive than the leak. It would be nice though if the prop2 supported a way to sleep after in operation and then wake via one of the standard I/0 pins, as that would completely mitigate any issue with the prop itself leaking current as it could just go to "sleep" after some time, as long as the wake-up was from the button I already use off one of the I/O pins.
How often does it wake, and for how long ?
What are the current budgets for each sleep/wake mode ?
As Chip mentioned above, the P2 is optimised for speed, so the Stopped-clock Icc will be nearer 1mA than 1uA.
If you need 1uA, then the 1.8V supply needs to be controlled.
They may be some lower core Vcc, that supports RAM retention (no CLK), but that would need a small companion MCU to control Vcc and Clk and RST.
As Chip mentioned above, the P2 is optimised for speed, so the Stopped-clock Icc will be nearer 1mA than 1uA.
1mA is fine; the battery I use is a 3400mAh 18650, so @1mA that's going to be over 100 days of sleep time, which is way more than enough. Heck even 1 month of sleep is fine, so the entire system can probably go up to 4-5x that amount and be okay for the "sleep" mode.
The device will be awake for a good period of time (1 hourish?) but sleeps quite a while after that. I have no concerns about power draw when it's not asleep because that means the end user is engaged with it, and if it runs out they'll know to plug it in; just want a better shelf life then like 7 days which is what I'm getting with prop1 since I cannot really re-wake from sleep without a physical reboot.
Again the ability to easily port my existing code to a platform with a ton more memory and more speed is already a huge win over anything else I've looked at so far.
1mA is fine; the battery I use is a 3400mAh 18650, so @1mA that's going to be over 100 days of sleep time, which is way more than enough. Heck even 1 month of sleep is fine, so the entire system can probably go up to 4-5x that amount and be okay for the "sleep" mode.
I'd expect P2 to have some MHz/mA trade off slope, by varying Core Vcc.
The test shuttle may reveal the RCosc supply range, and the PLL supply range too.
After that it's just CMOS logic, which has quite wide ranges of MHz/mA vs Vcc
The device will be awake for a good period of time (1 hourish?) but sleeps quite a while after that. I have no concerns about power draw when it's not asleep because that means the end user is engaged with it, and if it runs out they'll know to plug it in; just want a better shelf life then like 7 days which is what I'm getting with prop1 since I cannot really re-wake from sleep without a physical reboot.
If you want ideas on the present P1 design, for ways to improve P1 idle drain, short of a reboot, you could also look at this thread
The OP there is running off RCSLOW, and Tracy Allen Posts Here are experimental results, all pins configured for lowest current, running a counter in NCO mode at clkfreq/2.
RCslow: 8.6µA without NCO, or 9.8µA with NCO, period measured 97.18µs (clkfreq=20580Hz)
RCfast: 0.752mA without NCO, or 1.52mA with NCO, period measured 146.52ns . (clkfreq=13.680MHz)
XTAL1, 5MHz: 0.537mA without NCO, or 0.88mA with NCO, period 399.98ns (clkfreq=5MHz)
That XTAL1 is no PLL, just 5MHz core. P1 is Spec'd +125°C, >2.7V, 80MHz, so can go well under 2.7V, for 5MHz(no PLL)
If those choices of Speed/Icc are still not enough, another alternative is to add a small MCU as a smart-clock-generator.
The very cheap N76E003 has Slow(10KHz) and Fast(16MHz) Osc, and can divide osc -> Pin to then clock the Prop.
There is also a useful thread called Prop Limbo! how low (power, voltage) can it go!
- that post shows the RCSlow / Icc kHz.uA against Vcc.
Indicates 1.6~3.3 is a valid range, with below 1.6 having some caveats.
"Prior to 0.25um, static power leakage was practically ignorable. At 0.18um, it started to be noticeable." -Microprocessor Design Engineer, Intel Corp
As someone with a casual interest in chip design, I wonder what the leakage would be for a Prop2 in 250nm. Halfway between 350 and 180nm?
No idea. But 180nm is chipped in stone!
But the newer processes for the older geometries, including 180nm, have much less leakage than they used to. IIRC that is what Chip was told by OnSemi with their newer ONC18 process that will be used for the P2.
Can not compare P2 to 99cents CortexM0+, unless you plan to use 4 of them to create a DIY quadcore.
Could for example compare it to a EFM32 Giant Gecko GG11 avail Q1 2018 for $6 each at 1K:
72 MHz Freq.
DSP & Floating Point Unit.
2Meg Flash with read-while-write support.
512K RAM.
Octal/ Quad-SPI memory controller with XIP.
10/100 Ethernet MAC.
Dual CAN bus controller.
Enhanced alpha blending graphics engine.
>wouldn't one need 16 CortexM0+ to compare to 16 P2 COGs?
The ARM's have some direct-hardware like PWM & Timers and Commutation protocols that could do some of the jobs a COG would be dedicated for.
So somewhere between four to eight 99cents ARMs to create a Frankenstein P2.
I wouldn't call a multiple ARM design a Frankenstein, as it is pretty typical for most products these days. Its normal to see a few ARMs in simple appliances or cell phones or tablets. 10s to 100s of ARMs in cars. It makes for a good design practice as it lends itself to modularity.
A P2 will have an extra supply IC (needing 2), an external Flash, and external crystal. So when you can buy a 50 cent ARM you get probably more CPU power than a COG, and additionally some built in peripherals that chew up COGs, like UARTs, SPI, I2C, PWM timers and low power operation at around 1 microamp.
[quote="brucee;1419030...Not to mention I can buy those ARMs today.[/quote]
That is the main problem.
The 2 supplies,Flash and crystal are - hmm - annoying, but not so bad. P1 has one supply, EEPROM and crystal minimum, and usually 2 supplies (3.3/5) so what's the difference?
For me, I will anyways need some module, not able to hand solder a potential P2, so I guess at least those components might be on there.
I still would love some sort of bread-board-able module, and on a long DIP like module is for sure enough space to put some other components.
P1 (and P2?) can run without crystal/oscillator, alas quite slow. since I never did anything with ARMs, how they do handle this? Have they build-in clock generation or do they also relay on external components?
....
P1 (and P2?) can run without crystal/oscillator, alas quite slow.
RC Fast in P1 is a reasonable speed, just not as fast as the PLL.
One drawback of P1, fixed in P2, is the 5MHz Xtal - today that MHz is larger, fewer choices, and more expensive than higher MHz Xtals.
I think P2 may be able to run the PLL from the RCfast, as that has better stability - and that would mean P2 can run to PLL/VCO limits, using no crystal.
I'm not sure many designs will bother to do that, as the crystal is a small incremental cost to P2.
since I never did anything with ARMs, how they do handle this? Have they build-in clock generation or do they also relay on external components?
just curious,
Most any small MCUs these days, all have Internal RC oscillators. Even 30c ones are 1~2% calibrated.
That's enough to do simpler tasks, but a Crystal / Resonator / Oscillator is needed if you want better stability, and/or lower jitter than RC Osc can give
P2 should work well with Clipped Sine TCXOs - those are relatively new, have low prices driven by GPS volumes, and can give ±500ppb precision
you are a handful of knowledge, @jmg. Some of your posts in the P2 threads go right over my head, but other ones I am reading with the joy to understand something new.
That opens up another of the stupid questions I have to ask here.
The current serial boot-loader has - as far as I can follow - some support for having multiple P2's on the same 2 serial lines and with some magic being able to host 256(?) P2's on two serial lines for programming.
Is this still true, and how exactly does a P2 know if he is meant to react, or not?
Because I see the P2 as a very versatile io-expander, best driven by a very versatile P2 host.
@brucee mentioned 10-100's of arms in cars, so what about inter P2 communication.
P2-Hot had some planned serial links for P2 to P2 communication, but what is the current situation?
@Beau had some nice work done on a fast Hub-Ring buffer to 'share' Hub-Ram in all used P1 in the ring, allowing the classical mailbox-behavior to work even from one P1 to another P1.
Question is, is there some support - even non standard - for really fast transfers between different P2's?
Question is, is there some support - even non standard - for really fast transfers between different P2's?
Depends on what 'really fast' means
There is the Sync streamer, and the Smart Pin cell UARTS can run in Async and Sync modes.
If the P2's all had a common clock source, you may be able to run up to SysCLK/2 speeds ?
That opens up another of the stupid questions I have to ask here.
The current serial boot-loader has - as far as I can follow - some support for having multiple P2's on the same 2 serial lines and with some magic being able to host 256(?) P2's on two serial lines for programming.
Mike
I think the final answer depends on the analog substrate, which is about to be tested and will (no doubt) be on Chip's todo list:) Worst case we can always boost things a bit.
Depending upon the graduate students involved... I really expect surprising results from P2 clusters. I have some ideas about how to build one, but I would really like to see a (huge) Parallax cluster released with the first generation of P2 chips. I'm thinking $2500 for the unpopulated "mother of all boards."... or double that and give some away to the Profs.
The current serial boot-loader has - as far as I can follow - some support for having multiple P2's on the same 2 serial lines and with some magic being able to host 256(?) P2's on two serial lines for programming.
Is this still true, and how exactly does a P2 know if he is meant to react, or not?
Because I see the P2 as a very versatile io-expander, best driven by a very versatile P2 host.
"Each command keyword is followed by four 32-bit hex values which allow selection of certain chips by their INA and INB states. If you wanted to talk to any and all chips that are connected, you would use zeroes for these values. In case multiple chips are being loaded from the same serial line, you would probably want to differentiate each download by unique INA and INB mask and data values. When the serial loader receives data and mask values which do not match its own INA and INB ports, it waits for another command."
That looks like a LOT of P2 could be common on a BUS - with port pins setting the address
A more practical limit is likely to be the Download times.
Single stage loaders are looking ~ 2MBd MAX, but you could load a stub and enable the PLL to go faster.
Same here. I used to grouse about EEPROM because I was hand-wiring P1boards. With the P2, I'll be buying modules. So it makes little difference if EEPROM/FLASH is on the module or in the uC. We pay for it either way.
Speaking of modules, Leaf Maple mini boards from China are about the cheapest ARM Cortex M3 header boards imaginable: $2.12/ea. Sure beats what I used to pay Olimex or EA AB.
For rapid product development, something like P2-based motherboard for Tibbit modules would be very cool. There are integrators, such as myself, who need configurable, professionally packaged hardware that could be installed anywhere without looking like a hobbyist hack-job.
Comments
I've found that the qvector and qrotate instructions have been very useful for doing the trig functions. qlog, qexp and qsqrt have also been useful for implementing the corresponding C functions. qdiv and qmul work well for 32-bit floating point. It's going to be a little trickier to implement 64-bit floating point with those instructions.
I'm implementing many of the P2-specific instructions by using small functions that contain __asm__() statements. This is easy to do, but it does require the overhead of function calls. Eventually I'll try to define them as instructions to eliminate the function calls and allow the optimizer to do it's magic. This will also improve the speed of hub execution by eliminating function calls.
Maybe Parallax could use p2gcc as a starting point, and replace components of it until we have a full GCC implementation. That might be a viable approach that would allow Parallax to spread the development cost over a longer period of time.
The Prop2 will be fabricated in 180nm technology and will probably leak 1mA quiescent current whenever powered up. That will not work well in your app. It can process a lot of signals, though.
Welcome to the forums!
Will the test shuttle be able to indicate quiescent current, or is that dominated by the sea-of-logic not yet in the shuttle ?
Do you have a Core Vcc operating range ? Lowering that core Vcc, could give another means to reduce idle power.
Some 3v3 devices spec like eg that EEPROM needs 2.7V Min, but the MCU can operate down to 1.8V ?
Other, older MCUs used to specify a retention Vcc/Icc - that was the Clock stopped value, needed to preserve RAM contents.
Seems the P2 could do the same ?
"will probably leak 1mA quiescent current whenever powered up"
1 mA doesn't sound too bad to me. Is this something that is at a very remote site, mostly in sleep mode?
If you want to eliminate this ~1mA of leakage, you'll need another circuit to control the 1.8V supply.
I think the leak will be okay; the board gets installed into a small space to do audio and some motion tracking, which is much more intensive than the leak. It would be nice though if the prop2 supported a way to sleep after in operation and then wake via one of the standard I/0 pins, as that would completely mitigate any issue with the prop itself leaking current as it could just go to "sleep" after some time, as long as the wake-up was from the button I already use off one of the I/O pins.
Also, there's a HUGE advantage to not having to start from scratch in the code, so I'd prefer to stick with what I already have and know than start over somewhere else, though I'm a big fan of C# so am looking at alternatives that support the .Net micro framework as well, I do think the prop2 is still going to be the best overall choice for my business.
What are the current budgets for each sleep/wake mode ?
As Chip mentioned above, the P2 is optimised for speed, so the Stopped-clock Icc will be nearer 1mA than 1uA.
If you need 1uA, then the 1.8V supply needs to be controlled.
They may be some lower core Vcc, that supports RAM retention (no CLK), but that would need a small companion MCU to control Vcc and Clk and RST.
1mA is fine; the battery I use is a 3400mAh 18650, so @1mA that's going to be over 100 days of sleep time, which is way more than enough. Heck even 1 month of sleep is fine, so the entire system can probably go up to 4-5x that amount and be okay for the "sleep" mode.
The device will be awake for a good period of time (1 hourish?) but sleeps quite a while after that. I have no concerns about power draw when it's not asleep because that means the end user is engaged with it, and if it runs out they'll know to plug it in; just want a better shelf life then like 7 days which is what I'm getting with prop1 since I cannot really re-wake from sleep without a physical reboot.
Again the ability to easily port my existing code to a platform with a ton more memory and more speed is already a huge win over anything else I've looked at so far.
I'd expect P2 to have some MHz/mA trade off slope, by varying Core Vcc.
The test shuttle may reveal the RCosc supply range, and the PLL supply range too.
After that it's just CMOS logic, which has quite wide ranges of MHz/mA vs Vcc
If you want ideas on the present P1 design, for ways to improve P1 idle drain, short of a reboot, you could also look at this thread
The OP there is running off RCSLOW, and Tracy Allen Posts
Here are experimental results, all pins configured for lowest current, running a counter in NCO mode at clkfreq/2.
RCslow: 8.6µA without NCO, or 9.8µA with NCO, period measured 97.18µs (clkfreq=20580Hz)
RCfast: 0.752mA without NCO, or 1.52mA with NCO, period measured 146.52ns . (clkfreq=13.680MHz)
XTAL1, 5MHz: 0.537mA without NCO, or 0.88mA with NCO, period 399.98ns (clkfreq=5MHz)
That XTAL1 is no PLL, just 5MHz core. P1 is Spec'd +125°C, >2.7V, 80MHz, so can go well under 2.7V, for 5MHz(no PLL)
If those choices of Speed/Icc are still not enough, another alternative is to add a small MCU as a smart-clock-generator.
The very cheap N76E003 has Slow(10KHz) and Fast(16MHz) Osc, and can divide osc -> Pin to then clock the Prop.
There is also a useful thread called Prop Limbo! how low (power, voltage) can it go!
- that post shows the RCSlow / Icc kHz.uA against Vcc.
Indicates 1.6~3.3 is a valid range, with below 1.6 having some caveats.
"Prior to 0.25um, static power leakage was practically ignorable. At 0.18um, it started to be noticeable." -Microprocessor Design Engineer, Intel Corp
As someone with a casual interest in chip design, I wonder what the leakage would be for a Prop2 in 250nm. Halfway between 350 and 180nm?
But the newer processes for the older geometries, including 180nm, have much less leakage than they used to. IIRC that is what Chip was told by OnSemi with their newer ONC18 process that will be used for the P2.
Could for example compare it to a EFM32 Giant Gecko GG11 avail Q1 2018 for $6 each at 1K:
72 MHz Freq.
DSP & Floating Point Unit.
2Meg Flash with read-while-write support.
512K RAM.
Octal/ Quad-SPI memory controller with XIP.
10/100 Ethernet MAC.
Dual CAN bus controller.
Enhanced alpha blending graphics engine.
wouldn't one need 16 CortexM0+ to compare to 16 P2 COGs?
just asking...
Mike
The ARM's have some direct-hardware like PWM & Timers and Commutation protocols that could do some of the jobs a COG would be dedicated for.
So somewhere between four to eight 99cents ARMs to create a Frankenstein P2.
A P2 will have an extra supply IC (needing 2), an external Flash, and external crystal. So when you can buy a 50 cent ARM you get probably more CPU power than a COG, and additionally some built in peripherals that chew up COGs, like UARTs, SPI, I2C, PWM timers and low power operation at around 1 microamp.
Not to mention I can buy those ARMs today.
That is the main problem.
The 2 supplies,Flash and crystal are - hmm - annoying, but not so bad. P1 has one supply, EEPROM and crystal minimum, and usually 2 supplies (3.3/5) so what's the difference?
For me, I will anyways need some module, not able to hand solder a potential P2, so I guess at least those components might be on there.
I still would love some sort of bread-board-able module, and on a long DIP like module is for sure enough space to put some other components.
P1 (and P2?) can run without crystal/oscillator, alas quite slow. since I never did anything with ARMs, how they do handle this? Have they build-in clock generation or do they also relay on external components?
just curious,
Mike
One drawback of P1, fixed in P2, is the 5MHz Xtal - today that MHz is larger, fewer choices, and more expensive than higher MHz Xtals.
I think P2 may be able to run the PLL from the RCfast, as that has better stability - and that would mean P2 can run to PLL/VCO limits, using no crystal.
I'm not sure many designs will bother to do that, as the crystal is a small incremental cost to P2.
Most any small MCUs these days, all have Internal RC oscillators. Even 30c ones are 1~2% calibrated.
That's enough to do simpler tasks, but a Crystal / Resonator / Oscillator is needed if you want better stability, and/or lower jitter than RC Osc can give
P2 should work well with Clipped Sine TCXOs - those are relatively new, have low prices driven by GPS volumes, and can give ±500ppb precision
Thanks,
Mike
Why not just use a 48-core ARM?...
http://www.eenewseurope.com/news/power-management-key-10nm-48-core-arm-server-chip?news_id=97402
Something tells me that this chip might cost more that $6-10.
;-)
That opens up another of the stupid questions I have to ask here.
The current serial boot-loader has - as far as I can follow - some support for having multiple P2's on the same 2 serial lines and with some magic being able to host 256(?) P2's on two serial lines for programming.
Is this still true, and how exactly does a P2 know if he is meant to react, or not?
Because I see the P2 as a very versatile io-expander, best driven by a very versatile P2 host.
@brucee mentioned 10-100's of arms in cars, so what about inter P2 communication.
P2-Hot had some planned serial links for P2 to P2 communication, but what is the current situation?
@Beau had some nice work done on a fast Hub-Ring buffer to 'share' Hub-Ram in all used P1 in the ring, allowing the classical mailbox-behavior to work even from one P1 to another P1.
Question is, is there some support - even non standard - for really fast transfers between different P2's?
curious again,
Mike
BTW 16xARM =/= P2 because of the interconnection of hub ram, etc.
Depends on what 'really fast' means
There is the Sync streamer, and the Smart Pin cell UARTS can run in Async and Sync modes.
If the P2's all had a common clock source, you may be able to run up to SysCLK/2 speeds ?
I think the final answer depends on the analog substrate, which is about to be tested and will (no doubt) be on Chip's todo list:) Worst case we can always boost things a bit.
Depending upon the graduate students involved... I really expect surprising results from P2 clusters. I have some ideas about how to build one, but I would really like to see a (huge) Parallax cluster released with the first generation of P2 chips. I'm thinking $2500 for the unpopulated "mother of all boards."... or double that and give some away to the Profs.
Rich
See http://forums.parallax.com/discussion/162298/prop2-fpga-files-updated-4-july-2017-version-20/p1
The Prop DOC there details the boot rules.
"Each command keyword is followed by four 32-bit hex values which allow selection of certain chips by their INA and INB states. If you wanted to talk to any and all chips that are connected, you would use zeroes for these values. In case multiple chips are being loaded from the same serial line, you would probably want to differentiate each download by unique INA and INB mask and data values. When the serial loader receives data and mask values which do not match its own INA and INB ports, it waits for another command."
That looks like a LOT of P2 could be common on a BUS - with port pins setting the address
A more practical limit is likely to be the Download times.
Single stage loaders are looking ~ 2MBd MAX, but you could load a stub and enable the PLL to go faster.
Same here. I used to grouse about EEPROM because I was hand-wiring P1boards. With the P2, I'll be buying modules. So it makes little difference if EEPROM/FLASH is on the module or in the uC. We pay for it either way.
Speaking of modules, Leaf Maple mini boards from China are about the cheapest ARM Cortex M3 header boards imaginable: $2.12/ea. Sure beats what I used to pay Olimex or EA AB.