chipKIT® Development Platform

Inspired by Arduino™

Program Optimization

Created Mon, 18 Mar 2013 06:56:24 +0000 by mckeemj


mckeemj

Mon, 18 Mar 2013 06:56:24 +0000

Okay, I know there was some program size discussion back in 2011, but I'm unconvinced with the conclusion -- that there's no issue. Here's the situation. I'm working on a library for use in control systems ( robots mostly ), that will be used by our university's robotics/maker club. It is being designed to be usable by both Arduino and ChipKit boards at the moment with plans to port it to other boards as they become available. Now. A test program I'm working builds to 16k for Arduino Uno ( atmega328 ). The exact same program builds to 104k ( ! ) on the ChipKit Uno32. What takes about half of the AVR's memory is taking nearly 90% of the PIC32MX's. But that's not the whole story. I've not looked through the entire .lss file for the ChipKit build, but I have seen a pattern -- unreferenced functions are being included across the board. The compile command does have the -ffunction-sections and -fdata-sections options and the link command -gc-sections; but, the garbage collection does not seem to be working.

I'm not fluent enough with the PIC32MX asm to be able to tell how well, otherwise, it is optimizing things, but from the simple programs I've examined, it doesn't look too bad. If I could get the garbage collection working better I would be one happy camper. What I can say is that on both processors, the programs are working just fine -- and exactly the same ( as they should ). It is just that I'm almost out of program space on the ChipKit and it's, really, hardly doing anything...

I'll be the first to admit that the library does tend toward things that can inspire "bloat" ( virtual functions, templates, highly flexible objects with many methods and object parameters ), but I have certainly seen things optimized better than the PIC32 compiler is achieving. Ideas?

Martin Jay McKee


majenko

Mon, 18 Mar 2013 15:49:38 +0000

I think part of the problem is the code isn't the most efficient in the world. C++ itself tends to add a certain amount of overhead, but then that should be roughly the same on both platforms.

One of the down-sides of the PIC32 is that it is more powerful than the Atmel. Sounds strange, but it is ;) The chip can do more, so it does do more - i.e., mpide includes extra facilities that the Arduino doesn't - mainly the task scheduling system which Arduino doesn't have - that adds some extra overhead to the resultant code size.

Plus, because it's a 32-bit system compared to an 8-bit system each instruction takes 4 bytes compared to the 16 or 32 bits of the AVR, so there is extra bloat there. That can be countered by compiling in mips16e mode at the cost of slower code.

Garbage collection is working - if you compile "BareMinimum" with the defaults, you get

text	   data	    bss	    dec	    hex	filename
   5528	     48	   4320	   9896	   26a8	BareMinimum.cpp.elf

But without -Wl,--gc-sections, you get:

text	   data	    bss	    dec	    hex	filename
   7736	     60	   4340	  12136	   2f68	BareMinimum.cpp.elf

Taking something more complex - AnalogInOutSerial - you get these values: Normal:

text	   data	    bss	    dec	    hex	filename
   9284	     52	   4676	  14012	   36bc	AnalogInOutSerial.cpp.elf

Without gc:

text	   data	    bss	    dec	    hex	filename
 105828	   2348	   4812	 112988	  1b95c	AnalogInOutSerial.cpp.elf

Which is a huge difference.

-ffunction-sections and -fdata-sections has no effect on the output size whatsoever.

With garbage collection AnalogInOutSerial includes the following function list:

int HardwareSerial::available(void)
int HardwareSerial::peek()
int HardwareSerial::read(void)
int analogRead(uint8_t pin)
int main(void)
long map(long x, long in_min, long in_max, long out_min, long out_max)
uint32_t millisecondCoreTimerService(uint32_t curTime)
unsigned int __attribute__((nomips16))  INTEnableInterrupts(void)
unsigned int __attribute__((nomips16)) INTDisableInterrupts(void)
unsigned long micros()
unsigned long millis()
void HardwareSerial::begin(unsigned long baudRate)
void HardwareSerial::write(uint8_t theChar)
void Print::print(char c, int base)
void Print::print(const char str[])
void Print::print(int n, int base)
void Print::print(long n, int base)
void Print::printNumber(unsigned long n, uint8_t base)
void Print::println(int n, int base)
void Print::println(void)
void Print::write(const char *str)
void Print::write(const uint8_t *buffer, size_t size)
void __ISR(_CORE_TIMER_VECTOR, _CT_IPL_ISR) CoreTimerHandler(void)
void __ISR(_SER0_VECTOR, _SER0_IPL_ISR) IntSer0Handler(void)
void __ISR(_SER1_VECTOR, _SER1_IPL_ISR) IntSer1Handler(void)
void analogWrite(uint8_t pin, int val)
void delay(unsigned long ms)
void delayMicroseconds(unsigned int us)
void digitalWrite(uint8_t pin, uint8_t val)
void init()
void loop() {
void pinMode(uint8_t pin, uint8_t mode)
void setup() {
void turnOffPWM(uint8_t timer)

Without it, the list is huge:

boolean String::endsWith( const String &s2 ) const
boolean String::equals( const String &s2 ) const
boolean String::equalsIgnoreCase( const String &s2 ) const
boolean String::startsWith( const String &s2 ) const
boolean String::startsWith( const String &s2, unsigned int offset ) const
char & String::operator[]( unsigned int index )
char String::charAt( unsigned int loc ) const
char String::operator[]( unsigned int index ) const
const String & String::concat( const String &s2 )
const String & String::operator+=( const String &other )
const String & String::operator=( const String &rhs )
createTask(taskFunc task, unsigned long period, unsigned short state, void * var) {
destroyTask(int id) {
getTaskId(taskFunc task) {
getTaskNextExec(int id) {
getTaskPeriod(int id) {
getTaskState(int id) {
getTaskVar(int id) {
inline void String::getBuffer(unsigned int maxStrLen)
int HardwareSerial::available(void)
int HardwareSerial::peek()
int HardwareSerial::read(void)
int String::compareTo( const String &s2 ) const
int String::indexOf( char ch, unsigned int fromIndex ) const
int String::indexOf( char temp ) const
int String::indexOf( const String &s2 ) const
int String::indexOf( const String &s2, unsigned int fromIndex ) const
int String::lastIndexOf( char ch, unsigned int fromIndex ) const
int String::lastIndexOf( char theChar ) const
int String::lastIndexOf( const String &s2 ) const
int String::lastIndexOf( const String &s2, unsigned int fromIndex ) const
int String::operator!=( const String &rhs ) const
int String::operator<( const String &rhs ) const
int String::operator<=( const String &rhs ) const
int String::operator==( const String &rhs ) const
int String::operator>( const String &rhs ) const
int String::operator>=( const String & rhs ) const
int analogRead(uint8_t pin)
int digitalRead(uint8_t pin)
int main(void)
itoa(int val, char * buf, int base)
long String::toInt() {
long map(long x, long in_min, long in_max, long out_min, long out_max)
long random(long howbig)
long random(long howsmall, long howbig)
ltoa(long val, char * buf, int base)
setTaskPeriod(int id, unsigned long tmsSet)	{
setTaskState(int id, unsigned short st) {
setTaskVar(int id, void * var) {
startTaskAt(int id, unsigned long tms, unsigned short st) {
uint32_t millisecondCoreTimerService(uint32_t curTime)
uint8_t getPinMode(uint8_t pin)
ultoa(unsigned long val, char * buf, int base)
unsigned int __attribute__((nomips16))  INTEnableInterrupts(void)
unsigned int __attribute__((nomips16)) INTDisableInterrupts(void)
unsigned int detachCoreTimerService(uint32_t (* service)(uint32_t))
unsigned long micros()
unsigned long millis()
utoa(unsigned val, char * buf, int base)
void HardwareSerial::begin(unsigned long baudRate)
void HardwareSerial::write(uint8_t theChar)
void Print::print(char c, int base)
void Print::print(const String &argString)
void Print::print(const char str[])
void Print::print(double n, int digits)
void Print::print(int n, int base)
void Print::print(long n, int base)
void Print::print(unsigned char b, int base)
void Print::print(unsigned int n, int base)
void Print::print(unsigned long n, int base)
void Print::printFloat(double number, uint8_t digits)
void Print::printNumber(unsigned long n, uint8_t base)
void Print::println(char c, int base)
void Print::println(const String &argString)
void Print::println(const char c[])
void Print::println(double n, int digits)
void Print::println(int n, int base)
void Print::println(long n, int base)
void Print::println(unsigned char b, int base)
void Print::println(unsigned int n, int base)
void Print::println(unsigned long n, int base)
void Print::println(void)
void Print::write(const char *str)
void Print::write(const uint8_t *buffer, size_t size)
void String::getBytes(unsigned char *buf, unsigned int bufsize)
void String::setCharAt( unsigned int loc, const char aChar ) 
void String::toCharArray(char *buf, unsigned int bufsize)
void __ISR(_CORE_TIMER_VECTOR, _CT_IPL_ISR) CoreTimerHandler(void)
void __ISR(_SER0_VECTOR, _SER0_IPL_ISR) IntSer0Handler(void)
void __ISR(_SER1_VECTOR, _SER1_IPL_ISR) IntSer1Handler(void)
void analogReference(uint8_t mode)
void analogWrite(uint8_t pin, int val)
void delay(unsigned long ms)
void delayMicroseconds(unsigned int us)
void digitalWrite(uint8_t pin, uint8_t val)
void init()
void loop() {
void pinMode(uint8_t pin, uint8_t mode)
void randomSeed(unsigned int seed)
void setup() {
void turnOffPWM(uint8_t timer)

guymc

Mon, 18 Mar 2013 18:30:03 +0000

Hi Martin,

Would it be possible to get a copy of this project? Our compiler team would like to take a close look and see if anything can be done to improve the situation. I'll send you a PM with contact info.

As majenko explained, there are some architectural differences that will result in larger code size. Nevertheless, we'd like to improve the memory utilization if at all possible.

Cheers


majenko

Mon, 18 Mar 2013 21:09:40 +0000

I think I have found the culpret - or one of the culprets...

sprintf().

Take the following program:

void setup()
{
  Serial.begin(115200);
}

void loop()
{
  char temp[20];
  sprintf(temp, "%d", 23);
  Serial.println(temp);
  delay(1000);
}

On the Arduino, that compiles to 3,618 bytes. On the ChipKit it compiles to 42,068 bytes.

Comment out the sprintf() line, and the figures change:

Arduino: 2,088 (58% of the size), ChipKit: 7,492 (18% of the size).

... :o

By the way, I'm currently on mpide-0023-linux32-20120715-test


majenko

Mon, 18 Mar 2013 21:22:44 +0000

I think that there could be certain optimizations done to libc...

This is the printf contents of the mplab libc.a:

text	   data	    bss	    dec	    hex	filename
   1088	      0	      4	   1092	    444	lib_a-__dprintf.o
    224	      0	      0	    224	     e0	lib_a-asiprintf.o
    352	      0	      0	    352	    160	lib_a-asniprintf.o
    352	      0	      0	    352	    160	lib_a-asnprintf.o
    224	      0	      0	    224	     e0	lib_a-asprintf.o
     96	      0	      0	     96	     60	lib_a-diprintf.o
     96	      0	      0	     96	     60	lib_a-dprintf.o
     60	      0	      0	     60	     3c	lib_a-eprintf.o
     96	      0	      0	     96	     60	lib_a-fiprintf.o
     96	      0	      0	     96	     60	lib_a-fprintf.o
     96	      0	      0	     96	     60	lib_a-fwprintf.o
    108	      0	      0	    108	     6c	lib_a-iprintf.o
    108	      0	      0	    108	     6c	lib_a-printf.o
    192	      0	      0	    192	     c0	lib_a-siprintf.o
    316	      0	      0	    316	    13c	lib_a-sniprintf.o
    316	      0	      0	    316	    13c	lib_a-snprintf.o
    192	      0	      0	    192	     c0	lib_a-sprintf.o
   3912	      0	      0	   3912	    f48	lib_a-svfiprintf.o
   3592	      0	      0	   3592	    e08	lib_a-svfiwprintf.o
   6772	      0	      0	   6772	   1a74	lib_a-svfprintf.o
   6960	      0	      0	   6960	   1b30	lib_a-svfwprintf.o
    352	      0	      0	    352	    160	lib_a-swprintf.o
    124	      0	      0	    124	     7c	lib_a-vasiprintf.o
    224	      0	      0	    224	     e0	lib_a-vasniprintf.o
    224	      0	      0	    224	     e0	lib_a-vasnprintf.o
    124	      0	      0	    124	     7c	lib_a-vasprintf.o
    184	      0	      0	    184	     b8	lib_a-vdiprintf.o
    184	      0	      0	    184	     b8	lib_a-vdprintf.o
   3912	      0	      0	   3912	    f48	lib_a-vfiprintf.o
   3848	      0	      0	   3848	    f08	lib_a-vfiwprintf.o
   7036	      0	      0	   7036	   1b7c	lib_a-vfprintf.o
   7220	      0	      0	   7220	   1c34	lib_a-vfwprintf.o
     88	      0	      0	     88	     58	lib_a-viprintf.o
     88	      0	      0	     88	     58	lib_a-vprintf.o
    108	      0	      0	    108	     6c	lib_a-vsiprintf.o
    208	      0	      0	    208	     d0	lib_a-vsniprintf.o
    208	      0	      0	    208	     d0	lib_a-vsnprintf.o
    108	      0	      0	    108	     6c	lib_a-vsprintf.o
    228	      0	      0	    228	     e4	lib_a-vswprintf.o
     88	      0	      0	     88	     58	lib_a-vwprintf.o
    108	      0	      0	    108	     6c	lib_a-wprintf.o
  49912	      0	      4	  49916	   c2fc	(TOTALS)

Compare that to the Arduino's libc.a:

text	   data	    bss	    dec	    hex	filename
     38	      0	      0	     38	     26	fprintf.o
     62	      0	      0	     62	     3e	fprintf_p.o
     40	      0	      0	     40	     28	printf.o
     74	      0	      0	     74	     4a	printf_p.o
    102	      0	      0	    102	     66	snprintf.o
    102	      0	      0	    102	     66	snprintf_p.o
     74	      0	      0	     74	     4a	sprintf.o
     74	      0	      0	     74	     4a	sprintf_p.o
     40	      0	      0	     40	     28	vfprintf_p.o
   1038	      0	      0	   1038	    40e	vfprintf_std.o
     24	      0	      0	     24	     18	vprintf.o
     88	      0	      0	     88	     58	vsnprintf.o
     88	      0	      0	     88	     58	vsnprintf_p.o
     54	      0	      0	     54	     36	vsprintf.o
     54	      0	      0	     54	     36	vsprintf_p.o
   1952	      0	      0	   1952	    7a0	(TOTALS)

Somewhat leaner, yes?


mckeemj

Tue, 19 Mar 2013 15:25:15 +0000

Thanks for taking a look at this. I certainly understand the difference in the instruction set, and I would expect some increase in code size... but my feeling is that what I'm seeing is excessive. Just the fact that -ffunction-sections and -fdata-sections has no effect is somewhat troubling as the library code is certainly written in such a way that it is expecting those to work -- that could be a major part of what I'm seeing. If that's the case, the library code should probably be considered suspect... at least for this particular platform ( it works like a dream though! ).

I've zipped the project up and sent it to the compiler team. Glad to be of help in any way possible. While I've never been a big fan of the Arduino Library API/Structure ( hence this library ), there is immense value in a well maintained/designed toolchain. It's even better with a powerful chip like the PIC32. So, anything that I'm able to do to support that... well, I'm glad to do. If nothing else, the code should act as something of a stress test!

The libc dumps are certainly interesting. Not sure they are what I am seeing however, I've got only the bare minimum of string I/O in this project. What I do have, however, I've found tends to have a fairly substantial impact on code size ( though nothing on the order of your sprintf() example ).

Again, thanks for looking at this. Even bearing in mind the architectural differences between PIC32 and AVR... I'm a little stumped. Oh well, maybe I need to upgrade to a larger flash chip!

Martin Jay McKee


guymc

Thu, 21 Mar 2013 17:22:00 +0000

Thank you, majenko, for your helpful analysis.

Our compiler group has studied the project that Martin provided, and replies:

The -ffunction-sections and --gc-sections options are working as expected and are a standard practice for Arduino compilation paradigm. The cause here is a pure virtual function. Our C++ support libraries are compiled with RTTI and Exceptions enabled. The existence of a pure virtual function causes a hook to get linked that calls std::terminate() when the pure virtual function gets called. This in turn causes a boatload of exception handling and name-demangling code to get linked. The 8-bit AVR compiler doesn’t support RTTI or exceptions. If we made the decision to disable support for RTTI and exceptions in the chipKIT toolchain, we should be able to eliminate at least 70 KB of library code from this particular project. We could probably use this as an opportunity to poll the chipKIT community in the forums and get some good feedback. In the short term, we could ask the customer to get rid of the pure virtual function.

Which leads to (at least) two questions for chipKIT developers and the communiy at large:

  1. Can we capture and document the issue with pure virtual functions in such a way that most users can avoid the associated code overhead without much difficulty?

  2. Should we consider disabling support for RTTI and exceptions, putting the chipKIT compiler on par with avr-gcc, but becoming less than a full C++ compiler?

Everyone, please feel free to offer your views and opinions on this important issue.

Thanks!

Guy


mckeemj

Thu, 21 Mar 2013 19:22:37 +0000

Thank you for the analysis -- some thoughts. Given that the code size issue is the result of pure virtual functions, it is simple to avoid. I would have to think on where best to place this in the documentation but I feel it would be an unjustified weakening of the compiler's standards compliance to remove RTTI and exceptions. Although pure virtual functions are useful from a code design standpoint, there is nothing in the library that requires them. It is simple enough to remove them throughout and handle errors internally. Having the issue documented allows for the choice.

Of course, if I'd looked a bit more closely, I might have found the same issue ( and solution ), but it's always difficult with "side" projects... I'm not sure that I would agree with one assertion ( that -ffunction-sections, -fdata-sections and -gc-sections ) lead to "sloppy" library design. It moves the choice of what to link to a different section of the build process ( to garbage collection ), but it does allow for combining functions, logically, in implementation files with impunity. This can lead to libraries that are structurally easier to understand. Not a big deal... a difference of opinion.

So, how to deal with pure virtual functions... the issue is what level of detail the typical user is willing to wade through. Then again, the declaration of virtual functions more or less implies working at a library level, as such, the documentation shouldn't need to be targeted at neophyte users. There are, it seems to me, two issues. The first is the use of pure virtual functions and the second the inclusion of full RTTI and exceptions. Pure virtual functions are useful from a design standpoint but, if a user is disciplined, there need be no difference between an abstract base class and a concrete base class. So they are an unnecessary abstraction from the standpoint of functionality ( code safety is another issue ). Thus, documenting that pure virtual functions are supported ( which is not, across the board, true with avr-gcc ) but that they can lead to excessive bloat seems a reasonable minimum. Additionally, simply saying that converting a pure virtual function to one that has a "safe" null implementation can eliminate that bloat also seems a reasonable work-around. This documentation could be logically connected with either "C++ Language Support" or "Creating Libraries" type topics. Additional information on the source of the bloat does seem reasonable also, though in a separate location.

Thanks again for looking into this. Always good to know that ( while I may be blind! ) I'm not crazy. With the modifications, the libraries should be just as lean as I had hoped.

Martin Jay McKee


jasonk

Thu, 21 Mar 2013 20:10:36 +0000

Another possibility might be to change the library to not call std::terminate();

extern "C" void
__cxxabiv1::__cxa_pure_virtual (void)
{
  writestr ("pure virtual method called\n");
  std::terminate (); // Possibly replace with a reset or while(1) loop.
}

I'd argue that the functionality in terminate() doesn't really apply to the chipKIT environment.