chipKIT® Development Platform

Inspired by Arduino™

Tracking down (possible) memory problems?

Created Fri, 04 Sep 2015 08:57:42 +0000 by rasmadrak


rasmadrak

Fri, 04 Sep 2015 08:57:42 +0000

Hi there!

I've come a long way with my pinball machine, but got sidetracked and created a family for a while there. ;) However - As of lately, it's starting to act up. The machine, that is.

This is a rather long post and possibly quite confusing. But you'll have my gratitude for reading it... I've got an intricate use of almost all pins on my ChipKit MAX32, including DSPI0 and DSPI1, softPWMServo, Serial1 and Serial3, SD cards, wavTrigger, LED displays updated via interrupts on timer3. And it all works really well, but occasionally the Chipkit crashes for no apparent reason. The very same functions works perfectly in the idle/settings-menu but during game mode it crashes. My loop is (very simplified) like this:

updateServos updateSDCard if (game running) player->updateGame() else updateIdle() updateSwitches updateLights updateSound updateDrawbuffer drawGraphics

I suspect a possible memory problem, for instance - perhaps no more RAM available for function calls etc? I'm dynamically allocating all large arrays and they quite possibly consume a lot of RAM, but as I wrote - they work fine in when the machine is idle'ing. Aha, you say - it's something in the logic that causes the crash. Well, no, it shouldn't be - the same "hardware" code is apparent in the idle mode as well. The rendering writes to a backbuffer and draws from a frontbuffer, the switching is done via a volatile bool once the buffer is complete, so it shouldn't be access violations either. Commenting out random of the very same functions inside the game loop seem to get it running without crashing, but it makes no sense. Why would for instance "playAnimation()" suddenly cause the machine to crash when the very same function runs perfectly outside the game logic function? The functions that gets disabled contains no heavy calculations and/or extensive use of temporary variables (at most, a couple of bytes worth) as they only modify fixed data.

I don't think there's a problem with interrupts or serial ports for the reasons stated above. It seems to me that something somewhere decides to overwrite a crucial part of the memory and thus causes the ChipKit to crash. This as I can do a simulated game (i.e flash lights, servos, motors, sound, videos etc) without any crashes or hickups during idle, for instance triggering everything all at once at the press of a button. I can hammer it like crazy without problems.... I've increased minimum stack and heap sizes, which helped me historically with large sketches, but it has done very little since I moved to dynamically allocating memory instead of static. I've also done slight modifications to the MPIDE-libraries (to prevent DSPI/serial conflicts for instance) but they work great, no problem there. I've tried using a lot of "const" lately to push larger static lookup-tables/arrays into flash instead of RAM. Sketch size is ~320Kb, so there's space available.

What I HAVE done lately:

  • Rewritten most of the core
  • Changed switches, solenoid and rendering to interrupt driven updates on Timer3 (tried Timer5 as well, no difference)
  • Added a large const array of animation information. (tried to cut it in half, made no difference for the crashes)
  • Added communication over Serial1@57600 to wavTrigger and Serial3@115200 to Arduino Mega2560
  • Added softPWMServo
  • Added EEPROM reading/writing (Unused during game mode)
  • Switched to dynamically allocated data (this was AFTER the crashes were observed, and it's never deleted/recreated - only once during startup)

All changes work perfectly (as in amazingly well, actually...) during idle. So my questions, should you choose to answer them, are:

  • Is there a way to see memory usage, or calculate if RAM is the culprit? (found this: http://www.chipkit.net/forum/viewtopic.php?f=7&t=1596)
  • Is "const" enough to force ANY variable or class into flash memory, or are there special conditions? (found this: http://www.chipkit.net/forum/viewtopic.php?f=6&t=2141&p=7992 , not sure if it allows ANY variable or not however.)
  • Is EEPROM a reserved space, or could program data/usage find it's way into it?
  • Is there a limitation on a class' size on ChipKit? All game functions are part of rather large (previously static, now dynamically allocated) player-class.

majenko

Fri, 04 Sep 2015 10:01:36 +0000

You mention the use of volatile. I assume then you have some interrupt driven code? Are you protecting external access to those volatile variables through the use of critical sections?


rasmadrak

Fri, 04 Sep 2015 12:41:11 +0000

Yes, I have a interrupt on timer3 as I stated in the (wall of) text. :) The interrupt handles rendering one line at a time, solenoids and switches spread in a 1:32:256 ratio. In the main loop a framebuffer gets constructed in a backbuffer, and when completed the volatile bool is toggled and the drawbuffer points towards the newly built buffer. The rendering always renders the currently active frontbuffer.

I did check quickly on critical sections, but I thought volatile use and the lack of true multitasking was enough protection against this? Aren't volatile bools atomically toggled? But I figure disabling interrupts during the toggling is easy enough to test, so I will definitely do that. But as I wrote, everything works super during idle. It could of course be that the game-loop takes ever so slightly more resources and time to execute that somehow the interrupts crash into each other etc. I will try to get the whole spectacle on video later today to further showcase the issue. :)

I am reading switches and toggling solenoids outside of the interrupt, as well as reading and updating switches and solenoids inside the interrupt. I figure the critical sections should be around variables that is mutually used. Should I do this inside the interrupt also, or outside (i.e in the main loop) to prevent the interrupt from firing?


majenko

Fri, 04 Sep 2015 12:58:03 +0000

Critical sections are used around shared variables when accessed outside the interrupt that can modify them. The idea is that it's possible for the interrupt to fire mid-way through making modifications to the variable and change it while it's being changed, thus resulting in strange behaviour and crashes.

The purpose of "volatile" is purely to tell the compiler that "this variable is subject to being changed outside the normal procedural flow of the program" so it a) doesn't optimize it out when it thinks it's never going to change, and b) always retrieve it direct from memory before doing anything with it.

Any operation on a volatile variable will take multiple instructions - it needs to load the address of the variable and then load the content of the variable, then do whatever it needs to do to the variable, and possibly write it back to memory if it's been changed in the course of that operation. At any point during that time it's feasible that the interrupt could fire and change the variable. If you're just reading in one place and writing in the other (reading in the interrupt and writing in loop(), say) then it shouldn't be a problem because only one place is ever doing the writing of data to memory. However, if you're writing in both loop() and the interrupt then there's the potential for things going wrong. Booleans, though, don't really have much of a problem in this regard since they only have two states - it's hard to end up with an invalid value in them.

By your description though it does sound like there may be too much going on in your interrupt routine and it's taking too long to process before it needs to trigger again. What frequency is the interrupt triggering at? Do you clear the flag at the start or the end of the interrupt routine?


majenko

Fri, 04 Sep 2015 13:01:03 +0000

One other useful trick is to light an LED when you enter the interrupt routine and turn it off when you leave. When it crashes, if the LED is on, then it's crashed in the interrupt routine - otherwise the crash has happened outside the interrupt.


rasmadrak

Fri, 04 Sep 2015 13:11:32 +0000

Neat, I will try this as well!


rasmadrak

Fri, 04 Sep 2015 13:22:20 +0000

Any operation on a volatile variable will take multiple instructions - it needs to load the address of the variable and then load the content of the variable, then do whatever it needs to do to the variable, and possibly write it back to memory if it's been changed in the course of that operation. At any point during that time it's feasible that the interrupt could fire and change the variable. If you're just reading in one place and writing in the other (reading in the interrupt and writing in loop(), say) then it shouldn't be a problem because only one place is ever doing the writing of data to memory. However, if you're writing in both loop() and the interrupt then there's the potential for things going wrong. Booleans, though, don't really have much of a problem in this regard since they only have two states - it's hard to end up with an invalid value in them. By your description though it does sound like there may be too much going on in your interrupt routine and it's taking too long to process before it needs to trigger again. What frequency is the interrupt triggering at? Do you clear the flag at the start or the end of the interrupt routine?

I'm currently toggling the interrupt at 15360Hz (60 fps, 8 layers/colors, 32 lines). The code for handling everything is very neat with exclusively port-writing and reading, no digitalWrite for instance. I clear the interrupt immediately after rendering a line and after 32 lines I do a solenoid check, and after 256 lines I check the switches.

Every 256 lines there's "double" work at the moment as both switches and solenoids are updated, but the error is far more consistent than for me to hit the exact time of a conflict. I can trigger "thousands" of events in idle mode without problems and none during game mode. Unless I comment out som random piece of code, i.e if I comment out playAnimation the displayScoreboard works, and vice versa. In idle-mode they both work simultaneously. The seemingly random nature of the commenting out makes me believe that it's the actual recompilation of the program that aligns data in a less prone to crash typ of way. Again, this during game mode. Idle is solid...

The video will show what I mean. :)


rasmadrak

Fri, 04 Sep 2015 13:25:12 +0000

There's of course a chance I'm not hitting 60 fps in case the function takes longer than I believe, but the display is rock solid and no flickering, not even peripherally. I'm rather sensitive against flickering and this doesn't flicker. This leads me to at least believe it's performing like it should. :)

Edit: The flickering is visible when shot on video however. Don't really know if that means I'm not hitting 60Hz. :D


rasmadrak

Sat, 05 Sep 2015 19:50:44 +0000

Ok, I'm back. Took a little while to get things organised, but here goes -

  • I've changed so almost everything uses dynamically allocated memory. For some reason I believe this is better than using the stack.
  • I've added critical sections to rendering and switch updates (disabled solenoids for now, but they will be protected to).

But problems persists. Switch updates are buffered so reading always occurs from a shadow variable that the interrupt routine updates. This could be wrong, I need to investigate if this is correct, but writing is only performed on either side - never on both sides of the interrupt.

I've also tried to do the LED trick, but sometimes it's on and sometimes it's not. No help there.

Here's a video of the phenomenon where I basically simulate a "game" by firing audio, servos and animations all at once by pressing a switch. It works in idle, but as soon as I press start and "launch" a ball - it crashes. The crashing doesn't happen during the same event at all times, so it's not a specific function that is broken. Sometimes videos start correctly and run perfect during game, and sometimes they run until the end of the animation and then crash the machine. The animations appear flickering in the video but this is only because they are reset back to the idle animation (with higher priority) all the time. But you can see that the first frame appears correctly...and it doesn't crash.

And mind you - everything works in idle using the very same functions, animations, SD-reading, switches, lights, audio, servos etc... :cry:

[url]https://youtu.be/P7qdjFlQXrE[/url]


majenko

Sat, 05 Sep 2015 20:08:49 +0000

O M F G!!!

THAT'S TOTALLY AWESOME!!!!!!

Sorry - got a bit carried away there :)

Ok... so do you get any crashes if you disconnect all external hardware except the buttons and the screen? I'm wondering if you're getting some noise feeing back from the motor(s) that's causing it to crash? Or do the motors run (and not crash it) during "idle" mode?


rasmadrak

Sat, 05 Sep 2015 20:23:02 +0000

Haha, thank you! It's starting to shape up, so it's rather annoying I got this stupid issue all of a sudden - just when I was about to get to programming some game rules. :)

No, there's no problem at all during idle mode. I've got a maintenance mode where I can run everything. The "whack-it-all"-button is currently sitting on top of everything so pressing that switch causes everything to trigger regardless of which mode I am in. So when running the motor and servos I can display animations and play sound without a hitch.

I've tried to disable the entire player function now and run a game with events triggered at random now. Currently compiling...


rasmadrak

Sat, 05 Sep 2015 20:47:01 +0000

Ok, so with the entire player->update() function disabled it seems to run alright even when simulating a game. No problem entering/exiting maintenance mode etc. That's a good sign, I think...

I'm now leaning on that it's either a part of the player function that is broken programatically, or that part of the program lies in a part of the memory that is more easily affected by overflows (or similar). The player function has always functioned in the past, which fooled me to think the problem wasn't there. Now I think that with the added information and processing, perhaps the machine has been able to avoid conflicts by pure luck previously....

It's of course not 100% certain yet, but it's something to go on at least. :)


majenko

Sat, 05 Sep 2015 20:56:06 +0000

How complex is the player->update() function?


rasmadrak

Sat, 05 Sep 2015 21:15:27 +0000

Very much so, or not complex per se - The update-routine itself is a state machine with different modes in it, idleLoop (not the same as machine idle), playLoop etc. But it's really "nested" where almost every switch (64 of them) has it's own subset of rules and features, and they all run sequentially and after each other. They are not very complex in nature, mostly "read this switch, turn that light on, set counter to this or that".

I've tried disabling stuff previously when I started troubleshooting, but it never made sense what feature I removed as it could be something totally unrelated to what caused a crash.


rasmadrak

Sun, 06 Sep 2015 11:08:11 +0000

I think I cracked it... :)

The little guy 'Mr Vertical-line' dividing the score and animation area of the screen was the problem. At least, so it seems as the machine has yet to crash on me since I replaced the addLine with a simple rectangle, even when removing all critical section locks and interrupt-disables etc. The line function was only ever used inside - you guessed it - the game loop, so this makes it highly likely that it was the culprit. The problem, I believe, was a while-loop that was never exited in case it missed its target for some reason. As the direction never changes (why would it...) during the loop it would simply loop infinitely. I believe that since it doesn't exit the loop() function interrupts would never fire as well. (don't quote me on that thou)

As I'm a sceptical person, I'm not writing it off as solved just yet... But lesson learned - Never assume "it has worked before" to be a testament of a code's validity. :)

Looking at the line drawing code now, it seems something is rather broken with it. But since it's not used I updated it for keepsake with a max counter and check if x/y is valid in case it does fail again. That way it won't crash, at least. There's little to no random drawing taking place so in case I do decide to add a random "line rain" or something like that, it should survive. I think it's rather weird that it failed to hit home with a straight line, but the code was derived from Wikipedia if I recall correctly and was one of the first functions ever written for this machine and hasn't been touched since.

Here's what the original code looked like, in case you want to break it down:

void addLine(int x0,int y0,int x1,int y1, byte brightness)
{
	int dx = abs(x1-x0);
	int dy = abs(y1-y0);
	int sx,sy;
	float err,e2;
	if (x0 < x1) sx = 1; else sx = -1;
	if (y0 < y1) sy = 1; else sy = -1;
	err = dx-dy;

	while (true)
	{
		drawPixel(x0,y0,brightness);
		if (x0 == x1 and y0 == y1) break;
		e2 = 2*err;
		if (e2 > (dy*-1))
		{
			err = err - dy;
			x0 = x0 + sx;
		}
		if (e2 < dx)
		{
			err = err + dx;
			y0 = y0 + sy;
		}
	}
}

Now... all I need is a servo library that doesn't use interrupts as the interrupts mess with the rendering interrupt... :)


majenko

Sun, 06 Sep 2015 15:51:39 +0000

Congratulations!

If you're interested, here is the line drawing routine from my DisplayCore library:

static void inline swap(int16_t &i0, int16_t &i1) {
    int i2 = i0;
    i0 = i1;
    i1 = i2;
}

void DisplayCore::drawLine(int16_t x0, int16_t y0, int16_t x1, int16_t y1, uint16_t color) {
    startBuffer();
    int16_t steep = abs(y1 - y0) > abs(x1 - x0);
    if (steep) {
        swap(x0, y0);
        swap(x1, y1);
    }

    if (x0 > x1) {
        swap(x0, x1);
        swap(y0, y1);
    }

    int16_t dx, dy;
    dx = x1 - x0;
    dy = abs(y1 - y0);

    int16_t err = dx / 2;
    int16_t ystep;

    if (y0 < y1) {
        ystep = 1;
    } else {
        ystep = -1;
    }

    for (; x0<=x1; x0++) {
        if (steep) {
            setPixel(y0, x0, color);
        } else {
            setPixel(x0, y0, color);
        }
        err -= dy;
        if (err < 0) {
            y0 += ystep;
            err += dx;
        }
    }
    endBuffer();
}

rasmadrak

Sun, 06 Sep 2015 16:48:26 +0000

Congratulations! If you're interested, here is the line drawing routine from my DisplayCore library:

Thanks! That looks like a bit more polished than my crude attempt... :D

Well... Time to write some servo code and then game rules! :mrgreen:


rasmadrak

Thu, 10 Sep 2015 22:53:57 +0000

Sad to say I'm back...

It does crash again, it just takes a bit longer. I found a couple of writings outside the framebuffer that caused the reproducible errors, but once they were fixed I can still cause the machine to crash by playing for a while. I believe I'm running out of memory due to String-usage, where String objects are never really freed and eventually crash the system. My own allocations are only created once and never freed, so they are probably innocent here. The problem is pretty likely since I've never patched the WString or malloc files, or updated MPIDE during the last 1-2 years.

  • I will try to implement a RAM-watcher to confirm if this is the case.
  • I will try a newer version of MPIDE where WString has been updated.
  • I will also try to minimise the usage of String-objects (been mostly using them out of comfort). Updated:
  • I will try to avoid the use of the String-class completely. :P

Also - I found out that sometimes floating point operations turns shady and produce variable results, i.e the result is not always consistent even with fixed or integer input. It usually is, but sometimes it's not. It's almost like there's a random value thrown in for fun in there... :) I believe this is what caused my original addLine-function to crash - i.e, sometimes it would miss the endpoint due to a rounding error. This feels really fishy thou and is probably mostly me speculating.

To be continued...


majenko

Thu, 10 Sep 2015 23:14:28 +0000

Here's a tip:

Never use the String class

It's a crutch for people who can't program. You're not one of them, so don't use it :P


rasmadrak

Fri, 11 Sep 2015 07:52:13 +0000

Here's a tip: Never use the String class It's a crutch for people who can't program. You're not one of them, so don't use it :P

That bad, huh? :D But...but...it's convenient. :P

I'll see what can be done! :!:


majenko

Fri, 11 Sep 2015 08:09:34 +0000

Prostitutes are convenient. It doesn't mean it's a good idea to use them :p


rasmadrak

Sun, 13 Sep 2015 21:15:52 +0000

Alright, now I THINK it's stable as a rock! :twisted:

I've switched all String's to const char* and...ta-ta - const Strings (sorry Majenko!) as well as some generic safety precautions. I've measured RAM during idle and play and it does indeed stay fixed now. If I do a "new" without delete the available RAM ticks down and once low enough it crashes, so the RAM function seem to work. There were just too many functions and special occasion that I needed to replace, and by doing so I'd most certainly would have ended up with an even buggier version of the machine. So once I saw that RAM usage was static, I moved on. :)

I've also implemented the hardware watchdog so in case it does crash it will reboot and it won't start burning (tried that, wasn't pretty). Works pretty great too! A peculiar thing thou - if the watchdog was held alive by an interrupt, the main loop could crash and the interrupt continue. Don't ask me how, but it did. I did a divide-by-zero on purpose and all game loop and switches etc froze, but the rendering kept going. So the watchdog seems like it must reside in the main loop to have a proper effect.

Lastly, what still caused crashes after everything else was fixed was actually the sound card, a Wav Trigger. If I sent "too many", i.e more than a few request per 10-15 millisec it would eventually crash. I'm not sure which end causes the problems, but I do know that I can crash the machine by sending a faulty command to the sound card. (RobertSonics, if you're reading this - there could be something wrong with the latest firmware. It shouldn't crash if the command is formatted correctly but unknown...But then again - Chipkit shouldn't crash if the sound card crashes either...)

What I've done is to simply add a cooldown on each command sent to the sound card so that it will sit and wait for it's turn and then send the command. This works and no sound is ever skipped/missed, but the implementation is not nice as everything except for solenoids and rendering is stalled in case there's lots of audio. Realistically there's probably only ever gonna be 3-6 samples being started every 300 ms during play, so perhaps I can add the cooldown only after a certain amount of active tracks instead of every time, for instance.

All in all, the machine is playable now with most features programmed. :mrgreen:
Just need around 60 meter light cables soldered to compensate for the new lightboard and then the glass can be put on :)


majenko

Sun, 13 Sep 2015 21:55:32 +0000

Fantastical!!!

I'm glad you got that nobbled at last!