Jump to content

Deterministic periodic display glitch on STM32F7xx LTDC driver


nathanwiebe

Recommended Posts

I was wondering if anyone has any hints on how I can troubleshoot a problem I am having.  I am a paid customer with a commercial license and my first design using uGFX is almost production-ready, but there is a glitch in the software that is holding me up.  I am using an STM32F777 processor with a 800x480 LCD with the frame buffer located in external SDRAM.  I have a second frame buffer that I actually draw to, and I use the hardware blit/region copy functionality of the STM32's DMA2D to draw updates to the live frame buffer.  As far as uGFX is concerned, my code is very closely based on the STM32F746 discovery board demo.  The glitch seems to last a very deterministic amount of time: ~67 seconds out of every ~71 minutes.  Note that this is pretty much exactly 1/64th of the time, and that 71 minutes is 2^32 microseconds, although this might be a coincidence (I am grasping at straws to get some kind of a handle on what might be causing this issue).  The best I can describe the glitch is that the display seems to draw a fraction of the horizontal lines, and then shift to a random horizontal offset and draw another bunch of lines, and then repeat.  The result is an image comprised of horizontal bars that are the correct image except at random horizontal offsets, and the random offsets seem to change each scan - i.e. a weird flickery side-scrolling pattern.  Image quality is perfect the rest of the time.  It happens on all 5 prototype units, and has been consistent all through development, and I feel like I am no closer to discovering the cause of the glitch.

Because all drawing happens in lua (a scripting language that wraps the uGFX library), I can easily lock up the lua thread to halt all drawing operations, effectively preventing all uGFX API calls during the glitch.  Doing so seems to maybe have some effect on the pattern, but the glitch continues until the 67 seconds are done.

I am over-the-top happy with how well uGFX is working overall, but I would love if someone more familiar with the STM32F7 LTDC port could brainstorm with me on what might be causing this.  Please let me know if there is any additional information that I can post that would help get to the bottom of this.

Link to comment
Share on other sites

First please post a link to a video showing the effect. It may help us recognise what is going on. Make sure the video shows both the starting of the glitch and the ending.

My first thought was something fishy with the dma2d copy of the framebuffer. The real test is whether the uGFX framebuffer is corrupt during that period or the display framebuffer or neither.

If the uGFX framebuffer is corrupt then we start looking at the uGFX drawing operations.

If the display buffer is corrupt then we start looking at the dma2d code.

If neither are corrupt then it must be to do with the display refresh from the display framebuffer. 

Probably the eadiest way to start is to use a hardware input on the board to trigger operations stopping eg pass 1, stop any uGFX drawing (but leave the dma2d running) after triggering the hardware input. Trigger it during the problem and see if the problem resolves itself as it normally would. Pass 2 use the same mechanism but just turn off the dma2d transfer. If neither of those stop the problem resolving itself after the 61sec then the problem is definitely display refresh related.

Another thing to do is to do bandwidth calculations on your external ram. For a 800x400 display at high bit color depths check that you have plenty of bandwitdth to spare when doing refresh, a dma2d read, and a dma2d write from the external ram "simultaneously".

Another thing to check is that before starting your dma2d copy you are flushing the cpu data cache. With the normal f746 driver this is not really required because even if you miss the data being flushed for this frame refresh you will eventually see it when the cpu flushes it out to cache other data. Those gaps are not normally visible to a human. By seperating the display and the drawing framebuffer not flushing the cpu data chache will be very visible as updates will be missed by the dma2d as the f7 dma does not flush or read through the cpu cache.

I hope that will give you some hints on where to look.

Link to comment
Share on other sites

Hello Nathan,

This is a very interesting problem. We've never encountered this before. We ran some tests ourselves yesterday with a quite sophisticated µGFX application on an STM32F746G-Discovery board. Unfortunately, we were not able to reproduce the problem.

Are you able to reproduce the problem on an STM32F746G-Discovery board? I don't want to point with fingers at someone, but let's start at the things that are easiest to check: Did you have a look at the errata of your microcontroller? I remember that one of the first STM32 that featured the LTDC had a bug that caused some artifacts on higher-resolution displays. Again: I'm not saying this is the case here, it's equally likely that the bug is in the µGFX driver but things like that are just easier to check when getting started to investigate.

Also, what version of the µGFX library are you using? There was a datatype fix four months ago:
I don't think that this has any impact here but again - let's start with the things that are simple to check :)

The above is just an addition to what @inmarket wrote.

Link to comment
Share on other sites

The cache consistency on the STM32f7 platform is a pretty annoying thing to deal with.

If the problems described here are caused by the data cache, then it could be that the LTDC is reading invalid data causing the glitching. 

Making only the framebuffer RAM uncachable will still decrease performance. You could try however to flush the cache before the LTDC starts a DMA transfer. I have no idea if that will kill the performance again or not, but it should also solve the problem.

Link to comment
Share on other sites

Please post the video of the problem, do the bandwidth calculations and the testing suggested to identify the area causing the problem. Those will make a large difference to the solution required.

Personally my feel is that bandwidth could be a large part of this. What I suspect is happenning is that at certain periods the synchronisation between the cpu clock and say the ltdc clock or ram clock is causing the bandwidth available to be exceeded. Another poosibility is the gating on the ltdc pins causes the pins to be gated due to some other operation. When this aligns timing wise with the ltdc refresh or possibly the dma2d operation you get the symptoms you are seeing.

An any case the tests suggested will give you a lot more information on where to look.

Link to comment
Share on other sites

Hello! I'm out of my machine these days and will continue test on Monday. But befre that I will read about the caches and LTDC appnote in the part of Cortex-M7 specialities. Will do some videos on Monday/Tuesday and upload it here.

I think that some operations inside M7 core cause LTDC lines/frame refresh process to halt for a little and this does glitch on the screen or this is the data cache problem. Will read about this...

I've spent some days with 746 and found out that __arm_eabi_f2d (not remember the name exactly, but this is the function/intrinsic to convert floats to doubles) causes the glitch. Sometimes... But when it happens it real happen when you go one step thru it in debugger. Just "Step over" and the display flickrs.

Link to comment
Share on other sites

A huge thanks for all the helpful suggestions.  I will look into all of these things, although I have to apologize for being slow to look into this because of other project demands.  For now, here is a quick update on my investigation:

As I mentioned, it takes 27 minutes to get the glitch to happen, and it only lasts 67 seconds, so gathering info is extremely inefficient.  I have stepped through various parts of the code during the glitch and nothing sticks out.  I can see some flickers as I single-step, but haven't correlated it to a particular area of code or RTOS thread priority level.  When the glitch starts, I can lock up my highest priority thread and the glitch stops, suggesting that something may be competing with the SDRAM for bandwidth during this time.  I don't, however, have any application code that fits this pattern of memory usage.  My next step (as time allows), is to remove the individual components of the application and driver code one by one to find the culprit, but that will take some time.

I should mention that the reason I am thinking that this is a single 'culprit' chunk of code rather than a general problem like overall SDRAM bandwidth is because the display looks beautiful 98% of the time with the full application running (the application code is all written in lua, and the lua heap is located in SDRAM).  As for possibilities such as caching and/or FPU interactions, my use of floating point numbers is consistent throughout the code, and my glitch lasts exactly 67 seconds out of every 71 minutes, so I don't see that as a good fit.  I have also essentially ruled out corruption of the contents of my SDRAM because the remainder of the SDRAM is the lua heap, and that would hard fault the whole system real quick if the odd byte were being corrupted.

I also appreciate the suggestion that the LTDC or DMA2D clocks may be in some state of synchronization during the glitches (67 seconds out of 71 minutes, exactly 1/64th of the time), causing my effective available SDRAM bandwidth to decrease.  However, as I mentioned, I can halt the lua engine (halting virtually all SDRAM access) during the glitch and it continues for the same exact amount of time.  I have also read out the SFR register bank of the LTDC during that time to make sure nothing is accidentally getting changed, and it all looks good.  I guess I could do the same with the GPIO (IO directions, speed config, and alternate function selections)...

Either way, thanks for all the suggestions!

Link to comment
Share on other sites

1 hour ago, nathanwiebe said:

 I can see some flickers as I single-step, but haven't correlated it to a particular area of code or RTOS thread priority level.

Exactly the same as I get. But for 746 I've found the place with floating points calculation. At first I've fount that atan2() causes the glitch and than found out the reason. But The glitch could happen once per hour and now I think that this is a question of caches. Try to disable D-Cache and check.

Have you looked at AN4861, AN4838 and AN4839 documents from ST? There are very interesting parts in AN4861 (4.6 Special recommendations for Cortex®-M7 (STM32F7 Series), 6.2.7 MPU and cache configuration - especially for M7 core). I've read it carefully and wrote down some points to try. Moreover, it is said that CPU clock should be 200 MHz maximum (where SDRAM clock will be 100 Mhz) and I run it at 216...

Link to comment
Share on other sites

Good news!  The bug has been fixed.  Unfortunately I am able to offer less clarity on the exact solution that I had hoped.  As I mentioned, my specific display glitch started 26:49 after boot, continued for 67 seconds, and repeated every 71 minutes.  I believe the root cause to have been related to something thrashing and eating up too much SDRAM bandwidth.  I hoped to narrow down precisely which lines of code this was, however this did not happen before the bug was solved another way.  FreeRTOS on CortexM parts forbids making any OS API call from within a critical section.  Probably 99.99% of the time you'll be lucky and get away with it, but sometimes it'll hang the OS or do some other unpredictable thing.  I had an occasional boot issue (unrelated to the LCD) that I narrowed down to a race condition caused by a syscall from within a critical section, and decided to go through all critical sections in the whole codebase to make sure there were no syscalls from within them.  Lo and behold, I found 10.  Most were in my code, but one was in an ATMEL WiFi driver, and one was in uGFX's FreeRTOS port for the GFX layer (which I should probably report as a bug).  Anyways, after fixing these... poof... no display glitch.

So the solution is far less satisfying that finding a specific line of code that was actually chewing up the SDRAM bandwidth, but it is an important lesson nonetheless.  And it uncovered a bug in uGFX, so I guess we all win.

Link to comment
Share on other sites

Hello! Hope it helps but I would wounder what are these unpredictable conditions causing HW displat controller (LTDC) to miss its sync? It runs under the debugger step-by-step run even.

So, as I was written before I looked into the clocks and MPU (caches) configuration. Changed clocks to 200 HCLK and 100 for APB1, turned off the Overclock, enabled BURST in SDRAM and specified the MPU region for video memory to WT. Yesterday it worked well. I will look at it next days. I had no critical sections in my code but only ..Suspend and .. Resume the scheduled during the videobuffer switch (part 4.4.2 in STM AN4861).

Link to comment
Share on other sites

10 hours ago, nathanwiebe said:

Lo and behold, I found 10.  Most were in my code, but one was in an ATMEL WiFi driver, and one was in uGFX's FreeRTOS port for the GFX layer (which I should probably report as a bug).

We'd definitely appreciate if  you could create another forum topic regarding this bug.

Just wondering: Are you using the latest master branch of the µGFX library? There was a major rework of the FreeRTOS port about 6 months ago. I haven't checked but AFAIK that didn't make it into v2.7.

Link to comment
Share on other sites

  • 4 weeks later...

I am using what I think is the stock 2.7.0 release, rather than follow the repo, so likely I am using an old FreeRTOS port.  In any case, it is a subtle bug, but not too hard to search for.  Searching for the text portENTER_CRITCAL and taskENTER_CRITICAL will find all critical sections in the FreeRTOS/uGFX interface code, and it should be checked that there is no FreeRTOS API usage (even non-blocking calls such as a binary semaphore push/give - see the note on the FreeRTOS critical section documentation) within the critical sections.  In the version I have (stock 2.7.0 I think), gos_freertos.c:139 contains a semaphore give inside a critical section in the gfxSemSignal() function.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...