Sorry, something went wrong - it's timing out

I have my MIAB running on a Digital Ocean droplet which was working fine but as I’ve added more domains my MIAB has struggled. Now a lot of pages don’t load instead giving me a “Sorry, something went wrong” message". I know what the problem is…

time ./status_checks.py
.....
real	1m48.639s
user	1m43.367s
sys	0m1.693s

Specifically, after

Network
=======
✓  Firewall is active.
✓  Outbound mail (SMTP port 25) is not blocked.
✓  IP address is not blacklisted by zen.spamhaus.org.

There’s a delay you could sleep through between each domain check.

My DO Droplet has been running fine and not costing me much money for years. My question is whether it’s time to add vCPUs and/or memory to the droplet? Will I get a performance increase that benefits MIAB? Does anyone else use a DO droplet above the specification outlined in the documents and host a lot of domains. 17 in my case - 3 are in use, 5 just serve web pages and the rest are “parked” for various things I was planning to do and then never did. But for the status checks they’re all checked regardless of what they’re doing.

Is there an easier path where I tweak the timeouts? I haven’t looked in to this and assume it would involve modifying the code rather than setting a preference.

Thanks

Steve

What version are you on? I believe for v61 the timeout should be over 10 minutes …
Let’s make sure: in the output of journalctl -u mailinabox you will probably see an exception. Can you post that?

Thanks for replying - here’s the answers:

root@box:~# ~/mailinabox/management/status_checks.py

System
======
✓  All system services are running.
✓  SSH disallows password-based login.
✓  Mail-in-a-Box is up to date. You are running version v61.1.

It’s currently 1:14pm where I am. I’m going to load the status page and wait for it to fail…

Right…

root@box:~# journalctl -u mailinabox

blah blah (yes, I did scroll down)

Apr 21 06:25:39 box.23wwc.email start[415072]: reconfig start, read /etc/nsd/nsd.conf
Apr 21 06:25:39 box.23wwc.email start[415072]: ok
Apr 21 06:25:39 box.23wwc.email start[415073]: ok
Apr 21 10:55:41 box.23wwc.email start[447514]: {SHA512-CRYPT} xxxx etc (verified)

So the last journal entry is from nearly 3 hours ago and I can’t see any exceptions. However, the NGinX error.log does show

2023/04/21 13:25:45 [error] 316407#316407: *2620 upstream timed out
(110: Unknown error) while reading response header from upstream,
client: <IP Address Removed>, server: box.fake.domain,
request: "POST /admin/system/status HTTP/2.0",
upstream: "http://127.0.0.1:10222/system/status",
host: "box.fake.domain", referrer: "https://box.fake.domain/admin"

So there’s the timeout that causes the “Sorry, something went wrong”.

Steve

Ok, that’s clear. I was put on the wrong foot because the earlier post said it took 1m48.639s :frowning: It’s curious there is such a large difference between the check run from command line versus from the admin web interface.
My feeling with most these checks is that they are not memory or cpu bound. I have 2 GB, of which 800 MB is used, and I have 5 domains on the box. Adding more CPU or RAM will not help me.

There’s two other things you can try. Both are a bit hacky and I don’t know about any unintended consequences. All these changes will be overwritten on reinstall/upgrade of mailinabox:

  • First option: increase the timeout
    ** Edit file /usr/local/lib/mailinabox/start
    ** There’s a line exec gunicorn -b localhost:10222 -w 1 --timeout 630 wsgi:app which defines the timeout in seconds
    ** You probably also need to edit /etc/nginx/conf.d/local.conf, look for parameter fastcgi_read_timeout 630
    ** Restart nginx and mailinabox
  • Second option: the domain checks are done in parallel, increase the number of checks that can be carried out in parallel:
    ** Edit file ~/mailinabox/management/daemon.py
    ** Look for
        # Create a temporary pool of processes for the status checks
        with multiprocessing.pool.Pool(processes=5) as pool:
    
    ** Increase the processes=5 to for example processes=10
    ** Restart mailinabox: systemctl restart mailinabox

This is htop when MIAB is doing the status checks

This is normal operation

Interestingly (maybe) making the changes you’ve suggested to multiprocessing has made the command line execution much faster but the web interface still times out:

2023/04/22 00:07:58 [error] 1602#1602: *69 upstream timed out (110: Unknown error) while reading response header from upstream

I still think this stuff is not cpu bound, but maybe the cpu waiting for network or disc is also counting as cpu time?
Still curious that there is such a large difference between the command line and the admin web page when doing a status check. I’m out of ideas however :frowning: