Parallel processing for SpamAssassin (Speed for bulk mail proccesing)

A long standing issue with speed for multiple mail processing was the linear feed to spamassassin sa-learn from /usr/local/bin/sa-learn-pipe.sh

Default instalation of MaiB for /usr/local/bin/sa-learn-pipe.sh is:
cat<&0 >> /tmp/sendmail-msg-$$.txt
/usr/bin/sa-learn $* /tmp/sendmail-msg-$$.txt > /dev/null
rm -f /tmp/sendmail-msg-$$.txt
exit 0

But, what if you have a mail server with multiple cores? How to do it?

1 Like

The best solution until now is this and I would suggest that it should become the default for MaiB instalations. If available, when needed, it uses most of the cores that are free.

This is the best solution I found to handle 100+emails. On a 56 core machine it runs ok for handling 10000+ emails in one go.

Basically is the default but using available resources.

#!/bin/bash

################################################################################
# SpamAssassin Parallel Learning Script
################################################################################
# This script is called by Dovecot's antispam plugin when users move emails
# to/from spam folders. Instead of processing emails one-by-one sequentially,
# it processes them in parallel to dramatically speed up training.
#
# Called with: --spam (when moving TO spam) or --ham (when moving FROM spam)
################################################################################

# Calculate maximum parallel processes
# We use (total CPUs - 2) to leave some resources for the system
# The ternary operation ensures we always have at least 1 process even on
# very small servers (1-2 CPUs)
NPROC=$(nproc)  # Get total number of CPU cores
MAX_PARALLEL=$(( NPROC > 2 ? NPROC - 2 : 1 ))  # Subtract 2, minimum 1

# Directory for lock files to control parallel execution
# Each parallel job will create a lock directory to claim a "slot"
LOCK_DIR="/tmp/sa-learn-locks"
mkdir -p "$LOCK_DIR"  # Create lock directory if it doesn't exist

################################################################################
# Save the incoming email to a temporary file
################################################################################
# The email content comes from stdin (file descriptor 0)
# We save it with a unique name using the process ID ($$)
tmpfile="/tmp/sendmail-msg-$$.txt"
cat<&0 >> "$tmpfile"  # Read from stdin and append to temp file

################################################################################
# Wait for an available processing slot
################################################################################
# This section implements a semaphore-like mechanism using directory creation
# Directory creation is atomic in Linux, so it's safe for concurrent access
slot=-1  # Initialize slot as "not found"

while [ $slot -lt 0 ]; do
    # Try to claim one of the available slots (0 to MAX_PARALLEL-1)
    for i in $(seq 0 $((MAX_PARALLEL-1))); do
        # Try to create a lock directory for this slot
        # mkdir will succeed only if the directory doesn't exist (slot is free)
        if mkdir "$LOCK_DIR/slot-$i" 2>/dev/null; then
            slot=$i  # Successfully claimed this slot
            break    # Exit the for loop
        fi
    done
    
    # If no slot was available, wait a tiny bit and try again
    # 0.01 seconds = 10 milliseconds
    [ $slot -lt 0 ] && sleep 0.01
done

################################################################################
# Process the email in background
################################################################################
# By running in background (&), we return immediately to Dovecot
# This allows Dovecot to continue accepting more emails to process
(
    # Run SpamAssassin learning with the arguments passed to this script
    # $* contains either "--spam" or "--ham" depending on how we were called
    # Output is discarded (> /dev/null) and errors too (2>&1)
    /usr/bin/sa-learn $* "$tmpfile" > /dev/null 2>&1
    
    # Clean up the temporary email file
    rm -f "$tmpfile"
    
    # Release the slot by removing the lock directory
    # This allows another process to claim this slot
    rmdir "$LOCK_DIR/slot-$slot"
) &  # The & makes this entire block run in background

################################################################################
# Exit immediately
################################################################################
# We exit 0 (success) right away, while sa-learn continues in background
# This prevents Dovecot from waiting for sa-learn to complete
exit 0
2 Likes

Nice one. I’m certainly going to try it.
This is also discussed on github.

3 Likes