Featured image thumbnail for post 'Declutter your Gmail inbox with Python: A Step-by-Step Guide'

Declutter your Gmail inbox with Python: A Step-by-Step Guide

Discover how to reclaim control over your emails. Let's leverage Python to efficiently clean up and organize your Gmail inbox.

Joey Miller • Posted July 12, 2023




Emails are an important part of many of our lives - both personally and professionally. Staying on top of your inbox can be a daunting task. My matter how hard I try, inevitably my Gmail begins overflowing with countless unread messages.

In this guide we will explore how Python can be utilized to effortlessly sort through your inbox, allowing you to regain control.

Note: The purpose of this post isn't to detail a fully-automated AI that can clean our inboxes unsupervised. Rather, the goal is to introduce you to the tools needed to supplement your efforts when cleaning your inbox.

Installing dependencies

Ensure you have python3 and pip installed.

I encourage you to install the dependencies into a virtual environment.

Navigate to your project directory and run the following:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib pandas

Getting Google API access

Before we can get started, we need to register our application with Google so we can access user data.

We will follow the official instructions to create an OAuth "Desktop app".

  1. Go to Credentials
  2. Click Create Credentials > OAuth client ID.
  3. Click Application type > Desktop app.
  4. In the Name field, type a name for the credential. This name is only shown in the Google Cloud console.
  5. Click Create. A OAuth client created popover appears, showing the client details. Click 'Download JSON' and save the file as credentials.json to your project directory.

Analyzing your inbox

In this simple example, we will focus on creating a Python script that gives a breakdown of the most common senders in our inbox.

Create a Python file called gmail_organizer.py in your project directory.

First, let's add the shebang and imports.

#!/usr/bin/env python3
from __future__ import print_function

import os.path

import pandas as pd
import re

from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build

Then let's create the authentication function. This uses the credentials.json file to allow us to authenticate on behalf of a user. Once a user has authenticated a token.json will be created in the project directory. This matches the sample code provided by Google.

# If modifying these scopes, delete the file token.json.
SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']

def get_creds():
    creds = None
    # The file token.json stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first
    # time.
    if os.path.exists('token.json'):
        creds = Credentials.from_authorized_user_file('token.json', SCOPES)
    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES)
            creds = flow.run_local_server(port=0)
        # Save the credentials for the next run
        with open('token.json', 'w') as token:
            token.write(creds.to_json())
    return creds

Now, we create the functions get_inbox_emails(..) and process_email_metadata(..) - these will be doing most of the heavy lifting.

email_metadata = []
def process_email_metadata(request_id, response, exception):
    global email_metadata

    message_id = response.get('id')
    headers = response.get('payload').get('headers');
    if(headers is not None):
        for header in headers:
            if header['name'] == "From":
                username, domain = re.match(
                    r'(?:.*<)?(.*)@(.*?)(?:>.*|$)', header['value']
                ).groups()
                email_metadata.append({
                    'message_id':message_id,
                    'username':username,
                    'domain':domain})
                break

def get_inbox_emails(service):
    # Call the Gmail API
    response = service.users().messages().list(
            userId='me',
            labelIds=['INBOX'],
            maxResults=5000
    ).execute()

    # Retrieve all message ids
    messages = []
    messages.extend(response['messages'])
    while 'nextPageToken' in response:
      page_token = response['nextPageToken']
      response = service.users().messages().list(
              userId='me',
              labelIds=['INBOX'],
              maxResults=5000,
              pageToken=page_token
      ).execute()
      messages.extend(response['messages'])

    # Retrieve the metadata for all messages
    step = 100
    num_messages = len(messages)
    for batch in range(0, num_messages, step):
        batch_req = service.new_batch_http_request(callback=process_email_metadata)
        for i in range(batch, min(batch + step, num_messages)):
            batch_req.add(service.users().messages().get(
                userId='me',
                id=messages[i]['id'],
                format="metadata")
            )
        batch_req.execute()

Let's break down what these functions accomplish:

  1. Create a gmail service class.
  2. Retrieve all message ids / list all messages. Gmail only allows listing up to 5000 results at one time, so we have to keep requesting more until there is no nextPageToken in the response.
  3. Retrieve the metadata for all messages. Gmail does not provide any way to retrieve these details when listing emails, so we need to iterate over each of the message id's we found in Step 2. For performance; in each request we ask Gmail to return the metadata of up to 100 emails.
  4. For each email metadata we receive back, the callback process_email_metadata(..) is called. This is where we process our data. In this example, I process the From: field and apply some regex to extract the email username and domain name. This will allow us to find the most common senders in my inbox.

Now finally let's create the script entrypoint (calling the functions we've already made above).

def main():
    creds = get_creds()
    service = build('gmail', 'v1', credentials=creds)

    get_inbox_emails(service)

if __name__ == '__main__':
    main()

Printing results

Running the code above will return nothing. We need to process the data and display it to the user. We can use Pandas to easily report a descending list of email usernames and domains.

We've already done the work to process this data in process_email_metadata(..), so all we need to do is add the following lines to main() below get_inbox_emails(service):

    # Print the results
    df = pd.DataFrame(email_metadata)

    print("Most common email usernames -----------")
    print(df.groupby('username')
            .size().reset_index(name='count')
            .sort_values(by='count',ascending=False)
            .to_string(index=False))
    print()
    print("Most common email domains -------------")
    print(df.groupby('domain')
            .size().reset_index(name='count')
            .sort_values(by='count',ascending=False)
            .to_string(index=False))

See the full complete script on Github.

Running

From the project directory:

python3 gmail_organizer.py

A new browser window will open prompting you to sign in to your Google account. The script will analyze the emails in the Gmail account associated with the Google account you sign in with at this point. The browser window will warn you that this is unsafe, but that is only because your application is unverified. If necessary, you can go through the process to verify your application.

After running the application, you should get an output similar to the following:

Most common email usernames -----------
       username  count
           info      6
        noreply      5
       no-reply      2
     donotreply      1
...

Most common email domains -------------
         domain  count
    example.com      5
    youtube.com      2
     change.org      1
...

Extending the script

The example above is a very simple example of what you can accomplish. It serves as a scaffold that you can expand to tackle more complex situations. It is possible to extend the script to modify your inbox, including labeling or deleting emails.

Start by making sure you have the correct SCOPES for the operations you are attempting. Google outlines the different scopes here.

To be able to additionally label emails, we need the modify scope. This means we need to update:

# If modifying these scopes, delete the file token.json.
SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']

to:

# If modifying these scopes, delete the file token.json.
SCOPES = ['https://www.googleapis.com/auth/gmail.readonly',
          'https://www.googleapis.com/auth/gmail.modify']

Make sure you delete token.json after changing the scopes.

From here, labeling/starring an email is very straightforward.

def label_emails(service, message_id):
    response = service.users().messages().modify(
        userId='me',
        id=message_id,
        body={
            "addLabelIds":['STARRED']
        }
    ).execute()

Note: For labeling large numbers of emails, consider using batchModify instead (for the same reasons we did for retrieving metadata earlier).

We've already done the work to process this data in process_email_metadata(..), so to star all emails from example.com all we need to do is add the following lines to main() below get_inbox_emails(service):

    for email in email_metadata:
        if(email['domain'] == 'example.com'):
            label_emails(service, email['message_id'])

If you found this post helpful, please share it around:


Comments