Overview

In this post I will outline my use of the LEPP (Linux, Nginx, PostgreSQL, Python) stack to create a text data driven web app. Using these technologies I was able to create a fairly complex web application with relatively little overhead. You may already be familiar with each component, but I will also point out some interesting patterns I've adopted. For those unfamiliar with the stack - Linux is used because it is free and open source, ubiquitous in the cloud, and extremely robust. Nginx is used for those same reasons as well as being performant. PostgreSQL follows the open source trend and also has better security practices. Python is used for access to web frameworks and libraries that can save you a lot of time (Django, requests, pandas, newspaper are some of my favorites).

Goal

My goal was to create a web app that could aggregate content from various news outlets and wrangle the important data into a model defined within Django's ORM. I did this by using what would probably be regarded as an obscene amount of third party libraries (don't reinvent the wheel), and by integrating them with Django's application patterns. Yes I could probably remove half my dependencies by writing a handful of functions with the standard lib, but the goal is a working prototype first and foremost.

On the front end I wanted a responsive, crisp, and readable interface. The most important thing was to present the articles that people want to read, or make them otherwise searchable/browseable. My target was pages with 1 MB or less of data and load times in the 500-750ms range. My goal for the style of the interface was for it to be tolerable (I prefer backend development).

Component Choices


Django

It was either this or Flask. I chose Django since it has an excellent object-relational mapping (ORM), plenty of libraries like allauth, django-rest-framework, and mail/storage providers like anymail and storages. Not having to wrestle with user creation, password validation and storage, verification, forms, etc will literally save me hundreds of hours of headache and provide my users with a better experience and security.

To give you an idea of how simple querying the database in Django is, look at these typical examples:

>>> queryset = MyModel.objects.all()
# or
>>> queryset = MyModel.objects.filter(publish_date__range=[startdate, enddate], language='en', video=False).prefetch_related('author', 'domain').order_by('~publish_date')

The API covers 99% of SQL functionality1

Newspaper3k

This library allows you to "scrape" or "crawl" websites and extract articles. It has a simple API and makes use of multiprocessing to quickly fetch pages (but not so quickly as to overwhelm the host server). It works by abstracting the DOM and iterating over elements and assigning them scores to decide what text is a part of the article. There are parsers to extract data like title, authors, publish date, and so on. It is not always reliable since every site is different. I had to write some additional parsers for when the data is not extracted. I made a separate Django app based on newspaper, created models for the articles, authors, and domains, and made a management command to run a script that gathers articles from a defined list of websites. So adding more articles looks like:

docker-compose -f production.yml run django python manage.py cron

The management command allows for passing in lists of websites, overriding the default whitelist. That solves the content aspect of the site.

VueJS

I wanted users to be able to do simple things like follow sites and authors, and bookmark articles. I envisioned having a profile page where you could keep track of articles and get a customized feed based on your preferences. The issue with Django is to add/remove an object you typically make a POST request which gets handled by a view and redirects you to a GET request (another database hit). This didn't strike me as very efficient or modern, so I created a REST endpoint and added some VueJS to my templates. VueJS is amazing for applications with data - working with lists and API endpoints is a breeze 2. You can create a web app entirely with VueJS and serve it as a SPA, or you can add the .js files to your page and make use of its functionality. I feel like I'm cheating a bit here and having my cake while eating it, but it works for now.

PostgreSQL

Postgres was probably the easiest decision to make. It performs. It is secure. But most of all, it has a Django integration 3. The full text search is perfect for my text heavy application.

Nginx

Nginx hasn't failed me yet. It acts as a reverse proxy, passing requests to the WSGI server (gunicorn). Here is a simplified version of my config:

upstream django {
    server django:5000;
}

server {
    listen 80;
    server_tokens off;
    server_name inshapemind.com;
    return 301 https://inshapemind.com$request_uri;
}

server {

    listen 443 ssl;
    server_name inshapemind.com;
    ssl_certificate /etc/nginx/certs/fullchain.pem;
    ssl_certificate_key /etc/nginx/certs/privkey.pem;

    location / {
        proxy_pass http://django;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $host;
        proxy_redirect off;
    }
}

It's good practice to use a traditional webserver here because it lets Nginx handle the connections. Usually you would have a line here for serving static assets but I have a CDN set up for that.

Docker

Most people have heard of Docker by now, but I doubt everyone knows the full extent of what it has to offer. Yes, it allows you to run containerized applications, and yes Kubernetes uses it. I like it because it allows you to manage multiple containers and their configuration in a YAML file with docker-compose. Furthermore, you can provision and manage remote servers with docker-machine. For cloud providers like Digital Ocean who have an API, you can create an instance from the command line by passing in the parameters. The Docker engine will then be installed and you will be able to ssh into the instance simply by invoking docker-machine ssh myproject. Here is what my YAML configuration looks like:

version: '3'

services:
  django: &django
    build:
      context: .
      dockerfile: ./compose/production/django/Dockerfile
    image: inshapemind_production_django
    depends_on:
      - postgres
      - redis
    env_file:
      - ./.envs/.production/.django
      - ./.envs/.production/.postgres
    volumes:
      - newspaper_cache:/tmp/.newspaper_scraper
      - nltk:/root/nltk_data
    expose:
      - 5000
    command: /start

  postgres:
    build:
      context: .
      dockerfile: ./compose/production/postgres/Dockerfile
    image: inshapemind_production_postgres
    volumes:
      - production_postgres_data:/var/lib/postgresql/data
      - production_postgres_data_backups:/backups
    env_file:
      - ./.envs/.production/.postgres

  nginx:
    build:
      context: .
      dockerfile: ./compose/production/nginx/Dockerfile
    image: inshapemind_production_nginx
    depends_on:
      - django
    ports:
      - "0.0.0.0:80:80"
      - "0.0.0.0:443:443"
    volumes:
      - /etc/letsencrypt/live/inshapemind.com/fullchain.pem:/etc/nginx/certs/fullchain.pem:ro
      - /etc/letsencrypt/live/inshapemind.com/privkey.pem:/etc/nginx/certs/privkey.pem:ro

  redis:
    image: redis:5


volumes:
  production_postgres_data: {}
  production_postgres_data_backups: {}
  newspaper_cache: {}
  nltk: {}

Notice how I've easily added volumes for Django and Nginx. Those files are mounted from the host and will persist. The Newspaper cache contains the memoization information so I don't hit the same URL multiple times. I have another container for acquiring a Letsencrypt cert, which is then mounted in the production Nginx container.

Summary

That's my LEPP stack. It is easy to develop and deploy. I hope you learned something new and/or will consider some of these technologies in your next project if you haven't already.