CSV file converted to PDF, with variable column widths and footers

Printing a CSV file as a nicely formatted table is easy if you import it first into a spreadsheet like Microsoft Excel or LibreOffice Calc. I found it surprisingly hard using common Unix utilities. Toolchains like Latex are overkill, designed for complex publishing. My typical use cases are simple. I want to dump selected fields exported from an application like my Kanboard task list on a single sheet of paper, without over-running the margins.

My solution converts a CSV file to PDF in two hops. First I translate the CSV into HTML with a custom Python script. Then I send the HTML through WeasyPrint to generate the PDF. WeasyPrint, like the name implies, makes the HTML-to-PDF job a simple one-liner in a shell command.

Installing Weasyprint

Follow the Weasyprint docs. There are numerous dependencies. On Ubuntu 16.04, I ran:

sudo apt-get install build-essential python3-dev python3-pip python3-cffi libcairo2 libpango-1.0-0 libpangocairo-1.0.0 libgdk-pixbuf2.0-0 libffi-dev shared-mime-info

Then add WeasyPrint with the pip package installer:

pip install WeasyPrint

Now you’re ready to run some tests. Try converting some web pages or HTML to PDF using the quickstart tutorial. Then experiment with the options for injecting a stylesheet or in-line CSS.

Converting CSV to HTML

My python script takes any CSV file as a command-line parameter and outputs an HTML document. Optionally, I can specify the title, column widths, and a stylesheet name.

There’s nothing very special about the script. It uses Python’s standard csv module to read the file and writes writes the output line by line. The entire CSV becomes a <table> element in the body. Because the HTML document includes a link to the stylesheet, it will render nicely in the browser.

The main feature I needed was control over the individual column widths as percentages of the overall table width, using the <th> table header tags. If a column width is set to zero, then the field is skipped. If no width is specified, it’s left up to the browser.

The script reads these parameters in the form <csv-file-name> [<page-title> <"col-width-1, col-width-2, ..."> <style-sheet-name>]. For example:

./csv2html.py kbtasks.csv "My Kanboard Tasks" "35,8,8,5,5,8,30,0,0,0" csv2pdf.css

Styling

To fit wider CSV lists on the page, I specified CSS for smaller fonts and margins. I also wanted to supply a footer to WeasyPrint. My CSS file, created separately from the Python script, looks like this:

@page {
  size: letter landscape;
  margin-top: .40in;
  margin-left: .40in;
  margin-right: .40in;
  margin-bottom: .70in;
  font-size: 10px;
  @bottom-right { content: "Page " counter(page) " of " counter(pages)};
}
table {
  border-collapse: collapse;
}
td, th {
  font-size: 10px;
  line-height: 10pt; 
  border: 1px solid black;
}

Putting it all Together

The last problem for me was adding other elements to the page footer, like title and date. Weasyprint doesn’t seem to include the creation date automatically, even when I write it into a <meta name> content tag. I worked around this puzzle by adding the date dynamically, as in-line CSS:

weasyprint kbtasks.csv.html kbtasks.pdf -s <(echo '@page { @bottom-left { content: "'`date '+%Y-%m-%d %H:%M'`'" }; @bottom-center { content: "My Kanboard Tasks" }; }')

Here’s the completed bash and python scripts I use to create my CSV file and print it:

paul@pauldev:~/scripts$ cat kbprint.sh

#!/bin/bash
#TODAY=`date +20%y/%m/%d\ %k:%M`
SCRIPTDIR=~/scripts
OUTDIR=~/Downloads
cd $OUTDIR
# Note: the kbtasks2csv.py script is not discussed in this blog article.
echo
echo "Writing Kanboard tasks to CSV file..."
$SCRIPTDIR/kbtasks2csv.py
echo
echo "Converting CSV file to HTML..."
$SCRIPTDIR/csv2html.py kbtasks.csv "My Kanboard Tasks" "35,8,8,5,5,8,30,0,0,0" $SCRIPTDIR/csv2pdf.css
echo
echo "Converting HTML file to PDF..."
weasyprint kbtasks.csv.html kbtasks.pdf -s <(echo '@page { @bottom-left { content: "'`date '+%Y-%m-%d %H:%M'`'" }; @bottom-center { content: "My Kanboard Tasks" }; }')

paul@pauldev:~/scripts$ cat csv2html.py

#!/usr/bin/env python3
"""csv2html.py
Convert CSV files to HTML as tables, with ability to set percentage widths of each column.
"""
import csv
import datetime
import sys

def read_csv_file(any_csv_file):
    """Read a CSV file.  Return contents as a list."""
    in_reader = csv.reader(open(any_csv_file))
    row_list = []
    row_count = 0
    for row in in_reader:
        row_list.append(row)
        row_count += 1
    print(row_count, "rows read")
    return row_list

def write_html_file(any_list, any_html_file, any_title='', any_widths=None, css_sheet_name=None):
    """Write list to an HTML file, using the assigned column percentage widths."""
    with open(any_html_file, 'w') as out_file:
        out_file.write('<!DOCTYPE html>\n')
        out_file.write('<html>\n')
        out_file.write('<head>\n')
        out_file.write('    <title>' + any_title + '</title>\n')
        if css_sheet_name is not None:
            out_file.write('    <link rel="stylesheet" type="text/css" href="' \
                + css_sheet_name + '">\n')
        now_str = datetime.datetime.now().strftime('%Y-%m-%d')
        out_file.write('    <meta name=dcterms.created content="' + now_str + '" />\n')
        out_file.write('</head>\n\n')
        out_file.write('<body>\n')
        out_file.write("<table style='table-layout:fixed;width:100%' border:'1'>\n")
        row_count = 0
        col_width_pct = any_widths
        for row in any_list:
            col_count = 0
            out_file.write('    <tr>\n')
            for fld in row:
                if col_width_pct[col_count] != 0:
                    if row_count == 0: # header row
                        out_file.write("        <th")
                        if col_width_pct[col_count] > 0:
                            out_file.write(" style='width: " \
                            + str(col_width_pct[col_count]) + "%;'")
                        out_file.write(">")
                        out_file.write(fld + "</th>\n")
                    else:
                        out_file.write('        <td>' + fld + '</td>\n')
                col_count += 1
            out_file.write('    </tr>\n')
            row_count += 1
        out_file.write('</table>\n')
        out_file.write('</body>\n')
        out_file.write('</html>\n')
    return

def get_args():
    """Get command line parameters."""
    if (len(sys.argv) < 2) or (len(sys.argv) > 5):
        usage()
        sys.exit(0)
    in_file = sys.argv[1]
    print("Input file:", in_file)
    if len(sys.argv) >= 3:
        title = sys.argv[2]
    else:
        title = "Table"
    print("HTML header title:", title)
    if len(sys.argv) >= 4:
        column_widths = [x.strip() for x in sys.argv[3].split(",")]
        print("HTML column % widths:", column_widths)
    else:
        column_widths = []
    if len(sys.argv) >= 5:
        css_sheet_name = sys.argv[4]
        print("CSS style sheet:", css_sheet_name)
    else:
        css_sheet_name = None
    return in_file, title, column_widths, css_sheet_name

def usage():
    """Explain parameters if user fails to enter them."""
    text = """
    usage: csv2html.py <csv-file-name> [<page-title> <"col-width-1, col-width-2, ..."> <style-sheet-name>]
    """
    print(text)

def main():
    """Main program loop."""


    # Initial default for each potential column width is zero (meaning don't include it)
    my_html_column_widths = [-1 for x in range(1, 40)]
    my_specified_column_widths = []

    # Read command-line arguments
    my_csv_file, my_title, my_specified_column_widths, my_css_sheet_name = get_args()

    # Insert column-widths defined by user
    counter = 0
    for col in my_specified_column_widths:
        try:
            my_html_column_widths[counter] = int(col)
        except ValueError:
            my_html_column_widths[counter] = -1
        counter += 1

    # Read the CSV file
    my_html_file = my_csv_file + '.html'
    my_in_list = read_csv_file(my_csv_file)

    # Write the HTML file
    write_html_file(my_in_list, my_html_file, my_title, my_html_column_widths, my_css_sheet_name)

if __name__ == '__main__':
    main()

Resources