Techniques — Science writing
Back to techniques
http://pobox.com/~rudolf/techniques/sciencewriting.html

Based on Ubuntu 10.04 (a variety of Debian Linux).

Document workflow

Option 1: commercial: Word 97, MathType, EndNote 5.0

Option 2: OpenOffice/LibreOffice, Zotero standalone

LibreOffice (the major fork of OpenOffice) isn't bad. Zotero itself is excellent. However, the problems are (a) OpenOffice Math remains something of a pain for equations, (b) citation entry/editing isn't as easy as it could be (e.g. this formatting bug and the fact it's hard to add text at the end of a line after inserting a citation), and (c) ridiculously, you can't copy and paste citations. So:

Option 3: LyX, LyZ, Zotero standalone

Zotero is superb, particularly the standalone version. LyX is an editor that does a good job of hiding its underlying platform, LaTeX, and has superb maths capabilities (e.g. inline equations appear perfectly, and it's very quick to write in). LyZ is a plugin for Zotero that integrates it very well with LyX.

So, installation:

In use:

Illustration: Illustrator

Bugs persist (Mar 2012) in Inkscape that mean it's not yet up to Illustrator. Specifically, cutting lines with a shape doesn't work.

Fetch a list of URLs to PDF

For example, to batch-fetch a list of sources in a convenient (relatively immutable, PDF) form, given a textfile containing a list of URLs: see below or fetch_multiple_urls_to_pdf.py.

#!/usr/bin/python

# Requires Debian packages: wget wkhtmltopdf

import sys;
from subprocess import call;

if len(sys.argv) != 2:	# the program name and one other
	sys.exit("Syntax: fetch_multiple_urls_to_pdf.py urllistfile")

f = open(sys.argv[1], "r")
for url in f:
	url = url.strip()
	destfile = url.replace("http://", "")
	if destfile[-1] == "/":
		destfile = destfile[:-1]
	destfile = destfile.replace("/", "_")
	if destfile[-4:] == ".pdf":
		command = "wget " + url + " -O " + destfile
	else:
		destfile += ".pdf"
		command = "wkhtmltopdf " + url + " " + destfile
	print "Executing " + command
	ret = call(command, shell=True)

To generate the URL list from a textfile...

#!/usr/bin/perl
use strict;

my $filename = shift;
if ($filename eq "") {
	die "Syntax: list_urls.pl <filename>\n";
}

open(INFILE, "$filename")
	or die "Couldn't open $filename for reading.\n";
my $url;
while (<INFILE>) {
	if (/(http\S*)/) {
		print "$1\n";
	}
};

PDF manipulation

# The main package: pdftk (does all sorts of things)

sudo apt-get install pdftk
man pdftk
# example:
pdftk in.pdf cat 1-12 14-end output out1.pdf

# Extracting text:
pdftotext in.pdf out.txt

# Processing a PDF file through GhostScript (useful to find PDF corruption, amongst other things)
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=output.pdf input.pdf
# ... use -sDEVICE=x11 to view, or -sDEVICE=nullpage just to pass through input, as less restrictive error checks

# When GhostScript can't handle a pdf:
pdftops myfile.pdf
# ... generates myfile.ps
# Now list printers:
lpstat -p -d
# Now print the PostScript:
lpr -P<printer> myfile.ps

Systematic reviews

Quick optical character recognition (OCR)

Suppose your scanner produces page1.png. Then we'll use convert (part of ImageMagick) and tesseract to do some quick OCR.

convert page1.png page1.tif
tesseract page1.tif page1

The final result is page1.txt.

Quick Python/shell script hacks for indexing purposes

Renaming a bunch of files in a filename-sensitive manner

Probably achievable in a bash script, but this is quicker to write. Having split a PDF (with pdftk input.pdf burst), now renumber the pages so that pages 1–26 are named "...roman1"–"...roman26", but pages 27– are renumbered 1–. See below or rename_proof_pages.py.

#!/usr/bin/python2.7

import sys, getpass, shlex, subprocess, re

def get_external_command_output(command):
	args = shlex.split(command)
	return subprocess.check_output(args) # this needs Python 2.7 or higher

filelist = get_external_command_output("bash -c \"ls *.pdf\"").split() # need bash to do the wildcard expansion, then split() to make an array of a multiline string
numberregex = "\d+"
for infile in filelist:
	result = re.search(numberregex, infile)
	if result != None:
		ipagenum = int(result.group(0))
		# Now, the first 26 pages are roman-numeralled.
		if ipagenum>26:
			opagestr = str(ipagenum - 26)
		else:
			opagestr = "roman_" + str(ipagenum)
		outfile = "page_" + opagestr + ".pdf"
		get_external_command_output("cp " + infile + " " + outfile)

Extracting plaintext from PDFs

Use the following shell script (extract_pdf_text) as e.g. extract_pdf_text *.pdf. All it does is call pdftotext.

#!/bin/bash

# extract_pdf_text
# does very little!
# usage: extract_pdf_text *.pdf

# make sure you always put $f in double quotes to avoid any nasty surprises i.e. "$f"
for f in $@
do
  echo "Processing $f file..."
  pdftotext "$f" "$f.txt"
done

A sort of "multigrep"

Take a list of words (in a wordlist file); grep each word in turn against a bunch of files specified by the filespec; format the output a little. See below or multigrep.py.

#!/usr/bin/python2.7

import sys, getpass, shlex, subprocess, re

def raw_default(prompt, dflt=None):
	prompt = "%s [%s]: " % (prompt, dflt)
	res = raw_input(prompt)
	if not res and dflt:
		return dflt
	return res

def get_external_command_output(command):
	args = shlex.split(command)
	return subprocess.check_output(args) # this needs Python 2.7 or higher

wordlistfile = raw_default("Wordlist file", "../../wordlist.txt")
resultsfile = raw_default("Results file", "../../abbreviation_search_results.txt")
infilespec = raw_default("Input filespec", "*.txt")

infile = open(wordlistfile, "r")
wordlist = infile.read().split()
for word in wordlist:
	print "------------------------------------"
	print word
	print "------------------------------------"
	print get_external_command_output("bash -c \"grep " + word + " " + infilespec + "\"")
	print

Converting DOCX files

You could use OpenOffice. But sometimes it crashes. Alternative way:

# Prerequisites
sudo apt-get install rpm libgif4

# Now fetch odf-converter. Go to Novell, register (free)/log in, search under "OpenOffice", get "OpenOffice.OpenXML Translator 4.0" (e.g. odf-converter-4.0-12.1.i586.rpm)

# Now unpack/install
rpm2cpio odf-converter*rpm | cpio -ivd
sudo cp usr/lib/ooo-2.0/program/OdfConverter /usr/bin

# It wants libtiff.so.3, and we probably have libtiff.so.4, so give it a symlink
cd /usr/lib/
sudo ln -s libtiff.so.4 libtiff.so.3

# Can now use it:
OdfConverter /i example.docx

# That should produce example.odt (suitable for OpenOffice).

OpenOffice bits and bobs

Valid HTML 4.01 Transitional
Valid CSS