What I did at Skimlinks

Commission Importers

This draft skips over a lot of detail about how the implementation works. I’m planning to present it as literate programming, but until then you can read the linked sources.

At our 2016 Christmas party I found myself venting at our CEO, “we’re wasting the operations team’s time,” I said, “they spend half the month on a task I’m sure a computer could do.” At the time I had no proof, but over the holidays I prototyped a solution and “operationalize commoditized processes”—legalese for “automate repetitive tasks”—appeared in our guiding principles for 2017.

Each month the four-person operations team spent a fortnight verifying that last month’s commissions were imported correctly; and in the all-too-common case that an error was found, they made the appropriate change in the database. Errors were rarely reported to the integrations team responsible for developing the importers because waiting for a fix took much longer than the manual change, and so their workload grew every month.

My idea was to make operations responsible for developing the importers so that fixing bugs was the path of least resistance. They aren't developers, but they are technically savvy. If you can write a SQL query or a spreadsheet then a parser is well within your grasp.

Commissions

The core business is commissions, Skimlinks acts as a middleman between content publishers and affiliate networks. When a visitor clicks on an outbound link we transparently convert it into a tracking link and if that click leads to a conversion we pass the commission on to the publisher after taking a small cut.

Each affiliate network has a slightly different API for commissions: paginated or not; XML, JSON or CSV; by creation or modified date; etc. But ultimately they all contain the same content1: the ID we set in the affiliate link, the amount of commission earned, and the date of the transaction.

There are actually four types of commission APIs, those reporting baskets vs items, and those reporting states vs differences. Most bugs are the result of developers misidentifying which type they are dealing with, the solution is to have the user to choose the type and enforce appropriate tests.

import collections, csv, decimal, json

# Definitions of parse_datetime, parse_date, parse_time and compose_datetime.

Commission = collections.namedtuple('Commission', 'id commission date')
Money = collections.namedtuple('Money', 'amount currency')

def csv_parser(rows):
    for row in rows:
        yield Commission(id=row['Reference'],
                         commission=Money(decimal.Decimal(row['Commission']),
                                          row['Currency']),
                         date=compose_datetime(parse_date(row['TransactionDate']),
                                               parse_time(row['TransactionTime'])))

def json_parser(doc):
    for row in doc['commissions']:
    	yield Commission(id=row['clickRef'],
    	                 amount=Money(decimal.Decimal(row['amount']), 'USD'),
    	                 date=parse_datetime(row['created']))

>>> list(csv_parser(csv.DictReader([
... "ID,Reference,Commission,Sale,Currency,TransactionDate,TransactionTime,PaidDate,PaidTime",
... "1,ac1d1c00,10.00,100.00,USD,2017-01-01,00:00:00,2017-01-02,00:00:00",
... "2,badcafe0,6.16,61.60,USD,2017-01-02,09:00:00,,",
... "3,c0ffee00,7.00,75.00,GBP,2017-01-02,12:30:00,,",
... "4,deadbeef,15.00,60.00,GBP,2017-01-02,16:00:00,,",
... "5,ea51e570,1.00,15.00,EUR,2017-01-02,17:45:00,,",
... ])))
[Commission(id='ac1d1c00',
            amount=Money(amount=Decimal('10.00'), currency='USD'),
            date=datetime.datetime(2017, 1, 1, 0, 0)),
 Commission(id='badcafe0',
             amount=Money(amount=Decimal('6.16'), currency='USD'),
             date=datetime.datetime(2017, 1, 2, 9, 0)),
 Commission(id='c0ffee00',
             amount=Money(amount=Decimal('7.00'), currency='GBP'),
             date=datetime.datetime(2017, 1, 2, 12, 30)),
 Commission(id='deadbeef',
             amount=Money(amount=Decimal('15.00'), currency='GBP'),
             date=datetime.datetime(2017, 1, 2, 16, 0)),
 Commission(id='ea51e570',
             amount=Money(amount=Decimal('1.00'), currency='EUR'),
             date=datetime.datetime(2017, 1, 2, 17, 45))]

>>> list(json_parser(json.loads(
... """\
... {
...   "commissions": [
...     {
...       "clickRef": "0a750000",
...       "amount": 12.30,
...       "created": "2017-01-01T09:30:00Z"
...     },
...     {
...       "clickRef": "13370000",
...       "amount": 17.00,
...       "created": "2017-01-02T01:30:00Z"
...     }
...   ]
... }
... """)))
[Commission(id='0a750000',
            amount=Money(amount=Decimal('12.30'), currency='USD'),
            date=datetime.datetime(2017, 1, 1, 9, 30)),
 Commission(id='13370000',
            amount=Money(amount=Decimal('17.00'), currency='USD'),
            date=datetime.datetime(2017, 1, 2, 1, 30))]
Parsing commission reports.

DSLs and IDEs

Each parser is a single expression mapped over each row that transforms the columns into parameters for the Commission constructor2. JSON and XML are simple to transform into a columnar format and we already have examples of all the transformations we would want to apply to a column, so I extracted those functions as the built-ins for the parser DSL.

And also how to fetch the report, and an expression that extracts the rows from the report which we deal with similarly.

We can use the types of the built-ins to limit where they are used; for example date is a datetime and thus the only reasonable candidates are compose_datetime and parse_datetime. In Python we handled errors by raising exceptions, but they only show the first error so for the DSL we use something similar to Haskell’s AccValidation to report all the errors3.

The elegance of AccValidation was discovered when rewriting the prototype code for production, all the implementations here use exceptions.

import Data.Time hiding (parseTime)
import Data.Validation

type Errors = [String]

newtype DatetimeFormat = DatetimeFormat String
iso8601DTF = DatetimeFormat "%Y-%m-%dT%H:%M:%S"

parseDatetime :: DatetimeFormat -> String -> AccValidation Errors UTCTime
parseDatetime (DatetimeFormat f) s =
  case parseTimeM False defaultTimeLocale f s of
    Just t  -> AccSuccess t
    Nothing -> AccFailure ["not a datetime"]

composeDatetime :: Day -> TimeOfDay -> UTCTime
composeDatetime d t = UTCTime d $ timeOfDayToTime t

newtype DateFormat = DateFormat String
iso8601DF = DateFormat "%Y-%m-%d"

parseDate :: DateFormat -> String -> AccValidation Errors Day
parseDate (DateFormat f) s =
  case parseTimeM False defaultTimeLocale f s of
    Just t  -> AccSuccess t
    Nothing -> AccFailure ["not a date"]

parseTime :: String -> AccValidation Errors TimeOfDay
parseTime s =
  case parseTimeM False defaultTimeLocale "%H:%M:%S" s of
    Just t  -> AccSuccess t
    Nothing -> AccFailure ["not a time"]

> parseDatetime iso8601DTF "2017-01-01T00:00:00"
AccSuccess 2017-01-01 00:00:00 UTC
> parseDatetime iso8601DTF "$1.00"
AccFailure ["not a datetime"]

> -- HINT: <$> and <*> are on the Applicative instance and apply
> --       the function if possible, and accumulates errors if not.
> composeDateTime <$> (parseDate iso8601DF "2017-01-01") <*> (parseTime "00:00:00")
AccSuccess 2017-01-01 00:00:00 UTC
> parseDate iso8601DF "$1.00"
AccFailure ["not a date"]
> composeDatetime <$> (parseDate iso8601DF "$1.00") <*> (parseTime "0x0123")
AccFailure ["not a date","not a time"]
Using AccValidation to collect errors.

An IDE for developing and testing parsers in our DSL is straightforward from here. We use type information (with automatic insertion of <$> and <*>) to limit which functions the user can choose, and map the expression over the rows to support debugging.

Prototype IDE (source).

Autocomplete

While developing the IDE I noticed that many of my choices were forced: only one column could be the tracking ID, or only one column is a date in the past, or a pair of columns that represented the commission and sale value. Inspired by Barliman, I wanted the IDE to fill in those choices automatically.

Taking Commission as the root type I searched the graph of all possible programs for implementations that successfully return a commission for each row.

import enum, inspect, typing

def signature(c):
    # Types have callable constructors (i.e. `__init__`).
    if isinstance(c, type):
        default_return = c
        skip = 1 # Ignore `self`.
        c = c.__init__
    # Regular functions.
    else:
        default_return = typing.Any
        skip = 0

    argspec = inspect.getfullargspec(c)
    types = typing.get_type_hints(c)
    parameters = [types.get(a, typing.Any) for a in argspec.args[skip:]]
    returns = types.get("return", default_return)
    return typing.Callable[parameters, returns]

def make_namespace(values):
    """Return a namespace containing `values` with inferred types.

    `Enum`s are expanded into their values:

    >>> class Toggle(enum.Enum):
    ...     ON = True
    ...     OFF = False
    >>> make_namespace([Toggle])
    [(<Toggle.ON: True>, <enum 'Toggle'>),
     (<Toggle.OFF: False>, <enum 'Toggle'>)]
    """
    namespace = []
    for value in values:
        if isinstance(value, enum.EnumMeta):
            for member in value.__members__.values():
                namespace.append((member, value))
        elif callable(value):
            namespace.append((value, signature(value)))
        else:
            namespace.append((value, type(value)))
    return namespace

def expressions(namespace, type_, arguments):
    """Return expressions of type `type_` in `namespace`."""
    if issubclass(type_, typing.Callable):
        parameters = ((Argument(i), t) for i, t in enumerate(type_.__args__[:-1]))
        namespace_ = itertools.chain(namespace, parameters)
        return expressions(namespace_, type_.__args__[-1], arguments)
    else:
        return _expressions(type_, arguments, oftype(namespace))

def _expressions(type_, arguments, oftype):
    # HINT: This type system is invariant.
    for value in oftype[type_]:
        if isinstance(value, Callable):
            potential = (_expressions(t, arguments, oftype)
                         for t in value.parameters)
            potential = list(map(list, potential))
            for _ in itertools.product(*potential):
                expressions, values = zip(*_)
                try:
                    results = []
                    # FIXME: This is ridiculous!
                    vs = (v if len(v) > 1 else v * len(arguments[0]) for v in values)
                    for args in zip(*vs):
                        results.append(value.callable(*args))
                except Exception as e:
                    results = None
                    bound = Call(value.callable, map(Value, args))
                    logger.debug("Discarded %s: %s raised %r",
                                 value,
                                 _pretty(bound),
                                 e)
                if results is not None:
                    yield Call(value.callable, expressions), results
        elif isinstance(value, Argument):
            yield value, arguments[value.position]
        else:
            yield Value(value), [value]

>>> import datetime, re
>>> def parse_datetime(s: str) -> datetime.datetime:
...     s = re.sub(r"[+-][0-9]{2}:[0-9]{2}$", "", s)
...     return datetime.datetime.strptime(s, "%Y-%m-%dT%H:%M:%S")

>>> class DateFormat(enum.Enum):
...     ISO_8601 = "%Y-%m-%d"
...     US = "%m/%d/%y"
...     UK = "%d/%m/%y"

>>> def parse_date(s: str, f: DateFormat) -> datetime.date:
...     return datetime.datetime.strptime(s, f.value).date()

>>> def parse_time(s: str) -> datetime.time:
...     return datetime.datetime.strptime(s, "%H:%M:%S").time()

>>> def compose_datetime(d: datetime.date, t: datetime.time) -> datetime.datetime:
...     return datetime.datetime.combine(d, t)

>>> namespace = make_namespace([parse_datetime, compose_datetime,
...                             parse_date, DateFormat,
...                             parse_time])
>>> list(expressions(namespace,
...                  typing.Callable[[str], datetime.datetime],
...                  [("2017-01-01T00:00:00",)]))
[(Call(callable=<function parse_datetime at 0x...>,
   arguments=(Argument(position=0),)),
[datetime.datetime(2017, 1, 1, 0, 0)])]
Searching programs (usage in prototype).

While the assert-based autocomplete eliminated some of the manual work it wasn’t able to decide which date was the transaction date (probably the earliest one) or which number was the commission amount (probably the smallest positive one). To address this I introduced a variant of tests that return confidence values (i.e. between zero and one) instead of booleans. When faced with multiple possible solutions these tests could break ties if one solution scored significantly higher than the rest; without a clear winner a human was still needed.

Benefits