Analytics: my personal journey

If actuaries do not embrace data science, the sexiest job in the 21st century and the best job of 2016 may both go to data scientists. Actuaries may one day disappear! Source: Barnett Waddingham – The sexiest job in the 21st century vs the best job of 2015

I heartily agree. But I want to go further; the above article highlights actuaries' domain expertise in insurance and suggests we should partner with data scientists. But if actuaries want to deploy their risk management and other expertise beyond insurance they need to skill up. This is my personal journey.

Case studiesand a quick way of thinking about big data

Getting our hands and heads around big datasimpler thinking

Here's a way of being realistic about – rather than intimidated by – big data. I cannot remember the source of this idea; it's not mine.

  • Small data fit in a computer's memory (RAM) in when it is analysed, without chunking it up. Stunningly, all three examples below are 'small data'.
  • Medium data does not fit into a single computer's memory, but instead needs permanent storage e.g. in a database. There are other reasons to use databases.
  • Big data needs to be distributed across many computers – perhaps hundreds – because the data is too big, or arrives too fast, to be handled in other ways.

Learning data science in practicethree strands

What can someone do in practice to learn data science? Here are three techniques I've used. And I'm using the same strands today.

[1] Get grounded in data science work

Here are three examples I've used personally:

  1. Pulling the bible apart: I took John Walkenbach's example and turbo-charged it, first in VBA and then in Python. With data wrangling based on only c31,000 records (verses of the bible) this was a simple first exercise. But it allowed me to (a) switch between (free) bible versions using back end databases and (b) check with and correct the errors at BibleGateway. It's also satisfying to build a histogram of all c12000 bible words in 4 seconds (VBA) or 3 seconds (Python).

    Initially this appears to have more to do with backward-looking business intelligence that data science. It's certainly a low initial hurdle, but:

    • Game-like predictive analytics. For a given anonymised text, bible books can be ranked in terms of the probability that they generated the text.
    • Unsupervised learning. The text nature of the bible makes it an ideal example for techniques to guess genres, authors, identify stylistic points etc.
  2. Kaggle's titanic tutorial. Based on a classic example, Titanic: machine learning from disaster provides a great introduction to predictive analytics, using Excel, Python, R & Random Forests. Described as a Kaggle 'Getting Started' competition, it provides an ideal starting place for without a lot of experience in data science and machine learning. An unusual but helpful feature is that the tutorial starts with Excel before introducing the more complex data science libraries associated with Python. As a final bonus the problem set also has a parallel R track, fitting nicely with the DEVeloPeR stack described below.
  3. Enron's email corpus. Officially released into the public domain by US authorities in 2004, this is a large (500,000+) collection of Enron emails, mainly in the period leading up to their filing for bankruptcy in December 2001. Python can read all the emails in a single statement, taking just 12 seconds. VBA is almost as fast. My work takes the same route as the Kaggle tutorial above; initial exploratory analysis in Excel, making use of Excel's pivot table functionality. So-called 'bag-of-words' techniques can be implemented in both VBA and Python, while sophisticated unsupervised learning results are delivered using Python libraries.

[2] Learn a data science programming language

I learned Python and will provide a separate article on how to do this. My brief recommendations are:

  1. Learn fast using simple-but-realistic examples. I found the first five chapters of Head First Python ideal; quick learning in a week of afternoons!
  2. Tackle more demanding projects. Think Python is ideal. Free online and pdf versions include excellent word play and data structure case studies.
  3. Do data science-specific work. That comes through the examples above.

[3] Read a data science book

Why read such books? The good ones (*) will help in several ways, by providing:

  • motivating and inspirational examples and show what should be possible.
  • relatively non-technical ways of thinking about areas c.f. the differences between exploratory, explanatory and predictive analytics.
  • coverage of a wide range of data science techniques, using examples more than theory. Coverage might include: (**)

(*) Two recommendations: Data science for business and Data Mining for Business Analytics

(**) The coverage is based on the second recommendation. Not all techniques are covered; those missing include random forests, ensembles and deep learning.

Data science skillshow actuaries measure up

What skills do data scientists need? What have actuaries got?

The US Casualty Actuarial Society's Data and Technology Working party 'seeks to research and identify the knowledge and skills actuaries must possess to participate in the changes brought about by a rapidly evolving technology supporting data and analytics'.

Their presentation Actuarial science versus: competing for relevance gave the diagram to the right, itself based on Drew Conway's famous data science venn diagram.

Conway suggests that good data science needs:

  1. maths and statistics knowledge
  2. substantive expertise (e.g. domain knowledge of the industry / firm you're working in)
  3. hacking skills (data wrangling, using software and machine learning techniques)

The initial conclusion from the diagram above is that actuaries have what Conway calls the 'traditional research' skills; maths/statistics and domain knowledge, especially for insurance.

Compete or collaborate?

Actuaries and Data Scientists – Match Made In Heaven or Hell? suggests that actuaries and data scientists have overlapping skills and should collaborate. That also seems to be Cherry Chan's conclusion. But actuaries may face an uphill challenge in getting their skills acknowledged. Elsewhere I'll explain why, and what to do.

The three circles and skill gapsand mapping actuaries' skills

[1] Substantive expertise

Actuaries have significant insurance expertise e.g. historically in mortality studies for life insurers. This naturally extends to other types of claim (e.g. motor insurance) so that actuaries' expertise is valued throughout insurance. There has been limited expansion elsewhere in financial services e.g. to banking.

My personal experience extends to analysing customer 'movement' data: granular new business analytics (quote, apply, proceed, underwriting, time and decision effects etc), retention/detriment/lapse, claims. This easily extends to (e.g.) fraud and credit risks and beyond financial services – most firms have customers.

[2] Maths and statistics knowledge

What is this maths and statistics knowledge really about? There are three areas:

  1. Exploratory data analysis This area interacts with 'data wrangling' – see below – since we usually clean the data as part of the process of understanding the data. EDA consists in producing counts, ratios etc – non-parametric analyses rather than parametric models, but can be used for predictive purposes.
  2. Statistical modelling These regression and optimisation techniques are largely parametric 'curve fitting'. The analyses, grounded in classical statistics, are usually explanatory and backward-looking in nature, seeking a best fit on the full data set. They are also used for forward-looking purposes e.g. insurance pricing.
  3. Machine learning divides into several areas, all of which are developing rapidly:
    • Supervised learning: Mainly trying to predict (continuous output) or classify (categorical output) output data from input data. Predictive techniques go beyond statistical curve fitting. A formal approach is taken, splitting data into training and validation sets, with performance measures deliberately emphasizing predictive value over goodness of fit. People familiar with maths and statistical modelling should find this an interesting and manageable stretch.
    • Unsupervised learning: A more challenging area, with an emphasis on learning rather than prediction; there are initially no output categories or values with which to label the data, complicating the assessment – what's 'right'? Here's another way of thinking about the difference in terms of probability.
    • Semi-supervised learning: Naturally this is a hybrid. The idea is that data which is essentially non-labelled is combined with a small amount of labelled data, usually via skilled human intervention. The labels can be a core (external) feature of the data or synthetic i.e. the output of some process.

Actuaries usually have a mathematical background, and often a maths degree. Early in his career an actuary develops significant practical cashflow modelling skills. To make progress in a firm usually requires the development of many other skills and that can lead to the mathematical skills becoming rusty, or even de-valued.

In Actuaries, Risk Management and Data Science I suggest how the actuarial skillset could be applied more widely, starting with an initial model building focus.

My personal experience. I've done 1 and 2 above for 20+ years, using these techniques in for insurance new business process development, including data-driven underwriting rule development. I really got into statistical modelling via papers such as On Graduation By Mathematical Formula and the CMI papers which used techniques such as logistic regression: GM(r,s,) and LGM(r,s) in actuarial-speak. Happy days!

[3] Hacking skills

Hacking is pure grinding excellence: so many problems to solve in such a short amount of time leads to large amounts of hacking. Ricardo Vladimiro in pointers to hackers and hacking

Vladimiro suggests that hackers:

  • ... love quick prototyping and coding in an exploratory and creative way.
  • ... often produce ugly code and perhaps not best performing code.
  • ... get the job done, where others say 'too hard' or 'impossible'.
  • ... accept that production code requires adherence to different standards.

This will sound good to some people. But what hard skills are required of a skilled hacker?

  1. Data wrangling Sometimes called 'munging' this comprises a number of steps within a generic 'extract, transform and load' process. Perhaps surprisingly, getting to know, checking and cleaning the data can take up more than half of the analyst's time. The good news is that tools as simple as Excel and VBA have extraordinarily powerful methods to crunch through this stage, with the huge advantage of visual output and 'try as you go'. A hacker's paradise!
  2. Programming languages A number of articles suggest that it's a choice between R and Python:

    Four main languages for Analytics, Data Mining and Data Science takes a slightly more well-rounded view, noting that SQL is a core skill. Purists will argue that this is a separate point; R and SQL can interact with databases and act at 'hosting languages' for SQL. I have a DEVeloP perspective:

    • D is for Database. Not all analyses can be carried out 'in memory'. In any case sometimes we choose to pre-process and store results.
    • E is for Excel. Used well, Excel is ridiculously good! I agree with one expert who states that Excel is the most underutilised software in the world.
    • V is for VBA. Can glue Excel to a one-billion record SQL Server database or act as a stand alone programming language. Do learn the Excel object model.
    • P is for Python. A general purpose programming language with data science strengths, learning Python has been a great move.

    Before you call me a heretic, according to the O'Reilly 2015 survey Excel was the second most popular tool – after SQL – among data science professionals. When need arises I can easily extend this to a DEVeloPeR perspective.

  3. Machine learning Covered above, machine learning is the overlap between maths/statistics and hacking according to Drew Conway.

Actuaries have a relative weakness in this area, as identified by initial actuarial research. I believe this weakness ranges from the poor use of VBA (too much macro recording, little understanding of the Excel object model) to, more understandably, not being up to speed on modern data science techniques.

My personal story is that I've always had a more-than-average interest in this area, but that I've strengthened old abilities (Excel, VBA, data wrangling), extended and updated in some areas (from statistical modelling to machine learning) and learned some brand new tricks (Python). There's a lot more to come here.

How I can help youa range of possibilities

I've been doing real world analytics since 2002 – way before it became "the next big thing". Projects include:

  • Risk and retail pricing at a market-leading insurer
  • Data-driven underwriting and new business processes at a large reinsurer
  • Building a competitive MI facility using webbots and what used to be called "screen scraping"

My work is:

  • Practical: I'm your man if you're dipping your toe in the water and don't want to be sold a "big data" project or system.
  • Savvy: I've found rogue traders. I focus on commercially valuable insights. I've seen what works and it's not always so hard.
  • Robust: I can work with spreadsheets, databases or a combination of the two. Here's a toy example.

Analytics is fundamentally about making better decisions through enhancing knowledge: uncovering then exploiting new factors and relationships. There is a strong relationship to risk management via risk-adjusted returns and pricing. Analytics is absolutely not about backward-looking voluminous reports.

Big or small, let's talk about your project.

© 2014-2017: 4A Risk Management; a trading name of Transformaction Development Limited