Explainable AI

One of the emerging challenges in machine learning concerns the problem of transparency; put simply, not all ML models are easily explainable. Historically this seem to have not been a major problem, and we often associate the term 'black box', which may even have added to the mystique surrounding AI.

The issue with black boxes is that you can't easily tell what's happening inside, which may be compounded where an ML model happens to be also non-deterministic (in fact, or for all practical purposes). So how does this problem manifest itself?

Continue reading...

Six Myths About Data Science

There is a huge amount of hype around data science right now, on the back of previous hype around big data, and this has lead to a number of misconceptions, myths and untruths which we discuss here:


Myth No 1: Data Science is Now Largely Automated

Given the rapid advances and availability of data science frameworks and data and warehousing solutions, and because these tools are becoming so much simpler to use, there is a perception that data science is now easy, and largely automated using these tools - just take all your data, push it to the cloud, run some analysis, and miraculously insights and solutions pop out.

Continue reading...

Blockchain consensus mechanisms

Below are overviews of various techniques associated with blockchain convergence: these techniques are variants of consensus algorithm by which a (typically cryptocurrency based) blockchain network aims to achieve settlement. Note that not all variants are covered here, this page will be updated over time, the most widely used described first:

  1. Proof of Work (PoW)
  2. Proof of Stake (PoS)
  3. Others: RAFT, PBFT, Binary BFT, Async BFT, RPCA

Proof of Work (PoW)

The ability to to validate transactions and create new blocks (especially an unpermissioned cryptocurrency, such as Bitcoin) demands strict controls, to avoid the potential for double-spending or rewriting history. Proof of Work (PoW) is a blockchain consensus algorithm, implemented in Bitcoin by Satoshi Nakamoto in 2009. It was later adopted by Ethereum.

Continue reading...

Blockchain Scalability Metrics

Below are a number of stats on blockchain scalability. It's important that we compare apples with apples so note that some of these blockchains are public ('unpermissioned' - such as Bitcoin) and some are private (permissioned). Due to the consensus and associated security mechanisms required to maintain integrity in public blockchains it's typically far less challenging to achieve higher transaction rates (transactions per second, or tps) and faster settlement times in permissioned networks (where Proof of Work for example may not be required, or preemptive voting can be achieved).

As a point of reference for the financial industry:

  • VISA estimates it can handle up to 50,000 tps, and approximately 1,667 tps on average
  • Paypal estimates up to 450 tps, and approximately 200 tps on average
  • Settlement times using traditional industry methods are typically in the order of 3-5 days

In order for blockchain to support large-scale payment networks (e.g. Visa), stock markets (e.g. Nasdaq, NYSE, FTSE), as well as IoT and sensor networks, it will need to handle hundreds of thousands of transactions per second

Continue reading...

Selecting event sources for cybersecurity threat modelling

In order to experiment with cyber threat detection techniques you need lots of data. Fortunately modern networked systems generate huge volumes of events, including packet traces, flow statistics, firewall logs, server logs, audit logs traces, transactions logs and so forth. The real challenge is getting hold of this data with representative captures that include recent threats with realistic distributions.

How to Obtain Representative Event Data

There are several ways in which you can go about obtaining useful data, including:

  1. Public repositories
  2. Private repositories
  3. Reproduce a Live Environment
  4. Simulation

Public Repositories: The advantage of public sources is that there is a common dataset with which to compare findings with other researchers. These datasets tend to be fairly generic, with a broad range of threat vectors. That said, there are particular challenges with finding fully representative and current data: these datasets may be several years old, the threat landscape changes quickly, and in some cases the overall event composition may be of questionable value (depending on how the original data was generated). Nevertheless these datasets are still valuable places to start, and several have been used widely in comparative studies within the research community.

Continue reading...