Blog – Information Technology Consulting

Gennaio 8, 2025

Fostering ethical machine learning

If there is a sport that, in my opinion, can serve well to explain how machine learning works, it’s tennis. Training requires thousands of balls, and it’s estimated that over ten years of practice, more than a million shots are played.

Why, then, do even professional players sometimes miss seemingly easy shots during matches when their error rate in training is often much lower? This is a typical case of overfitting, where the model has been generated from countless balls played, mostly of the same type, hit by the coach or trainer, while during a tournament, players encounter opponents and game situations that are very different—some never seen before.

Styles of play, speed, ball spin, and trajectories can be entirely different from those seen in training. Personally, I think I’ve learned more from matches I lost miserably than from months of training with similar drills. No offense to coaches and trainers—they know it well themselves, having built their experience largely through hundreds of tournaments and diverse opponents.

The similarity with machine learning is quite obvious. Machine learning, like tennis, requires preparation based on experience and the ability to adapt to unforeseen contexts. A model trained on overly homogeneous data may seem very effective during training but fail to recognize new situations—a limitation that only diverse exposure can overcome. Just as a tennis player grows stronger by facing opponents with different styles, a machine learning model improves with data that reflects the variety and complexity of the real world.

In both cases, improvement doesn’t come solely from mechanical repetition but from iterative learning, analyzing errors, and refining strategies. Each mistake, each failure, is a step toward a more resilient and capable system—or player. This is the key to overcoming the limits of overfitting and building skills that go beyond mere memorization, allowing excellence in unexpected conditions.

Now let’s move on to the part that interests us the most: biasing. By its nature, an explainable machine learning algorithm (for example, a decision tree) generates models that must “split” on attributes. At some point within the tree (unless the tree’s depth is reduced to avoid this situation), a decision will have to be made based on the value of an attribute, which could lead to discrimination based on gender, age, or other factors.

Less explainable algorithms produce results that evaluate all variables simultaneously, thus avoiding decisions based on a single variable. However, there are methodologies (like Shapley values or insights such as those from Antonio Ballarin https://doi.org/10.1063/5.0238654) that allow verification of the impact of a single variable’s variation on a particular target value.

In short, no matter how balanced the dataset is and how low the impact of the observed variable on the target is, there will always be slight biasing in the generated model. A temporary solution, considering the tennis example, is to eliminate the variable that could cause the model to behave in ways deemed unethical (e.g., age, gender, nationality) and construct an initial model that is certainly less accurate than one using all variables but usable from day one.

As the model learns, increasingly de-biased data will be provided (data must be filtered at the source, balancing the number of cases, for instance, between genders). Meanwhile, the algorithm (which at this point won’t know the value of the excluded attribute because it doesn’t exist) will update the model, enabling it to generalize more and more—like an athlete participating in a large number of tournaments.

Dicembre 6, 2024Dicembre 6, 2024

Ethics of algorithms or data? Or how they are used?

By now, we are all aware of the potential of AI and, to some extent, the risks associated with its unethical use.

However, I would like to bring attention to a use case that might change the perception of what is ethical and what the definition of ethics entails.

A well-known open dataset from UCI includes the characteristics of employees in a company, and among the attributes, there is a variable that can be used as a target, representing the status of the employee (attrition: yes or no).

The objective could be to pay more attention to employees who, according to the model, appear to be at higher risk of attrition, and this goal might alter the concept of ethics (which, by the way, is not uniform across communities, cultures, or contexts). For example, dataset bias related to attributes like gender or age in this case could help focus more on the disadvantaged groups (here meant as attributes). This is just a different point of view and it doesn’t necessarily mean it’s an ethical approach (e.g. somebody may object that a model built on this data would allow for retaining just resources with high scores in performance reviews).

Below is an analysis of the dataset that highlights some interesting aspects, such as the importance of certain attributes that may not be intuitively significant, or vice versa. For instance, after removing the employee number, which represents an identity, monthly salary ranks only fifth in importance, while gender is among the least important, thus having minimal influence on the target variable.

By breaking down according to the maximum value of Gini impurity 2p*(1-p), a binary tree is constructed in this way and shown in figure.

The second variable to observe is precisely OverTime, which also represents the dependency of attrition on the overtime value recorded for the employee. In this case we use a CNN and shapley values to determine dependence of the target from independent variables.

Finally, we must note that age has strong impact on the decision, but it is quite fragmented and it is selected to separate very well the classes close to the leaves. Here below two examples of clear separation between the two classes.

Edited by G.Fruscio

Settembre 7, 2024

Mean variance and 1/N heuristic portfolio

As a next step, as non-expert of portfolio management, i found interesting books and papers making different use of the nobel-prize Markowitz strategy, AKA mean-variance portfolio https://en.wikipedia.org/wiki/Markowitz_model (MPT Modern Portfolio Theory) applied to 10 common stocks from the Nasdaq100.

The applied algorithm doesn’t allow for trades at every timeframe (15m, hour, day) thus some theories cannot be applied (e.g. fixed allocation, long strategy, etc.), however the best portfolio and weights with lowest risk that was found brought to a CAGR = 0.5402, SHARPE RATIO = 9.5759, MAX DRAWDOWN = 2,6948%.

A different approach would be to apply the 1/N heuristic that is a simple investment strategy where an investor allocates an equal proportion of their total capital across N assets. It offers some advantages, that is it simple and it may reduce the unsystematic risk associated with individual asset (it is not guaranteed that past behavior will be the same as future behavior), but it’s not optimal in terms of risk-adjusted return since it treats all assets equally, regardless of their risk/return characteristics.

The underlying selection follows a fast and frugal approach that is selecting first the underlying with acceptable risk (e.g. Max Drawdown) and then picking the 10 best performing. Resulting CAGR = 1.0249, SHARPE RATIO = 10.0465, MAX DRAWDOWN = 3,2248%

Agosto 27, 2024Agosto 27, 2024

Multislot Performance Example

In algorithmic trading platforms, “multislot” refers to the system’s ability to manage multiple trading algorithms or strategies simultaneously. Each “slot” represents a distinct strategy or trading approach, allowing the system to execute several strategies in parallel. This capability enhances both optimization and diversification, as the system can apply these strategies within the same market or across different markets, assets, or trading techniques. Essentially, “multislot” allows the system to handle multiple orders at the same time. For example, a trader could place various orders to buy or sell different assets in varying quantities, each with specific execution criteria (e.g., market orders, limit orders). All of these are managed concurrently by the trading system.

In this example, we’ll make some assumptions:

The underlying assets are high-volume stocks, selected blue-chip companies (e.g., AAPL, ADBE, AMD, AMZN, CSCO, GOOG, INTC, MRVL, MSFT, NVDA, TSLA);
The algorithms demonstrate an average accuracy of 57%. For this simulation, the trading signals are selected randomly;
The trading strategy is straightforward: positions are either long or short, with positions opened at the market open and closed at the market close;
Each slot has a fixed capital allocation of $10,000;
Lever is equal to 1.

The results of the simulation is shown in the previous figure.

Deductions from gross gains should be applied and include:

Trading fees: Assume $2 per transaction. Given 250 trading days, this would result in an annual trading fee of $1,000 per underlying;
Taxes: Tax liabilities can vary significantly depending on the country;
Market volatility: Since this is a long/short strategy, we assume that the volatility at the open and close of the market will often cancel out, meaning any movement in one direction will likely be balanced by the opposite movement, resulting in a net sum of zero.

(Image: Photo of Coinstash from Pixabay)

Maggio 8, 2024Dicembre 15, 2024

Protetto: What’s the frequency Kenneth

This entry is part 5 of 6 in the series Machine Learning

Dicembre 17, 2023Dicembre 15, 2024

Timeseries Forecast

This entry is part 4 of 6 in the series Machine Learning

Ancient civilizations, such as the Romans and the Greeks, used various methods for predicting the future. These methods were often based on superstition and religious beliefs. Some of the techniques used by the Romans included divination through chickens, human sacrifice, urine, thunder, eggs, and mirrors. The Ancient Greeks also had a variety of methods, such as water divination, smoke interpretation, and the examination of birthmarks and birth membranes. These practices were deeply rooted in their cultural and religious traditions. A scientific approach (one of many possible) involves the use of Machine Learning. In Machine Learning there are a number of benefits of time series prediction: identifying patternsin data, such as seasonal trends, cyclical patterns, and other regularities that may not be apparent from individual data points. Obviously, the main goal is to predict future data points based on historical patterns, allowing businesses to anticipate trends, plan for demand, and make informed decisions. Finally, it helps in better understanding a data set and cleaning the data by filtering out the noise, removing outliers, and gaining an overall perspective of the data. These methods are widely used in various industries, including finance, economics, and business, to make data-driven decisions and predictions. Time-series prediction work was partially founded on the “Laocoonte” project and was based on the fast training of a large number of ML models, tested in parallel and applied each step with the most accurate model. The prediction module was tested in two specific areas, indicated by the availability of real data and future business opportunities, in particular time series prediction (temperature and stock value prediction), tabular data prediction (churn prediction).
In order to obtain a benchmark against traditional models based on statistical algorithms (ARIMA type), the time series relating to the average daily temperature of Vancouver over the last twenty years was taken as a reference. The sample was subsequently divided in this way:

66% of the samples to train the model;
33% of samples to test the model.
The result, applied for 365 samples, is shown in the following figure and the measured MAPE (Mean Absolute Percentage Error) is 1.206. The result, applied for 365 samples, is shown in the following figure and the measured MAPE (Mean Absolute Percentage Error) is 1.206.

For the same sample, this time drawn in a different year (September 2004 – September 2005), the result obtained using the Auto-ML predictive model showed a lower performance result, obtaining an MAPE score of 1.647.

Finally, the SMOreg regression algorithm with 60 lag was applied to the same sample, information on seasonality was introduced and the prediction was carried out using the same train/test ratio. The measured MAPE was 1.2053.

One of the potential applications of the timeseries prediction is forecasting of future values of an underlying, building the model by using historical data. Here below a chart representing a backtest on S&P Future ES-Mini (Equity Line in dollars, maintenance margin 11’200$).

Giugno 29, 2023Dicembre 15, 2024

Precise information filtering deriving from laws present in nature

This entry is part 3 of 6 in the series Machine Learning

When the probability of measuring a particular value of some quantity varies inversely as a power of that value, the quantity is said to follow a power law, also known variously as Zipf’s law or the Pareto distribution (1). Looking at the figures, it’s really astonishing how most of the information is carried by such a low pecentage of words in a text.

Having in mind other objectives, C. Shannon dedicated most of his life and research to pieces of work that became fundamental for the Information Theory, and the final results allow to understand what is the acceptable loss of information in order to transfer a sequence of signals in an efficient way.

We have transposed the two approaches in order to perform a deterministic research in text by filtering the total words in a number of texts of the same kind by identifying the keywords of an ontological map built for the purpose.

We allowed the human to perform reinforcement learning on the submitted text and after a very low number of iterations we got incredible results.

In about 88% of cases it was possible to identify exactly the piece of information we were looking for and in all cases we were able to identify the two or three sentences that contained the information, out of documents exceeding the 130 pages in length.

This approach, being a “weak AI” achievement, exploits the following advantages:

it uses certain, trackable and validated data sources;
it obtains certain results, not probabilistic, not using generative AI;
it relies on the ability of the domain expert for reinforcement learning;
the knowledge base can be kept in-house and will represent a corporate asset.

(to be continued)

By A. Ballarin, G.Fruscio

The Zipf-Mandelbrot-Pareto law is a combination of the Zipf, Pareto, and Mandelbrot distributions. It is a power-law distribution on ranked data that is used to model phenomena where a few events or entities are much more common than others. The probability mass function of the Zipf-Mandelbrot-Pareto law is given by a power-law distribution on ranked data, similar to Zipf’s law. The Zipf-Mandelbrot-Pareto model is often used to model co-authorship popularity, insurance frequency, vocabulary growth, and other phenomena (2,3,4,5)
Applications of the Zipf-Mandelbrot-Pareto law include modeling the frequency of words in a corpus of text data, modeling income or wealth distributions, and modeling insurance frequency. The Zipf-Mandelbrot-Pareto law is also used in the study of vocabulary growth and the relationship between the Heaps and Zipf Laws (2,3,4,5)
Overall, the Zipf-Mandelbrot-Pareto law is a useful tool for modeling phenomena where a few events or entities are much more common than others. It has applications in linguistics, economics, insurance, and other fields (2,3,4,5)

http://www.cs.cornell.edu/courses/cs6241/2019sp/readings/Newman-2005-distributions.pdf
https://eforum.casact.org/article/38501-modeling-insurance-frequency-with-the-zipf-mandelbrot-distribution
https://www.r-bloggers.com/2011/10/the-zipf-and-zipf-mandelbrot-distributions/
https://journalofinequalitiesandapplications.springeropen.com/articles/10.1186/s13660-018-1625-y
https://www.sciencedirect.com/science/article/pii/S0378437122008172

Gennaio 20, 2023Dicembre 15, 2024

Machine Learning – AutoML vs Hyperparameter Tuning

This entry is part 2 of 6 in the series Machine Learning

Starting back from where we left, majority voting (ranking) or averaging (regression) among the predictions combine the independent predictions into a single prediction. In most cases the combination of the single independent predictions doesn’t lead to a better prediction, unless all classifiers are binary and have equal error probability.

The test procedure of the Majority Voting methodology is aimed at defining a selection algorithm for the following steps:

Random sampling of the dataset and use of known algorithms for a first selection of the classifier
Validation of the classifier n times, with n = number of cross-validation folds
Extraction of the best result
Using the n classifiers to evaluate the predictions, apply the selector of the result expressed by the majority of the predictions
Comparison of the output with the real classes
Comparison of the result with each of the scores of the individual classifiers

Once the test has been carried out on a representative sample of datasets (not by number but by nature), the selection of the algorithm will be implemented and will take place in advance at the time of acquisition of the new dataset to be analysed.

First of all, the selection of the most accurate algorithm, in the event that there is a significant discrepancy on the accuracy found on the basis of the data used, certainly leads to the optimal result compared to the use of a majority voting algorithm which selects the result of the prediction in based on the majority vote (or the average in the case of regression).

If, on the other hand, the accuracy of the algorithms used is comparable or the same algorithm is used in parallel with different data sampling, then one can proceed as previously described and briefly reported below:

Random sampling of the dataset and use of known algorithms for a first selection of the classifier
Validation of the classifier n times, with n = number of cross-validation folds
Extraction of the best result
Using the n classifiers to evaluate the predictions, apply the selector of the result expressed by the majority of the predictions
Comparison of the output with the real classes
Comparison of the result with each of the scores of the individual classifiers

Below is the test log.

Target 5 bins

Accuracy for Random Forest is: 0.9447705370061973
— 29.69015884399414 seconds —
Accuracy for Voting is: 0.9451527478227634
— 46.41354298591614 seconds —

Target 10 bins

Accuracy for Random Forest is: 0.8798219989625706
— 31.42287015914917 seconds —
Accuracy for Voting is: 0.8820879630893554
— 58.07574200630188 seconds —

As can be seen, the improvement obtained compared to the models obtained with pure algorithms is between 0.1% and 0.5%.

Auto-ML vs Hyperparameters Tuning

We have used a python library called “Lazy Classifier” which simply takes the dataset and tests it against all the defined algorithms. The result for a 95k rows by 44 columns dataset (normalized log of a backdoor malware vs normal web traffic) is shown in the figure below.

Lazy classifier for Auto-ML algorithm test

It can be noted that picking the best algorithm would result in some cases in an improvement of several percent points (e.g. Logistic Regression vs Random Forest) and, in most cases, in savings in terms of computation time thus consumed energy. For example if we take Random Forest algorithm and we compare the computation with standard parameters and optimization obtained with Grid Search and Random Search, you can notice that the improvement is, again, not worth the effort of repeating the calculation n-times.

**** Random Forest Algorithm standard parameters ****
Accuracy: 0.95
Precision: 1.00
Recall: 0.91
F1 Score: 0.95

**** Grid Search ****
Best parameters: {‘max_depth’: 5, ‘n_estimators’: 100}
Best score: 0.93
**** Random Search ****
Best parameters: {‘n_estimators’: 90, ‘max_depth’: 5}
Best score: 0.93

Marzo 8, 2020Dicembre 15, 2024

Dissertations on Machine Learning, “Stacking” and “Voting”

This entry is part 1 of 6 in the series Machine Learning

There are several tools on the market for data analysis, from the most complex to the simplest to use. The aim of this series of discussions will be to dispel some myths about the real need to use supercomputers to achieve more accurate results than can be done with a 99-1 combination, that is, by leaving 99% of the work to the computer and the 1% to human common sense.

Allow me to indulge my nostalgic weakness and to report short historical notes on Machine Learning. It all began between 1763 and 1812, when Bayes’ theorem was formulated, which first explained how to minimize the classification error by minimizing the probability of occurrence of an event conditioned by a set of attributes.

Then the silence for about a century, up to Markov, that many will know, if only because seen in the design of telecommunications networks and much more, including the analysis of texts and natural language.

Then for another half century another silence; meanwhile the greatest scientist of the era, Alan Turing, was working on the “war”. But at the end of this dark period, he gave life to what will be the foundation of the analysis of the data in modern key: neural networks. In those years, the production of transistors on an industrial scale and the consequent increase in the computational capabilities of computers made it possible to implement most of the algorithms deriving from game theory. It is no coincidence that one of the main pitfalls of computer designers of the last century was to be able to beat the chess world champions. And IBM did it for the first time in 1996.

But let’s get back to us, precisely in the 70s, when the foundations were laid for the backpropagation algorithms that have evolved up to the present day in the Generative Adversarial Networks (GANs). By the way, I strongly advise you to read the contents of the MIT website which explains the basics of what is most advanced today in the field of research: http://introtodeeplearning.com/, especially as regards the generation of the so-called “Deep Fake” . If you want to know, for example, which of these faces is real and which is artificially generated, go to the indicated link and find out.

Finally, the news of the experimentation in the context of the Sync project, which has created the prototype of an elementary hybrid network of three neurons, one biological and two artificial, in the field of Bio Medicine, is just these days.

If you have already done so and you are updated and aware, you will have wondered what the ethical limit is for all these applications. Right now I don’t have an answer, but a suggestion: there is a way to deceive the calculators and it is contained in the link above.

Let’s go back to our goal: is it really useful to apply stacking and voting to our learning algorithm? Before expressing an evaluation and some experimental results, let’s remember the definitions. Stacking is the combination of several models in succession to improve the results of machine learning, while voting and averaging are two ensemble methods: voting is used for classification and the average is used for regression and in practice they select the value. of the target variable (both for classification and for prediction) based on the greater number of votes expressed by the algorithms used.

Attributes: width and length petal and sepal ; 150 instances

A good mathematician would answer this question: “it depends on the input data”, and therefore we follow the same procedure. Before starting, however, let’s summarize what is the problem of classification and, therefore, of prediction. And let’s start from the first and most famous public domain dataset: Iris, dated 1936, which describes a set of Iris flowers of three different species, which therefore have individually different characteristics according to four parameters.

The classification can take place, for example, by defining a “boundary” that can separate the blue samples (the choice of the noun is not accidental), that is iris-silky from iris-versicolor and iris-virginica. In this case a linear border manages to separate the samples quite well and there are only a few “exceptions” or outliers. However, we will always ask ourselves later: is it better to increase the complexity of the model to identify and correct classification errors or do those samples represent exceptions and is it better to focus on why such exceptions occur? If we call these exceptions, or anomalies, for example, with the name “Spam” or “Cyber Attack” we will understand why this is one of the main problems of IT at the moment. If we are good at classifying instances (events), we are also good at predicting the future, or simply at understanding the event better. In the example we have seen, the Adaboost and Random Forest algorithms lead to an accuracy of 97.9%, which means that given an Iris represented only by the four attributes, petal width, petal length, sepal width, sepal length , we can predict with what 97.9% accuracy what type of Iris it is, without seeing it. But without forgetting that that will be a new sample for my dataset (Knowledge Base). Of course we could do the same thing in many different areas, for example trying to predict the quality of a wine based on some attributes such as PH, residual sugar, citric acid, sulphates, etc., without tasting it.

Let’s go back to our models and represent from the side just an example of a stacking model for the evaluation of the classification accuracy starting from data of different nature, both playing with datasets. There are 497 available at this address: https://archive.ics.uci.edu/ml/datasets.php or you can access the European Open Data portal: https://data.europa.eu/euodp/it/ date/

Stacking Model with KNN5, 10 and 15

Zoo animals: dataset size 101 instances, 16 attributes

By injecting different datasets into the model, we find that, in most cases, stacking does not improve the precision of the algorithm used individually, therefore we can say that the goodness of the methodology depends not only on the input data but also on the type of data itself.

With elementary procedure it can be demonstrated, even in the case of voting, that preferring the most accurate model leads to higher accuracy than relying on the results obtained by the majority of the algorithms used, especially for the identification of anomalies.

Another matter, however, the methodologies used in the Convolutional Neural Networks (CNN), for example, for the recognition of images, which break the problem into different sub-problems and cascade different algorithms to these different sub-problems (see 6S191_MIT_DeepLearning_L3) .

I close this article in the hope of not having bored the more experienced, nor of discouraging the new “followers”; however, the areas of expertise embraced are numerous and I am sure that each of you will find a field of application for data analysis. And we will also see some code and algorithms in the next articles.

G. Fruscio