Financial Statement Analysis Software | Part 2 • Clique of Code

The thumbnail for this post was generated in part using DALL-E.

This is Part 2 of a multi-part series. Find Part 1 here.

I: Let’s get building (for real this time)

Hope you enjoyed Part 1, dear reader. If you didn’t, that’s fine as well, because this time we do something different. As promised in the first section of the previous part, this will not be a step-by-step guide, it’ll barely be a guide at all. I suggest you read that part if you haven’t, by the way. That’s a good way to decide if this series is for you or not.

About the content of what we’re building, a short-ish description is available here.

Alright, with all that out of the way, let’s take a fresh look at our requirements, which I previously defined as:

Get financial report data. Yahoo Finance is preferred at first.
Read the data points in the spreadsheet from the financial report and add them to the output.
Calculate the metrics in the spreadsheet from the data points and add them to the output.
Calculate the OK/NOK based on the metrics.
Have it be easy to run without much setup. Or provide a guide or setup script.
Have it be easy to add or change data points, metrics and OK/NOK rules.
Print out in a way that is as close as possible to the sample spreadsheet.
Make it possible to take in a list of companies and execute the analysis on all of them.

These are not great requirements, but once again, that is the whole point. They aren’t concise, lean nor quantitative. They do not follow any philosophy or framework for “requirements engineering”. The tools you use depend on the context. In this specific case, we’re talking about a developer (yours truly) who is prone to analysis paralysis, a project with zero expectations and a goal of just delivering a rough POC. In this context, even writing down some requirements brings us enormous benefit.

Now, looking at the requirements, we can see that there are three main types of actions to perform on our data. We need to read a list of data points from the financial statements, calculate some metrics based on the data points, and finally calculate a OK/NOK value based on the metrics. It’s also very clear that we should build software that is easy to set up, use and modify.

Intermezzo I: POCs

The eagle-eyed reader might notice that I pretended not to see some of the requirements when implementing the first iteration of this project. That is because we are developing a POC, a Proof of Concept. Neither a Realisation of Concept, nor an Excellent Rendition of Concept. The whole point is to prove that a concept is possible at all and/or to visualise how the concept might come to be.

Our goal should not be to complete the project at all. Our goal for the POC should be to create something that has the rough silhouette of the end product we’re going for. One should act fast, cut corners, reduce feature sets and ignore some requirements when doing this.

I am this insistent on this point because I’ve been in countless projects where the POC is just ignored and either individual developers or entire teams sprint straight towards an MVP, or even (cover your ears kids) towards a finished product. Can you guess what happens when one does that?

Of course you can!

Anybody can, if they only stop to think about it for a second. The answer is (drumroll): You throw away a good chunk of what you built. A company I know about is on their second from-scratch rewrite of their core product. Their devs (a couple of whom I call friends) are apparently already discussing whether or not this third iteration is salvageable. My view of their predicament is that it is to a large extent caused by trying to jump from nothing to the end product, with no steps in between and no designs whatsoever. There are very obviously other issues that I see there, but that should be another post.

To conclude: The. Whole. Point. Of. A. POC. Is. To. Prove. That. The. Concept. Is. Possible. Don’t. Do. More.

Intermezzo I over.

II: Setup and reading data

II-A: Setting up our DataFrames

For now, we’ll be skipping the output formatting and the multi-company execution functions and just focus on implementing the three core stages for one company.

We discussed the practicality of starting with larger building blocks and moving onto smaller ones in a previous intermezzo. If you read that entire thing, congratulations, and thank you! It was hard to articulate, which means it probably was even harder to read through.

With the aforementioned building philosophy in mind, let’s take a pyramidian granite slab of a building block:

import yfinance as yf

ticker = yf.Ticker("AMZN")

Let’s not forget to kiss yf on the forehead with tender affection for doing the hard work for us. Do not skip this step! 🫵

Why the Amazon financials? No real reason, really. Starts with A, is irrevocable part of our cultural zeitgeist, etc. Would you look at the time! It’s footnote¹ o’clock!

Going back to the implementation, we’re able to read the following three financial statements from the object that we just created, which is just a pandas DataFrame by the way. Nice.

ticker.financials           # Gives us the income statement
ticker.balance_sheet        # Gives us the balance sheet
ticker.cash_flow_statement  # Gives us the cash flow statement

Now, though it’s slightly inconvenient that the three sources are separate dataframes, that isn’t a really big issue. My first instinct was to iterate over these, old school style:

dataframes = [
    ticker.financials
    ticker.balance_sheet
    ticker.cash_flow_statement
]
for df in dataframes:
    #...

I did go about the rest of the project for a while like this, before realising this is not a good approach for handling the keys to the datapoints that we want to read, and the metrics that follow from them. So, take two:

financial_statements = [
     ["Income Statement", ticker.financials],
     ["Balance Sheet", ticker.balance_sheet],
     ["Cash Flow Statement", ticker.cash_flow],
]
for k, df in dataframes:
    #...

Why no dict? First of all, call it a map like god intended. Second of all, doesn’t matter! Such a freeing thought, right? You create an array of keys, iterate over them and read from your various dicts. You can create a dict of statements, get it’s keys, then iterate over that. You’ll be reading the same bytes in either case. Knock yourself out, do what you like. Lately my preference skews towards iterators, so that’ll be my poison this evening.

Is the code understandable? I would say yes. Is this approach structurally inferior in its time complexity? No, and also, it doesn’t matter. n is equal to 3, what’s the time complexity matter?

II-B: Reading the data

Now to read the data that we’re actually interested in. We have a list of keys at hand, but we also want to make the keys configurable right? Now let’s think about that from the filter of the decisions we’ve made so far. How should we store and use these keys, how difficult would it be to make them configurable. The answer to the latter is “Not at all,” especially because our client is capable of opening up some python file and editing it (that is exceedingly rare in my experience). Still, let’s make their life a little bit easier by isolating the configurable parts.

# In a new python file/script for config purposes:
    financial_data_keys = {
        "Income Statement": [
            "Gross Profit",
            # ...
        ],
        "Balance Sheet": [
            "Current Assets",
            # ...
        ],
        "Cash Flow Statement": [
            "Capital Expenditure",
            # ...
        ],
    }   # The arrays are truncated for your viewing pleasure

Then in our file, we would simply import this variable. Nothing overkill needed.

It’s that simple. No special environment variable parsing process needed. Now the only missing piece is a way to iterate over our types of financial statement and read all data points that we are interested in from the dataframe that corresponds to said financial statement. We’d want to do something like:

    # Read the line below if you intend to use this code <3
    for k, df in financial_statements:
        for data_key in financial_data_keys[k]:
            print(f"Printing {data_key}: {df[data_key]}")
            # Or in reality, add it to the output data structure of your choice.

Turns out you can’t, because apparently the first dimension of the DataFrame is a date, so the for loops should look something like:

    year_range = 3
    y = 2025

    for k, df in financial_statements:
        date_keys = [f"{y-o}-12-31" for o in range(year_range)]
        for date_key in date_keys:
            for data_key in financial_data_keys[k]:
                print(f"Printing {data_key} on {date_key}: {df[date_key][data_key]}")

Let’s review what we’ve done so far. We thought about how to approach the first iteration of the project for a bit and went over how I first got about reading out the data points using yfinance. I’ve preserved what I thought to be some mistakes of my first implementation, because I myself caught them when evaluating my work. The intention was to go through the first implementation of all three parts, then take you through my first round of self-review.

We have however already approached approximately ten minutes of reading time, there is no way we can go through all of those parts without at least double the post length. So let’s change the plan and focus on the refactors I ended up doing on the parts we’ve went over so far.

III: Refactoring

Now, I want to clarify this is not to glorify the level of evaluation that I typically do or that I endorse. The review actually came after implementing everything. Not just after finishing this sub-module (the easiest of all of them at that).

As a matter of fact, I think it’s best to come back to the code that you’ve written after you’ve forgotten the thought process you had when developing it. It’s even better if you actually forget the code, and I’m getting really adept at emptying my mind (this is only half a joke, I would really suggest developing this skill). This approach that I prefer can be quite disorienting if you leave work half-done. So my personal preference is to first build a substantial part of anything, then take a break long enough to empty your mind, then come back and review.

Coming back to the review, the first thing that I realized was that I had ignored some of my own advice and jumped right in without a refresher on how one can interact with a DataFrame. So let’s look at our code through that lens.

Observe:

    # Old:
    data_frames = [
        ["Income Statement", ticker.financials],
        # ...
    ]
    for k, df in dataframes:
        #...

    # New:
    data_frames = [
            ticker.financials,
            ticker.balance_sheet,
            ticker.cash_flow,
    ]
    df = pd.concat(data_frames) # pd being pandas of course
    """
    No more for loop needed to iterate over DataFrames!
    We can also get rid of the terrible keyed way we had to store everything else.
    For instance:
    """
    financial_data_keys = [
        "Gross Profit",
        "Current Assets",
        "Capital Expenditure",
        # ...
    ] # This has a much simpler shape now!

The mighty concatenation makes our code much simpler, and removes a whole (impractical) dimension from our data throughout the project. That felt extremely good.

Another thing to catch was the deeply ugly and fragile way in which we iterated over dates. Someone not undergoing line fever would easily see that we can use the keys of the DataFrame to iterate over instead. Let’s use another feature of DataFrames to map the longer date-keys to just the years. This way we simplify our outputs and avoid potential cases where there might be mismatches between parts of the dates other than the year. Though this is unexpected, it is best to expect the unexpected in these matters.

For the renaming part:

    data_frames = [
        ticker.financials.rename(mapper=lambda timestamp: timestamp.year, axis="columns"),
        ticker.balance_sheet.rename(mapper=lambda timestamp: timestamp.year, axis="columns"),
        ticker.cash_flow.rename(mapper=lambda timestamp: timestamp.year, axis="columns"),
    ]
    df = pd.concat(data_frames)

In this way the keys become integers representing the year instead of complex timestamp objects. This simplification was in part triggered by a further development that you will have to figure out by reading the next part. As I mentioned before, these changes came as part of a review that originally happened later than the narrative order would suggest. Anyways, let’s also look at the non-deranged way to iterate over the dates:

    # Old 🤮
    year_range = 3
    y = 2025
    date_keys = [f"{y-o}-12-31" for o in range(year_range)]
    for date_key in date_keys:
        # ...
    # New 😎
    for date_key in df.keys():
        # ...

🤦‍♂️ Muuuuuch better.

I believe that concludes the meaningful refactors that happened after the first round of evaluations.

IV: Next time on FSA

📣American announcer voice📣

Next time ooon: Building our Financial Statement Analysis software! The tortured author builds a way for metrics to be calculated in a nice and easily editable way… Evaluation calculations watches from the shadows…

A personal evaluation follows! Will our author survive with their mental facilities intact!?!

Tune in next time to 🇺🇸🦅💥🔥FSA🔥💥🦅🇺🇸!!!

Take care!

VI: Footnotes

1: Recommendation

This is an experience that will bend your understanding of the medium; I recommend this masterpiece. I am still incapable of defining what you will see in any real depth; think of this as a piece of art. Also my thanks to Tom Scott—who will undoubtedly never read these lines—for the recommendation. ⤾