Operational TCO of your product

I keep four chickens that keep the slope on the side of our property under control. They do the weeding, I do the feeding, they produce eggs. Clearly there’s a few tools that I need to clean their chicken house, fences that need maintaining so they don’t escape or foxes that break in.

I tore my work gloves the other day, and just the day before, I broke the rake. Both incidents a combination of sheer stupidity and learning what I am doing and how to do that in the most efficient way, on the fly. The total expense of getting new gloves and a new rake will be some 35 EUR. Relative to the total cost of keeping the chickens, building the fencing and all, not very much. These chickens will never see the day they break even – but that was never the point for my keeping the chickens. It’s fun, it’s an activity off screen and I love animals.

It would be different though, if I were a farmer and the scale of my chicken keeping was different – to the point where I hope to make a living off of them. With more chickens and animals, required gear becomes bigger and more expensive. A similar mess up with a new tool or just a mistake from not being cautious enough could cause some serious financial damage – possibly risking a month’s profit – or a quarter’s profit.

So there’s an angle of being able to use the equipment right and learning what to do and what not to do with them. Ideally, the tools should allow for some error and forgive smaller beginner mistakes, without longer-term damage. There’s also a safety aspect, to not get the chickens into danger.

In the software world, we’re talking about Total Cost of Ownership (TCO) for a product or feature – and that the product should guide and encourage correct use, and prevent accidental misuse from causing unexpected additional cost. This is the operational part of the TCO, some would file it under user and admin support cost: how much would the company have to spend to prevent, correct, recover things that happened due to insufficient product guidance or guard rails being present. This doesn’t always have to a big disaster that requires remediating downtime or restoring gigabytes of data, but could mean a helpdesk person coming in and correcting records, or second and third level support or an engineer bring back configuration that other admins or end users accidentally brought to a faulty state – using your product. That may not cost that company a lot – but if it occurs a few times per month, requiring someone to spend 30 minutes on fixing it, cost may sum up. I would be pissed if I had to get a new rake every two months, even though it’s not a big expense, it’s still annoying and destroys my TCO for the chickens.

One of the classic ways of trying to prevent misuse and bad things from happening is creating clear documentation – also possible the one everybody hides behind, in case things go wrong. I certainly haven’t read a manual for my rake, nor will I read the care recommendations for my new gloves. Sorry.

Focusing on in-product guard rails, there’s a number of UX tricks you can pull, to make sure user choices are double checked, documentation is cross-linked and you only allow input within safe boundaries. Depending on the product and the greater UX plan you have, you may even be able to provide longer explanation texts alongside your functionality that people may read.

I want to dive into a slightly more specific scenario though: imagine you’re building a cleanup function for administrators or engineers that, based on a review conducted by end users or business, removes stale records, such as accounts or files. So the system’s very purpose is to, after an admin configured it, let users who know best, make decisions whether or not to keep individual assets – or revoke access to them. The function is by its sheer purpose, there to get rid of things. How do you prevent too many things, or the wrong things from being deleted or removed? There may be a fine line between removing stuff that was intended and removing a few things too much. Even if 90% of the time, things to right, what about the 10% of the time when someone changes their mind, needs things back, or clicked the wrong button and had things removed? Or worse yet – as admin see across multiple cleanups in their campaign that there’s a lot more removed than anticipated – how can they paddle back?

There’s a number of preventive product-assisted guard rails that we could put in place here. Effect and usefulness depend on the exact type of product you’re building. In this example, we could imagine that one of the first things that need establishing, if not there already, is transparency on what’s happening at every step of the way. Especially if there’s multiple actors that rely on one another. Giving admins transparency, as to how effective the cleanup is, and if there’s a potential problem with the results that are trickling in over time. It may also give end users insights into why they’re asked to look at the assets or user’s access, and how aggressive they’re supposed to decide for the cleanup. It may also important for business to know, if the cleanup is just a dry-run and admins are collecting the results only, or if user’s actions result in deletion of stuff – immediately. Also – who is reviewing and how many people, other than me, are involved?

The visibility part also extends to metadata that can be relayed to users or the business, about the assets or users they’re cleaning up: when was the file last used? What type of asset are we reviewing? Does it have a classification, according to corporate classification? Can it be easily restored? If we’re cleaning up user accounts: when were they last used? What do they have access to? What type of accounts are they? The more contextually relevant data the someone that makes the configuration change or cleanup decision, the further we can bring the risk down.

Depending on the criticality of the cleanup, once someone made decisions about what to keep and what to get rid of, don’t apply the changes immediately. Let the system take some grace period, before actions are taken. Use that grace period to notify relevant actors, the users themselves, the administrators, possible auditors, that the cleanup will be enforced, and let them know on how many results. This will give additional visibility and a few more hours or days of time to recover or revoke decisions.

Until the grace period has run up, allow for someone to come in and prevent any destructive actions or access revocation to happen. Treat this as a “Cancel” button that prevents the system from deleting assets or accounts in the first place. This is different than a restore capability, because you’re not carrying out the deletion in the first place, saving someone from having to restoring, and a few people from downtime. This doesn’t replace proper restoration functionality, but merely complements it.

Needless to say, your audit log should include and detail out all the changes that your system performs to other systems, with the references to the cleanup action and who initiated it, that are necessary to link and trace destructive actions back to who started them and how they came about.

For more advanced products, you would think the system would learn over time and take some decisions for itself, preventing human error as far as possible; or decreasing the surface the damage area by not allowing all assets to be cleaned up by a human, but just the ones the system isn’t sure about.

Ideally, as soon as irregular numbers of assets are deleted or decided to be deleted, the system could notify auditors or the admin who configured the cleanup. This should be flexible boundaries, relative to what is supposed to be cleaned up and what historically was cleaned up. There could even be a configurable level of confidence the system should at least have, before it blocks destructive actions from happening – or how liberal it should be with irregular amounts of deletions.

IF you’re running a cloud product, you may even benefit from building learning patterns of what good and bad cleanups look like for your product for any given customer. Based on whether some of the results were revoked later, or how often and when admins came in and canceled cleanup activities before they started, there are a lot of learnings that could be applied to later reviews – and raise flags early on – or prolong the grace period before actions are applied.

More intelligence in finding and surfacing the right metadata for assets, accounts and files could lead to the system to notify all relevant actors if (suddenly), an asset or account is used again or still in use – and prevent its deletion until someone overrides that decision. Or ask for someone third to confirm the deletion, if there’s reason to believe that an accident happened. If admins could define characteristics on assets that define classification, security demand, targeted user audience and other classifiers, the system could work as a scoring system and work out recommendations or automatically decided cleanup steps – making user’s jobs a lot easier. Better yet, the system could go and find assets that may need cleanup, and suggest them to administrators, or the right person in the business to make that decision, saving admin time and manual steps. If only relevant things are cleaned up, reviewed, and worked on by an individual, their attention to detail may be bigger than if they repeatedly work on loooong lists of things with seemingly mediocre relevance to them.

Alright, that’s it for now. If only my rake and gloves would have been clever enough to tell me when I didn’t use them correctly. Well, I guess at the price point I am willing to invest, I can’t expect much intelligence in a simple tool in the first place.

Operational TCO of your product

Published by frickelsoft

Leave a comment Cancel reply

Share this:

Published by frickelsoft

Leave a comment Cancel reply