Build vs. Buy? Test Data Doesn’t Have to be Another In-House Project

Guest post: the following article was written by Quentin Hartman, a technology enthusiast, leader, and long-time user of Tonic.

‍

Discovering Tonic.ai

I’ve been in the tech industry for about 25 years, and in that time I’ve done everything from pull cable through steam tunnels to consulting for Fortune 500 companies on the future of work. I’ve built private cloud clusters and managed power and cooling for a university data center. The last decade though has been dedicated to using DevOps to make software development work better. Getting good data into the hands of developers has always been a struggle. About three years ago though, I discovered Tonic.ai and that changed.

I was first introduced to Tonic.ai by the data team at a former employer. They were using the platform to do some data masking and anonymization before feeding it into analytics to make sure they were being compliant with relevant privacy laws. I was intrigued, I had never seen anything like it before. Instead of a framework to create yet-another-project, here was a complete platform for transforming production data into useful, fake data. It immediately became apparent to me that it could do a lot to help with SDLC-related tasks and help developers be more productive.

Since we already had licenses, I re-tasked it to creating production-quality data sets for developers to use on their local machines to make sure their testing was more representative of what we could expect to see in production. I was shocked that I could get useful results out of it in less than an hour after first contact with it. Since I knew the schema, I could use the tool. The short path to value was second to none.

Test Data is a Universal Need

When I moved on to another employer, I immediately brought it in there too where it once again proved its worth for providing developers with safe test data. Even though our needs were more complex, we were able to leverage the post-job scripts functionality native to the platform to add known testing users and organization accounts to our production data. This allowed developers and testers to keep using familiar logins even though the data set originated from production.

Now I’m doing independent consulting work; part hired gun, part trusted advisor. One of my clients is a national wireless construction management company that is venturing into the software space. They have data for tens of thousands of projects in their system, and once again, here is the need for sharing production-like data with developers safely. I immediately thought of Tonic. Price is always a consideration though, and it’s been a few years, so I decided to survey the Open Source landscape of tools to see if anything has appeared that could be competitive with Tonic’s featureset. The last thing we need is another DIY project, but I feel like some diligence is due.

Revisiting Alternatives

I spent the better part of two days researching and testing out various tools, most of which were nowhere near what I was hoping for. Lots of libraries, lots of add-ons for other tools, lots of things crafted for very niche situations. The most promising thing I found is ARX. It was the only one I found that felt like a real stand-alone tool. It was easy enough to get installed, however, it refused to connect to my Postgres data-source. More research! I finally managed to get it going, but I had to upgrade the version of postgres on the test server I was using to make it happy. There may have been a better solution, but this was quick-and-dirty research time, so I was really focused on getting enough information to make a call quickly here.

Finally the tool is working, and what appeared to be a very discoverable UI turns out to be pretty opaque. I would imagine that to a data scientist it would feel familiar, but that’s not me! After poking around for a bit, it became clear that I would be in for days more research and learning before I was going to be able to get results out of this tool. It’s obviously a capable engine for automating data anonymization for research purposes, but it’s not purpose-built for what I need.

Coming Back to Tonic.ai

It’s at this point that I just turn back to Tonic. I create a fresh replica of our production database and make it reachable from Tonic’s IPs. While that’s spinning up, I sign up for a demo account, and within moments of it being available, I’m able to connect Tonic to my copy of prod, and Tonic is showing me the schema with suggestions about what should be anonymized. I spend about thirty minutes reviewing the data and making decisions about how I want the data changed. Yup, those are names, replace them. Yeah, those are phone numbers, replace them too. So on through the tables. Some processing time later, I have another new DB, but all the PII has been replaced with simulated data. It looks real, it feels real, but nobody’s privacy is going to be put at risk by giving this data to developers. In another 30 minutes, I’ve made a snapshot and handed it off to the developers to put to use. Total time to value was about an hour, and much of that was just waiting for data to move.

The developers were thrilled to have better data to work with. Almost immediately they started asking questions about how I created the data set for them. While we were discussing Tonic, one of them mentioned Pynonimizer as something we should look at long-term since “it would be cheaper than paying for Tonic.” Surprised that I hadn’t seen it in my previous research I took a peek at it later, and once again it’s clearly a powerful tool. It also clearly would take longer to learn how to use it than it would take me to have results using Tonic. It doesn’t look hard, but I don’t want to learn how to write a strategy file and I don’t want to manage the infrastructure needed to support it as a service like we would need to. I could see it being a great fit in another situation, but not for us. We need results, not another project, and Tonic gives us results. Further, I just have to shake my head at the cost argument. If even a tiny fraction of a developer’s time had to be put towards managing Pynonimizer, the cost of that time would more than pay for Tonic.

There’s Still Nothing Like Tonic.ai

Until I discovered Tonic, these sorts of data issues were usually deemed “unsolvable” because of the level of effort solving them would take. Many orgs I came in contact with would just quietly let production data float around and hope they didn’t get caught. Others would settle for testing with bad data, only to inevitably create a firedrill in production because of some failure state that would have been easy to catch with better test data.

Three years on, Tonic is still the only solution I’ve found that doesn’t create yet another project. All told, I spent nearly three working days researching an alternative only to end up back where I started. Had I stuck with Tonic from the jump, I could have been done and on to something else in an afternoon. That’s a short path to value, and it’s one I’m going to be sticking with.

*** This is a Security Bloggers Network syndicated blog from Expert Insights on Synthetic Data from the Tonic.ai Blog authored by Expert Insights on Synthetic Data from the Tonic.ai Blog. Read the original post at: https://www.tonic.ai/blog/build-vs-buy-test-data-doesnt-have-to-be-another-in-house-project