Caught in the Data Vortex: My Journey of Falling Out, Then Spiraling Back into the World of Gradient Descent

Andrea Vasco
13 min readFeb 5, 2024

Disclaimer: The views and opinions expressed in my article are solely my own and do not necessarily reflect the official policy or position of Google or any of its affiliates. The content provided is for informational purposes only and is based on my personal experiences and insights as a Google Cloud Data Engineer. It should not be construed as representing the strategies, plans, or opinions of Google.

Imagine yourself back in your teenage years, those formative times when you began to measure the world against what your parents taught you. Picture that moment when you staunchly believed in something, grasping its importance with youthful certainty. Yet, as you ventured into the world, you found a dissonance between your beliefs and the world’s reception of them. Confused and perhaps a bit disillusioned, because you couldn’t yet fully articulate or understand why, you gently set aside those teachings.

For a time, life carried on, weaving through its natural progress. Then, years later, as experiences accumulated and perspectives broadened, a realization dawned: those early lessons, the ones imparted by your parents, held a deep-seated wisdom. It wasn’t that they were incorrect or irrelevant; it was simply that the world wasn’t ready to align with them, or perhaps, more accurately, you weren’t ready to implement them in the broader context of life.

This journey of coming full circle, of rediscovering and appreciating the knowledge once shelved away, mirrors my own voyage with statistics. As a teenager and a young adult, I was enamored with the clarity and precision that statistics brought to understanding the world. It was a passion nurtured in the academic haven but one that seemed to lose its sheen and applicability as I stepped into the more complex, less predictable real world.

Yet, this story is not just about a passion lost and then found. It’s about the journey of understanding — how, over the years, I came to realize that the lessons in statistical thinking were always valuable. They needed the right moment, the right context, and a more elevated perspective to truly shine. This is a story of falling in love, drifting apart, and then rediscovering the true value of statistical wisdom in the ever-evolving landscape of data and decision-making.

Unconventional Applications of Statistics

In my high school days, I stood at a pivotal crossroads, torn between the allure of writing and the structured world of engineering. Opting for engineering in university, I didn’t leave behind my artistic passions, finding an expressive outlet in music and theater. This unique blend of technical rigor and creative exploration laid the groundwork for my deep fascination with the intricate dance of mathematics and statistics, particularly the captivating concepts of estimation theory and gradient descent. This journey would not only shape my academic pursuits but also carve the path for my professional odyssey

During my university studies, my fascination with statistics took a rather unconventional turn. Merging my analytical prowess with a dash of youthful audacity, I embarked on a unique experiment: crafting a paper to use Bayesian statistics as a tool for romance. The premise was simple yet bold — to logically demonstrate to a girl why going on a date with me would statistically increase her chances of happiness. This was not just an academic exercise; it was a blend of probability, personal insights, and a good dose of humor. I meticulously analyzed our interactions and preferences, applying Bayesian principles to interpret these data points as convincing arguments for a date.

Unfortunately, the outcome wasn’t as successful as the calculations had suggested. The venture, humorous in hindsight, was more than a romantic misfire. It was a profound learning experience that showcased the real-world application of statistical theories. This episode, while not achieving its immediate romantic objective, served as a crucial step in my journey. It exemplified how statistical knowledge could be applied to everyday life scenarios, no matter how unconventional. The exercise was a testament to my growing belief in the power of data and probability to make sense of the world around us, even in matters of the heart

Balancing Statistical Dreams with Market Demands

As my university chapter closed, I embarked on a thrilling internship at the European Space Agency, an opportunity that aligned perfectly with my academic pursuits. However, this experience unveiled a stark reality: the limited financial prospects within the field. This realization prompted a significant pivot in my career direction.

I turned to the burgeoning world of startups, joining a venture initiated by one of my former professors. This startup was at the forefront of harnessing the power of statistics to unravel the complexities of IT systems. We developed intricate capacity planning models and forged a pioneering data warehouse tool. This tool wasn’t just a repository of information; it was a dynamic framework designed to capture the nuanced performance and business intelligence of intricate IT networks.

I found myself applying complex mathematical theories to real-world IT challenges. I was no longer just a student of statistics; I was an innovator, translating numbers into actionable insights. This phase marked a significant transition from theoretical understanding to practical application, solidifying my place in the IT world. It was a journey of immense growth, where I not only honed my technical skills but also learned the value of adaptability and innovation in the face of rapidly evolving industry needs.

In the startup’s fast-paced environment, I faced a reality check: my vision for applying advanced statistics in IT clashed with the market’s simpler demands for straightforward reports. This led to introspection and a realization: my methods didn’t always align with industry needs. Embracing flexibility, I learned the value of balancing academic rigor with practical business applications.

Don’t confuse me with facts:
my mind is already made up!

My wandering career path, spanning the UK, Russia, the US, and back, further reinforced this lesson. Interacting with diverse leaders, I recognized that while my passion for detailed statistical analysis was strong, the industry often preferred simpler, direct solutions. This was a humbling, enlightening journey that reshaped my understanding of data’s role in business. It taught me the importance of adapting my statistical expertise to meet varying market demands and the value of aligning my skills with the diverse needs of the global market.

The Epiphany at Google

Joining Google marked an unexpected yet transformative chapter in my career. It was a leap into a realm I had never imagined for myself, driven by the glowing recommendations of friends who had joined Google’s expanding European team (I lived in Boston back then). What I found at Google was a world of innovation and excellence that reignited my passion for data and statistics.

As a ‘Noogler’, I was immersed in an environment brimming with groundbreaking projects and ideas. My curiosity knew no bounds as I delved into the company’s vast resources, uncovering details about projects I had only heard of in passing. This exposure was not just exhilarating; it was profoundly educational, feeding my passion for knowledge and innovation.

During this exploratory phase, I stumbled upon a field that would redefine my professional outlook: Decision Intelligence Engineering. Championed by Cassie Kozyrkov, Google’s first Chief Decision Scientist, this discipline represented a significant leap forward in data utilization, decision-making, and technology. It was a perfect blend of science and philosophy, and it spoke to the core of my professional and personal aspirations.

This discovery wasn’t just a professional revelation; it was a personal reawakening. It bridged the gap between the young, ambitious version of myself and the experienced, seasoned professional I had become. Cassie’s work in Decision Intelligence Engineering liberated me, fostering peace between my past aspirations and present realities. It showed me the limitless possibilities of applying data and statistics in innovative ways.

My discovery of Cassie Kozyrkov’s work in Decision Intelligence Engineering at Google marked a pivotal moment. It was a revelation that redefined my understanding of how data intersects with decision-making. Her discipline, a rich blend of applied data science, social science, and managerial science, offered a new framework for turning information into actionable insights at any scale, especially crucial in the AI-driven era.

Unraveling the Nuances of Decision Intelligence

In her approach, Cassie tackles a common yet critical issue in corporate environments: the tendency to use data as a tool for narrative rather than as a basis for decision-making. I had often witnessed this in executive meetings, where data was bent to fit pre-conceived narratives, leaving real data-driven decision-making behind. Cassie’s methodology brought clarity to this foggy landscape, emphasizing the importance of setting decision criteria before delving into the data, thereby ensuring that decisions are genuinely informed by data.

Her philosophy goes beyond the mere numbers and narratives, advocating for a disciplined approach where decision-makers set parameters upfront, letting data guide the final verdict. This approach requires a check on biases and ego at the door, acknowledging human tendencies like cherry-picking data, significance chasing, and p-hacking. These pitfalls were now framed within a more structured and honest approach to data interpretation.

Cassie’s work illuminated the crucial difference between being ‘Data-Inspired’ and truly ‘Data-Driven.’ Data-inspired decisions are those where data is present but doesn’t necessarily drive the decision, often clouded by confirmation bias where we interpret evidence in a way that confirms our pre-existing beliefs. On the other hand, data-driven decisions are those where data plays a central role in influencing the decision, free from preconceived notions.

The essence of Decision Intelligence, as presented by Cassie, lies in its ability to foster an environment where data is not just a bystander but a key player in the decision-making process. It’s about integrating the best practices of data science with an understanding of human behavior and managerial needs, creating a holistic approach to decisions that impact lives, businesses, and broader societal contexts.

This newfound knowledge was not just intellectually stimulating; it was practically liberating. It equipped me with a framework to approach decision-making in a more disciplined, clear, and effective manner. Decision Intelligence Engineering, as championed by Cassie, was the missing link I had been searching for in my journey to make truly informed, data-driven decisions.

Let the Data Speak: Statistical Inference and Hypothesis Testing

Encountering the world of hypothesis testing in my journey during the University studies, work, and an Executive MBA, I recognized its parallels with scientific inquiry. Just as scientists establish hypotheses and define criteria before conducting experiments, decision-makers must set clear criteria before analyzing data. This principle is integral to Decision Intelligence Engineering, advocating for a methodical and genuine approach to using data in decision processes.

Hypothesis testing in statistics revolves around a fundamental query: ‘Should I change my current thinking?’ It’s grounded in the concept of the null hypothesis (H0), representing our default assumption or decision in the absence of data, and the alternative hypothesis (H1), which encompasses all other possibilities.

Alpha, p, stdev, rejection areas: the inference gang

This framework prompts crucial considerations about evidence and error tolerance. In statistical terms, these are encapsulated in concepts such as:

  • alpha: The threshold of evidence required to shift from our initial decision, closely related to the confidence interval concept under specific conditions.
  • Type I Error: The risk of erroneously discarding our initial decision when we actually should have stuck to it, akin to alpha.
  • Type II Error: The risk of incorrectly changing our decision when the initial one was correct, also known as beta.
  • p-value: A measure that helps in determining the significance of your results in hypothesis testing. It’s the probability of observing your data, or something more extreme, under the null hypothesis. A low p-value indicates that the observed data is highly unlikely under the null hypothesis and thus significant.

In the realm of Machine Learning, these concepts are also joined by other counterparts:

  • Precision: The proportion of true positives in all positive predictions, reflecting the accuracy of these predictions.
  • Recall (Sensitivity): The ability of the model to identify all actual positives, measuring the effectiveness of the model in detecting the positive class.

Setting these parameters beforehand and letting the data guide the final decision is essential. Amassing sufficient evidence to warrant a change in perspective implies new learning. Conversely, a lack of substantial evidence doesn’t affirm the null hypothesis; it merely indicates no significant learning has occurred.

This realization was a liberating epiphany, framing a comprehensive approach I had been seeking. A clear path connecting ‘Data-Inspired’ and truly, utterly satisfying (for my inner nerd) ‘Data-Driven’ decisions.

Harnessing Unsupervised Learning and Generative AI

A few weeks ago, I had the opportunity to run an experiment that brilliantly showcased the synergy of unsupervised learning and Generative AI. My objective was to understand if these technologies could be combined to decipher a complex dataset, revealing deeper insights into customer behaviors.

I began with a vast dataset mocking the content of a CRM, rich in details about our customers’ interactions and preferences; here’s a subset of the available data:

Account_Number,Account_Source,Annual_Revenue,State_Code,Country_Code,Channel_Program_Name,Industry,Name,Number_of_employees,Ownership,Rating,Site,Type,Year_Started,Total_Opportunity_Amount,First_Sale_Date,Last_Sale_Date,Average_Sales_Cycle_,Number_of_Renewals,Num_Opportunities_Won,Num_Opportunities_Lost

172,1,7970.48,AK,US,3,Agricultural Chemicals,Parker LLC,87725,1,1,US,3,1904,712311,9/21/2017,7/25/2018,48,3,4,2

304,1,842522.21,AK,US,3,Biotechnology: Biological Products (No Diagnostic Substances),”Crooks, Kilback and Howe”,51673,1,3,US,2,1971,813675,04–07–2016,4/23/2018,90,2,2,5

1286,1,71638.3,AK,US,2,Property-Casualty Insurers,Schultz-Kris,69505,4,3,US,2,1922,442033,05–01–2016,10–07–2018,54,5,2,4

2065,1,854969.22,AK,US,2,Biotechnology: Biological Products (No Diagnostic Substances),Kertzmann-Trantow,23500,3,2,US,3,1940,953652,09–05–2016,4/21/2018,58,1,4,4

416,1,198014.41,AL,US,3,Computer Software: Prepackaged Software,”Kutch, Gutkowski and Goodwin”,85786,1,1,US,7,1991,99654,09–09–2016,5/31/2018,70,1,3,2

699,1,733417.96,AL,US,3,Accident &Health Insurance,Casper and Sons,742,4,1,US,4,1988,385074,05–07–2017,01–07–2019,57,2,2,2

1042,1,716999.84,AL,US,2,Multi-Sector Companies,VonRueden LLC,96640,1,1,US,7,1940,353508,08–03–2017,09–08–2018,84,1,5,1

1099,1,617124.6,AL,US,3,,Carroll-Greenfelder,63880,1,3,US,1,1936,70350,3/19/2016,2/27/2018,35,3,4,1

To make it even more challenging, I had no metadata, glossary or column description available at this time. The first step was to employ unsupervised learning algorithms to analyze this data. Unlike supervised learning, unsupervised learning doesn’t rely on predefined labels or categories. Instead, it identifies patterns and relationships within the data, organically grouping customers into distinct clusters based on their behaviors.

CENTROID_ID,Account_Number
2,89
2,2301
2,1625
2,1478
2,1626
1,2044
3,846
3,1014

The real challenge emerged after clustering: interpreting what these clusters represented. The results from unsupervised learning, while statistically significant, were not immediately comprehensible. This is a common hurdle in data science — the findings are often abstract and require further interpretation. Why 3 clusters, and not 5? What does each cluster stand for?

To bridge this gap, we leveraged the capabilities of Generative AI. To comprehend the groups, we first must grasp the data itself. With no metadata at hand, we turn to Generative AI to dissect and interpret each column for us. Imagine Generative AI as our data detective, meticulously sifting through the dataset, piecing together the story behind each customer group by analyzing the nuances and patterns embedded within:

- Account_Number: Unique identifier for each account.
- Account_Source: Source of the account (1–5).
- Annual_Revenue: Annual revenue of the account in USD.
- State_Code: State code of the account’s location.
- Country_Code: Country code of the account’s location.
- Channel_Program_Name: Name of the channel program associated with the account (2–4).
- Industry: Industry of the account.
- Name: Name of the account.
- Number_of_employees: Number of employees in the account.
- Ownership: Ownership type of the account (1–4).
- Rating: Rating of the account (1–3).
- Site: Site of the account (US).
- Type: Type of the account (1–7).
- Year_Started: Year the account was started.
- Total_Opportunity_Amount: Total opportunity amount for the account in USD.
- First_Sale_Date: Date of the first sale to the account.
- Last_Sale_Date: Date of the last sale to the account.
- Average_Sales_Cycle_: Average sales cycle for the account in days.
- Number_of_Renewals: Number of renewals for the account.
- Num_Opportunities_Won: Number of opportunities won for the account.
- Num_Opportunities_Lost: Number of opportunities lost for the account.

Armed with insights into our dataset’s columns from Generative AI, we now delve deeper. We prompt Generative AI to scrutinize a data subset, aiming to decode and describe the essence of each customer group. This step transforms our understanding into targeted action, revealing unique group narratives for strategic engagement.

Group 1: High-performing, large enterprises
- Description: This group consists of large, well-established companies with high annual revenue and a significant number of employees. They are typically leaders in their respective industries and have strong financial performance.
- Annual revenue: $424,713.71- $768,114.38
- Number of employees: 10,889- 85,622
- Industry: Major Banks, Integrated oil Companies, Food Chains, Oil & Gas Production, Business Services, Specialty Chemicals

Group 2: Medium-sized, growing companies
- Description: This group includes medium-sized companies that are experiencing growth and have the potential to become larger enterprises. They have moderate annual revenue and a decent number of employees.
- Annual revenue: $89,930.12
- $912,714.78
- Number of employees: 5,119- 87,430
- Industry: Restaurants, Computer Software: Prepackaged Software, Radio And Television Broadcasting And Communications Equipment, Real Estate Investment Trusts, Auto Parts:O.E.M., Commercial Banks, Medical/Dental Instruments, Finance Companies

Group 3: Small, emerging companies
- Description: This group consists of small, emerging companies that have the potential for growth but are still in their early stages. They have relatively low annual revenue and a limited number of employees.
- Annual revenue: $22,494.57- $238,982.76
- Number of employees: 1,5874- 46,527
- Industry: EDP Services, Specialty Chemicals, Ordnance And Accessories

Now, the magic happens! We’ll run this analysis multiple times on different groups, looking for variations and insights that emerge over different runs. This is like getting multiple expert opinions, helping us build a comprehensive understanding.

Finally, we bring everything together. We’ll ask our AI to analyze all the collected interpretations and give us the ultimate summary — clear, concise descriptions of what makes each customer group unique.

Customer Group 1
- Description: Small businesses with low annual revenue and a small number of employees.
- Annual Revenue: Low (typically below $500,000)
- Number of Employees: Small (a few hundred to a few thousand)
- Industry: Varies, including integrated oil companies, catalog/specialty distribution, hospital/nursing management, and mining & quarrying.

Customer Group 2
- Description: Mid-sized businesses with moderate annual revenue and a moderate number of employees.
- Annual Revenue: Moderate (between $500,000 and $1 million)
- Number of Employees: Moderate (a few thousand to tens of thousands)
- Industry: Diverse industries, including banks, industrial machinery/components, oil & gas production, and business services.

Customer Group 3
- Description: Large businesses with high annual revenue and a large number of employees.
- Annual Revenue: High (exceeding $1 million)
- Number of Employees: Large (tens of thousands to hundreds of thousands)
- Industry: Industries such as real estate investment trusts, industrial machinery/components, and major banks.

This endeavor was a clear demonstration of the untapped potential in combining different AI methodologies. By fusing the analytical strength of unsupervised learning with the narrative power of Generative AI, we could convert intricate data into meaningful narratives and strategic insights.

You can find a video showcasing the experiment in action here:

A New Understanding of Data and Decision-Making

Reflecting on my experiment at Google, I see it as a prime example of data-driven decision-making at its finest. I configured the unsupervised learning model to align with my desired decision context, effectively setting the stage for the data to lead the way. This approach allowed the data to reveal insights free from my biases, often yielding surprising and enlightening results. Leveraging GenAI, I then translated these complex data findings into understandable narratives, making the results both accessible and actionable.

This process epitomizes the essence of decision intelligence: meticulously framing the decision context within the learning model and then embracing the unbiased insights that emerge, often revealing unexpected truths. The journey of data-driven discovery is an ongoing one, and I am eager to see where it takes us next.

A few references to Cassie’s work:

--

--

Andrea Vasco

Analytics and AI at Google | Startup Mentor | Innovation Champion | If you have a problem, if no one else can help and if you can find me...