DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionAs demand for multilingual AI grows, fine-tuning large language models (LLMs) for under-resourced languages has become a strategic priority. Google’s Gemma competition on Kaggle showcased how open-source LLMs can be adapted for new languages using synthetic data, supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and memory-efficient training techniques.At the same time, the LLM Prompt Recovery challenge pushed the boundaries of prompt engineering—asking competitors to reverse-engineer hidden instructions from model outputs. Together, these competitions reveal a powerful truth: modern LLM performance depends less on scale alone and more on data strategy, alignment methods, and metric-aware optimization.Unlocking global communication with Gemma: finetuning LLMs for new languagesOne of the landmark generative AI competitions on Kaggle was Google’s Unlocking GlobalCommunication with Gemma, an analytics competition with a $150,000 prize pool. Announced in late 2024, this competition invited participants to fine-tune Google’s Gemma 2 LLM for a specific language or cultural domain. The backdrop here is that Gemma is a family of open-source language models (built with the same underlying tech as Google’s Gemini models) that Google released to foster a community-driven ecosystem of language-specific models. By the time of the competition, a “Gemmaverse” of developers had already begun adapting Gemma to languages ranging from Arabic to Zulu. The competition’s goal was to accelerate this trend: each team would pick one of many under-represented languages (or a unique cultural niche of language use) and fine-tune Gemma 2 to excel at it. Competitors documented their approach in Kaggle notebooks, demonstrating improvements in tasks like the model’s language fluency, ability to handle literary traditions or historical texts of that language, and other culturally relevant capabilities.Competition format and dataAs an analytics competition, this wasn’t about submitting a model for automated scoring on a hidden test set. Instead, participants shared notebooks with their fine-tuned Gemma variants and qualitative/quantitative evaluations. Judges (including the Gemma model developers) assessed the submissions on criteria such as innovation, performance improvements, and the insightfulness of the approach. Google provided baseline resources: the base Gemma 2 model (with variants of 2B parameters that could be fine-tuned on Kaggle’s GPUs) and a list of ~70 eligible languages that were considered “under-resourced” in the LLM context. Participants often had to assemble their own fine-tuning datasets for the chosen language—drawing from public text sources or creating synthetic data—since by definition, many of these languages had limited ready-made datasets.Top solutions overviewThe winners of this competition delivered some impressive and instructive solutions. Many leveraged synthetic data generation to overcome data scarcity, creating large corpora of question-answer pairs or translated sentences using existing LLMs.For example, one of the winning teams focused on Italian (a moderately resourced language, but they aimed to push Gemma’s abilities in Italian to a new level). Their approach, as described by team member Stefano Fiorucci, was a “cheap recipe” that combined multiple techniques: Synthetic data generation (with LLM-as-a-judge), Supervised Fine-Tuning (SFT), Direct PreferenceOptimization, and efficient training with a method called Spectrum. In simpler terms, they first used a large model to generate Italian text data (and employed an LLM to judge/filter the quality of this synthetic data), then fine-tuned Gemma on this data (SFT). Next, they applied Direct Preference Optimization (DPO)—a technique related to RLHF—to further align the model’s outputs with human preferences by using a smaller reward model and optimizing the LLM against it. Finally, they used Spectrum, which is a memory-efficient fine-tuning strategy that selects which parts of the model to train, allowing them to fine-tune the 2B parameter model on Kaggle’s limited hardware. This multifaceted pipeline yielded an improved Gemma-It model that performed better in Italian language tasks than the original Gemma. You can find the Kaggle notebook for this solution here: https://www.kaggle.com/code/anakin87/post-training-gemma-for-italian-and-beyond.Another top team, headed by Justin Yang (which placed 2nd in the competition), tackled the task for Traditional Chinese and produced a notable open-source project called Kyara. In their write-up, the team explains that they generated over one million synthetic QA pairs for fine-tuning, plus an additional 150k prompts for preference optimization (Direct Preference Optimization DPO). This massive dataset was created by a method they dub “Retrieve, Rewrite, and Reformulate,” which uses a retrieval-augmented approach to generate question-answer pairs covering a wide range of knowledge. They also translated and paraphrased existing datasets from other languages to enrich the Chinese training data. By the end, they had essentially built a Chinese-centric instruction-following model on top of Gemma. The results, according to their report, showed strong performance compared to the original Gemma models—in other words, their fine-tuned Kyara model could understand and generate Chinese with greater accuracy and cultural relevance than baseline Gemma. Impressively, they released the Kyara model and the huge dataset on Hugging Face for the community, contributing back to the open-source ecosystem.You can find the Kyara model here: https://huggingface.co/datasets/zake7749/kyara-zhsample-1M. Justin’s Kaggle solution notebook can be found here: https://www.kaggle.com/code/ zake7749/kyara-retrieval-augmentation-for-llm-fine-tuning.Fine-tuning Gemma in practice: exampleHow does one actually fine-tune and use an LLM like Gemma in a Kaggle notebook? A common workflow is:1. Prepare a fine-tuning dataset of prompts and outputs.2. Train or fine-tune the model on this data (often using Hugging Face’s Transformers library or Google’s JAX/TPU tools if provided).3. After fine-tuning, load the new model and test it on some examples. Many top teams used PyTorch with Hugging Face Transformers for this process, as it’s a straightforward way to implement custom training loops or use Trainer APIs.Key techniques used in fine-tuningThe top solutions to the Gemma competition highlighted a few recurring themes that are instructive for any generative AI project on Kaggle:Synthetic data generation with LLMs: When real training data is scarce, use a larger LLM (or multiple LLMs) to generate additional data. For example, you might prompt GPT-4 to produce Q&A pairs in the target language, or to translate and rephrase English text into the target language. One team mentioned using an “LLM-as-a-judge” approach—they generated candidate outputs and then used another model (or heuristic) to judge which outputs were high-quality, ensuring that the fine-tuning data was clean and relevant.Supervised Fine-Tuning (SFT): This is the standard next step—take the base model and fine-tune on the supervised dataset (prompt à ideal response pairs). This aligns the model with the task. In practice, teams often had to train for multiple epochs and monitor an evaluation set (if available) to avoid overfitting. Libraries like HuggingFace Transformers make this easier via the Trainer class or peft.Lora for parameter-efficient fine-tuning. However, interestingly, one team reported that they attempted full LLM fine-tuning with LoRA but abandoned it due to poor validation correlation—indicating that not all fine-tuning attempts guarantee success, especially if the evaluation metrics are tricky (more on this in the Prompt Recovery section).Direct Preference Optimization (DPO) or RLHF: Several top entrants went beyond SFT and performed a second stage of tuning to better align the model’s outputs with human preferences. DPO is an approach where, instead of doing full reinforcement learning (which can be complex to implement), one can fine-tune the model to maximize a reward score (like a proxy for human preference) in a simpler, more direct way. To do this, you typically need a reward model—sometimes participants trained a smaller model to act as a judge between outputs, or they reused an existing one. By optimizing Gemma with DPO, teams were essentially making the model’s tone and style more user-friendly and its answers more “helpful” or correct where possible. Google’s Gemma being open meant that participants could experiment with these cutting-edge alignment techniques right on Kaggle.Efficient training and deployment: Handling a multi-billion-parameter model within Kaggle’s constraints (limited GPU RAM and runtime) is non-trivial. Teams used tricks like 4-bit quantization of model weights to reduce memory usage, gradient checkpointing to trade compute for memory, and transferring parts of the training to faster hardware off-platform when allowed. The Italian team’s use of Spectrum and the general use of BitsAndBytes (for 4-bit quantization) are examples of how winners squeezed more performance out of the available resources. In practice, Kaggle kernels with A100 GPUs (if provided) can handle fine-tuning a 2B model, but doing RLHF or DPO on top might require careful memory management. Participants often had to innovate on the engineering side as much as the modeling side.By the end of the competition, Google and the community gained a suite of fine-tuned Gemma models in many languages. This showcased a path towards truly multilingual AI that is not dominated by only high-resource languages. The winning notebooks illustrated how thoughtful data curation and novel training strategies can localize a large model to perform impressively well on niche languages or dialects. For example, one project fine-tuned Gemma for a dialect from Korea’s Jeju Island to help preserve that dialect—something far outside the reach of most commercial models. The competition underscored the practical insights that:• LLMs can be fine-tuned with surprisingly good results using public tools and a bit of creativity• Even without going into low-level model architecture changes, one can achieve significant performance boosts through data and training strategy alone.As Google’s official blog post put it, this collaborative effort helps “build a future where AI transcends language barriers,” and it exemplifies how Kaggle participants are contributing to the frontier of generative AI.While the Gemma competition demonstrated the power of fine-tuning open-source LLMs for multilingual and culturally aware performance, the next challenge shifted the focus inward—to-ward understanding the behavior of LLMs themselves. Instead of asking what these models can produce, the LLM Prompt Recovery competition asked: Can we reverse-engineer the very instructions that shaped those outputs? Let’s now explore how this innovative Kaggle competition reframed the prompt as the central object of prediction.LLM prompt recoveryImagine you have an original piece of text, and then you have a second piece of text that is a transformed version of the first—for example, maybe the second text is a summary of the first, or it’s the first text translated into Shakespearean English, or perhaps it’s the first text with all numerical information removed. In modern NLP, such transformations can be done by prompting an LLM. For instance, you might prompt an LLM: Translate the following passage to French or Rewrite this paragraph in a polite tone. The LLM takes the original and produces the transformed version.The LLM prompt recovery competition asked the reverse: given the original text and the transformed text, recover the prompt that was used to generate the transformation. In other words, participants had to figure out what instruction an LLM had been given. This challenge was a Kaggle Featured Competition held in early 2024, with a hefty $200,000 prize pool and over 2,000 teams participating. It was one of the first competitions to explicitly focus on prompt engineering and LLM behavior as the core problem.Let us go into the details of this competition in the following sections.Competition overviewThe task in LLM Prompt Recovery can be viewed as an inverse problem. Normally, we craft prompts to get outputs from an LLM. Here, we have the outputs (and the inputs) and must deduce the prompt. For example, if the original text is a verbose passage and the transformed text is a short bulleted list, the hidden prompt might have been Summarize the following text in bullet points. If the transformed text is in Spanish, the prompt likely was Translate to Spanish. If the transformed text is a playful version of the original, maybe Rewrite this text in a lighthearted, humorous tone. The challenge is that we are not told what kinds of transformations exist; participants had to infer patterns from data.Examining the dataThe competition dataset consisted of many pairs of texts: (original, transformed), and the goal was to predict the exact text of the prompt that was used to go from original to transformed. Importantly, the prompt was applied using a specific LLM (likely a Gemma model or similar) in a controlled environment. The competition description hinted that Google’s Gemma models were used to generate the transformed text given the prompt. So the transformations were realistic and fluent, not rule-based modifications. Participants were given some training examples with known prompts to learn from, and a test set where they had to predict prompts.One complication: prompts could be phrased in multiple ways (e.g., “Translate to French” vs“Translate this text into French language”). To evaluate predictions objectively, Kaggle defined a special metric called LLM Nerd-Off Sharpened Cosine Similarity. Despite the whimsical name, this metric was essentially a measure of semantic similarity between the predicted prompt and the true prompt, using embeddings. Likely, they embedded both prompts with a language model and computed cosine similarity, then perhaps raised it to a power or scaled it (“sharpened”) to emphasize differences. In short, participants didn’t need to match the prompt exactly word for word; they needed to capture the same meaning. A submission earned a high score if its prompts were semantically very close to the actual prompts used in generation.This evaluation method meant that the task was about capturing the essence of the prompt. If the true prompt was Summarize the article briefly, and a prediction was Give a short summary of the above passage, that should score very high (as it’s semantically equivalent). If a prediction missed the mark (e.g., Translate to French when the prompt was actually Simplify the language), the cosine similarity would be low.This competition encouraged participants to think about how different instructions manifest in text. They had to practically build a system that reads an original and its transformed version, and then outputs a plausible instruction. It’s a bit like playing detective with an AI’s behavior.Evaluating the challenges of the competitionThere are potentially hundreds of possible prompt types. Without additional structure, this is a challenging NLP problem—essentially, natural language understanding combined with some creativity. A straightforward approach would be to fine-tune a sequence-to-sequence model that takes the original and transformed text as input and tries to output the prompt. But doing that naïvely might be tough with limited training data. Also, some prompt types might be extremely rare or ambiguous.To succeed, competitors incorporated external knowledge and clever strategies. They suspected that certain transformation categories existed (such as translation, summarization, tone change, and information extraction). Likely, the competition forum discussions (and perhaps an initial “analysis” notebook by organizers) indicated the scope of tasks.Participants, therefore, had to leverage multiple techniques:NLP heuristics: For example, if the transformed text is much shorter than the original and covers similar content, it’s probably a summarization prompt. If the transformed text is in parentheses or brackets, it’s possible that the prompt asked for something to be extracted.Embedding-based similarity: You can embed the original and transformed text to examine the differences. For instance, if the embedding of the transformed text is close to that of a known French translation of the original, that signals a translation prompt. Kagglers might cluster or classify pairs using embeddings to identify prompt categories.LLMs themselves: Ironically, one could use an LLM to solve this LLM problem. For example, one could feed the original and transformed text into GPT-4 with a prompt like: Given the above original text and its modified version, what instruction was likely given to the model? This might give a very good answer much of the time.Fine-tuning custom models: Many top teams fine-tuned their own smaller LLMs or sequence models on the training data, along with a substantial amount of synthetic data. They simulated the process: take a large amount of text, apply various known prompts using an API (such as OpenAI or a local model) to generate transformed text, and thereby create a large dataset of (original, prompt, transformed) text. This synthetic corpus could then be used to train a model for prompt inversion.The creativity factor was high. In fact, the highest-performing strategies did some non-intuitive things, as we’ll see with the highlighted solution.To read from here (https://www.kaggle.com/suicaokhoailang) exploited the metric by formulating prompts that maximized similarity without necessarily being exact matches (an “adversarial” approach to the scoring metric). Khoi observed that predicting just the first half of the true prompt often yields a higher similarity score than the full prompt. This clever hack suggests that he might generate prompts that cover the key words and phrasing common to many actual prompts, thereby ensuring high embedding overlap. In Khoi’s words, “I think the special token pulls a sentence to some focal point in the embedding space, maybe the center of it. If the sentences are far enough from each other, it’s more likely that the new distance (from point B to point A </s>) is shorter than the old one (base of the triangle). But if the two sentences are already very close to each other, this can hurt performance.”Figure 13.2: Visualization of various distances between prompts and sentencesKhoi used a mix of Mistral 7b and Gemma-7b-1.1-it, and trained them on different datasets. He concatenated the predictions of both models, and the final mixture helped him achieve a winning score. However, this approach, while maximizing the score, might produce prompts that sound incomplete or odd to a human.Third-place solution (team prompt = “don’t say anything”)The third-place team tackled the prompt recovery task with a hybrid strategy combining a robust “mean prompt” baseline with several fine-tuned models (https://www.kaggle.com/competitions/ llm-prompt-recovery/discussion/494621). In essence, they constructed a fixed template prompt(a kind of universal prompt) and then augmented it with dynamic content predicted by models. By blending this optimized static prompt with learned predictions, the team could closely mimic the style and content of the true prompts. All of this was done in a neutral, third-person manner in their final assembled prompts, as required by the competition’s rules.The solution can be viewed as a hybrid ensemble of a prompt template and model predictions. It consists of five main components working in concert:Mean prompt template: A fixed template string (discovered through optimization) that serves as a baseline prompt, into which model-generated phrases are inserted.Full prompt prediction model: A fine-tuned language model (based on Mistral-7B) that attempts to predict the full rewrite prompt for a given sample.Gate classifier: A filtering model that checks whether the predicted prompt from the previous component is credible (i.e., consistent with the given original and rewritten text) and should be used.Tags prediction model: Another language model that predicts auxiliary tags (keywords describing style or instructions) relevant to the rewrite prompt.Clustering mechanism: A strategy that groups test samples into clusters of similar examples and selects the most suitable prompt template for each cluster.In the final inference pipeline, these components interact as follows. For a given test sample, the cluster assignment is first determined based on the sample’s characteristics (more on clustering below). A corresponding prompt template is chosen for that cluster, or a global default template is used if the cluster is uncertain. Then the full prompt model generates a candidate rewrite prompt for the sample, and the tags model produces supplementary keywords (such as a target style or tone). The gate model evaluates the candidate prompt against the original and rewritten text; if the prompt seems clearly incorrect (irrelevant or inconsistent), the system discards it. If it passes the gate, the final submission prompt is constructed by starting with the mean prompt template and inserting the additional predicted words (the tags and/or the full prompt prediction) at a specific position within the template. Notably, they only insert words that are not already present in the template to avoid redundancy. Through experimentation, the team found that the optimal insertion point was after the third word of the mean prompt template (this placement yielded the best validation performance). The result is a single coherent prompt string that includes the general template phrasing plus any unique details predicted by the models. This approach proved very effective, as using the first four components (template and models without clustering) already achieved an SCS of approximately 071 on validation. The clustering step added a small further boost (roughly +0.005)—not enough to reach 0.72, but every bit helped on the leaderboard.The final solution was the mean prompt + tags + full prompt (if passes the gate). The schematic of the solution can be found below:Figure 13.3: Third-place solution schematicBelow, we take a closer look at each component in detail and how they were developed and tuned.Origins of the mean promptLong before any model was trained, the team explored whether a single fixed instruction could, on average, resemble many hidden prompts. They began by producing tens of thousands of candidate instructions using strong external LLMs, such as GPT-3.5 and Gemini. Each candidate was applied to an LLM-generated passage to fabricate a plausible rewritten_text instance. Because the public leaderboard exposed similarity scores for half the test set, the team could estimate how well any individual instruction aligned with real hidden prompts. By iteratively sampling subsets of their synthetic data whose score distribution mirrored that of the public leaderboard, they obtained a surrogate development set. Then, they executed a token-level beam search across a few thousand word candidates. The search objective was simply to maximize average SCS on that surrogate set.The resulting sentence—obscure, repetitive, and studded with the Romanian word lucrarea— would look nonsensical to a human reader, yet it aligned strikingly well with the vectors used in the competition metric. Including lucrarea several times exploited an accidental quirk of the embedding space. This seemingly random token sat near the centroid of many true prompts, so multiplying it within the instruction nudged predictions closer in cosine space. With no modeling at all, the “mean prompt” alone achieved a score of 0.70 SCS and provided a strong backbone for later components. Here is the final mean prompt sentence:improve phrasing text {here we format full prompt prediction + tags} lucrarea tone lucrarea rewrite this creatively formalize discours involving lucrarea anyone emulate lucrarea description send casual perspective information alter it lucrarea ss plotline speaker recommend doing if elegy tone lucrarea more com n paraphrase ss forward this st text redesign poem above etc possible llm clear lucrareaBuilding a full-prompt predictorAlthough the universal instruction covered broad stylistic ground, it could not capture sample-specific details such as “translate to French” or “summarize in bullet points.” For that, the team fine-tuned Mistral-7B Instruct v0.2 under Low-Rank Adaptation (LoRA). They discovered that careful dataset design, rather than exotic training techniques, drove performance.The data pipeline ran backward from the goal. First, external LLMs generated diverse rewrite instructions, each enriched with stylistic hints so that a future model could infer the prompt’s intended effect. Next, the same LLMs paraphrased every instruction several times, yielding clusters of semantically identical yet lexically distinct prompts. For each prompt variant, the team asked an LLM to draft an original_text whose content harmonized with the instruction—ensuring, for instance, that a “formalize this casual email” prompt indeed paired with a casual email.To create the target rewritten_text, they let a quantized Gemma-2B model, accelerated with Unsloth, apply the prompt to the original passage. The outcome was a synthetic triple: original, prompt, rewritten.Because some prompt families now outnumber others, the team embedded every prompt with a Sentence-T5 model, clustered them via HDBSCAN, and subsampled each cluster to an equal size. This balanced curriculum prevented the model from memorizing only the most frequent rewrite patterns. After stripping boilerplate phrases such as “Sure, here is your text,” the dataset contained hundreds of thousands of clean examples. Fine-tuned on this mixture, the Mistral LoRA model achieved approximately 0.62 SCS when asked, zero-shot, to guess the hidden prompt from a genuine competition pair. That raw score was modest but crucial: when its predictions were appended to the mean template, overall similarity jumped markedly.Safeguarding with a gate classifierEven a well-trained predictor occasionally hallucinated. A single rogue instruction—”Translate to Klingon,” say—could drag the composite prompt far from the ground truth and erase gains accrued by the mean template. To defend against such outliers, the team equipped another Mistral instance with a sequence-classification head. Its input concatenated the original passage, a candidate prompt, and the rewritten passage, framed inside a short meta-instruction asking whether the rewriting matched the candidate’s prompt.Positive examples comprised 40 percent of the classifier’s training: genuine triples drawn from the synthetic corpus. Another 20 percent were easy negatives, formed by pairing a random prompt with an unrelated rewritten text. The remaining 40 percent were hard negatives: prompts retrieved from the nearest neighbors of the correct instruction in embedding space, therefore superficially similar yet semantically misaligned.Through this curriculum, the gate learned nuanced distinctions. During inference, it scored each predicted prompt; only those above a conservative threshold graduated to the final composition, while the rest were discarded, leaving the mean template unperturbed. Isolating attributes with a tag modelCertain facets of a rewrite—such as musicality, rhyme, or an imperative to summarize—proved hard for the full-prompt model to express reliably. The team addressed this by training a second causal-LM head, again atop Mistral-7B, whose sole aim was to emit a comma-separated list of keywords describing the transformation. The tag model digested the same original-rewritten pair and produced elements such as “summarize,” “formal tone,” “poem,” or “third-person perspective.” After inference, the system filtered out any tag already present in either the universal template or the accepted full-prompt prediction, then spliced the remaining tokens into the template at a fixed slot—immediately after the third word—found via validation sweep to be optimal. The tag model’s brevity reduced its error surface, allowing it to contribute correct micro-details that nudged SCS upwards.Riding the cluster spectrumThe final refinement sought a middle road between a single template and one-prompt-per-sample prediction. By embedding ground-truth prompts with Sentence-T5 and applying K-Means, the team observed 12 coherent clusters in their local validation set. For each cluster, they re-ran the beam-search optimization to craft a specialized mean prompt. Upon validation, the hypothetical score obtained by always selecting the ideal cluster template approached 0.76 SCS, far exceeding any published leaderboard result, revealing untapped headroom.Realizing that perfect cluster identification was impossible, the team trained yet another multiclass Mistral classifier to label unseen pairs with cluster IDs. They reinforced that decision with an independent heuristic: the predicted prompt from the full-prompt model was embedded and assigned to a cluster by the earlier K-Means centroids. Only when both the classifier and heuristic agreed did the pipeline adopt the cluster-specific template; otherwise, it reverted to the global mean prompt. This cautious protocol trimmed risk yet still delivered roughly five thousandths of a point on cross-validation, enough to matter in a contest decided on the third decimal place.End-to-end inference flowAt test time, each sample passed through a deterministic cascade. Sentence-T5 embeddings were computed to map the pair into a provisional cluster. In parallel, the full-prompt Mistral proposed an instruction, the tag model suggested keywords, and the gate classifier vetted the instruction’s plausibility.The system next chose a base template—cluster-specific when confidently classified, otherwise the global version. Into this template, it stitched, in order, unique tokens from the vetted full prompt and the filtered tag list. Because the universal instruction already contained many high-yield generic phrases, additive tokens were often scarce, resulting in concise yet customized prompts. The composite string then stood as the team’s prediction.Quantitative outcome and qualitative lessonsThe five-stage assembly scored just above 0.71 SCS on the hidden private set, securing third place. Though marginally shy of the winning mark, it demonstrated several principles valuable beyond this singular task. First, metric-aware prompt engineering—especially exploiting embedding artifacts such as lucrarea—can convert a daunting search space into a hill-climb around a sturdy baseline. Second, synthetic data, if diverse and cluster-balanced, can teach an LLM to infer latent instructions even when no labelled ground truth exists. Third, small specialist heads (gate and tag models) can prune and polish the output of a larger generator, providing robustness and incremental gains. Finally, intermediate granularity—here, a dozen clusters—yields a pragmatic compromise between universality and per-sample overfitting.All code, from data generation through LoRA fine-tuning to inference, relied heavily on opensource scaffolding. Yet, the decisive advantage arose less from tooling than from a mindset of relentless empirical probing. The team iterated through thousands of candidate tokens, balanced dozens of clusters, and rejected any enhancement that could not demonstrate reproducible uplift on a public subset that faithfully mirrored the test distribution. By combining a rule-based backbone with discriminative and generative neural modules, they reconstructed hidden promptswith a degree of fidelity previously thought unattainable in a zero-label setting.Although the competition’s idiosyncratic metric may never reappear, the architecture of this solution—static template plus gated generative refinement, augmented by attribute tagging and distribution-aware clustering—offers a portable recipe for tasks where the goal is to reverse-engineer or approximate opaque human instructions. In that sense, the third-place team delivered not merely a leaderboard score but a blueprint for marrying prompt engineering and fine-tuned language models under extreme supervision scarcity.Having explored the complexities of recovering prompts from transformed text using LLMs, we now turn to a different but equally innovative challenge: designing AI assistants that proactively help users with data tasks.ConclusionThe Gemma multilingual fine-tuning competition and the LLM Prompt Recovery challenge highlight the future of generative AI:Synthetic data can overcome language scarcity.Alignment techniques like SFT and DPO significantly improve output quality.Understanding evaluation metrics is critical for competitive performance.Open-source LLMs enable scalable, culturally aware AI development.For practitioners building multilingual systems or refining prompt strategies, these approaches provide a practical blueprint for engineering high-performance, resource-efficient language models.This article is an excerpt from The Kaggle Book, Second Edition by Luca Massaron, IconBojan Tunguz, and IconKonrad Banachewicz. Author BioLuca Massaron is a data scientist with over a decade of experience in transforming data into high-impact, innovative artifacts, solving real-world problems, and generating value for businesses and stakeholders. He is the author of numerous bestselling books on AI, machine learning, and algorithms. Luca is also a 3x Kaggle Grandmaster who reached number 7 in the worldwide user rankings for his performance in data science competitions. Additionally, he is recognized as a Google Developer Expert (GDE) in AI, Kaggle, and the cloud.Bojan Tunguz is the founder and CEO of TabulAI, a start-up focused on applying machine learning and AI to structured-data problems. Before founding TabulAI, he worked at three other machine learning start-ups and most recently at NVIDIA. He holds a PhD in theoretical physics from the University of Illinois and has taught as a professor at three liberal arts colleges.Konrad Banachewicz holds a PhD in statistics from Vrije Universiteit Amsterdam. His academic work focused on extreme dependency modeling in credit risk. In addition to his research activities, he was a tutor and supervised master's students. He transitioned from classical statistics to data mining and machine learning before "data science" became a buzzword. Over the next decade, he tackled quantitative analysis problems in various financial institutions, becoming an expert in the full life cycle of a data product. His work spanned high-frequency trading to credit risk, predicting potato prices, and analyzing anomalies in the performance of large-scale industrial equipment. He is a believer in knowledge sharing and also competes on Kaggle. --This text refers to an out of print or unavailable edition of this title.
Read more