How-To Tutorials

article-image-understanding-data-structures-in-swift

22 Oct 2024

10 min read

Understanding Data Structures in Swift

22 Oct 2024

This article is an excerpt from the book, The Ultimate iOS Interview Playbook, by Avi Tsadok. The iOS Interview Guide is an essential book for iOS developers who want to maximize their skills and prepare for the competitive world of interviews on their way to getting their dream job. The book covers all the crucial aspects, from writing a resume to reviewing interview questions, and passing the architecture interview successfully.Introduction In iOS development, data structures are fundamental tools for managing and organizing data. Whether you are preparing for a technical interview or building robust iOS applications, mastering data structures like arrays, dictionaries, and sets is essential. This tutorial will guide you through the essential data structures in Swift, explaining their importance and providing practical examples of how to use them effectively. By the end of this tutorial, you will have a solid understanding of how to work with these data structures, making your code more efficient, modular, and reusable. Prerequisites Before diving into the tutorial, make sure you have the following prerequisites: Familiarity with Swift Programming Language: A basic understanding of Swift syntax, including variables, functions, and control flow, is essential for this tutorial. Xcode Installed: Ensure you have Xcode installed on your Mac. You can download it from the Mac App Store if you haven’t done so already. Basic Understanding of Object-Oriented Programming: Knowing concepts such as classes and objects will help you better understand the examples provided in this tutorial. Step-by-Step Instructions 1. Learning the Importance of Data Structures Data structures play a crucial role in iOS development. They allow you to store, manage, and manipulate data efficiently, which is especially important in performance-sensitive applications. Whether you're handling user data, managing app state, or working with APIs, choosing the right data structure can significantly impact your app's performance and scalability. Swift provides several built-in data structures, including arrays, dictionaries, and sets. Each of these data structures offers unique advantages and is suitable for different use cases. Understanding when and how to use each of them is a key skill for any iOS developer. 2. Working with Arrays Arrays are one of the most commonly used data structures in Swift. They allow you to store ordered collections of elements, making them ideal for tasks that require sequential access to data. Declaring and Initializing an Array To declare and initialize an array in Swift, you can use the following syntax: var numbers: [Int] = [1, 2, 3, 4, 5] This creates an array of integers containing the values 1 through 5. Arrays in Swift are type-safe, meaning you can only store elements of the specified type (in this case, Int). Removing Duplicates from an Array A common task in programming is to remove duplicate elements from an array. Swift makes this easy by converting the array into a Set, which automatically removes duplicates, and then converting it back into an array: let arrayWithDuplicates = [1, 2, 3, 3, 4, 5, 5] let arrayWithNoDuplicates = Array(Set(arrayWithDuplicates)) This approach is efficient and concise, leveraging the unique properties of sets to remove duplicates. Iterating Over an Array Arrays provide several methods for iterating over their elements. The most common approach is to use a for-in loop: for number in numbers { print(number) } This loop prints each element in the array to the console. You can also use methods like map, filter, and reduce for more advanced operations on arrays. 3. Implementing a Queue Using an Array A queue is a data structure that follows the First-In-First-Out (FIFO) principle, where the first element added is the first one to be removed. Queues are commonly used in scenarios like task scheduling, breadth-first search algorithms, and managing requests in networking. In Swift, you can implement a basic queue using an array. Here’s an example: struct Queue<Element> { private var array: [Element] = [] var isEmpty: Bool { return array.isEmpty } var count: Int { return array.count } mutating func enqueue(_ element: Element) { array.append(element) } mutating func dequeue() -> Element? { return array.isEmpty ? nil : array.removeFirst() } } In this implementation: The enqueue method adds an element to the end of the array. The dequeue method removes and returns the first element in the array. Queues are useful in many scenarios, such as managing tasks in a multi-threaded environment or implementing a breadth-first search algorithm. 4. Dictionaries in Swift Dictionaries are another powerful data structure in Swift. They store data in key-value pairs, allowing you to quickly look up values based on their associated keys. Dictionaries are ideal for tasks where you need fast access to data based on a unique identifier. Declaring and Initializing a Dictionary Here’s how you can declare and initialize a dictionary in Swift: var userAges: [String: Int] = ["Alice": 25, "Bob": 30] In this example, the keys are strings representing user names, and the values are integers representing their ages. Accessing and Modifying Dictionary Values You can access and modify values in a dictionary using the key: if let age = userAges["Alice"] { print("Alice is \(age) years old.") } userAges["Alice"] = 26 This code snippet retrieves Alice's age and updates it to 26. Dictionaries are highly efficient for lookups, making them a valuable tool when working with large datasets. Adding and Removing Key-Value Pairs To add a new key-value pair to a dictionary, simply assign a value to a new key: userAges["Charlie"] = 22 To remove a key-value pair, use the removeValue(forKey:) method: userAges.removeValue(forKey: "Bob") 5. Exploring Sets Sets in Swift are similar to arrays, but with one key difference: they do not allow duplicate elements. Sets are unordered collections of unique elements, making them ideal for tasks like checking membership, ensuring uniqueness, and performing set operations (e.g., union, intersection). Declaring and Initializing a Set Here’s how you can declare and initialize a set in Swift: let uniqueNumbers: Set = [1, 2, 3, 4, 5] Unlike arrays, sets do not maintain the order of elements. However, they are more efficient for operations like checking if an element exists. Performing Set Operations Swift sets support various operations that are common in set theory, such as union, intersection, and subtraction: let evenNumbers: Set = [2, 4, 6, 8] let oddNumbers: Set = [1, 3, 5, 7] let union = evenNumbers.union(oddNumbers) // All unique elements from both sets let intersection = evenNumbers.intersection([4, 5, 6]) // Elements common to both sets let difference = evenNumbers.subtracting([4, 6]) // Elements in evenNumbers but not in the other set 6. Understanding the Codable Protocol The Codable protocol in Swift simplifies encoding and decoding data, making it easier to work with JSON and other data formats. This is especially useful when interacting with web APIs or saving data to disk. Defining a Codable Struct Here’s an example of a struct that conforms to the Codable protocol: struct Person: Codable { var name: String var age: Int var address: String } With Codable, you can easily encode and decode instances of Person using JSONEncoder and JSONDecoder: let person = Person(name: "Alice", age: 25, address: "123 Main St") let jsonData = try JSONEncoder().encode(person) let decodedPerson = try JSONDecoder().decode(Person.self, from: jsonData) Output and Explanation For each code snippet, you should test and verify that the output matches the expected results. For example, when implementing the queue structure, enqueue and dequeue elements to ensure the correct order of processing. Similarly, when working with dictionaries, confirm that you can retrieve, add, and remove key-value pairs correctly. Conclusion This tutorial has covered fundamental data structures in Swift, including arrays, dictionaries, and sets, and their practical applications in iOS development. Understanding these data structures will make you a better Swift developer and prepare you for technical interviews and real-world projects. Author BioAvi Tsadok, seasoned iOS developer with a 13-year career, has proven his expertise leading projects for notable companies like Any.do, a top productivity app, and currently at Melio Payments, where he steers the mobile team. Known for his ability to simplify complex tech concepts, Avi has written four books and published 40+ tutorials and articles that enlighten and empower aspiring iOS developers. His voice resonates beyond the page, as he's a recognized public speaker and has conducted numerous interviews with fellow iOS professionals, furthering the field's discourse and development.

0
0
62442

article-image-how-we-are-thinking-about-generative-ai

Packt

18 Jul 2024

10 min read

How we are Thinking About Generative AI

Packt

18 Jul 2024

10 min read

How we are Thinking About Generative AI for Developers and Tech LearningPackt is a global tech publisher serving developers and tech professionals (TechPros). Over the last 20 years, we have published over 8,000 books and videos, gaining deep insights into the evolving challenges tech professionals face. Recently, the rapid emergence of generative AI (GenAI) technologies like CoPilot, ChatGPT, and Gemini has transformed the tech landscape, affecting everyone from software developers to business strategists.The rapid emergence of generative AI (GenAI) technologies like CoPilot, ChatGPT, and Gemini has transformed the tech landscape.The rapid emergence of generative AI (GenAI) technologies like CoPilot, ChatGPT, and Gemini has transformed the tech landscape. These changes affect everyone from software developers to business strategists. The tech industry is at a critical inflection point with technology use, development, and education. At Packt, we are actively exploring generative AI's impact on the industry and TechPros' daily work and learning. Here, we outline our thoughts on how GenAI reshapes professional activities and tech learning, and our strategic responses to it. We would love to hear your feedback on this document and your thoughts on the issues raised within it. Please do send any comments to: GenAI_feedback@packt.com. The Impact of GenAI on TechPro WorkThe rapid pace of advancement in Generative AI makes it difficult to predict, but we believe, on balance, that it is a force for good in software development. A core Packt value that we share with our TechPro users is a belief in and commitment to the power of technology for progress. Our default setting is to get on board with change.GenAI is already changing the nature of many development jobs, but it will not mean the end of software development. We are fundamentally optimistic about the future for TechPros powered by GenAI. It will mean more, faster, better work.This is how we at Packt see these changes: Increased Software ProductionHumanity continuously evolves, adapts, and advances, maintaining a need for more sophisticated software solutions – whether those are built on traditional software platforms or on top of AI models themselves. GenAI is already transforming the economics of supply by making engineers more productive and enabling more engineering tasks. The demand for more, better software will remain, leading to an increase in the number of professionals building, designing, adapting, and managing software. Shifts in Software DevelopmentMuch of what engineers spend time doing can be quite generic. GenAI is beginning to automate these middle-tier, routine activities, allowing developers to focus on higher-value, more creative tasks. This shift redistributes work in three dimensions from the center of the development stack. Work moves ‘up the stack’ into architecture, domain expertise, and design, ‘down the stack’ into complex algorithm development, infrastructure, and tooling, and outwards to the edges with specific integrations and implementations. To meet the increased demand for software, there will be significantly more designers and implementors at those development edges, with increasing business and domain focus and specialization. There will be a continuously hard-to-meet need for deep tech engineers building the tools and infrastructure that enable this automation to operate efficiently at scale and speed. This will be seen at the hardware and firmware level as well as operating systems, cloud platforms, and the models and algorithms that modern software is built upon. Increased Domain and Business SpecializationAs GenAI moves tasks from generic operations upwards and outwards to more specialized domains, engineers will increasingly make decisions that require greater judgment and domain expertise. This will lead to a greater focus on domain experience and knowledge, and a higher value on business relationships.GenAI also democratizes the development and management of systems, making these processes accessible to more users and transforming many jobs from direct task execution to overseeing AI agents that perform the work. This evolution could significantly expand the roles involving aspects of software design or delivery. Impact on Tech Pro LearningGenAI integrates automation and problem solving, leading to profound change in how TechPros learn and solve problems. We see the core changes as being:Shift Toward Just-In-Time (JIT) Continuous LearningDevelopers have always preferred to learn by doing—starting work and solving problems on the fly. GenAI makes this the only viable approach. The ROI of upfront Just-In-Case (JIC) learning, where developers research technologies that might be useful in future, declines when co-pilots can accelerate initial builds and troubleshoot during development. GenAI tools can escalate to rapid Just-in-Time [JIT] learning sprints to backfill knowledge gaps as they are discovered.GenAI tools can help engineers to rapidly understand and work on existing complex and often undocumented code bases, again backfilling knowledge gaps JIT. Entry Level Learning Moves to Simulated EnvironmentsThe JIT learning-by-doing model also applies to students and juniors, but the study work they do will be “as good as real.” Traditional, linear courseware will be replaced by personalized, hands-on projects in rich simulated environments. These environments provide shorter, contextual learning experiences that effectively bridge the gap between theory and practice, reducing the training load on increasingly busy senior developers. Growth in Demand for Real World Experience and Peer InteractionAs development increasingly moves up the stack and routine tasks are automated, there is a growing need for TechPros to understand specific real-world applications of systems and solutions. Highly specific, detailed, and objective case studies with high relevance to a specific problem area and technical solution will become increasingly valuable. Demand for discussion and interaction with experienced fellow professionals to share knowledge and insights will also grow. Such authentic content not only aids learning but also enhances the training of AI models. Authoritative and Expert Insight Remains KeyDespite the shift towards more automated and JIT learning approaches, a thorough understanding of core concepts remains crucial. Books will continue to be one of the most powerful and authoritative ways for technology originators to share their foundational knowledge. This will remain the key long-term use-case for tech books. Continuing Need for Creator Trust and AuthenticityGen AI enables the rapid creation of written work. In the tech publishing domain, we estimate that up to around 50% of titles in certain categories on Amazon might already be AI-generated or derived. This AI content meets certain user needs, and this proliferation will continue across store platforms. We believe that human-generated work fulfils a different user need and that there will always be value in authentic creator insight and expertise. We continue to build direct relationships with tech professionals and authors to create and publish this content. The Future is UncertainHow this evolves is hard to know. The pace of change both in the technology and in the landscape around it has surfaced issues with reliability, compliance, cost, and memory/reasoning limitations. GenAI technology is moving extremely fast but has serious technical challenges. GenAI technology is moving extremely fast but has serious technical challenges.These issues will be resolved over time, but they limit the pace of actual deployment. A Cautious Approach to ChangeThe case for changing existing systems, practices, and organizational models should be approached with caution. Enterprises have a high bar for adopting core systems and the deployment phase will be long and require detailed work. Uncertainty in Computing PlatformsIt remains uncertain whether GenAI might evolve into the dominant general purpose computing platform or how it will evolve past the current transformer architecture. It may become a ubiquitous implementation layer for all services over time; we do not know. However, we share the view that this is a pivotal phase for technology and for humanity. A Mixed Economy of the Old and the NewWe see a long phase of a mixed economy of old methods and new GenAI tools. There will be pockets of rapid adoption of GenAI tooling, like we see in coding co-pilots and in application areas, such as customer service agents. However, with every deployment there will be a lot of “old style” engineering: problem solving, integrations, QA, optimization. The shifts to high level working will be gradual and not immediately noticeable. Friction in Human SystemsHuman systems inherently resist change. Individuals stick with working and learning systems with which they are comfortable. Teaching methods evolve slowly, and we see different generations working and learning in different ways. While a shift toward Just-In-Time (JIT) learning is underway, structured, long-form learning will continue to play a crucial role. Rapid Adoption Among DevelopersThe pace at which individual developers have adopted co-pilots and are using GenAI for problem solving is striking. We expect these trends of grassroots, individual adoption to continue and accelerate. How Packt is RespondingThe insights gained from talking with TechPros combined with our thinking about the impact of GenAI on TechPro work and learning has resulted in these strategic initiatives:Shift to the Edges of the Development Stack in PublishingWe are pioneering new approaches to developing and publishing real world practical case studies to answer the crucial questions: “What are people actually building with this right now?” and, “How are they actually doing it?”What are people actually building with this right now? How are they actually doing it?We will increase our focus on publishing specific, definitive, deep, technical books from the creators and builders of new technology to help TechPros broaden their skills across the development stack. We will continue to build the tech book canon in the era of GenAI.License for LLM Training ResponsiblyThe uniquely high-quality content tech authors create has immense value for LLM training. We want to support the evolution of this technology while developing model training as a potentially valuable new channel for published content.We want authors to get fair value and the recognition they are due, and we will pursue all agreements with partners in a pragmatic but principled way. Use GenAI to Enable a Step Change in Content Engineering and Derived WorksGenAI tools and automations can reduce the cost and effort of keeping a title up to date as technology evolves, and of creating a rich portfolio of derived works from the initial content. We call this BODE: Build Once, Deploy Everywhere.We are exploring exciting use-cases to increase the value of the original work, and its reach into new platforms, formats, languages, and versions. Build Packt Models and Explore JITWe have already delivered experimental AI agents fine-tuned on specific Packt titles. We are expanding this to topic, role, and whole-library models. We are exploring integration of the Packt corpus into co-pilots and tools to deliver workflow-embedded JIT knowledge and learning escalation. Build Professional MembershipsRecognizing the increased value of live interactions in a post-GenAI world, we are committed to enabling Tech Professionals to engage in high-quality, trustworthy interactions with peers working on similar roles and projects.Thoughts? Feedback?Please send any comments to:GenAI_feedback@packt.com

3
0
62818

article-image-microsoft-ais-skeleton-key-automl-with-autogluon-multion-ais-retrieve-api-narrative-bis-hybrid-ai-pythons-duck-typing-gibbs-diffusion

05 Jul 2024

13 min read

Microsoft AI’s Skeleton Key, AutoML with AutoGluon, MultiOn AI's Retrieve API, Narrative BI’s Hybrid AI, Python's Duck Typing, Gibbs Diffusion

05 Jul 2024

13 min read

Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Happy Friday! Welcome to DataPro#101—Your Essential Data Science & ML Update! 🚀 This week, we’ve curated the latest techniques in data extraction, transforming unstructured data into structured formats, best practices for prompt engineering in NL2SQL, and much more. Consider this your all-in-one guide to staying informed in the ever-evolving world of data science and machine learning. Now, dive in and explore these exciting new ideas! ⚡ Tech Highlights: Stay Updated! Prompt Engineering with Claude 3: Learn hands-on techniques on Amazon Bedrock. Accelerated PyTorch: Boost models with torch.compile on AWS Graviton. BigQuery Data Canvas: Perfect your prompts. Skeleton Key AI: New AI jailbreak method. GraphRAG: Complex data discovery tool on GitHub. 📚 New from Packt Library Data Science for Web3 - Guide to blockchain data analysis and ML. 🔍 Latest in LLMs & GPTs NASA-IBM's INDUS Models: Advanced science LLMs. EvoAgent: Evolutionary multi-agent systems. Kyutai's Moshi: Real-time AI model. MultiOn AI's Retrieve API: Accurate web search. Gibbs Diffusion (GDiff): Bayesian image denoising. Narrative BI’s Hybrid AI: Business data analysis. WildGuard: Safe LLM interactions. ProgressGym: Ethical AI alignment. OmniParse: Structuring unstructured data for GenAI. ✨ What's Fresh Claude 3.5 Sonnet Use Cases: Future AI capabilities. Explainability in ML: Make models understandable. Group-By Aggregation: Powerful EDA tool. OpenAI and PandasAI: Series operations. AutoML with AutoGluon: ML in four lines of code. Python's Duck Typing: Flexible coding concept. 🔰 GitHub Finds: Add These Repos fal/AuraSR arcee-ai/Arcee-Spark-GGUF pprp/Pruner-Zero ruiyiw/patient-psi hrishioa/rakis ragapp/ragapp Doriandarko/claude-engineer hao-ai-lab/MuxServe DataPro Newsletter is not just a publication; it’s a complete toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🔰 Data Science Tool Kit ➔ ️ fal/AuraSR: AuraSR, a GAN-based super-resolution model for upscaling images. Implemented in PyTorch, it's inspired by the GigaGAN paper, enhancing image quality significantly. ➔ arcee-ai/Arcee-Spark-GGUF: Arcee Spark, a 7B model from Qwen2, excels with fine-tuning and DPO, outperforming GPT-3.5 on tasks, ideal for efficient AI deployment. ➔ pprp/Pruner-Zero: Pruner-Zero automates symbolic pruning metric discovery for Large Language Models, surpassing current methods in language modeling and zero-shot tasks. ➔ ruiyiw/patient-psi: Patient-Ψ uses Large Language Models to simulate patient interactions for training mental health professionals, emphasizing cognitive modeling and practical deployment. ➔ hrishioa/rakis: Rakis is a browser-based permissionless AI inference network enabling decentralized consensus without servers, emphasizing open-source and educational use. ➔ ragapp/ragapp: RAGapp simplifies enterprise use of Agentic RAG models, configurable like OpenAI's custom GPTs, deployable via Docker on cloud infrastructure. ➔ Doriandarko/claude-engineer: Claude Engineer, powered by Anthropic's Claude-3.5-Sonnet, aids software development through an interactive CLI blending AI model capabilities with file operations and web search. ➔ hao-ai-lab/MuxServe: MuxServe efficiently serves multiple LLMs using spatial-temporal multiplexing, optimizing memory and computation resources based on LLM popularity and characteristics. 📚 Expert Insights from Packt CommunityData Science for Web3: A comprehensive guide to decoding blockchain data with data analysis basics and machine learning cases By Gabriela Castillo Areco Understanding the blockchain ingredients If you have a background in blockchain development, you may skip this section. Web3 represents a new generation of the World Wide Web that is based on decentralized databases, permissionless and trustless interactions, and native payments. This new concept of the internet opens up various business possibilities, some of which are still in their early stages. Currently, we are in the Web2 stage, where centralized companies store significant amounts of data sourced from our interactions with apps. The promise of Web3 is that we will interact with Decentralized Apps (dApps) that store only the relevant information on the blockchain, accessible to everyone. As of the time of writing, Web3 has some limitations recognized by the Ethereum organization: Velocity: The speed at which the blockchain is updated poses a scalability challenge. Multiple initiatives are being tested to try to solve this issue. Intuition: Interacting with Web3 is still difficult to understand. The logic and user experience are not as intuitive as in Web2 and a lot of education will be necessary before users can start utilizing it on a massive scale. Cost: Recording an entire business process on the chain is expensive. Having multiple smart contracts as part of a dApp costs a lot for the developer and the user. Blockchain technology is a foundational technology that underpins Web3. It is based on Distributed Ledger Technology (DLT), which stores information once it is cryptographically verified. Once reflected on the ledger, each transaction cannot be modified and multiple parties have a complete copy of it. Two structural characteristics of the technology are the following: It is structured as a set of blocks, where each block contains information (cryptographically hashed – we will learn more about this in this chapter) about the previous block, making it impossible to alter it at a later stage. Each block is chained to the previous one by this cryptographic sharing mechanism. It is decentralized. The copy of the entire ledger is distributed among several servers, which we will call nodes. Each node has a complete copy of the ledger and verifies consistency every time it adds a new block on top of the blockchain. This structure provides the solution to double spending, enabling for the first time the decentralized transfer of value through the internet. This is why Web3 is known as the internet of value. Since the complete version of the ledger is distributed among all the participants of the blockchain, any new transaction that contradicts previously stored information will not be successfully processed (there will be no consensus to add it). This characteristic facilitates transactions among parties that do not know each other without the need for an intermediary acting as a guarantor between them, which is why this technology is known as trustless. The decentralized storage also takes control away from each server and, thus, there is no sole authority with sufficient power to change any data point once the transaction is added to the blockchain. Since taking down one node will not affect the network, if a hacker wants to attack the database, they would require such high computing power that the attempt would be economically unfeasible. This adds a security level that centralized servers do not have. This excerpt is from the latest book, "Data Science for Web3: A comprehensive guide to decoding blockchain data with data analysis basics and machine learning cases” written by Gabriela Castillo Areco. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here!⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS ➤ Prompt engineering techniques and best practices: Learn by doing with Anthropic’s Claude 3 on Amazon Bedrock. In this blog post, the focus is on crafting effective prompts for generative AI models to achieve desired outputs. It emphasizes the importance of well-constructed prompts in guiding models like Claude 3 Haiku on Amazon Bedrock to produce accurate and relevant responses, showcasing examples of prompt variations and their impact. ➤ Accelerated PyTorch inference with torch.compile on AWS Graviton processors. In this blog post, AWS optimized PyTorch's torch.compile feature for AWS Graviton3 processors, significantly enhancing performance for Hugging Face and TorchBench model inference compared to the default eager mode. These optimizations, available from PyTorch 2.3.1, aim to streamline model execution on Graviton3-based Amazon EC2 instances. Google➤ How to write prompts for BigQuery data canvas? This blog post focuses on leveraging generative AI, specifically Gemini in BigQuery, to perform data tasks via natural language queries (NL2SQL and NL2Chart). It highlights how refining NL prompts can enhance query accuracy, promoting collaboration and efficiency among data professionals using BigQuery's data canvas tool. Microsoft➤ Microsoft AI Unveils Skeleton Key: A Novel Generative AI Jailbreak Method. This blog post discusses a newly discovered type of attack in generative AI called Skeleton Key, also known as Master Key. It explores how this attack bypasses AI guardrails, allowing models to generate unauthorized content, and outlines Microsoft's mitigation strategies using Prompt Shields in Azure AI. ➤ GraphRAG: New tool for complex data discovery now on GitHub. The update introduces GraphRAG, a graph-based approach to retrieval-augmented generation (RAG), now available on GitHub. It enhances information retrieval and response generation by automating knowledge graph extraction from text datasets, offering structured insights for global queries. An Azure-hosted API facilitates easy deployment without coding. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🔸 NASA-IBM Collaboration Develops INDUS Large Language Models for Advanced Science Research. The blog explores NASA's collaboration with IBM to develop INDUS, a suite of specialized language models (LLMs) tailored for scientific domains. INDUS enhances data analysis, retrieval, and curation across Earth science, heliophysics, and more, advancing research capabilities in diverse scientific disciplines. 🔸 EvoAgent: Expanding Expert Agents to Multi-Agent Systems with Evolutionary Algorithms. EvoAgent automates the extension of expert agents to multi-agent systems using evolutionary algorithms, applicable to any LLM-based agent framework. It enhances agent diversity and performance across tasks, exemplified in debates by generating varied opinions and improving content quality dynamically. 🔸 Kyutai Releases Moshi: A Real-Time AI Model that Understands and Speaks. Kyutai introduces Moshi, a real-time native multimodal foundation model surpassing GPT-4o functionalities. Moshi understands emotions, speaks with accents like French, and handles dual audio streams, enabled by joint pre-training on text and audio. It supports open-source transparency and runs efficiently on consumer hardware. 🔸 MultiOn AI's Retrieve API Boosts Web Search with Real-Time Accuracy for Advanced Applications. MultiOn AI has launched the Retrieve API, a cutting-edge tool for autonomous web information retrieval. It enhances data extraction from web pages with real-time processing, catering to diverse applications such as personalized shopping assistants, automated lead generation, and content creation tools, setting new standards in web data extraction technology. 🔸 Gibbs Diffusion (GDiff): A Bayesian Blind Denoising Method for Images and Cosmology. The study introduces Gibbs Diffusion (GDiff) as an innovative method for blind denoising with deep generative models. It enables simultaneous sampling of signal and noise parameters, improving Bayesian inference for scenarios like natural image denoising and cosmological data analysis, enhancing accuracy in noise characterization and signal recovery. 🔸 Narrative BI Introduces Hybrid AI Approach for Business Data Analysis: The research explores hybrid approaches in business data analysis, combining rule-based systems' precision with Large Language Models' (LLMs) pattern recognition. This integration aims to generate actionable insights from complex datasets, improving efficiency and accuracy in decision-making processes for businesses. 🔸 WildGuard: A Lightweight Moderation Tool for User Safety in LLM Interactions. The paper introduces WildGuard, an open and lightweight moderation tool for enhancing safety in Large Language Models (LLMs). It focuses on identifying malicious intent in user prompts, detecting safety risks in model responses, and evaluating model refusal rates. WildGuard achieves state-of-the-art performance across these tasks, addressing critical gaps in existing moderation tools. 🔸 ProgressGym: ML Framework for Ethical Alignment in Frontier AI. This research addresses the influence of AI systems, particularly large language models (LLMs), on human epistemology and societal values. It introduces progress alignment as a technical solution to prevent AI reinforcement of problematic moral beliefs. ProgressGym, an experimental framework, facilitates learning from historical data to advance real-world moral decision-making challenges. 🔸 OmniParse: AI Platform for Structuring Unstructured Data for GenAI Applications. OmniParse tackles the challenge of managing diverse unstructured data types—documents, images, audio, video, and web content—by converting them into structured formats optimized for AI applications. It integrates various tools like Surya OCR and Florence-2 for accurate data extraction, enhancing workflow efficiency and data usability across platforms. ✨ On the Radar: Catch Up on What's Fresh🔹 10 Use Cases of Claude 3.5 Sonnet: Unveiling the Future of Artificial Intelligence AI with Revolutionary Capabilities. Claude 3.5 Sonnet by Anthropic AI marks a leap forward in AI capabilities, showcasing versatility across diverse domains. It excels in generating n-body particle animations, interactive learning dashboards, escape room experiences, virtual psychiatry, interactive poster designs, educational visual demonstrations, customizable calendar applications, real-time object detection, financial tools, and advanced physics simulations. 🔹 Explainability, Interpretability and Observability in Machine Learning: The article explores the nuances of machine learning (ML) transparency through concepts like explainability, interpretability, and observability. It discusses their definitions, distinctions, and importance in fostering trust, accountability, and effective deployment of ML models across various industries and applications. 🔹 A Powerful EDA Tool: Group-By Aggregation. The article dives into Exploratory Data Analysis (EDA) techniques, focusing on group-by aggregation in Pandas. Using the Metro Interstate Traffic dataset as an example, it demonstrates how to derive insights such as monthly traffic progression, daily traffic profiles, hourly traffic patterns by weekday versus weekend, and identifying top weather conditions associated with congestion rates. 🔹 Using OpenAI and PandasAI for Series Operations: This article explores PandasAI, leveraging AI models like OpenAI to enhance Pandas data manipulation tasks. It covers querying Series values, creating new Series, conditional value setting, and reshaping data using natural language commands. Examples include summarizing statistics, conditional operations, and reshaping COVID-19 and NLS youth study datasets efficiently. 🔹 AutoML with AutoGluon: ML workflow with Just Four Lines of Code. The article explores AutoGluon, an automated machine-learning framework developed by Amazon Web Services (AWS). It discusses how AutoGluon simplifies the entire machine-learning process—from data preprocessing to model selection and hyperparameter tuning—making it accessible and efficient for users across various data types like tabular, text, and image data. 🔹 Understanding Python's Duck Typing: The article explores the concept of duck typing in Python, emphasizing behavior over type. It allows objects to be used based on their methods rather than explicit types, promoting flexibility and polymorphism. Duck typing simplifies code but requires careful handling to avoid runtime errors. See you next time!

0
0
57265

article-image-top-100-essential-data-science-tools-repos-streamline-your-workflow-today

Merlyn Shelley

27 Jun 2024

14 min read

Top 100+ Essential Data Science Tools & Repos: Streamline Your Workflow Today!

Merlyn Shelley

27 Jun 2024

14 min read

IntroductionAs data professionals, navigating the vast sea of Big Data often leaves us searching for the right tools to harness its potential. Whether we're defining intricate problems, identifying emerging trends, or crafting innovative solutions, the challenge is undeniable. Too often, this quest has us wandering aimlessly through the web, seeking elusive answers. Here at the DataPro Newsletter team, we understand this all too well. That's why, in celebration of our 100th edition, we're thrilled to present a special gift to our valued readers—a thorough reference module brimming with resources. This carefully curated collection features over 100 of the most popular tools and GitHub repositories. Each one is not only widely used and trusted but is also consistently updated with the latest breakthroughs to enhance your data processing capabilities. Think of this module as your treasure chest, designed to streamline your workflow and inspire innovative solutions. Bookmark this page for quick access whenever you encounter challenges in any area of data science and machine learning, from DataOps to Recommender Systems to Quantitative Finance—we've got it all covered! So, dive into this one-stop reference module, explore its depths, and let the spirit of data kinship propel you forward. Here's to more empowering tools and transformative insights from your DataPro team—cheers! DataOps/MLOps kestra-io/kestra: Kestra is an open-source orchestrator for scheduled and event-driven workflows, leveraging Infrastructure as Code for reliable management. open-metadata/OpenMetadata: OpenMetadata is a unified platform for data discovery, observability, and governance, featuring a central repository, column lineage, and team collaboration. dolthub/dolt: Dolt is a SQL database with Git-like version control features, accessible via MySQL or a command line interface. iterative/dvc: DVC is a tool for reproducible machine learning, enabling data and model versioning, lightweight pipelines, experiment tracking, and easy sharing. quiltdata/quilt: Quilt allows creating versioned datasets with Python and an S3 bucket. It supports data-driven teams, aiding rapid experimentation and collaboration. Real-time Data Processing allinurl/goaccess: GoAccess is a real-time web log analyzer for *nix systems and browsers, offering fast HTTP statistics. More details: goaccess.io. feathersjs/feathers: Feathers is a TypeScript/JavaScript framework for building APIs and real-time apps, compatible with various backends and frontends. apache/age: Apache AGE extends PostgreSQL with graph database capabilities, supporting both relational SQL and openCypher graph queries seamlessly. zephyrproject-rtos/zephyr: Real-time OS for diverse hardware, from IoT sensors to smart watches, emphasizing scalability, security, and resource efficiency. hazelcast/hazelcast: Hazelcast integrates stream processing and fast data storage for real-time insights, enabling immediate action on data-in-motion within unified platform. Data Quality Management WeBankFinTech/Qualitis: Qualitis manages data quality through verification, notification, and management across various data sources, solving data processing-related quality issues. raystack/optimus: Optimus is a robust workflow orchestrator for data transformation, modeling, pipelines, and quality management, emphasizing ease of use and reliability. Toloka/crowd-kit: Crowd-Kit is a Python library for crowdsourced annotation, featuring aggregation methods, metrics, and datasets to simplify working with crowd data. ydataai/ydata-profiling: ydata-profiling offers a streamlined, fast EDA solution akin to pandas' df.describe(), providing detailed DataFrame analysis exportable in formats like HTML and JSON. cleanlab/cleanlab: cleanlab automates data and label cleaning by detecting issues in ML datasets, enhancing model training with real-world data. Predictive Analytics spring-cloud/spring-cloud-dataflow: Spring Cloud Data Flow enables microservices-driven data processing pipelines on Cloud Foundry and Kubernetes, supporting diverse use cases like streaming and batch processing. ScottfreeLLC/AlphaPy: AlphaPy, a Python ML framework, caters to speculators and data scientists with scikit-learn, pandas, and additional tools for feature engineering and visualization. retentioneering/retentioneering-tools: Retentioneering simplifies analyzing clickstreams and user paths, offering deeper insights than funnel analysis, benefiting data and marketing analysts. genular/pandora: PANDORA offers advanced analytics for biomedical research, employing machine learning tools like clustering, PCA, UMAP, and interpretable models for discovery. nabeel-oz/qlik-py-tools: Qlik's SSE integrates modern data science into Qlik Sense, enabling business users to leverage advanced analytics through Python-based functions. Deep Learning Lightning-AI/pytorch-lightning: Lightning 2.0 simplifies PyTorch workflows with a stable API, enabling scalable training and deployment of AI models efficiently. ultralytics/yolov5: YOLOv5 by Ultralytics is a leading vision AI model, built on extensive open-source research and development for advanced performance. hpcaitech/ColossalAI: Colossal-AI simplifies distributed deep learning with user-friendly tools, enabling easy parallel training and inference similar to local model development. naptha/tesseract.js: Tesseract.js simplifies OCR with a webassembly-based Tesseract engine, supporting both browser and Node.js environments with easy integration and setup. microsoft/DeepSpeed: DeepSpeed enables efficient training of models like ChatGPT with significant speed improvements and cost reductions across all scales. Reinforcement Learning ray-project/ray: Ray is a unified framework that scales AI and Python applications with a distributed runtime and specialized AI libraries. d2l-ai/d2l-en: An open-source book using Jupyter notebooks to make deep learning accessible, blending concepts, context, and interactive code examples. Unity-Technologies/ml-agents: Unity ML-Agents enables games and simulations for training intelligent agents with deep reinforcement learning and imitation learning, fostering innovation in AI. google/trax: Trax is a Google Brain-endorsed deep learning library known for clear code and speed, demonstrated in a Colab notebook. wandb/wandb: The repository includes a CLI and Python API for visualizing and tracking machine learning experiments effectively. VowpalWabbit/vowpal_wabbit: Vowpal Wabbit advances machine learning with online, hashing, allreduce, and active learning techniques, pushing the frontier of ML capabilities. Time Series Analysis taosdata/TDengine: TDengine is a high-performance, open-source time-series database designed for IoT, connected cars, industrial IoT, and DevOps environments. timescale/timescaledb: An open-source SQL database for time-series data, optimized for rapid data ingestion and complex querying, available as a PostgreSQL extension. influxdata/telegraf: Telegraf is an agent for gathering and processing metrics, logs, and data, featuring 300+ plugins and community-driven development for flexibility. questdb/questdb: QuestDB is an open-source time-series database known for high throughput ingestion, fast SQL queries, and operational simplicity, ideal for various high-cardinality datasets. ccfos/nightingale: Nightingale is an all-in-one, open-source, cloud-native monitoring system combining data collection, visualization, and alerting capabilities seamlessly. Data Engineering PrefectHQ/prefect: Prefect simplifies Python data pipeline orchestration, transforming scripts into dynamic workflows that react to changes and ensure resilience. airbytehq/airbyte: Airbyte, an open-source data integration platform, offers 300+ connectors for seamless ELT pipelines between diverse data sources and destinations. argoproj/argo-workflows: Argo Workflows orchestrates parallel jobs on Kubernetes via container-native workflows, supporting DAGs and accelerating compute-intensive tasks like ML and data processing. dagster-io/dagster: Dagster is a cloud-native data pipeline orchestrator with integrated lineage, observability, declarative programming, and robust testability across the lifecycle. Avaiga/taipy: Taipy simplifies web app development for data scientists & ML engineers using Python, focusing on AI algorithms with no extra languages. Business Intelligence ankane/blazer: SQL-based tool for data exploration, chart creation, dashboard sharing. Supports various data sources, variables, checks, audits, and security integrations. evidence-dev/evidence: Open-source BI tool uses Markdown with SQL queries for data sourcing, rendering charts, and generating templated, dynamic web pages. lightdash/lightdash: Empower teams with self-service data insights using dbt: define metrics, visualize data, and share dashboards seamlessly across your organization. TuiQiao/CBoard: User-friendly open BI platform for self-service reporting and dashboards, simplifying data insights and sharing across teams effortlessly. quarylabs/quary: BI platform for engineers to connect databases, write SQL for table transformations, create charts, dashboards, and reports with collaboration and deployment capabilities. Data Visualization netdata/netdata: Real-time metrics collection and visualization for servers, cloud, Kubernetes, and edge/IoT devices, scaling effortlessly across diverse environments. directus/directus: Open-source API and dashboard for managing SQL database content with REST & GraphQL interfaces, supporting various databases, and customizable for on-premises or cloud deployment. airbnb/visx: Reusable low-level visualization components combining d3's power with React's DOM updating capabilities for dynamic data visualization. uber/react-vis: React component library for diverse data visualizations: line, bar, scatter, heatmaps, pie charts, sunbursts, radar charts, and more. bokeh/bokeh: Interactive visualization library for web browsers, offering versatile graphics creation and high-performance interactivity for large datasets and dashboards. apache/echarts: Free JavaScript library for intuitive, interactive, and customizable charts, ideal for enhancing commercial products with powerful visualizations. Recommender Systems NicolasHug/Surprise: Python scikit for building recommender systems with explicit rating data, emphasizing experiment control, dataset handling, and diverse prediction algorithms. gorse-io/gorse: Open-source recommendation system in Go, designed for universal integration into online services, automating model training based on user interaction data. recommenders-team/recommenders: Recommenders, a Linux Foundation project, offers Jupyter notebooks for building classic and cutting-edge recommendation systems, covering data prep, modeling, evaluation, optimization, and production deployment on Azure. alibaba/Alink: Alink, developed by Alibaba's PAI team, integrates Flink for ML algorithms. PyAlink supports various Flink versions, maintaining compatibility up to Flink 1.13. RUCAIBox/RecBole: RecBole, built on Python and PyTorch, facilitates research with 91 recommendation algorithms across general, sequential, context-aware, and knowledge-based categories. Quantitative Finance AI4Finance-Foundation/FinGPT: FinGPT is a cost-effective, adaptable financial large language model for quick updates and fine-tuning, enhancing accessibility compared to BloombergGPT. google/tf-quant-finance: This library leverages TensorFlow's hardware acceleration and automatic differentiation for high-performance mathematical methods, mid-level functions, and pricing models support. goldmansachs/gs-quant: GS Quant, a Python toolkit by Goldman Sachs, aids in developing quantitative trading strategies and risk management solutions with robust market experience. domokane/FinancePy: A Python finance library specializing in pricing and managing financial derivatives across fixed-income, equity, FX, and credit markets. romanmichaelpaolucci/Q-Fin: QFin is evolving with enhanced object-oriented principles, deprecating old modules like PDEs/SDEs, introducing 'stochastics' for model calibration and option pricing. avhz/RustQuant: This Rust library for quantitative finance covers diverse modules from autodiff and data handling to instruments pricing and stochastic processes. Responsible AI microsoft/responsible-ai-toolbox: Responsible AI Toolbox offers interfaces and libraries for model and data exploration, enabling developers to monitor and improve AI responsibly. Giskard-AI/giskard: Giskard, an open-source Python library, detects performance, bias, and security issues in AI applications, spanning LLMs to traditional ML models. fairlearn/fairlearn: Fairlearn, a Python package, helps developers assess and mitigate fairness issues in AI systems with algorithms and assessment metrics provided. Azure/PyRIT: PyRIT is an open-access Python tool for generative AI, aiding security professionals and ML engineers in identifying system risks. ModelOriented/DALEX: DALEX enhances model transparency to prevent failure through its explainability tools, supporting understanding and trust in complex AI systems. JohnSnowLabs/langtest: LangTest simplifies testing of AI models with over 60 tests in one line, covering robustness, bias, fairness, and accuracy across various NLP frameworks. Explainable AI (XAI) SeldonIO/alibi: Alibi is a Python library focused on machine learning model inspection, offering diverse explanation methods for classification and regression models. Trusted-AI/AIX360: AI Explainability 360 offers an open-source Python toolkit for detailed model interpretability across various data types, supporting diverse explanation methods. dssg/aequitas: Aequitas is an open-source toolkit for bias auditing and Fair ML, aiding data scientists and researchers in assessing and correcting model biases. albermax/innvestigate: iNNvestigate is a Python library providing a unified interface for various methods to analyze neural networks' predictions and understand their internal workings. mindsdb/lightwood: Lightwood is an AutoML framework simplifying machine learning pipelines with JSON-AI syntax, allowing customization and automation across diverse data types. Anomaly Detection SeldonIO/alibi-detect: Alibi Detect is a Python library for detecting outliers, adversarial attacks, and drift in tabular, text, image, and time series data. datamllab/tods: TODS automates outlier detection in multivariate time-series data with modules for data processing, feature analysis, and diverse detection algorithms. pygod-team/pygod: PyGOD is a Python library using PyTorch Geometric for graph outlier detection, offering 10+ algorithms and easy integration with PyOD. Jingkang50/OpenOOD: This repository replicates methods from the Generalized Out-of-Distribution Detection Framework for fair comparison across anomaly, novelty, and out-of-distribution detection methods. yzhao062/pyod: PyOD is a Python library for detecting anomalies in multivariate data, offering diverse algorithms for various project scales and datasets. chaos-genius/chaos_genius: Chaos Genius is an open-source ML-powered analytics engine for outlier detection and root cause analysis at scale. Supply Chain Analytics guacsec/guac: GUAC creates a high fidelity graph database for software security, facilitating organizational outcomes like audit, policy, and risk management. owasp-dep-scan/blint: BLint is a Binary Linter using lief to verify executable security and capabilities, now supporting SBOM generation for compatible binaries. samirsaci/picking-route: This repository focuses on improving warehouse productivity through Python-based tools and methodologies, particularly addressing order batching and optimizing picking routes using the Single Picker Routing Problem (SPRP). ragamarkely/scanalytics: Scanalytics automates Supply Chain Analytics & Design tasks in Python, streamlining analyses and reducing manual spreadsheet work for assignments. aitechtools/SunFlow: SunFlow optimizes supply chain design with comprehensive modeling of materials, components, suppliers, manufacturers, and customers, integrating costs, capacities, and constraints. CIOL-SUST/SupplyGraph: This repository introduces a benchmark dataset for applying Graph Neural Networks (GNNs) to supply chain networks, enabling research in optimization and prediction. Network Optimization ray-project/ray: Ray is a scalable framework with a distributed runtime and AI libraries designed to accelerate AI and Python applications. svg/svgo: SVGO optimizes SVG files by removing redundant metadata, comments, and hidden elements to improve file efficiency and rendering performance. zeux/meshoptimizer: meshoptimizer is a C/C++ library optimizing GPU rendering by reducing mesh complexity and storage overhead, compatible with Rust via meshopt crate. cvxpy/cvxpy: CVXPY is a Python-based modeling language designed for convex optimization problems, providing a natural expression format aligned with mathematical conventions. guofei9987/scikit-opt: The repository provides Python implementations of various swarm intelligence algorithms such as Genetic Algorithm, Particle Swarm Optimization, and others for optimization tasks. Speech Processing espnet/espnet: ESPnet is a detailed speech processing toolkit using PyTorch, covering recognition, synthesis, translation, enhancement, diarization, and understanding tasks. mozilla/DeepSpeech: DeepSpeech is an open-source Speech-To-Text engine based on Baidu's research, implemented using TensorFlow for accessibility and performance. microsoft/SpeechT5: The repository proposes SpeechT5, adapting T5's text-to-text approach for self-supervised speech and text representation learning using shared encoders and modality-specific nets. sloria/TextBlob: Python library simplifying NLP tasks like POS tagging, sentiment analysis, and classification with a straightforward API for textual data. pytorch/audio: Torchaudio integrates PyTorch with audio processing, emphasizing GPU acceleration, trainable features via autograd, and maintaining a consistent tensor-based style. Graph Data Science neo4j/graph-data-science: The Neo4j Graph Data Science (GDS) library offers graph algorithms, transformations, and ML pipelines, accessible via Cypher within Neo4j. cncf/landscape-graph: This repository explores open source project dynamics, evolution, and collaboration using a Graph Data Model for insightful community analysis. BlueBrain/nexus: Blue Brain Nexus organizes and enhances data with a Knowledge Graph ecosystem, featuring various products, libraries, and tools for comprehensive use. lynxkite/lynxkite: LynxKite is a robust graph data science platform with a user-friendly interface and powerful Python API for large datasets. dgraph-io/dgraph: Dgraph is a scalable GraphQL database optimized for performance, offering ACID transactions and distributed architecture for real-time queries. arangodb/arangodb: ArangoDB is a versatile multi-model database supporting documents, graphs, and key-values, empowering high-performance applications with SQL-like queries and JavaScript extensions. ETL/ELT (Extract, Transform, Load / Extract, Load, Transform) redpanda-data/connect: Redpanda Connect is a robust stream processor for seamless data integration, featuring a powerful mapping language and easy deployment options. turbot/steampipe: Steampipe simplifies data access from APIs with CLI, Postgres FDWs, SQLite extensions, export tools, and cloud-based Turbot Pipes. risingwavelabs/risingwave: RisingWave is a cost-efficient streaming database compatible with Postgres, designed for real-time event streaming data processing and analysis. apache/dolphinscheduler: Apache DolphinScheduler is a modern data orchestration platform with low-code workflow creation, robust task management, and cloud-native capabilities. rudderlabs/rudder-server: RudderStack is a privacy-focused, Segment-alternative platform in Golang and React. It simplifies data collection and integrates with warehouses and tools for enriched customer data pipelines. We hope this extensive collection of tools and techniques proves to be a valuable asset in your daily data practice. May it help you achieve smoother workflows and better outcomes!

2
0
65583

article-image-mastering-semi-structured-data-in-snowflake

Serge Gershkovich

27 Jun 2024

7 min read

Mastering Semi-Structured Data in Snowflake

Serge Gershkovich

27 Jun 2024

7 min read

This article is an excerpt from the book, Data Modeling with Snowflake, by Serge Gershkovich. Discover how Snowflake's unique objects and features can be used to leverage universal modeling techniques through real-world examples and SQL recipes.Introduction In the era of big data, the ability to efficiently manage and analyze semi-structured data is crucial for businesses. Snowflake, a leading cloud-based data platform, offers robust features to handle semi-structured data formats like JSON, Avro, and Parquet. This article explores the benefits of using the VARIANT data type in Snowflake and provides a hands-on guide to managing semi-structured data.The Benefits of Semi-Structured Data in Snowflake Semi-structured data formats are popular due to their flexibility when working with dynamically varying information. Unlike relational schemas, where a precise entity structure must be predefined, semi-structured data can adapt to include or omit attributes as needed, as long as they are properly nested within corresponding parent objects. For example, consider the contact list on your phone. It contains a list of people and their contact details but does not capture those details uniformly. Some contacts may have multiple phone numbers, while others have only one. Some entries might include an email address and street address, while others have just a number and a vague description. To handle this type of data, Snowflake uses the VARIANT data type, which allows semi-structured data to be stored as a column in a relational table. Snowflake optimizes how VARIANT data is stored internally, ensuring better compression and faster access. Semi-structured data can sit alongside relational data in the same table, and users can access it using basic extensions to standard SQL, achieving similar performance. Another compelling reason to use the VARIANT data type is its adaptability to change. If columns are added or removed from semi-structured data, there is no need to modify ELT (extract, load, and transform) pipelines. The VARIANT data type does not require schema changes, and read operations will not fail for an attribute that no longer exists.Getting Hands-On with Semi-Structured Data Let's delve into a practical example of working with semi-structured data in Snowflake. This example uses JSON data representing information about pirates, such as details about the crew, weapons, and their ship. All this information is stored in a single VARIANT data type column. In relational data, a row represents a single entity; in semi-structured data, a row can represent an entire file containing multiple entities. Creating a Table for Semi-Structured Data Here is a sample SQL script to create a table with semi-structured data:CREATE TABLE pirates_data ( id NUMBER AUTOINCREMENT PRIMARY KEY, load_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP, data VARIANT ); ``` In this example, the `AUTOINCREMENT` keyword generates a unique ID for each record inserted, and the `VARIANT` column stores the semi-structured JSON data.Loading Semi-Structured Data To load semi-structured data into Snowflake, you can use the `COPY INTO` command. Here’s an example of how to load JSON data from an external stage into the `pirates_data` table:COPY INTO pirates_data FROM @my_stage/pirates_data.json FILE_FORMAT = (TYPE = 'JSON'); ```Querying Semi-Structured Data Once the data is loaded, you can query it using standard SQL. For instance, to extract specific attributes from the JSON data, you can use the dot notation: SELECT data:id::NUMBER AS pirate_id, data:crew AS crew, data:weapons AS weapons FROM pirates_data; ```This query extracts the `id`, `crew`, and `weapons` fields from the JSON data stored in the `data` column.Converting Semi-Structured Data into Relational Data Although semi-structured data offers flexibility, converting it into a relational format can provide better performance for certain queries. Snowflake allows you to transform VARIANT data into relational columns using the `FLATTEN` function. Here's an example of how to flatten a JSON array into a relational table:SELECT value:id::NUMBER AS pirate_id, value:name::STRING AS name, value:rank::STRING AS rank FROM pirates_data, LATERAL FLATTEN(input => data:crew); ``` This query converts the `crew` array from the JSON data into individual rows in a relational format, making it easier to query and analyze.Schema-on-Read vs. Schema-on-Write One of the main advantages of using the VARIANT data type in Snowflake is the flexibility of schema-on-read. This approach allows you to ingest data without a predefined schema, and then define the schema at the time of reading the data. This contrasts with the traditional schema-on-write approach, where the schema must be defined before data ingestion.Benefits of Schema-on-ReadFlexibility: You can ingest data without worrying about its structure, which is particularly useful for unstructured or semi-structured data sources.Adaptability: Schema changes do not require re-ingestion of data, as the schema is applied at read time.Speed: Data can be loaded more quickly, as there is no need to enforce a schema during the ingestion process.Example: Using Schema-on-Read with VARIANT Data Here’s an example demonstrating schema-on-read with semi-structured data in Snowflake: SELECT data:id::NUMBER AS pirate_id, data:ship.name::STRING AS ship_name, data:ship.type::STRING AS ship_type FROM pirates_data; ```In this query, the schema is defined at read time, allowing you to extract specific attributes from the nested JSON data.Handling Nested and Repeated Data Snowflake’s support for semi-structured data also extends to handling nested and repeated data structures. The FLATTEN function is particularly useful for working with such data, enabling you to transform nested arrays into a more manageable relational format.Example: Flattening Nested Data Consider a JSON structure where each pirate has a nested array of previous voyages. To flatten this nested data, you can use the following query: SELECT data:id::NUMBER AS pirate_id, value:date::DATE AS voyage_date, value:destination::STRING AS voyage_destination FROM pirates_data, LATERAL FLATTEN(input => data:previous_voyages); ```This query extracts the nested `previous_voyages` array and converts it into individual rows in a relational format.Performance Considerations When working with semi-structured data in Snowflake, it’s important to consider performance implications. While the VARIANT data type offers flexibility, it can also introduce overhead if not managed properly.Tips for Optimizing PerformanceUse Caching: Take advantage of Snowflake’s caching mechanisms to reduce query times for frequently accessed data.Optimize Queries: Write efficient SQL queries, avoiding unnecessary complexity and ensuring that only the required data is processed.Monitor Usage: Regularly monitor your Snowflake usage and performance metrics to identify and address potential bottlenecks.ConclusionHandling semi-structured data in Snowflake using the VARIANT data type provides immense flexibility and performance benefits. Whether you are dealing with dynamically changing schemas or integrating semi-structured data with relational data, Snowflake’s capabilities can significantly enhance your data management and analytics workflows. By leveraging the techniques outlined in this article, you can efficiently model and transform semi-structured data, unlocking new insights and value for your organization.For more detailed guidance and advanced techniques, refer to the book "Data Modeling with Snowflake," which provides comprehensive insights into modern data modeling practices and Snowflake’s powerful features.Author BioSerge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.

0
0
19613

article-image-prompt-engineering-with-azure-prompt-flow

Shankar Narayanan SGS

06 May 2024

10 min read

Prompt Engineering with Azure Prompt Flow

Shankar Narayanan SGS

06 May 2024

10 min read

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionThe ability to generate relevant and creative prompts is one of the imperative aspects of the natural language processing system. Especially when the world is evolving in the landscape of artificial intelligence, it is one of the crucial prospects. During this situation, Microsoft's Azure prompt flow provides groundbreaking solutions while empowering the data, scientists, and developers to engineer prompts effectively. Here, let us explore the nuances of Azure prompt flow while delving deep into the realm of prompt engineering. Significance of Prompt Engineering With the help of prompt engineering, one can construct problems, helping the user with the guide of machine learning models effectively. However, it involves Formulating contextually relevant and specific questions or statements that elicit the desired responses from the artificial intelligence models. Azure prompt flow is one of the sophisticated tools by Microsoft Azure that simplifies intricate processes while enabling the developers to create brands that can have meaningful and accurate outcomes. Getting started with Azure prompt flow Even before exploring the practical applications of Azure Prompt flow, it is necessary to understand the few essential components of Azure prompt flow. The core of prompt flow utilizes the GPT 3.5 architecture to generate various relevant responses to prompts. With the integration of Azure, one can expect a secure and seamless environment for prompt engineering. Let us consider a practical example of a chatbot application. from azure.ai.textanalytics import TextAnalyticsClient from azure.core.credentials import AzureKeyCredential # Set up Azure Text Analytics client key = "YOUR_AZURE_TEXT_ANALYTICS_KEY" endpoint = "YOUR_AZURE_TEXT_ANALYTICS_ENDPOINT" credential = AzureKeyCredential(key) text_analytics_client = TextAnalyticsClient(endpoint=endpoint, credential=credential) # User input user_input = "Tell me a joke." # Generate a prompt using Azure Promptflow prompt = f"User: {user_input}\nChatbot:" # Get chatbot's response response = text_analytics_client.analyze_sentiment(prompt) # Output the response print(f"Chatbot: {response[0]['sentiment']}") In this particular example, we can see that the user inputs a request. Azure prompt flow constructs the required form for the chatbot while generating a sentiment analysis response. Here is the output:Chatbot: Positive Tuning prompts using Azure Promptflow Crafting good prompts can be a challenging task. With the concept of variants, the user would be able to test the behavior of the model under various conditions. Example: If the user wants to create a chatbot using Azure Promptflow, then this example might help one to respond creatively to the queries about movies. Prompt Tuning: User: "Tell me about your favorite movie." Chatbot: "Certainly! One of my favorite movies is 'Inception.' Directed by Christopher Nolan, it's a mind-bending sci-fi thriller that explores the depths of the human mind." Python code: from azure.ai.textanalytics import TextAnalyticsClient from azure.core.credentials import AzureKeyCredential # Set up Azure Text Analytics client key = "YOUR_AZURE_TEXT_ANALYTICS_KEY" endpoint = "YOUR_AZURE_TEXT_ANALYTICS_ENDPOINT" credential = AzureKeyCredential(key) text_analytics_client = TextAnalyticsClient(endpoint=endpoint, credential=credential) # User input user_input = "Tell me about your favorite movie." # Generate a creative prompt using Azure Promptflow prompt = f"User: {user_input}\nChatbot: Certainly! One of my favorite movies is 'Inception.' Directed by Christopher Nolan, it's a mind-bending sci-fi thriller that explores the depths of the human mind." # Get chatbot's response response = text_analytics_client.analyze_sentiment(prompt) # Output the response print(f"Chatbot: {response[0]['sentiment']}") In this example, Azure Promptflow is used to create prompts tailored to specific user queries, providing creative and contextually relevant responses. The analyze_sentiment function from the Azure Text Analytics client is used to assess the sentiment of the generated prompts. Replace "YOUR_AZURE_TEXT_ANALYTICS_KEY" and "YOUR_AZURE_TEXT_ANALYTICS_ENDPOINT" with your actual Azure Text Analytics API key and endpoint. Here are a few examples: URL: https://music.apple.com/us/app/apple-music/id1108187390 Text Content: Apple Music is a comprehensive music streaming app that boasts an extensive library of songs, albums, and playlists. Users can enjoy curated playlists, radio shows, and exclusive content from their favorite artists. Apple Music allows offline downloads and offers a family plan for multiple users. It also integrates with the user's existing music library, making it seamless to access purchased and uploaded music. OUTPUT: {"category": "App", "evidence": "Both"} URL: https://www.youtube.com/user/premierleague Text Content: Premier League Pass, in collaboration with the English Premier League, delivers live football matches, highlights, and exclusive behind-the-scenes content on YouTube. Football aficionados can stay updated with their favorite teams and players through this official channel. Subscribing to Premier League Pass on YouTube ensures fans never miss a moment from the most exciting football league in the world. OUTPUT: {"category": "Channel", "evidence": "URL"} URL: https://arxiv.org/abs/2305.06858 Text Content: This research paper explores the realm of image captioning, where advanced algorithms generate descriptive captions for images. The study delves into techniques that combine computer vision and natural language processing to achieve accurate and contextually relevant image captions. The paper discusses various models, evaluates their performance, and presents findings that contribute to the field of image captioning technology. OUTPUT: {"category": "Academic", "evidence": "Text content"} URL: https://exampleconstructionsite.com/ Text Content: This website is currently under construction. Please check back later for updates and exciting content. OUTPUT: {"category": "None", "evidence": "None"} For a given URL: {{url}}, and text content: {{text_content}}. Classified Category: Travel Evidence: The text contains information about popular tourist destinations, travel itineraries, and hotel recommendations. OUTPUT: After summarizing, here is the final Promptflow with 2 variants for the summarize_text_content node. Benefits of using Azure ML prompt flow Apart from offering a wider range of benefits, Azure ML promptflow helps users to make the transition from ideation to experimentation. This ultimately results in production ready LLM based applications. Prompt engineering agility Azure prompt flow offers a visual representation of the struct of the flow structure. It allows the users to understand and navigate the projects while offering a notebook-like coding experience for debugging and efficient flow development. At the same time, users can create as well as compare more than one prompt variant which helps in facilitating an iterative refinement process. Enterprise readiness The prompt flow streamlines the entire prompt engineering process and leverages robust enterprise readiness solutions. It thus offers a secure, reliable, and scalable foundation for experimentation and development. Besides, it supports team collaboration where multiple users can work together, share knowledge, and maintain version control. Application development The well-defined process of Azure prompt facilitates the seamless development of AI applications. Only by leveraging it the user can progress effectively through the consequent stages of developing, testing, tuning, and deploying flows. All these ultimately result in creating a fully-fledged AI applications. However, when the user follows this methodical and structured approach, it empowers them to develop fine-tune and test rigorously to deploy with confidence. Real-world applications of Azure Promptflow Content creation One of the applications of Azure promptflow lies in the content creation tunes. Various content creators can generate outlines and creative ideas by creating engineering tailored to specific topics. One can even generate entire paragraphs using the prompt flow engineering method. This helps streamline the content creation process while making it look more inspiring and efficient. Language Translation Developers are now leveraging Azure promptflow to build large language translation applications. With the help of constructing prompts in the source language, one can let the system translate the inputs by providing accurate outputs required in the desired language. Such a profound implication can only be possible with the help of Azure prompt flow. It has the propensity to break all the language barriers in the globalized world. Custom support chat box By integrating Azure prompt flow within the customer support chatbots, one can enhance the user experience. However, the prompt engineering techniques help ensure the queries are accurately understood. This process would result in relevant and precise responses. It significantly reduces the response time while improving customer satisfaction. Azure prompt flow simplifies prompt engineering Prompt engineering is an iterative and challenging process. With the help of Azure prompt flow, one can simplify the development, comparisons, and evaluation of problems. The process makes it easier for the user to find the best prompt for use cases. Besides, developing a chatbot that utilizes large language models, including GPT3.5, can help companies provide personalized product recommendations based on customer input. Here, Azure prompt flow allows users to evaluate, create, and even deploy from the machine learning models. It speeds up the whole process of developing and deploying artificial intelligence solutions. At the same time, it also allows the user to create connections to the large language model. Such models include GPT 3.5 and Azure open AI. Users can also use these models for different purposes, including chat computation or creating embeddings. Designing and modifying prompts Designing and modifying alarms for effective use is crucial, especially when using them for large language models. Azure prompt flow enables users to test, create, and deploy various prompt versions for recommendation purposes. To effectively utilize the large language model, especially while dealing with multiple prompts, it is imperative to modify them and design accordingly for better results. Once you can create the problems, it is time to evaluate and test them in multiple scenarios. For instance, if you are creating prompts for a product company, you must explain the process of prompts and their flow to handle the user queries. Also, one can mention the need for custom coding and deployment of end-to-end solutions with the help of Azure's prompt flow feature. Conclusion With powerful prompt engineering capabilities, Azure prompt flow enables the developers to construct contextually relevant prompts. It enhances the efficiency and accuracy of AI applications over various domains. The potential of prompt engineering makes the future of AI development promising. However, it can only be possible with the help of Azure AI leading the way. Author BioShankar Narayanan (aka Shanky) has worked on numerous different cloud and emerging technologies like Azure, AWS, Google Cloud, IoT, Industry 4.0, and DevOps to name a few. He has led the architecture design and implementation for many Enterprise customers and helped enable them to break the barrier and take the first step towards a long and successful cloud journey. He was one of the early adopters of Microsoft Azure and Snowflake Data Cloud. Shanky likes to contribute back to the community. He contributes to open source is a frequently sought-after speaker and has delivered numerous talks on Microsoft Technologies and Snowflake. He is recognized as a Data Superhero by Snowflake and SAP Community Topic leader by SAP.

0
0
51493

article-image-fabrics-code-first-automl-and-hyperparameter-tuning-google-cloud-cortex-framework-snowflakes-data-metric-functions-qliks-ai-accelerator

Merlyn Shelley

29 Apr 2024

12 min read

Fabric’s Code-First AutoML and Hyperparameter Tuning, Google Cloud Cortex Framework, Snowflake’s Data Metric Functions, Qlik's AI Accelerator

Merlyn Shelley

29 Apr 2024

12 min read

Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Welcome to BI-Pro #54: Your Premier Destination for Data and Business Intelligence Insights! 🌟 In this edition, we dive deep into the cutting-edge solutions of business intelligence, data modeling, and advanced analytics. Prepare to explore an array of transformative topics and industry insights that will redefine how you interact with technology and data. 🧩 Highlights of This Issue: Python Practice Platforms: The top 7 platforms where you can sharpen your Python skills. Innovative Experiments: Dive into hands-on experiments with MLFlow and Microsoft Fabric to enhance your project’s efficiency. SAP Expertise: Master the complex data models of SAP and leverage them for optimal performance. AI-Powered Business Management: Learn how to integrate AI to streamline and enhance business management functions. Snowflake’s Surveillance: Monitor your data pipelines effectively using Snowflake’s Data Metric Functions. 🧬 Stay Informed with Industry Highlights: Power BI: Learn about the significant deprecation of AutoML in Power BI using Dataflows V1. Microsoft Fabric: Get the scoop on the new code-first AutoML and hyperparameter tuning, now available in public preview. AWS BI: Discover how to build SAP Golden AMIs with EC2 Image Builder and Ansible and explore the transformative impact of Amazon Q on business experiences. Google Cloud Data: Catch up with the latest updates from the Google Cloud Cortex Framework. Tableau: Uncover how Einstein Copilot for Tableau is building the next generation of AI-driven analytics. From the Experts at Packt Community: Gain insights from industry leaders on the fundamentals of Analytics Engineering. 🧮 What’s the Latest from the BI Community? Explore real-time AI capabilities with Datorios’ new observability tool. Learn about Snowflake's launch of Arctic, an enterprise-grade LLM. Discover how Qlik's AI Accelerator is integrating generative AI to deliver customer outcomes. Witness the future of AI with Avant Technologies’ new supercomputing advancements. Join us as we unpack these topics to keep you at the forefront of the data and BI world. Stay curious, stay informed! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktPackt BI-Pro is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Upgrade to paidSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos🧩 pixiedust/pixiedust: PixieDust is an open-source library enhancing Jupyter notebooks, improving data work experience, particularly for cloud-hosted notebooks without configuration access. 🧩 plotly/plotly.py: plotly.py is an interactive, open-source graphing library for Python, offering over 30 chart types, including scientific, 3D, statistical, and financial charts. 🧩 AykutSarac/jsoncrack.com: JSON Crack is a free, open-source data visualization app for JSON, YAML, XML, CSV, etc., offering interactive graphs for easy data exploration and analysis. 🧩 apexcharts/apexcharts.js: ApexCharts is a JavaScript charting library with a simple API, 100+ samples, and over a dozen chart types for beautiful, responsive visualizations in apps and dashboards. 🧩 antvis/G2: G2 is a visualization library inspired by "The Grammar of Graphics," offering an introduction, examples, tutorials, and API reference for learning and using its core concepts. 🧩 visgl/deck.gl: deck.gl simplifies high-performance, WebGL2/WebGPU-based visualization of large datasets. It offers pre-built layers for easy setup or customizable architecture for tailored needs. Email Forwarded? Join BI-Pro Here!🔮 Revolutionizing Analytics: New BI Tools🧬 7 Best Platforms to Practice Python: The article lists seven platforms—Practice Python, Edabit, CodeWars, Exercism, PYnative, LeetCode, and HackerRank—that offer various levels of programming challenges for learning and practicing Python, particularly for coding interviews and skill improvement. 🧬 Experimenting with MLFlow and Microsoft Fabric: The blog discusses the importance of systematic experimentation in machine learning (ML) to improve model performance, highlighting the use of MLFlow within Fabric for managing ML experiments. It covers setting up experiments, running them, logging results, and analyzing them, emphasizing the importance of tracking configurations and outcomes for iterative improvement in ML models. 🧬 Mastering SAP’s data models: The article discusses challenges faced in understanding SAP data models for analytics, focusing on integrating procurement data. It explains SAP's ERP software, data architecture basics, table types (master vs. transaction), and data mapping for procurement tables. 🧬 Building an AI-Powered Business Manager: The post explores the concept of consolidating business management into a single, chat-based platform powered by Large Language Models (LLMs). It discusses the advantages for small businesses, outlines project structure, sets up the database, and updates the Tool class to handle SQLModel instances. 🧬 Monitor Data Pipelines Using Snowflake’s Data Metric Functions: The author emphasizes the importance of data quality in gaining trust with stakeholders and focuses on using Google's Site Reliability Engineering principles to measure the health of data systems. It discusses defining service level indicators and objectives for data quality dimensions and provides a technical implementation example in Snowflake. ⚡Stay Informed with Industry HighlightsPower BI🧮 Deprecation of AutoML in Power BI using Dataflows V1: The update announces the deprecation of Power BI Automated Machine Learning (AutoML) models for Dataflows V1 in all regions as of the third week of April. Customers are encouraged to migrate to the AutoML solution based on Synapse Data Science in Microsoft Fabric, offering a more customizable AutoML experience with advanced tools and features. Microsoft Fabric🧮 Introducing Code-First AutoML and Hyperparameter Tuning: Now in Public Preview for Fabric Data Science: The update introduces code-first automated machine learning (AutoML) and hyperparameter tuning in Public Preview for Fabric Data Science. Users can access both AutoML and Tune capabilities seamlessly within the Fabric 1.2 runtime, enhancing machine learning model optimization and accessibility. 🧮 Fabric Change the Game: Embracing Azure Cosmos DB for NoSQL. The post explores setting up Azure Cosmos DB for NoSQL and leveraging Vector Search capabilities of AI Search Services through Microsoft Fabric's Lakehouse features. It also discusses integrating Cosmos DB Mirror and using Python coding facilitated through Lakehouse, highlighting Fabric's integration capabilities for search or data mirroring. 🧮 Microsoft Fabric April 2024 Update: The April 2024 update brings various enhancements and previews to Microsoft Fabric, including new visuals like the 100% Stacked Area Chart, improvements to reporting, data connectivity, administration features, analytics, real-time analytics, data factory, and data pipelines. Additionally, the update includes the availability of Exam DP-600 for Fabric Analytics Engineer certification and free learning sessions. AWS BI 🧮 Build SAP Golden AMIs with EC2 Image Builder and Ansible: This blog post guides users on building a reusable Amazon Machine Image (AMI) for deploying Amazon Elastic Compute Cloud (EC2) instances for SAP installations. It covers using Terraform and Ansible to automate the process and provides sample code. 🧮 Transforming Business Experiences: The Impact of Amazon Q and Generative BI for AWS Partners. This post highlights how advances in AI, particularly Amazon Q and generative BI, are transforming business operations. It showcases how AWS partners like ZS Associates, Tiger Analytics, and Compass UOL are leveraging these innovations for industry-specific solutions. Google Cloud Data 🧮 What’s new with Google Cloud Cortex Framework? The article discusses Google Cloud Cortex Framework, emphasizing its role in unifying enterprise data for AI-driven insights. It highlights new solutions for marketing, sustainability management, and finance, showcasing how Cortex Framework accelerates innovation, enhances decision-making, and drives business efficiency in the AI era. Tableau🧮 Einstein Copilot for Tableau: Building the Next Generation of AI-Driven Analytics. The post delves into the development of Einstein Copilot for Tableau, an AI-driven tool revolutionizing data analysis. It highlights the challenges and solutions in building its infrastructure, improving accuracy and efficiency, and enhancing AI and core capabilities through collaboration and continuous improvement. ✨ Expert Insights from Packt CommunityFundamentals of Analytics Engineering - By Dumky De Wilde, Fanny Kassapian, Jovan Gligorevic and 4 more The role of dbt in analytics engineering dbt emerged as a solution to the challenges relating to data transformation faced in data analysis. Initially crafted as an open-source Python package, dbt aimed to bring software engineering best practices to the world of analytics. Over time, dbt matured beyond just a package, becoming a versatile cloud service. While the open-source package remains available and actively supported, dbt now offers a cloud-based version, packed with features such as an integrated development environment (IDE), scheduling tools, data lineage trackers, and hosted documentation. This is especially valuable for analysts who might not have a deep software engineering background. For more information on dbt’s history, read https://www.getdbt.com/blog/what-exactly-is-dbt. We will use dbt Cloud, which offers a free tier for a single developer: that’s you! You can learn more about its pricing here: https://www.getdbt.com/pricing. dbt seamlessly integrates into the ELT architecture. It does not store or process data but serves as a bridge between analysts and the data warehouse. dbt’s position in a data stack as an intermediary in the transformation layer. This is how it works: analysts draft SQL queries, enhanced with dbt’s unique capabilities. dbt then translates this specialized SQL into the native SQL of the data warehouse and dispatches it for execution. All the transformed data and results remain within the data warehouse, making dbt a lightweight yet powerful tool in the analytics toolkit. Because of dbt’s pivotal position in analytics engineering, we will spend more time discussing its features and zooming in on best practices. First, we will set up dbt for our use case. Setting up dbt Cloud The following steps are required for dbt: Creating a dbt Cloud account. Setting up a connection from dbt Cloud to BigQuery. Testing the connection by querying the data using dbt Cloud. Follow the step-by-step instructions here: https://github.com/PacktPublishing/Fundamentals-of-Analytics-Engineering/blob/main/chapter_8/guides/setting_up_dbt_cloud.md. Now, let’s focus on the various data layers in dbt. Data layers in dbt It is a widespread practice to separate the data we use for analytics into layers. This helps data practitioners communicate the distinct parts of the data transformation process. Broadly speaking, the process will fall into three layers in dbt, Raw, Preparation and Business. Let’s take a closer look: Raw layer: The source data is stored in the form it arrives in. Whenever you receive data, it should be stored as-is so that you have a backup in case something goes wrong during the transformations. When you copied the Excel sheets using Airbyte, they became part of the raw layer inside BigQuery. Preparation layer: In the second layer, the raw data is cleaned, deduplicated, and transformed to conform to naming conventions and other rules. For our data, this could mean renaming fields for readability and standardizing sales figures from cents to euros. Business layer: In the final layer, business rules are applied to the prepared data, and different data is joined and modeled into datasets that are ready for consumption by BI tools and stakeholders. In our case, we might add a business rule to disregard negative sales amounts when summing the total stroopwafels sold, as these are likely an error. The resulting data can then be served to the BI tool for dashboarding. Discover more insights from Fundamentals of Analytics Engineering - By Dumky De Wilde, Fanny Kassapian, Jovan Gligorevic and 4 more. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here💡 What's the Latest Scoop from the BI Community? 🧠 Datorios unleashes real-time AI with the first observability tool for streaming data: Datorios introduces the first observability tool for Apache Flink, offering deep insights into streaming data processing. It enables faster AI innovation and thorough auditability, providing developers with event visualization, event search, state monitoring, window analysis, and more. Datorios is now publicly available for free. 🧠 Snowflake Launches Arctic: The Most Open, Enterprise-Grade Large Language Model: Snowflake introduces Snowflake Arctic, an open, enterprise-grade large language model (LLM) with a Mixture-of-Experts architecture, optimized for complex enterprise workloads. Arctic sets new openness standards for AI technology, offering weights under an Apache 2.0 license and enhancing AI innovation. 🧠 Introducing Qlik's AI Accelerator - Delivering Tangible Customer Outcomes in Generative AI Integration: Qlik is at the forefront of integrating generative AI, particularly Large Language Models (LLMs), into data analysis and decision-making. They address key challenges like data privacy, technical complexity, and cost, offering seamless integration of popular LLMs and an AI Accelerator program to quickly prove the benefits of AI integration with minimal barriers to entry. 🧠 Avant Technologies Launches Advanced AI Supercomputing: Avant Technologies, an AI company, introduces a supercomputing network and licensable dataset with Wired4Tech, aiming to accelerate AI adoption. The offerings include a versatile AI dataset, dynamic resource scaling, accelerated AI processing, robust security measures, and seamless integration, designed to empower developers and drive innovation across industries. See you next time!

0
0
31717

Jakov Semenski

25 Apr 2024

6 min read

ChatGPT for Coding

Jakov Semenski

25 Apr 2024

6 min read

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionChatGPT's coding style is terrible:Verbosecomplexand outdated.Let's change that.ChatGPT promised to be our coding savior, but sometimes it feels more like a blast from the past.Remember those early 2000’s coding books? Yep, it's giving those vibes.It's like having a sports car with a tractor engine. Great potential, but the performance? Not quite there.Imagine harnessing the power of ChatGPT but with the finesse of a master coder.Ready for the upgrade?Here are 12 Pro prompts that will get you the right results.Tip #1: Specificity is the kingAs soon as you ask for some coding snippet from ChatGPT, by default, you will get the most basic HelloWorld example.The more vague your prompt is, the more mediocre your results will beInstead, specify exactlylanguageversionframeworkWrite backend code for Library app that uses Rest to communicate Cover endpoints for adding, removing, and filtering books by category and date published Use Java latest version. Use lambda streams instead of for loops Use Spring framework Tip #2: Avoid code vomitChatGPT loves to write a lot of code, the way I like to call it “code vomit”We are no longer rewarded by the amount of code we produce, but by the clarity and principles we follow.Give chat GPT instructions towrite clean codeuse latest principlescover logging and exception handlingWrite clean code Code needs to be covered with Logging and proper exception handling Use principles: Kiss & DRY, SOLID Keep in mind to use design patterns where it is applicable Using coding instructions I gave you, give me code for each class Tip #3: Make it easy to use with IDEEvery time ChatGPT writes code you getexplanationsimport statementscomments.This can be good for a beginner but is not something we need for our IDEOur IDE is already good with importing all the right packages, so let ChatGPT knowWhen writing code, avoid detailed explanations, just simple bullet points Don't add import statements, as IDE will do this instead Tip #4 Write testsYour code is not complete if you are not done with tests.But not just any tests. We want to have unit and integration tests in areadable format (give when then)covering the happy and unhappy pathuses the latest testing libraries such as AssertJ and BDDMockitoFor each class write a unit and integration tests Use given when then format For libraries use BDDMockito and AssertJ Cover happy and unhappy paths Tip #5 Give REST call request examplesWhat is the app if we cannot test it without some examplesInstead of creating them manually, ask ChatGPT to create Curl examples we can easily copy to Postman.For each request, generate curl examples Now go ahead and use your terminal or copy/paste them to PostmanTip #6 Create documentationWe don’t want just plain text, instead, we need a quick start guide for developersWrite a quick start guide for developers using markdown. Imagine this app has been published to github repository Cover - Introduction - how to install app - how to run it - how to use it Tip #7 Prepare deployment script for CloudThis app cannot live just in your local environment. Instead, we need a deployment script.Depending on where you want to deploy your changes, it might beKubernetes cluster scriptGoogle-specific terraform scriptsAWS cloud formation scriptAzure-specific deployment scriptOr ask ChatGPT to suggest a deployment scriptProvide me deployment script for one of most popular cloud providers Tip #8 Version ControlOur code for now is living only locally. Let’s ask chatGPT to give us instructions on how to set up Version ControlProvide Github version control setup instructions Tip #9 Define CI/CD pipelineCI/CD or continuous integration and continuous deployment is a must-have step for any serious development.There are plenty of options to choose from, such asJenkinsGitHub actionsBambooWith CI we guarantee we cansafely merge our changes by running build and testscheck if our code changes comply with sonar policiesWith CD we guarantee that we can safely deploy our changesProvide github actions that for each open pull request we run the build and run all the tests Also automatically include sonarqube scans Also create github action to run deployment on every code merge Tip #10 Performance optimizationOur backend rest service is now running, but the question we need to ask ourselveshow fast is ithow many requests it can handlewhat is the maximum limit of requestsFor that, we need to execute performance tests, e.g. using jmeter or gatling.We need to test what is the limit of our app. Write a load test script for gatling that tests how many book searches we can execute Tip #11 Run a security auditHow can we ensure our app is secure and not open to any threats?The best way is to run security scans.Our application might be open for security threats. Which security scan tools we can use for free and how can we use them. Give me step-by-step instruction on how to use it. Tip #12 Optimize for observabilityYou have your app running somewhere in the cloud.But did you optimize it for observability?How can you easily troubleshoot issues?How can you trace requests between different services?Did you set up monitoringWe want to make sure our application is optimized for observability Create guideline and configuration for the cloud environment for Traceability - tracing request from start to finish Monitoring - monitoring key performance metrics Logging - have a centralized logging system ConclusionYou can find the full prompt herehttps://chat.openai.com/share/f0bef1ca-062d-4a22-96aa-9711615329a5ChatGPT is a tool, and like any tool, it shines when used the right way.With these prompts, you get a coding assistant that keeps up with the latest trends, ensuring your code is not just functional but also follows modern standards.Author BioJakov Semenski is an IT Architect working at IBMiX with almost 20 years of experience.He is also a ChatGPT Speaker at the WeAreDevelopers conference and shares valuable tech stories on LinkedIn.

0
0
45453

article-image-gemini-10-pro-vision-in-bigquery-python-ui-library-feature-engineering-with-fabric-and-pyspark-power-analytics-with-redshift-amazon-rds-for-mysql

Merlyn Shelley

19 Apr 2024

14 min read

Gemini 1.0 Pro Vision in BigQuery, Python UI Library, Feature Engineering with Fabric and PySpark, Power analytics with Redshift, Amazon RDS for MySQL

Merlyn Shelley

19 Apr 2024

14 min read

Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!Get the first look at Sigma's new features and functionality at our virtual product launch on May 2nd at 12pm ET/9am PT.The virtual event will showcase talks and demos from Sigma's CEO, co-founders, and product managers about what's next in the future of analytics.Don't miss out. See how Sigma is reinventing BI.👋 Hello,Welcome to BI-Pro #52: Your Premier Destination for Data and BI Insights! 🌟 In This Edition: 🔮 Data Viz with Python Libraries Exploring causality with Python. Meet NiceGUI: Your Soon-to-be Favorite Python UI Library. Feature Engineering with Microsoft Fabric and PySpark. 10 GitHub Repositories to Master Python. 🔌 Power BI On-premises data gateway April 2024 release. Copilot in Power BI expansion. 🛠️ Microsoft Fabric Introducing Optimistic Job Admission for Fabric Spark. Introducing Job Queueing for Notebook in Microsoft Fabric. ☁️ AWS BI Meet Amazon QuickSight expert Sanjeeb Mohapatra. Handle tables without primary keys for Amazon Aurora MySQL and Amazon RDS for MySQL. Power analytics with Amazon Redshift. 🌐 Google Cloud Data Gemini 1.0 Pro Vision in BigQuery. BigQuery data canvas. Gemini in Looker AI-powered BI. Memorystore for Redis Cluster updates. Firestore launch updates. 📊Tableau Tableau vs Power BI: A Comparison of AI-Powered Analytics Tools. Salesforce-Informatica Deal Could Transform Enterprise GenAI Forever. ✨ Expert Insights from Packt Community ChatGPT for Cybersecurity Cookbook by Clint Bodungen. 💡 What's the Latest Scoop from the BI Community? Geospatial Data Analysis with Geemap. Microsoft Fabric Table Maintenance - Checkpoint and Statistics. Identifying Customer Buying Pattern in Power BI - Part 1. Full vs. Incremental Loads – Data Engineering with Fabric. Joining Queries in Azure Data Factory on Cosmos DB Sources. Feature Engineering with Microsoft Fabric and Dataflow Gen2. Stay ahead in the ever-evolving landscape of business intelligence with BI-Pro. Unleash the full potential of your data today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos 🐾 altair - Vega-Altair is a Python library for statistical visualization, offering simplicity, friendliness, and consistency for creating beautiful and effective visualizations. 🐾 bokeh - Bokeh is a Python library for creating interactive plots and data applications in web browsers, offering elegant and versatile graphics. 🐾 bqplot - bqplot is a 2-D visualization system for Jupyter, based on the Grammar of Graphics, enabling interactive plots with other Jupyter widgets. 🐾 cartopy - Cartopy simplifies map drawing in Python, offering easy projection definitions, point transformations, and integration with Matplotlib for advanced mapping. 🐾 diagrams - Diagrams simplifies cloud system architecture design in Python, supporting major providers and frameworks, allowing prototyping and visualization of existing architectures. Email Forwarded? Join BI-Pro Here!🔮 Data Viz with Python Libraries 🐍 Exploring causality with Python. Difference-in-differences: The series dives into causal inference, crucial in modern analytics, explaining tools like difference-in-differences. It explores how events impact outcomes, using examples such as minimum wage effects on employment. The setup involves treatment and control groups to establish cause-and-effect relationships in diverse real-world scenarios. 🐍 Meet the NiceGUI: Your Soon-to-be Favorite Python UI Library. NiceGUI is a Python UI framework for web and desktop apps, offering a simple interface for small projects, dashboards, and robotics. It simplifies state management and interaction, boasting features like easy layout, visualization tools, and integration with popular libraries. 🐍 Feature Engineering with Microsoft Fabric and PySpark: The post delves into feature engineering in Microsoft Fabric, emphasizing its importance in ML development. It explores PySpark's role in handling large datasets and provides a basic overview and example of using PySpark for feature engineering. 🐍 10 GitHub Repositories to Master Python: The blog explores 10 essential GitHub repositories for mastering Python, emphasizing hands-on experience and real-world projects to enhance skills. It covers a range of topics, from beginner to advanced, including machine learning, web development, and data analysis. Asabeneh/30-Days-Of-Python trekhleb/learn-python Avik-Jain/100-Days-Of-ML-Code realpython/python-guide zhiwehu/Python-programming-exercises geekcomputers/Python practical-tutorials/project-based-learning avinashkranjan/Amazing-Python-Scripts TheAlgorithms/Python vinta/awesome-python ⚡Stay Informed with Industry Highlights Power BI 📊 On-premises data gateway April 2024 release: This update to the on-premises data gateway aligns it with the April 2024 release of Power BI Desktop, ensuring consistency in query execution. Additionally, the gateway now supports refreshes longer than one hour, allowing tokens to be refreshed mid-stream for continuous operation. 📊 Copilot in Power BI: Soon available to more users in your organization. The update introduces changes to Copilot in Power BI, including enabling Copilot by default for all tenants starting May 20th, 2024. It also addresses features reported by customers and community, updates abuse monitoring to not store prompts, and improves geo mapping for EU data boundary customers. Microsoft Fabric📊 Introducing Optimistic Job Admission for Fabric Spark: The post introduces Optimistic Job Admission for Spark in Microsoft Fabric, a new feature aimed at improving concurrency and job admission experience. It explains how this feature optimizes resource allocation and increases the number of concurrent jobs that can be admitted to the cluster. 📊 Introducing Job Queueing for Notebook in Microsoft Fabric: Microsoft Fabric introduces Job Queueing for Notebook Jobs to streamline data engineering and data science processes. This feature automatically queues notebook jobs when Fabric capacity is maxed out, eliminating manual retries and improving user experience. Jobs are retried when resources become available, enhancing efficiency for enterprise users. AWS BI 📊 Meet one of Amazon QuickSight’s Top Community Experts: Sanjeeb Mohapatra. The Amazon QuickSight Community, launched in 2022, is a hub for BI authors and developers to collaborate, ask and answer questions, and learn about QuickSight. Sanjeeb Mohapatra, the top Community Expert for 2023, exemplifies the community's spirit by providing over 1,700 replies and 235 solutions in one year. 📊 Handle tables without primary keys while creating Amazon Aurora MySQL or Amazon RDS for MySQL zero-ETL integrations with Amazon Redshift: AWS is advancing its zero-ETL vision with Amazon Aurora zero-ETL integration to Amazon Redshift, combining transactional data with analytics capabilities. This integration, along with four new ones announced at re:Invent 2023, empowers customers to implement near real-time analytics for various use cases. 📊 Power analytics as a service capabilities using Amazon Redshift: Analytics as a service (AaaS) leverages cloud-based analytic capabilities to enable cost-effective, scalable solutions for organizations. Amazon Redshift, a cloud data warehouse service, facilitates real-time insights and predictive analytics, empowering AaaS providers to embed rich data analytics capabilities. Delivery models include managed, bring-your-own-Redshift (BYOR), and hybrid options, offering flexibility to meet customer needs. Google Cloud Data 📊 How to use Gemini 1.0 Pro Vision in BigQuery? BigQuery integrates with Vertex AI to leverage Gemini 1.0 Pro, PaLM, Vision AI, Speech AI, Doc AI, Natural Language AI, enabling analysis of unstructured data like images, audio, and documents. New integrations support multimodal generative AI, enhancing capabilities for object recognition, info seeking, captioning, digital content understanding, and structured content generation, allowing structured data output for deeper analysis. 📊 Get to know BigQuery data canvas: BigQuery Data Canvas simplifies the data-to-insights journey by offering a natural language-driven experience. It centralizes data tasks, accelerates analysis, and fosters collaboration, all within a unified workspace, enabling faster and more efficient data analytics. 📊 Gemini in Looker to bring intelligent AI-powered BI to everyone: Gemini in Looker introduces Conversational Analytics, transforming how businesses engage with data. It offers a natural language-driven experience, simplifying data analytics and fostering collaboration, all within a unified workspace. 📊 Memorystore for Redis Cluster updates at Next ‘24: The article elaborates on the rapid adoption and recent enhancements of Google Cloud's Memorystore for Redis Cluster. It features customer testimonials from companies like Statsig, Character.AI, and AXON Networks, showcasing the service's performance, scalability, and cost-effectiveness. It also highlights new features such as data persistence, new node types, and ultra-fast vector search. 📊 Firestore launches at Next ‘24: Firestore is beloved by developers for its speed in app development. Updates include improved developer productivity, AI-enabled app building, richer queries, and enterprise-level scalability. Gemini Code Assist now supports Firestore, allowing natural language queries and data model definitions, enhancing the development experience. Firestore also supports AI applications and integrations with LangChain and LlamaIndex for generative AI. Tableau📊 Tableau vs Power BI: A Comparison of AI-Powered Analytics Tools. The comparison delves into the unique strengths of Tableau and Power BI, showcasing how each excels in different areas of data visualization and analytics. It outlines Tableau's robust visualizations and analytics capabilities, especially for large datasets, contrasting with Power BI's integration with Microsoft services and affordability for small to medium-sized businesses. 📊 Salesforce-Informatica Deal Could Transform Enterprise GenAI Forever: Salesforce is reportedly in advanced talks to acquire Informatica, a data-management software provider, for $11 billion. This aligns with Salesforce's strategy to expand beyond CRM, bolstered by recent AI advancements like Einstein Copilot, complementing Informatica's data integration expertise and potential synergy with Tableau and MuleSoft. Additionally, it aligns with Salesforce's strategy to expand beyond CRM and become a comprehensive data journey platform. ✨ Expert Insights from Packt Community ChatGPT for Cybersecurity Cookbook - By Clint Bodungen Sending API Requests and Handling Responses with PythonIn this recipe, we will explore how to send requests to the OpenAI GPT API and handle the responses using Python. We’ll walk through the process of constructing API requests, sending them, and processing the responses using the openai module. Getting ready Ensure you have Python installed on your system. Install the OpenAI Python module by running the following command in your Terminal or command prompt: pip install openai How to do it… The importance of using the API lies in its ability to communicate with and get valuable insights from ChatGPT in real time. By sending API requests and handling responses, you can harness the power of GPT to answer questions, generate content, or solve problems in a dynamic and customizable way. In the following steps, we’ll demonstrate how to construct API requests, send them, and process the responses, enabling you to effectively integrate ChatGPT into your projects or applications: Start by importing the required modules: import openai from openai import OpenAI import os Set up your API key by retrieving it from an environment variable, as we did in the Setting the OpenAI API key as an Environment Variable recipe: openai.api_key = os.getenv("OPENAI_API_KEY") Define a function to send a prompt to the OpenAI API and receive a response:client = OpenAI() def get_chat_gpt_response(prompt): response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}], max_tokens=2048, temperature=0.7 ) return response.choices[0].message.content.strip() Call the function with a prompt to send a request and receive a response:prompt = "Explain the difference between symmetric and asymmetric encryption." response_text = get_chat_gpt_response(prompt) print(response_text) How it works… First, we import the required modules. The openai module is the OpenAI API library, and the os module helps us retrieve the API key from an environment variable. We set up the API key by retrieving it from an environment variable using the os module. Next, we define a function called get_chat_gpt_response() that takes a single argument: the prompt. This function sends a request to the OpenAI API using the openai.Completion.create() method. This method has several parameters: engine: Here, we specify the engine (in this case, chat-3.5-turbo). prompt: The input text for the model to generate a response. max_tokens: The maximum number of tokens in the generated response. A token can be as short as one character or as long as one word. n: The number of generated responses you want to receive from the model. In this case, we’ve set it to 1 to receive a single response. stop: A sequence of tokens that, if encountered by the model, will stop the generation process. This can be useful for limiting the response’s length or stopping at specific points, such as the end of a sentence or paragraph. temperature: A value that controls the randomness of the generated response. A higher temperature (for example, 1.0) will result in more random responses, while a lower temperature (for example, 0.1) will make the responses more focused and deterministic. Discover more insights from ChatGPT for Cybersecurity Cookbook - By Clint Bodungen. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!Read Here💡 What's the Latest Scoop from the BI Community? 🧠 Geospatial Data Analysis with Geemap: This article introduces geospatial data analysis, focusing on raster data from Google Earth Engine, accessed and analyzed using the Geemap Python library. Earth Engine offers a vast catalog of geospatial datasets, and Geemap simplifies access and analysis, making it easier to work with such data in Python. 🧠 Microsoft Fabric Table Maintenance - Checkpoint and Statistics: This article discusses the maintenance requirements for warehouse tables in Microsoft Fabric, particularly focusing on tasks like updating statistics, removing fragmentation, and managing log files. While some maintenance tasks, such as data compaction and log file checkpointing, are automated, others, like managing statistics, may require manual intervention. 🧠 Identifying Customer Buying Pattern in Power BI - Part 1: This article is part 1 of a retail analytics analysis in Power BI, focusing on customer purchasing frequency for various products over the years. It includes identifying data elements, creating calculated columns, and analyzing trends to aid in business decision-making. 🧠 Full vs. Incremental Loads – Data Engineering with Fabric: This article discusses using Apache Spark in Microsoft Fabric to achieve data quality zones (bronze and silver) in a data lake. It explores loading weather data, transforming it with Spark SQL and DataFrames, and implementing full and incremental load patterns. 🧠 Joining Queries in Azure Data Factory on Cosmos DB Sources: This article provides a detailed guide on joining two queries in Azure Data Factory (ADF). It covers prerequisites, creation of data sources, defining queries for each dataset, and using the "Join" transformation in ADF to merge data. Different join types such as inner, left outer, right outer, and full outer joins are explained. 🧠 Feature Engineering with Microsoft Fabric and Dataflow Gen2: This article introduces Dataflow Gen2 as a low-code data transformation and integration engine for creating data pipelines in Microsoft Fabric. It focuses on using Dataflow Gen2 to create features needed for training a machine learning model with college basketball game data, offering different approaches from no code to all code. See you next time!

0
0
44486

Merlyn Shelley

18 Apr 2024

5 min read

Elevate Your LLM Mastery

Merlyn Shelley

18 Apr 2024

5 min read

Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,🚀 Welcome to DataPro Newsletter #84! Dive into the dynamic world of data science and AI, where breakthroughs and trends shape our future. 🔍 Highlights: Google's Genie Meta AI's Priority Sampling DeepMind's Hawk and Griffin CMU's OmniACT Qualcomm's GPTVQ Azure PyRIT Microsoft's ChunkAttention ✨ Data Community Blogs: ML Workflow with Scikit-learn Pipelines Text Embeddings AI System Design Mixture of Thought LLM Cascades GNN with Pytorch Implementation Vertex AI MLOps Platform 🏭 Industry Updates: Anthropic’s Claude 3 Sonnet in Amazon Bedrock Anthropic’s Claude 3 models in Vertex AI Microsoft’s Orca-Math Table Meets LLM OpenAI and Elon Musk 📚 New in Packt Library: "Building AI Applications with ChatGPT APIs" by Martin Yanev DataPro Newsletter is not just a publication; it’s a comprehensive toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, Packt Sign Up | Advertise | Archives🔰 GitHub Finds: Any of These Repos in Your Toolbox?🛠️ VAST-AI-Research/TripoSR: TripoSR, developed by Tripo AI and Stability AI, is an open-source model for fast 3D reconstruction from a single image. It outperforms others in speed and quality, generating 3D models in under 0.5 seconds on NVIDIA A100 GPUs. 🛠️ facebookresearch/ViewDiff: ViewDiff creates consistent, high-quality images of 3D objects in real-world settings from multiple angles. 🛠️ YubiaoYue/MedMamba: MedMamba, inspired by visual state space models, sets a new baseline for medical image classification, excelling across diverse datasets. 🛠️ BAAI-Agents/Cradle: Cradle framework pioneers General Computer Control, enhancing agent capabilities for any task through reasoning and self-improvement. 📚 Expert Insights from Packt CommunityBuilding AI Applications with ChatGPT APIs - By Martin Yanev Setting Up the Code Bug Fixer Project Open PyCharm: Double-click on the PyCharm icon on your desktop or search for it in your applications folder to open it. On the PyCharm welcome screen, click on Create New Project or go to File | New Project. Choose the directory where you want to save your project. You can either create a new directory or select an existing one. Select the Python interpreter: Choose the version of Python you want to use for your project. Configure project settings: Give your project the name CodeBugFixer, and choose a project location. Once you’ve configured all the settings, click Create to create your new PyCharm project. After creating a new PyCharm project, the next step is to create the necessary files and folders for the CodeBugFixer project. Firstly, create two new Python files, called app.py and config.py, in the root directory of the project. The app.py file is where the main code for the CodeBugFixer app will be written, and the config.py file will contain any sensitive information such as API keys and passwords. Next, create a new folder called templates in the root directory of the project. This folder will contain the HTML templates that the Flask app will render. Inside the templates folder, create a new file called index.html. This file will contain the HTML code for the home page of the CodeBugFixer app. The project structure should look like the following: CodeBugFixer/ ├── config.py ├── app.py ├── templates/ │ └── index.html By following these steps, you have created the necessary files and folders for your CodeBugFixer project in your PyCharm project. You can now start writing the code for your Flask app in the app.py file and the HTML code in the index.html file. Once you have the correct interpreter, you can open the terminal within PyCharm by going to View | Tool Windows | Terminal. Check your terminal and ensure that you can see the (venv) indicator to confirm that you are working within your virtual environment. This is an essential step to prevent conflicting package installations between projects and guarantee that you are using the correct set of dependencies. In the terminal window, you can install any necessary libraries as follows: (venv)$ pip install flask (venv)$ pip install openai Finally, in order to establish the foundation for utilizing the ChatGPT API in your CodeBugFixer app, you’ll need to add the following code to config.py and app.py: config.py API_KEY = <Your API Key> app.py from flask import Flask, request, render_template import openai import config app = Flask(__name__) # API Token openai.api_key = config.API_KEY @app.route("/") def index(): return render_template("index.html") if __name__ == "__main__": app.run() The config.py file will securely hold your OpenAI API key. Make sure to replace <Your API Key> with the actual API key that you obtained from OpenAI. Discover more insights from 'Building AI Applications with ChatGPT APIs' by Martin Yanev. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here!Message from our Partners!👉 Octane AI Insights Analyst: Explore how Octane AI is revolutionizing ecommerce. Over 3,000 Shopify merchants have harnessed AI Quiz Funnels and Insights, generating over $500 million in revenue. It's more than growth; it's understanding and engaging customers on a new level. Join the community and see the difference. 👉 Cognism: Transform your sales strategy with Cognism. Experience a 3x boost in connect rate, gain access to verified B2B contacts, and enjoy seamless integration with your CRM tools. Expand globally with our comprehensive data coverage. Streamline your outreach for better conversions. 👉 Freshdesk: Revolutionize your customer service with Freshworks Smart Suite's focus on analytics. Unlock actionable insights, anticipate needs, and streamline support through AI-driven dashboard. Empower your team with the tools to excel in efficiency and personalization. Start with a free trial and transform your service today! 👉 Murf AI: Enhance your projects with Murf's AI-powered voices, offering a range of realistic options for any use case. From corporate presentations to entertainment, find the perfect voice in over 20 languages. With Murf Studio, seamlessly integrate voice with your videos, music, or images, bringing your creative vision to life. Start your free trial and experience the difference. Thanks for reading Packt DataPro! Subscribe for free to receive new posts and support my work.⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS ML Made Easy 🌀 Anthropic’s Claude 3 Sonnet foundation model is now available in Amazon Bedrock: Amazon announced a collaboration with Anthropic to accelerate the development of Claude foundation models, making them accessible to AWS customers. Recently, Claude 3 was introduced, offering three models with varying levels of intelligence, speed, and cost. Claude 3 Sonnet is now available in Amazon Bedrock, providing faster speeds, increased steerability, and image-to-text vision capabilities. Mastering ML with Google 🌀 Announcing Anthropic’s Claude 3 models in Google Cloud Vertex AI: Google Cloud is enhancing customer choice and innovation in Vertex AI with the addition of Anthropic's Claude 3, a new family of state-of-the-art AI models. These models, optimized for various enterprise applications, include the highly capable Claude 3 Opus, the balanced Claude 3 Sonnet, and the fast, compact Claude 3 Haiku. Customers can soon access all three models via API in Vertex AI Model Garden, starting with private preview access to Claude 3 Sonnet. The Claude 3 models offer improved reasoning, content creation, language fluency, and vision capabilities, enabling customers to focus on applications while benefiting from flexible scaling, cost optimization, and Google Cloud's security and compliance. Microsoft Research Insights🌀 Orca-Math: Demonstrating the potential of SLMs with model specialization. The study on Orca and Orca 2 demonstrated how improved training methods can enhance the reasoning abilities of smaller language models, bringing them closer to larger models. Orca-Math, a 7 billion parameter model, specializes in solving math problems and outperforms larger models in this area. The research highlights the value of smaller models in specialized tasks and the potential of continual learning. The dataset and training procedure are available for further research. 🌀 Table Meets LLM: Improving LLM understanding of structured data and exploring advanced prompting methods: This paper explores how large language models (LLMs) understand structured table data. It investigates effective prompts, inherent structured data detection, leveraging existing knowledge, and trade-offs among input designs for better understanding and utilization of table-based data in LLMs. OpenAI Updates 🌀 OpenAI and Elon Musk: In a recent blog post, OpenAI shared its mission to ensure AGI benefits all of humanity, emphasizing the need for substantial resources. The post recounts disagreements with Elon Musk over funding and control, leading to his departure. OpenAI highlights its efforts to create widely available beneficial tools, such as GPT-4, and addresses ongoing legal disputes with Musk while reaffirming its commitment to its mission. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🧞 Google’s Genie: Generative Interactive Environments. Genie introduces a new generative AI paradigm for creating interactive, playable environments from a single image prompt. It can generate virtual worlds from unseen images, including real-world photos or sketches. Trained on a large dataset of Internet videos without action labels, Genie learns fine-grained controls, identifying controllable parts of an observation and inferring consistent latent actions across different environments. 🌀 Meta AI's Priority Sampling: Revolutionizing Machine Learning with Deterministic Code Generation. This research introduces Priority Sampling, a deterministic sampling technique for large language models that generates unique and confident code samples. It aims to improve code generation and optimization by providing a more structured and controllable exploration process, outperforming traditional sampling methods and enhancing model performance. 🌀 Google DeepMind Launches Hawk and Griffin: Efficient Language Models with Advanced Attention Mechanisms. This paper introduces Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model combining gated linear recurrences and local attention. Hawk outperforms Mamba on downstream tasks, while Griffin matches Llama-2's performance with significantly less training data. Both models are hardware-efficient, with Griffin showing exceptional scalability and the ability to extrapolate on long sequences. The study also details efficient distributed training for large-scale models. 🌀 CMU Unveils OmniACT: Groundbreaking AI Dataset for Measuring Program Execution Skills. OmniACT is a new dataset and benchmark designed to test if virtual agents can automate computer tasks by creating executable scripts. Initial tests show a significant gap between agent and human performance, highlighting the challenge and encouraging advancements in multimodal AI models. 🌀 Qualcomm's GPTVQ: Speeding Up Large AI Networks with Vector Quantization. GPTVQ is a new fast method for post-training vector quantization of Large Language Models (LLMs), improving size vs. accuracy trade-offs. It uses column-wise quantization and updates with Hessian information, efficient codebook initialization, and further compression techniques. GPTVQ sets new standards in LLM quantization efficiency and latency, even on mobile CPUs. 🌀 Azure PyRIT: Elevating ML Engineers with Python's Generative AI Risk Tool. PyRIT, a Python Risk Identification Tool for generative AI, automates AI Red Teaming tasks to assess the security of Language Model (LLM) endpoints. It employs proactive methods, categorizes risks, and offers detailed metrics, enabling researchers to mitigate potential risks in LLM deployment effectively. 🌀 Microsoft Introduces ChunkAttention: Accelerating Self-Attention for LLMs! This research introduces ChunkAttention, a novel self-attention module for large language models (LLMs) that optimizes compute and memory operations by detecting shared prefixes in LLM requests. It breaks key/value tensors into chunks and uses a prefix tree to share them, speeding up the self-attention kernel by 3.2-4.8×. ✨ On the Radar: Catch Up on What's Fresh🌀 Streamline Your Machine Learning Workflow with Scikit-learn Pipelines: This blog explores the benefits of using Scikit-learn pipelines for simplifying machine learning workflows. It covers how pipelines can streamline preprocessing, modeling, hyperparameter tuning, and workflow organization, making code more efficient and maintaining consistency in data preprocessing. 🌀 Do text embeddings perfectly encode text? The rapid advancement of generative AI has led to the widespread adoption of Retrieval Augmented Generation (RAG) systems, where AI retrieves relevant documents from a database to generate responses. This has given rise to vector databases, designed to store and search through embeddings, vector representations of documents. The paper "Text Embeddings Reveal as Much as Text" explores the security of embedding vectors, questioning whether they can be inverted back to text, posing challenges for privacy and information security. 🌀 End to End AI Use Case-Driven System Design: This blog explores the complexities of AI system performance beyond TOPs (Tera Operations Per Second), focusing on real AI use cases. It dives into optimizing an AI system for an infinite zoom feature, emphasizing power efficiency through model and memory optimizations, dynamic power scaling, and specialized hardware accelerators. 🌀 Navigating Cost-Complexity: Mixture of Thought LLM Cascades Illuminate a Path to Efficient Large Language Model Deployment: This post discusses how to significantly reduce costs while maintaining accuracy in utilizing Large Language Models (LLMs), crucial for various applications. It introduces a novel approach called Mixture of Thought (MoT) Cascades, employing a blend of weaker and stronger LLMs, along with innovative prompting techniques and consistency measurements.🌀 Structure and Relationships: Graph Neural Networks and a Pytorch Implementation. This article introduces Graph Neural Networks (GNNs), a powerful method for modeling spatial and graphical structures in data, such as molecular structures, social networks, and city designs. It covers the mathematical description of GNNs, including graph convolution networks (GCNs) and graph attention networks (GATs), and provides a regression example using the PyTorch library. The article aims to make GNNs more accessible by explaining their principles and demonstrating their potential applications. 🌀 Extensible and Customisable Vertex AI MLOps Platform: The article describes the development of an MLOps platform for scalable machine learning models on Vertex AI using Kubeflow pipelines. It aims to provide a modular, flexible, and integrated solution for building operationalized ML models, serving as an educational resource and foundation for teams. The platform addresses common challenges and emphasizes testing, configuration, and CI/CD orchestration. See you next time!Want to go deeper? Build production-ready LLM systems with live expert-led workshops at LLM Engineering by Packt.Affiliate Disclosure: This newsletter contains affiliate links. If you buy through them, we may earn a small commission at no extra cost to you. This supports our work and helps us keep providing useful content. We only recommend products and services we think will benefit our readers. Thanks for your support!

0
0
35925

Mostafa Ibrahim

16 Apr 2024

10 min read

LLMOps in Action

Mostafa Ibrahim

16 Apr 2024

10 min read

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionIn an era dominated by the rise of artificial intelligence, the power and promise of Large Language Models (LLMs) stand distinct. These colossal architectures, designed to understand and generate human-like text, have revolutionized the realm of natural language processing. However, with great power comes great responsibility – the onus of managing, deploying, and refining these models in real-world scenarios. This article delves into the world of Large Language Model Operations (LLMOps), an emerging field that bridges the gap between the potential of LLMs and their practical application.BackgroundThe last decade has seen a significant evolution in language models, with models growing in size and capability. Starting with smaller models like Word2Vec and LSTM, we've advanced to behemoths like GPT-3, BERT, and T5. With that said, as these models grew in size and complexity, so did their operational challenges. Deploying, maintaining, and updating these models requires substantial computational resources, expertise, and effective management strategies.MLOps vs LLMOpsIf you've ventured into the realm of machine learning, you've undoubtedly come across the term MLOps. MLOps, or Machine Learning Operations, encapsulates best practices and methodologies for deploying and maintaining machine learning models throughout their lifecycle. It caters to the wide spectrum of models that fall under the machine learning umbrella.On the other hand, with the growth of vast and intricate language models, a more specialized operational domain has emerged: LLMOps. While both MLOps and LLMOps share foundational principles, the latter specifically zeros in on the challenges and nuances of deploying and managing large-scale language models. Given the colossal size, data-intensive nature, and unique architecture of these models, LLMOps brings to the fore bespoke strategies and solutions that are fine-tuned to ensure the efficiency, efficacy, and sustainability of such linguistic powerhouses in real-world scenarios.Core Concepts of LLMOpsLarge Language Models Operations (LLMOps) focuses on the management, deployment, and optimization of large language models (LLMs). One of its foundational concepts is model deployment, emphasizing scalability to handle varied loads, reducing latency for real-time responses, and maintaining version control. As these LLMs demand significant computational resources, efficient resource management becomes pivotal. This includes the use of optimized hardware like GPUs and TPUs, effective memory optimization strategies, and techniques to manage computational costs.Continuous learning and updating, another core concept, revolve around fine-tuning models with new data, avoiding the pitfall of 'catastrophic forgetting', and effectively managing data streams for updates. Parallelly, LLMOps emphasizes the importance of continuous monitoring for performance, bias, fairness, and iterative feedback loops for model improvement. To cater to the vastness of LLMs, model compression techniques like pruning, quantization, and knowledge distillation become crucial.How do LLMOps workPre-training Model DevelopmentLarge Language Models typically start their journey through a process known as pre-training. This involves training the model on vast amounts of text data. The objective during this phase is to capture a broad understanding of language, learning from billions of sentences and paragraphs. This foundational knowledge helps the model grasp grammar, vocabulary, factual information, and even some level of reasoning.This massive-scale training is what makes them "large" and gives them a broad understanding of language. Optimization & CompressionModels trained to this extent are often so large that they become impractical for daily tasks.To make these models more manageable without compromising much on performance, techniques like model pruning, quantization, and knowledge distillation are employed.Model Pruning: After training, pruning is typically the first optimization step. This begins with trimming model weights and may advance to more intensive methods like neuron or channel pruning.Quantization: Following pruning, the model's weights, and potentially its activations, are streamlined. Though weight quantization is generally a post-training process, for deeper reductions, such as very low-bit quantization, one might adopt quantization-aware training from the beginning.Additional recommendations are:Optimizing the model specifically for the intended hardware can elevate its performance. Before initiating training, selecting inherently efficient architectures with fewer parameters is beneficial. Approaches that adopt parameter sharing or tensor factorization prove advantageous. For those planning to train a new model or fine-tune an existing one with an emphasis on sparsity, starting with sparse training is a prudent approach.Deployment Infrastructure After training and compressing our LLM, we will be using technologies like Docker and Kubernetes to deploy models scalably and consistently. This approach allows us to flexibly scale using as many pods as needed. Concluding the deployment process, we'll implement edge deployment strategies. This positions our models nearer to the end devices, proving crucial for applications that demand real-time responses.Continuous Monitoring & FeedbackThe process starts with the Active model in production. As it interacts with users and as language evolves, it can become less accurate, leading to the phase where the Model becomes stale as time passes.To address this, feedback and interactions from users are captured, forming a vast range of new data. Using this data, adjustments are made, resulting in a New fine-tuned model.As user interactions continue and the language landscape shifts, the current model is replaced with the new model. This iterative cycle of deployment, feedback, refinement, and replacement ensures the model always stays relevant and effective.Importance and Benefits of LLMOpsMuch like the operational paradigms of AIOps and MLOps, LLMOps brings a wealth of benefits to the table when managing Large Language Models.MaintenanceAs LLMs are computationally intensive. LLMOps streamlines their deployment, ensuring they run smoothly and responsively in real-time applications. This involves optimizing infrastructure, managing resources effectively, and ensuring that models can handle a wide variety of queries without hiccups.Consider the significant investment of effort, time, and resources required to maintain Large Language Models like Chat GPT, especially given its vast user base.Continuous ImprovementLLMOps emphasizes continuous learning, allowing LLMs to be updated with fresh data. This ensures that models remain relevant, accurate, and effective, adapting to the evolving nature of language and user needs.Building on the foundation of GPT-3, the newer GPT-4 model brings enhanced capabilities. Furthermore, while ChatGPT was previously trained on data up to 2021, it has now been updated to encompass information through 2022.It's important to recognize that constructing and sustaining large language models is an intricate endeavor, necessitating meticulous attention and planning.ConclusionThe ascent of Large Language Models marks a transformative phase in the evolution of machine learning. But it's not just about building them; it's about harnessing their power efficiently, ethically, and sustainably. LLMOps emerge as the linchpin, ensuring that these models not only serve their purpose but also evolve with the ever-changing dynamics of language and user needs. As we continue to innovate, the principles of LLMOps will undoubtedly play a pivotal role in shaping the future of language models and their place in our digital world.Author BioMostafa Ibrahim is a dedicated software engineer based in London, where he works in the dynamic field of Fintech. His professional journey is driven by a passion for cutting-edge technologies, particularly in the realms of machine learning and bioinformatics. When he's not immersed in coding or data analysis, Mostafa loves to travel.Medium

0
0
35225

Louis Owen

12 Apr 2024

12 min read

AI for Investment

Louis Owen

12 Apr 2024

12 min read

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights and books. Don't miss out – sign up today!IntroductionOne of the most important activities for an investor is to always keep up to date with the latest and relevant news. Usually, it’s done by reading at least a dozen news articles starting from macroeconomic issues, political issues, news related to the sector of the corresponding stock, analyst reports, and whatnot. This, of course, takes a lot of time and also sometimes can be overwhelming for new investors since the amount of information to be processed is too much.Many ML developers have tried to solve this issue by building a traditional ML workflow usually called the sentiment analyzer. This system will take text from the news as the input and return the sentiment score as the output. This is no doubt helpful for the investor, but it doesn’t solve the bigger problem which is the need to curate relevant articles and also knowing what’s the impact of each news toward their investment decision. In other words, it’s lacking of broader insight. What if there’s an AI assistant that can act as our personal investment news analyst? What if there’s an AI assistant that is able to analyze dozens of news articles and generate the insights summary along with the investment recommendation? And, what if I told you that this AI assistant is personalized toward your risk appetite and investment portfolio allocation? In this article, I’ll guide you on how to build an AI assistant that can do all the above-mentioned things with only a few lines of code - thanks to GPT4! We’ll discuss several ways to get the news data in bulk and in real-time. We’ll discuss what are the important search keywords we need to use to get relevant news data. We’ll also discuss how to construct the prompt to fulfill all of the above-mentioned criteria while also getting a great generated output. Finally, we’ll see how to put all of this together to build our AI assistant!Without wasting any more time, let’s take a deep breath, make yourselves comfortable, and be ready to learn how to build your personal AI investment news analyst!News Data SourcesGetting as much news data as possible is important since we don’t want to miss any important information out there. Once we get all the information, we just need to filter them out with the help of our AI assistant.SerpAPI is one of the best all-in-one scraping tools that we can utilize to get news data from Google, Yahoo, Bing, DuckDuckGo, and many other search engines. It also provides a free plan with a 100 searches/month limit. However, this limit is surely not enough for our use case. If you don’t mind spending some money and want to get multiple search results from different search engines, then this tool is suitable for you.Another solution that is more budget-friendly is by utilizing DuckDuckGo search engine API directly. DuckDuckGo is a search engine that offers data privacy as their main unique selling point. No search history will be stored. Moreover, they also open their search engine API for free. We will use DuckDuckGo in this article and learn how to utilize it via Python!The more effective way to widen our search results is actually not by using different search engines but by having a diverse yet mutually exclusive set of search keywords. The goal of our AI investment assistant is to summarize the important insights that are relevant to a particular stock that we’re interested in. Hence, we need to provide relevant news data to be able to achieve our goal.The following are some of the search keywords that we can use. Note that this list is not exhaustive, you can surely expand the search keywords based on your own needs. We’ll use AAPL as the ticker example. You can change it to any ticker you want.$AAPL stock $AAPL industry and competitors $AAPL business model and strategy $AAPL management and leadershipBesides ticker-specific search keywords, we can also search for more general information that is not ticker-specific. The following is an example list of such keywords.economic growth this yearmonetary and fiscal policies todaypolitic todayeconomic todayinflation rate todayinterest rate todayreal estate todayDuckDuckGo APIOnce we have the keywords list, we can easily get the news data using DuckDuckGo via Python. First, we need to install the duckduckgo package by running the following command. pip install duckduckgo-searchOnce it is installed we can create the general Python function that can take the search keyword as the input and return the search results.from duckduckgo_search import DDGS import json ddgs = DDGS() def web_search(query: str, num_results: int = 4,debug=True) -> str: """Useful for general internet search queries.""" if debug: print("Searching with query {0}...".format(query)) search_results = [] if not query: return json.dumps(search_results) results = ddgs.text(query) if not results: return json.dumps(search_results) total_added = 0 for j in results: search_results.append(j.get('body','')) total_added += 1 if total_added >= num_results: break return search_resultsUsing this function is very simple. We just need to pass the search keyword along with the number of search results to this function and get the list of search results.apple_competitors_news = web_search(“$AAPL industry and competitors”, num_results = 10)Prompt EngineeringThe next important thing to do is to build our AI assistant. Here, we’ll utilize GPT4 to build our assistant. Since it’s an LLM, we just need to provide the prompt without the need to train it from scratch. However, creating the prompt itself is indeed not an easy task. I have published another article regarding prompt engineering if you’re interested to learn more about it.Remember that the goal of our assistant is to analyze the provided news data dump and return the summary insights along with the recommendation as the output. However, to be able to give a recommendation, our assistant needs to know our risk appetite along with our portfolio condition. The following is an example of the system prompt that we can give to GPT4.system_prompt = “””You are an expert in giving recommendation to BUY / SELL / HOLD for {} ({}). You can only return in JSON format with 5 fields: "Investment Thesis" (dictionary of string. Consist of elaborated decision reasoning (in bullet points) based on the risk profile of the investor, unrealized profit, and all of the factors as the basis of your recommendation. Provide numbers to justify your assertions, a lot ideally. The deeper the analysis the better.), "Investor Profiling" (dictionary of string. Connect the investment thesis with each of the investor profiles, including risk profile and unrealized profit.) "Summary Thesis" (string. Summary of your all investment thesis as the basis of the given recommendation. You have to take into account all factors in the investment thesis as well as the investor profiles.), "recommendation" ("BUY"/"SELL"/"HOLD") In the investment thesis, please cover the following factors. If a particular factor needed to write the investment thesis does not exist, don't try to make up the answer, just write "The information needed is unavailable". (1) Industry and Competitive Analysis: Assess the company's position within its industry and analyze industry trends, competition, barriers to entry, and market dynamics. (2) News and Events: Stay updated on relevant news, earnings announcements, product launches, regulatory changes, and other events that can impact the company or the overall market. (3) Market and Economic Conditions: Assess broader macroeconomic factors from news, including economic growth, interest rates, inflation, monetary and fiscal policies, geopolitical events, gold price, bond price, index price, real estate.”””And here’s an example of the user prompt that consists of all necessary data points. Risk profiles can be “Moderate”, “Aggresive”, or “Conservative”. user_prompt = “””<INVESTOR PROFILE> Risk Profile: {} Unrealized Profit: {}% {}”””Putting All TogetherNow, we just need to create the main function that will act as our personal AI investment assistant. def personal_investment_assistant(company_name:str, ticker:str, risk_profile: str, unrealized_profit_perc: float): news_data = [] for search_keyword in search_kwrds_lst: news_data.extend(web_search(search_keyword)) news_data = "\n".join(news_data) messages = [ { "role": "system", "content": system_prompt.format(company_name,ticker) }, { "role": "user", "content": user_prompt.format(risk_profile,unrealized_profit_perc,news_data) } ] response = get_gpt_response("gpt-4", temperature = 0.0, messages = messages ) return response["choices"][0]["message"]["content"].strip() import requests import json import os def get_gpt_response(model: str,temperature: float,messages: list): headers = { 'content-type': "application/json", 'Authorization': "Bearer " + os.environ["OPENAI_API_KEY"] } endpoint = 'https://api.openai.com/v1/chat/completions' data = json.dumps({ "model": model, "messages": messages, "temperature": temperature, }) try: data = requests.post(endpoint, data=data, headers=headers) openai_response = json.loads(data.text) return openai_response except Exception as e: print(e) return ""ConclusionCongratulations on keeping up to this point! Throughout this article, you have learned how to build your own personal AI investment analyst based on news data. You have learned how to get the news data, a list of useful search keywords, also the code implementation to build the AI assistant. Hope the best for your investment journey and see you in the next article!Author BioLouis Owen is a data scientist/AI engineer from Indonesia who is always hungry for new knowledge. Throughout his career journey, he has worked in various fields of industry, including NGOs, e-commerce, conversational AI, OTA, Smart City, and FinTech. Outside of work, he loves to spend his time helping data science enthusiasts to become data scientists, either through his articles or through mentoring sessions. He also loves to spend his spare time doing his hobbies: watching movies and conducting side projects. Currently, Louis is an NLP Research Engineer at Yellow.ai, the world’s leading CX automation platform. Check out Louis’ website to learn more about him! Lastly, if you have any queries or any topics to be discussed, please reach out to Louis via LinkedIn.

0
0
40807

Merlyn Shelley

08 Apr 2024

12 min read

Apple’s ReALM, Google DeepMind’s Gecko, X.ai's Grok 1.5, Salesforce AI’s Moira, Stability AI’s Stable Audio 2.0, TWIN-GPT, ChatGPT Instant usage

Merlyn Shelley

08 Apr 2024

12 min read

Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Welcome to DataPro#88 – Your portal to the innovations in Data Science & Machine Learning! 🚀 In this edition, you'll find: ⚙️ LLMs & GPTs Unleashed TWIN-GPT: Digital Twins for Clinical Trials. Apple’s ReALM: AI with contextual understanding. Stability AI’s Stable Audio 2.0: Audio synthesis revolution. Salesforce AI’s Moira: Enhancing customer engagement. Google DeepMind’s Gecko: Versatile Text Embeddings. X.ai's Grok 1.5: Enhanced reasoning and context. ✨ What's Fresh & Exciting Distribute LLMs with llamafile: 5 Simple Steps. Dockerized Python Environment: The Elegant Way. Knowledge Distillation: Clone Powerful LLMs. Sora’s Diffusion Transformer (DiT): A Deep Dive. Generative AI: Copyright Reckoning. OpenAI Agent: Function Calling Capabilities. ⚡ Industry Pulse AWS & Mistral AI: Democratizing generative AI. Amazon SageMaker: No-code to code-first ML. Google Cloud Next: Database success stories. Google’s SEEDS in Weather Forecasting: AI quantifies uncertainty. Microsoft’s LLMs in the Imaginarium: Tool Learning. OpenAI: Fine-tuning API and custom models. ChatGPT: Instant usage. Synthetic Voices: Challenges and Opportunities. 📚 Packt's Latest Gem MATLAB for Machine Learning - Second Edition, By Giuseppe Ciaburro. DataPro Newsletter is not just a publication; it’s a comprehensive toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🔰 GitHub Finds: Any of These Repos in Your Toolbox?🛠️ UpstageAI/dataverse: Dataverse simplifies ETL pipelines in Python, providing a user-friendly solution for data processing and management, accessible to all. 🛠️ GAP-LAB-CUHK-SZ/gaustudio: GauStudio is a modular framework for 3D Gaussian Splatting, providing streamlined pipelines and tools for easier implementation and deployment. 🛠️ TencentARC/BrushNet: BrushNet is a text-guided image inpainting model that enhances pre-trained diffusion models, focusing on divided features and dense control. 🛠️ agiresearch/AIOS: AIOS embeds LLMs into OS, enhancing resource allocation, context switch, concurrent execution, tool service, access control, and toolkit availability for developers. 🛠️ jasonppy/VoiceCraft: VoiceCraft excels in speech editing and zero-shot text-to-speech, requiring only a few seconds of reference to clone or edit voices. 📚 Expert Insights from Packt CommunityMATLAB for Machine Learning - Second Edition, By Giuseppe Ciaburro.Anomaly Detection in MATLAB Throughout the life cycle of a physical system, the occurrence of failures or malfunctions poses a potential threat to its normal functioning. To safeguard against critical interruptions, it becomes imperative to implement an anomaly detection system within the facility. Termed as a fault diagnosis system, this mechanism is designed to identify potential malfunctions within the monitored system. The pursuit of fault detection stands as a pivotal and defining phase in maintenance interventions, demanding a systematic and deterministic approach to comprehensively analyze all conceivable causes that might have led to the malfunction. Anomaly detection overview Anomaly detection is a technique used in data analysis and ML to identify data points or patterns that deviate significantly from the expected or normal behavior within a dataset. Anomalies, also known as outliers, are data points that do not conform to most of the data and may indicate errors, fraud, unusual events, or other important information. Anomaly detection has various applications across different domains, such as cybersecurity, industrial quality control (QC), finance, healthcare, and more. We can start to get an overview of different types of anomalies to understand what is intended with this term, we will list some types of anomalies: Point anomalies: These are individual data points that are considered anomalies, such as a single fraudulent transaction in a credit card dataset. Contextual anomalies: These are anomalies that are context-dependent. A data point might not be an anomaly on its own but is unusual in a particular context or time, such as a sudden spike in web traffic during a holiday sale. Collective anomalies: These are anomalies that are identified by examining a group of data points collectively. These anomalies involve patterns or relationships between data points. There are several methods for addressing anomaly detection problems, ranging from simple statistical techniques to complex ML algorithms. The choice of method depends on the nature of the data and the specific problem you are trying to solve. Here, we are listing the most used ones: Statistical methods: Statistical techniques such as z-scores, percentiles, and boxplots can be used to identify anomalies based on deviations from the mean or median of the data distribution. ML: Supervised, unsupervised, and semi-supervised ML algorithms can be used for anomaly detection. Some popular methods include Isolation Forest, One-Class Support Vector Machine (One-Class SVM), autoencoders (AEs), and k-means clustering. Time series analysis: Specialized techniques are used for detecting anomalies in time series data, such as autoregressive (AR) models, exponential smoothing, and moving averages (MAs). Density estimation: Methods such as kernel density estimation (KDE) and Gaussian Mixture Models (GMMs) are used to estimate the probability density function of the data and identify anomalies as low-density regions. Deep learning (DL): Neural networks (NNs), especially deep AEs (DAEs) and recurrent NNs (RNNs), are used for anomaly detection in high-dimensional data or sequences. Ensemble methods: Combining multiple anomaly detection models can improve overall performance and robustness. In addressing anomaly detection problems, we have to face some challenges. For example, determining an appropriate threshold for defining anomalies can be challenging. Imbalanced datasets, where anomalies are rare, can make model training and evaluation tricky. Handling high-dimensional data and noisy datasets can also be challenging. Anomaly detection is a valuable tool for identifying rare but potentially important events or patterns in large datasets. The choice of method depends on the specific domain, data characteristics, and the nature of anomalies that need to be detected. Discover more insights from "MATLAB for Machine Learning - Second Edition" by Giuseppe Ciaburro. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!Read Here!⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS ML Made Easy 🌀 AWS and Mistral AI commit to democratizing generative AI with a strengthened collaboration: The article discusses the growing use of generative AI applications across industries, facilitated by Amazon Bedrock. It highlights Mistral AI's Mistral Large model, now available on Amazon Bedrock, offering advanced language capabilities. This collaboration aims to provide customers with diverse model options to suit their specific business needs, promoting innovation in AI technology. 🌀 Seamlessly transition between no-code and code-first machine learning with Amazon SageMaker Canvas and Amazon SageMaker Studio: This post discusses Amazon SageMaker Studio, an integrated ML development environment, and SageMaker Canvas, a no-code ML tool, highlighting their features and integration for seamless collaboration between non-ML and ML experts. Google Research 🌀 Get inspired: Database success stories at Google Cloud Next. This blog post previews Google Cloud Next '24, focusing on customers using Google Cloud databases for transformative purposes. It highlights sessions featuring Nuro, Lightricks, Bayer, Yahoo!, and Statsig, showcasing their innovative use cases.🌀 Generative AI to quantify uncertainty in weather forecasting: Google is advancing weather forecasting with innovations like MetNet-3 and SEEDS, a generative AI model. SEEDS efficiently generates probabilistic ensembles, addressing the butterfly effect's uncertainty, and offers cost-effective solutions for extreme weather events. Microsoft Research🌀 LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error. This research enhances large language models' (LLMs) tool usage accuracy through simulated trial and error (STE), inspired by biological systems. STE improves learning by simulating tool use scenarios, interacting with tools, and leveraging short and long-term memory. Results show significant performance boosts over existing methods.OpenAI Updates🌀 Introducing improvements to the fine-tuning API and expanding our custom models program: This update discusses techniques to improve model performance, such as retrieval-augmented generation (RAG) and fine-tuning and introduces new API features for developers to control their fine-tuning jobs, enhancing model quality, reducing costs, and latency. 🌀 Start using ChatGPT instantly: This new initiative aims to make AI more accessible by allowing instant access to ChatGPT without the need to sign up. It targets those curious about AI's potential but hesitant to set up an account, offering a seamless experience for learning, creative inspiration, and answering questions. 🌀 Navigating the Challenges and Opportunities of Synthetic Voices: Voice Engine is a model by OpenAI that generates natural-sounding speech from text input and a short audio sample, closely resembling the original speaker. They're sharing insights from a small-scale preview, highlighting its potential for various applications like reading assistance and personalized responses in education. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🌀 TWIN-GPT: Digital Twins for Clinical Trials via LLM. The research explores virtual clinical trials' benefits in healthcare, emphasizing patient safety and cost reduction. Existing methods struggle with prediction accuracy due to limited data. TWIN-GPT, a proposed approach, uses large language models to create personalized digital twins, improving predictions and showcasing digital twins' potential in healthcare. 🌀 Apple’s ReALM: AI that can “See” to understand the context: ReALM (Reference Resolution As Language Modeling) addresses the challenge of context understanding, including non-conversational entities like on-screen elements. By leveraging Language Models (LLMs), it demonstrates significant improvements in reference resolution, even outperforming GPT-4, offering over 5% gains for on-screen references. 🌀 Stability AI’s Stable Audio 2.0: Stable Audio 2.0 introduces a groundbreaking AI-generated audio standard, offering high-quality, full tracks up to three minutes long at 44.1kHz stereo. It features audio-to-audio generation, honoring creator rights, and expands creative possibilities, available for free on the Stable Audio website. 🌀 Salesforce AI’s Moira: Moirai is a universal time series forecasting model designed to address diverse forecasting tasks across various domains, frequencies, and variables in a zero-shot manner. It tackles key challenges in forecasting and offers robust performance, making it valuable for IT operations, sales forecasting, and more. 🌀 Google DeepMind’s Gecko: Versatile Text Embeddings Distilled from LLMs. Gecko is a compact text embedding model that achieves strong retrieval performance by distilling knowledge from large language models (LLMs). Its two-step distillation process, generating synthetic paired data and refining data quality, outperforms larger models on the Massive Text Embedding Benchmark. Gecko with 256 dimensions outperforms all entries with 768 dimensions; Gecko with 768 dimensions competes with models 7x larger and 5x higher dimensional embeddings. 🌀 X.ai Unveils Grok 1.5: Enhanced Reasoning and Long Context Features. Grok-1.5, the latest version of x.ai's Grok model, offers improved reasoning and long context capabilities. It excels in coding and math tasks, scoring 50.6% on MATH and 90% on GSM8K benchmarks. Grok-1.5 can process long contexts up to 128K tokens and boasts robust infrastructure for large-scale training. Early testers and existing Grok users on the x.ai platform will soon have access to Grok-1.5, with further features expected to roll out gradually. ✨ On the Radar: Catch Up on What's Fresh🌀 Distribute and Run LLMs with llamafile in 5 Simple Steps: This blog introduces llamaFile, a framework that simplifies using large language models (LLMs) by providing a one-file executable that runs locally without installation. It explains how to use llamaFile with the LLaVa model, a 7-billion-parameter model quantized to 4 bits, for tasks like chat, image uploading, and question-answering. 🌀 Setting A Dockerized Python Environment — The Elegant Way. This blog post demonstrates a more elegant method for setting up a dockerized Python development environment using VScode and the Dev Containers extension. It provides step-by-step instructions and prerequisites, including Docker Desktop, a Docker Hub account, and VScode with the Dev Containers extension installed. The tutorial focuses on using the official Python image (`python:3.10`) and explains the Dev Containers extension's role in creating an isolated VScode session inside a docker container. 🌀 Clone the Abilities of Powerful LLMs into Small Local Models Using Knowledge Distillation: This post explores the use of specialized, smaller-scale language models for specific NLP tasks, such as grammatical error correction. It discusses the process of constructing tailored models through data annotation and fine-tuning, and the use of knowledge distillation to automate labeling. The post provides a workflow for distilling knowledge from a large language model to a smaller one, using prompts and APIs, and demonstrates this process in the context of building a grammatical error correction model. 🌀 Deep Dive into Sora’s Diffusion Transformer (DiT) by Hand: This blog introduces Sora, OpenAI's text-to-video model, explaining its unique approach combining diffusion transformer and transformer strength for video prediction. It explores key concepts like diffusion, dimension reduction, and noise addition, offering insights into how Sora converts text prompts into realistic videos. Ideal for AI enthusiasts and those interested in video generation technologies. 🌀 The Coming Copyright Reckoning for Generative AI: This blog explores the complexities of copyright law in America, particularly in the context of generative AI. It discusses key concepts like original works, fair use, and the implications of generative AI on copyright. It also delves into legal cases and future considerations, offering insights for data scientists and AI enthusiasts. 🌀 Create an Agent with OpenAI Function Calling Capabilities: This article explores the advancements and challenges in developing AI-powered applications in 2024. It discusses how AI streamlines app features for a better user experience and introduces OpenAI's Function Calling to simplify structured data extraction. The article also highlights the ongoing innovations and the future of AI applications. See you next time!

0
0
49070

article-image-bi-pro49-microsoft-fabric-lifecycle-management-data-factory-adds-cicd-to-fabric-data-pipelines-database-mirroring-aws-well-architected-data-analytics-lens

Merlyn Shelley

04 Apr 2024

11 min read

BI-Pro#49: Microsoft Fabric Lifecycle Management, Data Factory Adds CI/CD to Fabric Data Pipelines, Database Mirroring, AWS Well-Architected Data Analytics Lens

Merlyn Shelley

04 Apr 2024

11 min read

Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Welcome to BI-Pro #49, your ultimate guide to data and BI insights! 🚀 ⏩ What's Inside? Python Simplified: Master data validation with Pydantic. Visualize Like a Pro: 30+ tools for stunning data visuals. R for Bioinformatics: Custom visuals for bio data. Interactive Data: JavaScript meets Handsontable. Seaborn Stories: Craft data tales with line plots. MetaGPT Insights: Next-gen data solutions unveiled. 🏭 Industry Scoop: Power BI’s Latest: March's must-know features. Fabric Innovations: Updates and new tools from Microsoft Fabric. AWS Well-Architected Data Analytics Lens: Analytics strategies for the real world. Google Cloud Savings: Cut costs on ETL workflows. Tableau Journeys: From student to BI analyst. 💎 Expert Takes: Deep Dive into Python Deep Learning: The latest from Packt. 👉 Community Buzz: Twitch Chat Analysis, Graph Networks, LLM Data Quality, and Ethical AI: Key conversations this week! Dive into the trends shaping data and BI today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktThanks for reading Packt BI-Pro! Subscribe for free to receive new posts and support our work.Pledge your supportSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos🌀 man-group/ArcticDB: ArcticDB is a high-performance DataFrame database designed for Python Data Science, with a Python-centric API for Pandas DataFrames. 🌀 gradio-app/gradio: Gradio is an open-source Python package for building demos or web apps for ML models or Python functions, with easy sharing via built-in features. 🌀 Sinaptik-AI/pandas-ai: PandasAI is a Python library using generative AI to explore, clean, and analyze data with natural language queries.🌀 OpenRefine/OpenRefine: OpenRefine is a powerful Java-based tool for loading, understanding, cleaning, reconciling, and augmenting data, accessible from a web browser. 🌀 Kanaries/pygwalker: PyGWalker simplifies Jupyter Notebook workflows by converting pandas dataframes into interactive user interfaces for data analysis and visualization. 🌀 cleanlab/cleanlab: cleanlab aids in data and label cleaning by identifying issues in ML datasets automatically, enabling better model training with real-world data.Email Forwarded? Join BI-Pro Here!🔮 Data Viz with Python Libraries 🌀 Pydantic Tutorial: Data Validation in Python Made Simple. This blog tutorial explains how to use Pydantic, a data validation and serialization library in Python, to validate and serialize data classes, offering support for custom validators and Python's type hints for field validation. 🌀 30+ Data Visualization Libraries, Frameworks and Apps, Mastering Data Presentation: Explore over 30 data visualization tools like Metabase, Gephi, and Grafana, offering a range of features to transform raw data into meaningful visualizations for better decision-making in industries like tech, healthcare, finance, and marketing. 🌀 Mastering Data Visualization in R for Bioinformatics: The article delves into data visualization in R for bioinformatics, stressing its role in understanding complex biological data, communicating findings, hypothesis generation, and decision-making. It also discusses Anscombe's Quartet, highlighting the importance of visualizing data before analysis and the limitations of summary statistics. 🌀 Integrating JavaScript charting libraries with Handsontable: The article guides developers on integrating Highcharts, Recharts, and Chart.js with Handsontable for data visualization. It explains the features of each library and provides demos for creating a stock portfolio with interactive charts. 🌀 Data Visualization with Seaborn Line Plot: The article introduces Seaborn, a Python library for data visualization, built on top of Matplotlib. It covers installation and demonstrates creating single line plots and customizing styles for better presentation of data. 🌀 MetaGPT’s Data Interpreter: SOTA Open Source LLM-based Data Solutions. MetaGPT introduces its Data Interpreter, a new agent for streamlined data interpretation and analysis. The Data Interpreter employs advanced techniques for real-time data adaptability, tool integration, and logical inconsistency identification, showcasing superior performance in machine learning tasks. ⚡Stay Informed with Industry HighlightsPower BI 🌀 Power BI March 2024 Feature Summary: The Power BI update introduces visual calculation editing, data model editing in the Power BI Service, and report subscription delivery to OneDrive SharePoint. A new Microsoft Fabric certification exam, DP-600, is also available, with free certification opportunities through the Fabric AI Skills Challenge. 🌀 Announcing the Public Preview of Database Mirroring in Microsoft Fabric: Mirroring, now in Public Preview, allows seamless integration of databases into Microsoft Fabric's OneLake, providing real-time insights without ETL. It simplifies data replication and warehousing, enabling easy data access and analysis across different sources, including data lakes and warehouses. 🌀 Get data with Power Query available in Power BI Report Builder (Preview): Power BI Report Builder now allows connecting to 100+ data sources like Snowflake, Databricks, and AWS Redshift. You can transform data using M-Query for paginated reports. Install the latest version and connect from the "Data" tab. Microsoft Fabric🌀 Microsoft Fabric March 2024 Update: This update brings new features like OneLake File Explorer, Autotune Query Tuning, and Test Framework for Power Query SDK in VS Code to Power BI, enhancing reporting, modeling, service, mobile, and developer experiences. 🌀 Data Factory Adds CI/CD to Fabric Data Pipelines: Fabric engineers with Azure Synapse Analytics and Azure Data Factory experience can now utilize Git integration and built-in Deployment Pipelines in Data Factory data pipelines in Fabric. This public preview offers source control, CI/CD features, and collaborative development environments, enhancing data analytics projects. 🌀 Microsoft Fabric Lifecycle Management – Getting started with Git Integration and Deployment Pipelines: Microsoft Fabric makes Lifecycle Management easy, enabling continuous releases through Git and Deployment Pipelines. Git allows reliable updates for supported items like Lakehouse, Notebooks, and Reports, while Deployment Pipelines clone content between stages like DEV, TEST, UAT, and PROD. AWS BI 🌀 Announcing the AWS Well-Architected Data Analytics Lens: The Data Analytics Lens helps assess and improve analytics platforms on AWS. It offers best practices, such as building ACID-compliant data lakes and leveraging Serverless for data pipelines, aligned with the AWS Well-Architected Framework's pillars for secure, efficient, and cost-effective solutions. 🌀 Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics. The post discusses how healthcare providers can improve patient care by leveraging AWS services for real-time analytics and personalized healthcare, focusing on a zero-ETL approach to data integration.Google Cloud Dat🌀 Enrich streaming data in Bigtable with Dataflow: The post discusses the importance of event stream processing in data engineering and introduces Apache Beam's Enrichment transform, which simplifies the process of enriching streaming data with Bigtable, improving data context and enabling more meaningful analysis.🌀 Dataflow at-least-once vs. exactly-once streaming modes: The post compares exactly-once and at-least-once processing modes in Dataflow Streaming Engine for streaming jobs. It explains the trade-offs between the two modes and provides guidance on choosing the right mode based on use case requirements. Tableau🌀 Data is both art and science - My Tableau Story: Andy Cotgreave. The post highlights Andy Cotgreave's journey from a data analyst at Oxford to becoming a Senior Technical Evangelist at Tableau. It emphasizes the importance of community engagement, innovation, building a portfolio, and having fun in data visualization. 🌀 Student to BI Analyst, How Tableau Can Lead to a Successful Data Career: This blog discusses Karolina Grodzinska's data visualization journey, from discovering Tableau to winning Iron Viz: Student Edition and becoming a Business Intelligence Analyst at Schneider Electric. Karolina emphasizes the importance of an active Tableau Public profile in career development and shares tips for building a strong portfolio and networking with the Tableau Community. ✨ Expert Insights from Packt CommunityPython Deep Learning - Third Edition - By Ivan VasilevDeveloping NN models for edge devices with TF Lite TF Lite is a TF-derived set of tools that allows us to run models on mobile, embedded, and edge devices. Its versatility is part of TF’s appeal for industrial applications (as opposed to research applications, where PyTorch dominates).The key paradigm of TF Lite is that the models run on-device, contrary to client-server architecture, where the model is deployed on remote, more powerful, hardware. This organization has the following implications (both good and bad): Low-latency execution: The lack of server-round trip significantly reduces the model inference time and allows us to run real-time applications. Privacy: The user data never leaves the device. Internet connectivity: Internet connectivity is not required. Small model size: The devices have limited computational ability, hence the need for small and computationally efficient models. More specifically, TF Lite models are stored in the FlatBuffers (https://flatbuffers.dev/) special efficient portable format, identified by the .tflite file extension. Besides its small size, it allows us to access data directly without parsing/unpacking it first. TF Lite models support a subset of the TF Core operations and allow us to define custom ones: Low power consumption: The devices often run on battery. Divergent training and inference: NN training is a lot more computationally intensive compared to inference. Because of this, the model training runs on a different, more powerful, piece of hardware than the actual devices, where the models will run inference. In addition, TF Lite has the following key features: Multi-platform and multi-language support, including Android (Java), iOS (Objective-C and Swift) devices, web (JavaScript), and Python for all other environments. Google provides a TF Lite wrapper API called MediaPipe Solutions (https://developers.google.com/mediapipe, https://github.com/google/mediapipe/), which supersedes the previous TF Lite API. Optimized for performance. It has end-to-end solution pipelines. TF Lite is oriented toward practical applications, rather than research. Because of this, it includes different pipelines for common ML tasks such as image classification, object detection, text classification, and question answering among others. The computer vision pipelines use modified versions of EfficientNet or MobileNet, and the natural language processing pipelines use BERT-based models. So, how does TF Lite model development work? First, we’ll select a model in one of the following ways: An existing pre-trained .tflite model (https://tfhub.dev/s?deployment-format=lite). Use MediaPipe Model Maker (https://developers.google.com/mediapipe/solutions/model_maker) to apply feature engineering transfer learning on an existing .tflite model with a custom training dataset. Model Maker only works with Python. Convert a full-fledged TF model into .tflite format. Discover more insights from 'Python Deep Learning - Third Edition' by Ivan Vasilev. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here💡 What's the Latest Scoop from the BI Community? 🌀 Real-Time Twitch Chat Sentiment Analysis with Apache Flink: This blog explores building a real-time sentiment analysis application for Twitch chat using Apache Flink. It covers setting up the project, reading Twitch chat messages, performing sentiment analysis, and concludes with a demo. 🌀 Entity Type Prediction with Relational Graph Convolutional Network (PyTorch): This post discusses a Python setup for predicting entity types on heterogeneous graphs using the Relational Graph Convolutional Network (R-GCN) and the RGCNConv module from PyTorch. It explains knowledge graphs, entity type prediction, and the R-GCN model. 🌀 Data Quality Error Detection powered by LLMs: This article explores automating the identification of data errors in tabular datasets using Large Language Models (LLMs). It discusses the Data Dirtiness Score, challenges in data cleaning, and the potential of LLMs in detecting data quality issues. 🌀 Building Ethical AI Starts with the Data Team — Here’s Why: This article discusses the ethical considerations of AI, focusing on model bias, AI usage, and data responsibility. It emphasizes the role of data teams in ensuring ethical AI and suggests steps for data teams to take towards a more ethical future. See you next time!

0
0
23293

Merlyn Shelley

02 Apr 2024

10 min read

Databricks' DBRX, Stability AI's Stable Code Instruct 3B, SambaNova's Samba CoE v0.2, FrugalGPT, Advanced RAG Patterns on Amazon SageMaker

Merlyn Shelley

02 Apr 2024

10 min read

Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Welcome to DataPro#87 – Your Gateway to the Cutting-Edge of Data Science & Machine Learning! 🚀 Dive into this edition to explore: ⚙️ LLMs & GPTs Unleashed Samba CoE v0.2: SambaNova's Speedy AI Models Efficient Training of Language Models with OpenAI AI21's Revolutionary SSM-Transformer Model: Jamba Databricks' DBRX: The New Open LLM Benchmark Stable Code Instruct 3B: Stability AI's Latest Offering HyperLLaVA: Boosting Multimodal Language Models ✨ What's Fresh & Exciting FrugalGPT: Cutting LLM Operating Costs Building a Reliable AI Agent from Scratch with OpenAI Tool Calling Fine-Tuning Instruct Models over Raw Text Data Crafting an OpenAI-Compatible API ⚡ Industry Pulse: Deciphering Advanced RAG Patterns on Amazon SageMaker Unveil the Future with AutoBNN: Mastering Probabilistic Time Series Forecasting! Engaging with Microsoft Copilot (web): Learning from Interaction 📚 Packt's Latest Gem "Principles of Data Science - Third Edition" by Sinan Ozdemir DataPro Newsletter is not just a publication; it’s a comprehensive toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🔰 GitHub Finds: Any of These Repos in Your Toolbox?🛠️ Zejun-Yang/AniPortrait: AniPortrait is a new framework for creating high-quality animations using audio input and a reference portrait image, with face reenactment capabilities.🛠️ agiresearch/AIOS: AIOS embeds large language models into operating systems, enabling smarter resource allocation, context switching, and concurrent agent execution, advancing AGI. 🛠️ lichao-sun/Mora: Mora is a multi-agent framework for video generation, enhancing OpenAI's Sora capabilities through collaborative visual agents for diverse tasks. 🛠️ jasonppy/VoiceCraft: VoiceCraft is a high-performing neural codec language model for speech editing and zero-shot text-to-speech, excelling with diverse real-world data. 🛠️ dvlab-research/MiniGemini: Mini-Gemini enhances LLMs (Large Language Models) from 2B to 34B, integrating image understanding, reasoning, and generation, inspired by LLaVA. 🛠️ Picsart-AI-Research/StreamingT2V: StreamingT2V is a technique for creating long videos with rich motion dynamics, ensuring temporal consistency and high image quality. 📚 Expert Insights from Packt Community"Principles of Data Science - Third Edition" by Sinan Ozdemir. The Five Steps of Data Science A question I’ve gotten at least once a month for the past decade is What’s the difference between data science and data analytics? One could argue that there is no difference between the two; others will argue that there are hundreds of differences! I believe that, regardless of how many differences there are between the two terms, the following applies: Data science follows a structured, step-by-step process that, when followed, preserves the integrity of the results and leads to a deeper understanding of the data and the environment the data comes from. As with any other scientific endeavor, this process must be adhered to, or else the analysis and the results are in danger of scrutiny. On a simpler level, following a strict process can make it much easier for any data scientist, hobbyist, or professional to obtain results faster than if they were exploring data with no clear vision. While these steps are a guiding lesson for amateur analysts, they also provide the foundation for all data scientists, even those in the highest levels of business and academia. Every data scientist recognizes the value of these steps and follows them in some way or another. Overview of the five steps The process of data science involves a series of steps that are essential for effectively extracting insights and knowledge from data. These steps are presented as follows: Asking an interesting question: The first step in any data science project is to identify a question or challenge that you want to address with your analysis. This involves finding a topic that is relevant, important, and that can be addressed with data. Obtaining the data: Once you have identified your question, the next step is to collect the data that you will need to answer it. This can involve sourcing data from a variety of sources, such as databases, online platforms, or through data scraping or data collection methods. Exploring the data: After you have collected your data, the next step is to explore it and get a better understanding of its characteristics and patterns. This might involve examining summary statistics, visualizing the data, or applying statistical or machine learning (ML) techniques to identify trends or relationships. Modeling the data: Once you have explored your data, the next step is to build models that can be used to make predictions or inform decision-making. This might involve applying ML algorithms, building statistical models, or using other techniques to find patterns in the data. Communicating and visualizing the results: Finally, it’s important to communicate your findings to others in a clear and effective way. This might involve creating reports, presentations, or visualizations that help to explain your results and their implications. By following these five essential steps, you can effectively use data science to solve real-world problems and extract valuable insights from data. It’s important to note that different data scientists may have different approaches to the data science process, and the steps outlined previously are just one way of organizing the process. Some data scientists might group the steps differently or include additional steps such as feature engineering or model evaluation. Despite these differences, most data scientists agree that the steps listed previously are essential to the data science process. Whether they are organized in this specific way or not, these steps are all crucial for effectively using data to solve problems and extract valuable insights. Let’s dive into these steps one by one.Discover more insights from "Principles of Data Science - Third Edition" by Sinan Ozdemir. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here!⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS ML Made Easy 🌀 Advanced RAG patterns on Amazon SageMaker: This post discusses how customers across various industries are utilizing large language models (LLMs) like Mixtral-8x7B Instruct to build generative AI applications such as QnA chatbots and search engines. It highlights the challenges and solutions in improving the accuracy and performance of these applications, focusing on Retrieval Augmented Generation (RAG) patterns implemented with LangChain.Google Research 🌀 AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks. This research introduces AutoBNN, an open-source package for automated, interpretable time series forecasting using Bayesian neural networks (BNNs). It addresses limitations of traditional methods like Gaussian processes (GPs) and Structural Time Series by combining the interpretability of GPs with the scalability and flexibility of neural networks. AutoBNN automates model discovery, provides high-quality uncertainty estimates, and scales effectively for large datasets. Microsoft Research🌀 Learning from interaction with Microsoft Copilot (web): This research focuses on how AI systems like Bing and Microsoft Copilot learn and improve from user interactions, particularly through reinforcement learning from human feedback (RLHF). It also explores how Bing has evolved its search capabilities and how Copilot is changing user interactions to be more conversational and workflow oriented. The research introduces frameworks like TnT-LLM and SPUR to improve taxonomy generation and user satisfaction estimation in AI interactions. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🌀 Samba CoE v0.2 from SambaNova delivers accurate AI models at blazing speeds: This blog post highlights Samba's advancements in AI architecture, specifically focusing on the introduction of Samba-1, a CoE architecture for enterprise AI. It discusses the features and benefits of Samba-1, its performance benchmarks, and plans for future releases, emphasizing the role of RDUs in driving efficiency and speed in AI models. 🌀 OpenAI’s Efficient Training of Language Models to Fill in the Middle: OpenAI demonstrates that autoregressive language models can effectively learn to infill text by moving a span of text from the middle of a document to its end, without harming generative capability. They propose training models with this method by default and provide benchmarks and best practices. 🌀 Jamba: AI21's Groundbreaking SSM-Transformer Model. Jamba is a groundbreaking model that merges Mamba SSM with Transformer elements, offering a 256K context window and outperforming similar models. Released under Apache 2.0, it will be available in the NVIDIA API catalog. Jamba optimizes memory, throughput, and performance, delivering remarkable efficiency. 🌀 Databricks’ DBRX: A New State-of-the-Art Open LLM. Databricks introduces DBRX, an open LLM setting new benchmarks in language understanding, programming, and math. With a 256K context window, it outperforms GPT-3.5 and competes with Gemini 1.0 Pro. DBRX is 40% smaller than Grok-1, offering 2x faster inference than LLaMA2-70B. 🌀 Introducing Stable Code Instruct 3B — Stability AI: Stable Code Instruct 3B, built on Stable Code 3B, offers state-of-the-art performance in code completion and natural language interactions for programming tasks. It outperforms Codellama 7B Instruct and matches StarChat 15B, with a focus on popular languages like Python and Java. Available for commercial use with a Stability AI Membership, the model is accessible on Hugging Face. 🌀 HyperLLaVA: Enhancing Multimodal Language Models with Dynamic Visual and Language Experts. This blog explores the advancements in Multimodal Large Language Models (MLLMs) and introduces HyperLLaVA, a dynamic model that improves performance by adaptively tuning parameters for handling diverse multimodal tasks, surpassing existing benchmarks and opening new avenues for multimodal learning systems. ✨ On the Radar: Catch Up on What's Fresh🌀 FrugalGPT and Reducing LLM Operating Costs: The blog discusses the high cost of running Large Language Models (LLMs) and introduces the "FrugalGPT" framework, which reduces operating costs significantly while maintaining quality. It explains how different models cost different amounts and proposes using a cascade of LLMs to minimize costs while maximizing answer quality. 🌀 Leverage OpenAI Tool calling: Building a reliable AI Agent from Scratch. The blog discusses the future role of AI in everyday tasks, focusing on text creation, correction, and brainstorming. It highlights the importance of Retrieval-Augmented Generation (RAG) pipelines and aims to provide Large Language Models with better context to generate more valuable content. 🌀 Fine-tune an Instruct model over raw text data: The blog explores the challenges of integrating modern chatbots with large datasets, focusing on context window sizes and the use of Retrieval-Augmented Generation (RAG) techniques. It proposes a lighter approach to fine-tuning chatbots on smaller datasets, aiming to bridge the gap between the constraints of a 128K context window and the complexities of models fine-tuned on billions of tokens. The experiment involves fine-tuning a model on The Guardian's dataset and aims to provide reproducible instructions for cost-effective model training using accessible hardware. 🌀 How to build an OpenAI-compatible API: The blog discusses the dominance of OpenAI in the Gen AI market, and the reasons developers might choose alternative LLM providers. It explores implementing a Python FastAPI server compatible with the OpenAI API specs to wrap any LLM, aiming for flexibility and cost-effectiveness. See you next time!

0
0
22112

Understanding Data Structures in Swift

How we are Thinking About Generative AI

Microsoft AI’s Skeleton Key, AutoML with AutoGluon, MultiOn AI's Retrieve API, Narrative BI’s Hybrid AI, Python's Duck Typing, Gibbs Diffusion

Top 100+ Essential Data Science Tools & Repos: Streamline Your Workflow Today!

Mastering Semi-Structured Data in Snowflake

Prompt Engineering with Azure Prompt Flow

Fabric’s Code-First AutoML and Hyperparameter Tuning, Google Cloud Cortex Framework, Snowflake’s Data Metric Functions, Qlik's AI Accelerator

ChatGPT for Coding

Gemini 1.0 Pro Vision in BigQuery, Python UI Library, Feature Engineering with Fabric and PySpark, Power analytics with Redshift, Amazon RDS for MySQL

Elevate Your LLM Mastery

Trending Topics

LLMOps in Action

AI for Investment

Apple’s ReALM, Google DeepMind’s Gecko, X.ai's Grok 1.5, Salesforce AI’s Moira, Stability AI’s Stable Audio 2.0, TWIN-GPT, ChatGPT Instant usage

BI-Pro#49: Microsoft Fabric Lifecycle Management, Data Factory Adds CI/CD to Fabric Data Pipelines, Database Mirroring, AWS Well-Architected Data Analytics Lens

Databricks' DBRX, Stability AI's Stable Code Instruct 3B, SambaNova's Samba CoE v0.2, FrugalGPT, Advanced RAG Patterns on Amazon SageMaker

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access