Data | Tech News, Tutorials & Expert Insights

article-image-mastering-transfer-learning-fine-tuning-bert-and-vision-transformers

27 Nov 2024

15 min read

Mastering Transfer Learning: Fine-Tuning BERT and Vision Transformers

27 Nov 2024

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. This article is an excerpt from the book, "Principles of Data Science", by Sinan Ozdemir. This book provides an end-to-end framework for cultivating critical thinking about data, performing practical data science, building performant machine learning models, and mitigating bias in AI pipelines. Learn the fundamentals of computational math and stats while exploring modern machine learning and large pre-trained models.IntroductionTransfer learning (TL) has revolutionized the field of deep learning by enabling pre-trained models to adapt their broad, generalized knowledge to specific tasks with minimal labeled data. This article delves into TL with BERT and GPT, demonstrating how to fine-tune these advanced models for text classification and image classification tasks. Through hands-on examples, we illustrate how TL leverages pre-trained architectures to simplify complex problems and achieve high accuracy with limited data.TL with BERT and GPTIn this article, we will take some models that have already learned a lot from their pre-training and fine-tune them to perform a new, related task. This process involves adjusting the model’s parameters to better suit the new task, much like fine-tuning a musical instrument:Figure 12.8 – ITLITL takes a pre-trained model that was generally trained on a semi-supervised (or unsupervised) task and then is given labeled data to learn a specific task.Examples of TLLet’s take a look at some examples of TL with specific pre-trained models.Example – Fine-tuning a pre-trained model for text classificationConsider a simple text classification problem. Suppose we need to analyze customer reviews and determine whether they’re positive or negative. We have a dataset of reviews, but it’s not nearly large enough to train a deep learning (DL) model from scratch. We will fine-tune BERT on a text classification task, allowing the model to adapt its existing knowledge to our specific problem.We will have to move away from the popular scikit-learn library to another popular library called transformers, which was created by HuggingFace (the pre-trained model repository I mentioned earlier) as scikit-learn does not (yet) support Transformer models.Figure 12.9 shows how we will have to take the original BERT model and make some minor modifications to it to perform text classification. Luckily, the transformers package has a built-in class to do this for us called BertForSequenceClassification:Figure 12.9 – Simplest text classification caseIn many TL cases, we need to architect additional layers. In the simplest text classification case, we add a classification layer on top of a pre-trained BERT model so that it can perform the kind of classification we want.The following code block shows an end-to-end code example of fine-tuning BERT on a text classification task. Note that we are also using a package called datasets, also made by HuggingFace, to load a sentiment classification task from IMDb reviews. Let’s begin by loading up the dataset:# Import necessary libraries from datasets import load_dataset from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments # Load the dataset imdb_data = load_dataset('imdb', split='train[:1000]') # Loading only 1000 samples for a toy example # Define the tokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Preprocess the data def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_ length', max_length=512) imdb_data = imdb_data.map(encode, batched=True) # Format the dataset to PyTorch tensors imdb_data.set_format(type='torch', columns=['input_ids', 'attention_ mask', 'label'])With our dataset loaded up, we can run some training code to update our BERT model on our labeled data:# Define the model model = BertForSequenceClassification.from_pretrained( 'bert-base-uncased', num_labels=2) # Define the training arguments training_args = TrainingArguments( output_dir='./results', num_train_epochs=1, per_device_train_batch_size=4 ) # Define the trainer trainer = Trainer(model=model, args=training_args, train_dataset=imdb_ data) # Train the model trainer.train() # Save the model model.save_pretrained('./my_bert_model')Once we have our saved model, we can use the following code to run the model against unseen data:from transformers import pipeline # Define the sentiment analysis pipeline nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer) # Use the pipeline to predict the sentiment of a new review review = "The movie was fantastic! I enjoyed every moment of it." result = nlp(review) # Print the result print(f"label: {result[0]['label']}, with score: {round(result[0] ['score'], 4)}") # "The movie was fantastic! I enjoyed every moment of it." # POSITIVE: 99%Example – TL for image classificationWe could take a pre-trained model such as ResNet or the Vision Transformer (shown in Figure 12.10), initially trained on a large-scale image dataset such as ImageNet. This model has already learned to detect various features from images, from simple shapes to complex objects. We can take advantage of this knowledge, fi ne-tuning the model on a custom image classification task:Figure 12.10 – The Vision TransformerThe Vision Transformer is like a BERT model for images. It relies on many of the same principles, except instead of text tokens, it uses segments of images as “tokens” instead.The following code block shows an end-to-end code example of fine-tuning the Vision Transformer on an image classification task. The code should look very similar to the BERT code from the previous section because the aim of the transformers library is to standardize training and usage of modern pre-trained models so that no matter what task you are performing, they can offer a relatively unified training and inference experience.Let’s begin by loading up our data and taking a look at the kinds of images we have (seen in Figure 12.11). Note that we are only going to use 1% of the dataset to show that you really don’t need that much data to get a lot out of pre-trained models!# Import necessary libraries from datasets import load_dataset from transformers import ViTImageProcessor, ViTForImageClassification from torch.utils.data import DataLoader import matplotlib.pyplot as plt import torch from torchvision.transforms.functional import to_pil_image # Load the CIFAR10 dataset using Hugging Face datasets # Load only the first 1% of the train and test sets train_dataset = load_dataset("cifar10", split="train[:1%]") test_dataset = load_dataset("cifar10", split="test[:1%]") # Define the feature extractor feature_extractor = ViTImageProcessor.from_pretrained('google/vitbase-patch16-224') # Preprocess the data def transform(examples): # print(examples) # Convert to list of PIL Images examples['pixel_values'] = feature_ extractor(images=examples["img"], return_tensors="pt")["pixel_values"] return examples # Apply the transformations train_dataset = train_dataset.map( transform, batched=True, batch_size=32 ).with_format('pt') test_dataset = test_dataset.map( transform, batched=True, batch_size=32 ).with_format('pt')We can similarly use the model using the following code:Figure 12.11 – A single example from CIFAR10 showing an airplaneNow, we can train our pre-trained Vision Transformer:# Define the model model = ViTForImageClassification.from_pretrained( 'google/vit-base-patch16-224', num_labels=10, ignore_mismatched_sizes=True ) LABELS = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'] model.config.id2label = LABELS # Define a function for computing metrics def compute_metrics(p): predictions, labels = p preds = np.argmax(predictions, axis=1) return {"accuracy": accuracy_score(labels, preds)} # Define the training arguments training_args = TrainingArguments( output_dir='./results', num_train_epochs=5, per_device_train_batch_size=4, load_best_model_at_end=True, # Save and evaluate at the end of each epoch evaluation_strategy='epoch', save_strategy='epoch' ) # Define the trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset )Our final model has about 95% accuracy on 1% of the test set. We can now use our new classifier on unseen images, as in this next code block:from PIL import Image from transformers import pipeline # Define an image classification pipeline classification_pipeline = pipeline( 'image-classification', model=model, feature_extractor=feature_extractor ) # Load an image image = Image.open('stock_image_plane.jpg') # Use the pipeline to classify the image result = classification_pipeline(image)Figure 12.12 shows the result of this single classification, and it looks like it did pretty well:Figure 12.12 – Our classifier predicting a stock image of a plane correctlyWith minimal labeled data, we can leverage TL to turn models off the shelf into powerhouse predictive models.ConclusionTransfer learning is a transformative technique in deep learning, empowering developers to harness the power of pre-trained models like BERT and the Vision Transformer for specialized tasks. From sentiment analysis to image classification, these models can be fine-tuned with minimal labeled data, offering impressive performance and adaptability. By using libraries like HuggingFace’s transformers, TL streamlines model training, making state-of-the-art AI accessible and versatile across domains. As demonstrated in this article, TL is not only efficient but also a practical way to achieve powerful predictive capabilities with limited resources.Author BioSinan is an active lecturer focusing on large language models and a former lecturer of data science at the Johns Hopkins University. He is the author of multiple textbooks on data science and machine learning including "Quick Start Guide to LLMs". Sinan is currently the founder of LoopGenius which uses AI to help people and businesses boost their sales and was previously the founder of the acquired Kylie.ai, an enterprise-grade conversational AI platform with RPA capabilities. He holds a Master’s Degree in Pure Mathematics from Johns Hopkins University and is based in San Francisco.

0
0
38736

article-image-airflow-ops-best-practices-observation-and-monitoring

Dylan Intorf, Kendrick van Doorn, Dylan Storey

12 Nov 2024

15 min read

Airflow Ops Best Practices: Observation and Monitoring

Dylan Intorf, Kendrick van Doorn, Dylan Storey

12 Nov 2024

15 min read

This article is an excerpt from the book, "Apache Airflow Best Practices", by Dylan Intorf, Kendrick van Doorn, Dylan Storey. With practical approach and detailed examples, this book covers newest features of Apache Airflow 2.x and it's potential for workflow orchestration, operational best practices, and data engineering.IntroductionIn this article, we will continue to explore the application of modern “ops” practices within Apache Airflow, focusing on the observation and monitoring of your systems and DAGs after they’ve been deployed.We’ll divide this observation into two segments – the core Airflow system and individual DAGs. Each segment will cover specific metrics and measurements you should be monitoring for alerting and potential intervention.When we discuss monitoring in this section, we will consider two types of monitoring – active and suppressive.In an active monitoring scenario, a process will actively check a service’s health state, recording its state and potentially taking action directly on the return value.In a suppressive monitoring scenario, the absence of a state (or state change) is usually meaningful. In these scenarios, the monitored application sends an active schedule to a process to inform it that it is OK, usually suppressing an action (such as an alert) from occurring.This chapter covers the following topics:Monitoring core Airflow componentsMonitoring your DAGsTechnical requirementsBy now, we expect you to have a good understanding of Airflow and its core components, along with functional knowledge in the deployment and operation of Airflow and Airflow DAGs.We will not be covering specific observability aggregators or telemetry tools; instead, we will focus on the activities you should be keeping an eye on. We strongly recommend that you work closely with your ops teams to understand what tools exist in your stack and how to configure them for capture and alerting your deployments.Monitoring core Airflow componentsAll of the components we will discuss here are critical to ensuring a functioning Airflow deployment. Generally, all of them should be monitored with a bare minimum check of Is it on? and if a component is not, an alert should surface to your team for investigation. The easiest way to check this is to query the REST API on the web server at `/health/`; this will return a JSON object that can be parsed to determine whether components are healthy and, if not, when they were last seen.SchedulerThis component needs to be running and working effectively in order for tasks to be scheduled for execution.When the scheduler service is started, it also starts a `/health` endpoint that can be checked by an external process with an active monitoring approach.The returned signal does not always indicate that the scheduler is working properly, as its state is simply indicative that the service is up and running. There are many scenarios where the scheduler may be operating but unable to schedule jobs; as a result, many deployments will include a canary dag to their deployment that has a single task, acting to suppress an external alert from going off.Import metrics that airflow exposes for you include the following:scheduler.scheduler_loop_duration: This should be monitored to ensure that your scheduler is able to loop and schedule tasks for execution. As this metric increases, you will see tasks beginning to schedule more slowly, to the point where you may begin missing SLAs because tasks fail to reach a schedulable state.scheduler.tasks.starving: This indicates how many tasks cannot be scheduled because there are no slots available. Pools are a mechanism that Airflow uses to balance large numbers of submitted task executions versus a finite amount of execution throughput. It is likely that this number will not be zero, but being high for extended periods of time may point to an issue in how DAGs are being written to schedule work.scheduler.tasks.executable: This indicates how many tasks are ready for execution (i.e., queued). This number will sometimes not be zero, and that is OK, but if the number increases and stays high for extended periods of time, it indicates that you may need additional computer resources to handle the load. Look at your executor to increase the number of workers it can run. Metadata databaseThe metadata database is used to store and track all of the metadata for your Airflow deployments’ previous DAG/task executions, along with information about your environment’s roles and permissions. Losing data from this database can interrupt normal operations and cause unintended consequences, with DAG runs being repeated.While critical, because it is architecturally ubiquitous, the database is also least likely to encounter issues, and if it does, they are absolutely catastrophic in nature.We generally suggest you utilize a managed service for provisioning and operating your backing database, ensuring that a disaster recovery plan for your metadata database is in place at all times.Some active areas to monitor on your database include the following:Connection pool size/usage: Monitor both the connection pool size and usage over time to ensure appropriate configuration, and identify potential bottlenecks or resource contention arising from Airflow components’ concurrent connections.Query performance: Measure query latency to detect inefficient queries or performance issues, while monitoring query throughput to ensure effective workload handling by the database.Storage metrics: Monitor the disk space utilization of the metadata database to ensure that it has sufficient storage capacity. Set up alerts for low disk space conditions to prevent database outages due to storage constraints.Backup status: Monitor the status of database backups to ensure that they are performed regularly and successfully. Verify backup integrity and retention policies to mitigate the risk of data loss if there is a database failure.TriggererThe Triggerer instance manages all of the asynchronous operations of deferrable operators in a deferred state. As such, major operational concerns generally relate to ensuring that individual deferred operators don’t cause major blocking calls to the event loop. If this occurs, your deferrable tasks will not be able to check their state changes as frequently, and this will impact scheduling performance.Import metrics that airflow exposes for you include the following:triggers.blocked_main_thread: The number of triggers that have blocked the main thread. This is a counter and should monotonically increase over time; pay attention to large differences between recording (or quick acceleration) counts, as it’s indicative of a larger problem.triggers.running: The number of triggers currently on a triggerer instance. This metric should be monitored to determine whether you need to increase the number of triggerer instances you are running. While the official documentation claims that up to tens of thousands of triggers can be on an instance, the common operational number is much lower. Tune at your discretion, but depending on the complexity of your triggers, you may need to add a new instance for every few hundred consistent triggers you run.Executors/workersDepending on the executor you use, you will need to monitor your executors and workers a bit differently.The Kubernetes executor will utilize the Kubernetes API to schedule tasks for execution; as such, you should utilize the Kubernetes events and metrics servers to gather logs and metrics for your task instances. Common metrics to collect on an individual task are CPU and memory usage. This is crucial for tuning requests or mutating individual task resource requests to ensure that they execute safely.The Celery worker has additional components and long-lived processes that you need to metricize. You should monitor an individual Celery worker’s memory and CPU utilization to ensure that it is not over- or under-provisioned, tuning allocated resources accordingly. You also need to monitor the message broker (usually Redis or RabbitMQ) to ensure that it is appropriately sized. Finally, it is critical to measure the queue length of your message broker and ensure that too much “back pressure” isn’t being created in the system. If you find that your tasks are sitting in a queued state for a long period of time and the queue length is consistently growing, it’s a sign that you should start an additional Celery worker to execute on scheduled tasks. You should also investigate using the native Celery monitoring tool Flower (https://flower.readthedocs.io/en/latest/) for additional, more nuanced methods of monitoring.Web serverThe Airflow web server is the UI for not just your Airflow deployment but also the RESTful interface. Especially if you happen to be controlling Airflow scheduling behavior with API calls, you should keep an eye on the following metrics:Response time: Measure the time taken for the API to respond to requests. This metric indicates the overall performance of the API and can help identify potential bottlenecks.Error rate: Monitor the rate of errors returned by the API, such as 4xx and 5xx HTTP status codes. High error rates may indicate issues with the API implementation or underlying systems.Request rate: Track the rate of incoming requests to the API over time. Sudden spikes or drops in request rates can impact performance and indicate changes in usage patterns.System resource utilization: Monitor resource utilization metrics such as CPU, memory, disk I/O, and network bandwidth on the servers hosting the API. High resource utilization can indicate potential performance bottlenecks or capacity limits.Throughput: Measure the number of successful requests processed by the API per unit of time. Throughput metrics provide insights into the API’s capacity to handle incoming traffic.Now that you have some basic metrics to collect from your core architectural components and can monitor the overall health of an application, we need to monitor the actual DAGs themselves to ensure that they function as intended.Monitoring your DAGsThere are multiple aspects to monitoring your DAGs, and while they’re all valuable, they may not all be necessary. Take care to ensure that your monitoring and alerting stack match your organizational needs with regard to operational parameters for resiliency and, if there is a failure, recovery times. No matter how much or how little you choose to implement, knowing that your DAGs work and if and how they fail is the first step in fixing problems that will arise.LoggingAirflow writes logs for tasks in a hierarchical structure that allows you to see each task’s logs in the Airflow UI. The community also provides a number of providers to utilize other services for backing log storage and retrieval. A complete list of supported providers is available at https://airflow.apache.org/docs/apache-airflow-providers/core-extensions/logging.html.Airflow uses the standard Python logging framework to write logs. If you’re writing custom operators or executing Python functions with a PythonOperator, just make sure that you instantiate a Python logger instance, and then the associated methods will handle everything for you.AlertingAirflow provides mechanisms for alerting on operational aspects of your executing workloads that can be configured within your DAG:Email notifications: Email notifications can be sent if a task is put into a marked or retry state with the `email_on_failure` or `email_on_retry` state, respectively. These arguments can be provided to all tasks in the DAG with the `default_args` key work in the DAG, or individual tasks by setting the keyword argument individually.Callbacks: Callbacks are special actions that are executed if a specific state change occurs. Generally, these callbacks should be thoughtfully leveraged to send alerts that are critical operationally:on_success_callback: This callback will be executed at both the task and DAG levels when entering a successful state. Unless it is critical that you know whether something succeeds, we generally suggest not using this for alerting.on_failure_callback: This callback is invoked when a task enters a failed state. Generally, this callback should always be set and, in critical scenarios, alert on failures that require intervention and support.on_execute_callback: This is invoked right before a task executes and only exists at the task level. Use sparingly for alerting, as it can quickly become a noisy alert when overused.on_retry_callback: This is invoked when a task is placed in a retry state. This is another callback to be cautious about as an alert, as it can become noisy and cause false alarms.sla_miss_callback: This is invoked when a DAG misses its defined SLA. This callback is only executed at the end of a DAG’s execution cycle so tends to be a very reactive notification that something has gone wrong.SLA monitoringAs awesome of a tool as Airflow is, it is a well-known fact in the community that SLAs, while largely functional, have some unfortunate details with regard to implementation that can make them problematic at best, and they are generally regarded as a broken feature in Airflow. We suggest that if you require SLA monitoring on your workflows, you deploy a CRON job monitoring tool such as healthchecks (https://github.com/healthchecks/healthchecks) that allows you to create suppressive alerts for your services through its rest API to manage SLAs. By pairing this third- party service with either HTTP operators or simple requests from callbacks, you can ensure that your most critical workflows achieve dynamic and resilient SLA alerting.Performance profilingThe Airflow UI is a great tool for profiling the performance of individual DAGs:The Gannt chart view: This is a great visualization for understanding the amount of time spent on individual tasks and the relative order of execution. If you’re worried about bottlenecks in your workflow, start here.Task duration: This allows you to profile the run characteristics of tasks within your DAG over a historical period. This tool is great at helping you understand temporal patterns in execution time and finding outliers in execution. Especially if you find that a DAG slows down over time, this view can help you understand whether it is a systemic issue and which tasks might need additional development.Landing times: This shows the delta between task completion and the start of the DAG run. This is an un-intuitive but powerful metric, as increases in it, when paired with stable task durations in upstream tasks, can help identify whether a scheduler is under heavy load and may need tuning.Additional metrics that have proven to be useful (but may need to be calculated) include the following:Task startup time: This is an especially useful metric when operating with a Kubernetes executor. To calculate this, you will need to calculate the difference between `start_date` and `execution_date` on each task instance. This metric will especially help you identify bottlenecks outside of Airflow that may impact task run times.Task failure and retry counts: Monitoring the frequency of task failures and retries can help identify information about the stability and robustness of your environment. Especially if these types of failure can be linked back to patterns in time or execution, it can help debug interactions with other services.DAG parsing time: Monitoring the amount of time a DAG takes to parse is very important to understand scheduler load and bottlenecks. If an individual DAG takes a long time to load (either due to heavy imports or long blocking calls being executed during parsing), it can have a material impact on the timeliness of scheduling tasks.ConclusionIn this article, we covered some essential strategies to effectively monitor both the core Airflow system and individual DAGs post-deployment. We highlighted the importance of active and suppressive monitoring techniques and provided insights into the critical metrics to track for each component, including the scheduler, metadata database, triggerer, executors/workers, and web server. Additionally, we discussed logging, alerting mechanisms, SLA monitoring, and performance profiling techniques to ensure the reliability, scalability, and efficiency of Airflow workflows. By implementing these monitoring practices and leveraging the insights gained, operators can proactively manage and optimize their Airflow deployments for optimal performance and reliability.Author BioDylan Intorf is a solutions architect and data engineer with a BS from Arizona State University in Computer Science. He has 10+ years of experience in the software and data engineering space, delivering custom tailored solutions to Tech, Financial, and Insurance industries.Kendrick van Doorn is an engineering and business leader with a background in software development, with over 10 years of developing tech and data strategies at Fortune 100 companies. In his spare time, he enjoys taking classes at different universities and is currently an MBA candidate at Columbia University.Dylan Storey has a B.Sc. and M.Sc. from California State University, Fresno in Biology and a Ph.D. from University of Tennessee, Knoxville in Life Sciences where he leveraged computational methods to study a variety of biological systems. He has over 15 years of experience in building, growing, and leading teams; solving problems in developing and operating data products at a variety of scales and industries.

2
0
62994

article-image-vertex-ai-workbench-your-complete-guide-to-scaling-machine-learning-with-google-cloud

Jasmeet Bhatia, Kartik Chaudhary

04 Nov 2024

15 min read

Vertex AI Workbench: Your Complete Guide to Scaling Machine Learning with Google Cloud

Jasmeet Bhatia, Kartik Chaudhary

04 Nov 2024

15 min read

1
1
67379

article-image-essential-sql-for-data-engineers

Kedeisha Bryan, Taamir Ransome

31 Oct 2024

10 min read

Essential SQL for Data Engineers

Kedeisha Bryan, Taamir Ransome

31 Oct 2024

10 min read

This article is an excerpt from the book, Cracking the Data Engineering Interview, by Kedeisha Bryan, Taamir Ransome. The book is a practical guide that’ll help you prepare to successfully break into the data engineering role. The chapters cover technical concepts as well as tips for resume, portfolio, and brand building to catch the employer's attention, while also focusing on case studies and real-world interview questions.Introduction In the world of data engineering, SQL is the unsung hero that empowers us to store, manipulate, transform, and migrate data easily. It is the language that enables data engineers to communicate with databases, extract valuable insights, and shape data to meet their needs. Regardless of the nature of the organization or the data infrastructure in use, a data engineer will invariably need to use SQL for creating, querying, updating, and managing databases. As such, proficiency in SQL can often the difference between a good data engineer and a great one. Whether you are new to SQL or looking to brush up your skills, this chapter will serve as a comprehensive guide. By the end of this chapter, you will have a solid understanding of SQL as a data engineer and be prepared to showcase your knowledge and skills in an interview setting. In this article, we will cover the following topics: Must-know foundational SQL concepts Must-know advanced SQL concepts Technical interview questions Must-know foundational SQL concepts In this section, we will delve into the foundational SQL concepts that form the building blocks of data engineering. Mastering these fundamental concepts is crucial for acing SQL-related interviews and effectively working with databases. Let’s explore the critical foundational SQL concepts every data engineer should be comfortable with, as follows: SQL syntax: SQL syntax is the set of rules governing how SQL statements should be written. As a data engineer, understanding SQL syntax is fundamental because you’ll be writing and reviewing SQL queries regularly. These queries enable you to extract, manipulate, and analyze data stored in relational databases. SQL order of operations: The order of operations dictates the sequence in which each of the following operators is executed in a query: FROM and JOIN WHERE GROUP BY HAVING SELECT DISTINCT ORDER BY LIMIT/OFFSET Data types: SQL supports a variety of data types, such as INT, VARCHAR, DATE, and so on. Understanding these types is crucial because they determine the kind of data that can be stored in a column, impacting storage considerations, query performance, and data integrity. As a data engineer, you might also need to convert data types or handle mismatches. SQL operators: SQL operators are used to perform operations on data. They include arithmetic operators (+, -, *, /), comparison operators (>, <, =, and so on), and logical operators (AND, OR, and NOT). Knowing these operators helps you construct complex queries to solve intricate data-related problems. Data Manipulation Language (DML), Data Definition Language (DDL), and Data Control Language (DCL) commands: DML commands such as SELECT, INSERT, UPDATE, and DELETE allow you to manipulate data stored in the database. DDL commands such as CREATE, ALTER, and DROP enable you to manage database schemas. DCL commands such as GRANT and REVOKE are used for managing permissions. As a data engineer, you will frequently use these commands to interact with databases. Basic queries: Writing queries to select, filter, sort, and join data is an essential skill for any data engineer. These operations form the basis of data extraction and manipulation. Aggregation functions: Functions such as COUNT, SUM, AVG, MAX, MIN, and GROUP BY are used to perform calculations on multiple rows of data. They are essential for generating reports and deriving statistical insights, which are critical aspects of a data engineer’s role. The following section will dive deeper into must-know advanced SQL concepts, exploring advanced techniques to elevate your SQL proficiency. Get ready to level up your SQL game and unlock new possibilities in data engineering! Must-know advanced SQL concepts This section will explore advanced SQL concepts that will elevate your data engineering skills to the next level. These concepts will empower you to tackle complex data analysis, perform advanced data transformations, and optimize your SQL queries. Let’s delve into must-know advanced SQL concepts, as follows: Window functions: These do a calculation on a group of rows that are related to the current row. They are needed for more complex analyses, such as figuring out running totals or moving averages, which are common tasks in data engineering. Subqueries: Queries nested within other queries. They provide a powerful way to perform complex data extraction, transformation, and analysis, often making your code more efficient and readable. Common Table Expressions (CTEs): CTEs can simplify complex queries and make your code more maintainable. They are also essential for recursive queries, which are sometimes necessary for problems involving hierarchical data. Stored procedures and triggers: Stored procedures help encapsulate frequently performed tasks, improving efficiency and maintainability. Triggers can automate certain operations, improving data integrity. Both are important tools in a data engineer’s toolkit. Indexes and optimization: Indexes speed up query performance by enabling the database to locate data more quickly. Understanding how and when to use indexes is key for a data engineer, as it affects the efficiency and speed of data retrieval. Views: Views simplify access to data by encapsulating complex queries. They can also enhance security by restricting access to certain columns. As a data engineer, you’ll create and manage views to facilitate data access and manipulation. By mastering these advanced SQL concepts, you will have the tools and knowledge to handle complex data scenarios, optimize your SQL queries, and derive meaningful insights from your datasets. The following section will prepare you for technical interview questions on SQL. We will equip you with example answers and strategies to excel in SQL-related interview discussions. Let’s further enhance your SQL expertise and be well prepared for the next phase of your data engineering journey. Technical interview questions This section will address technical interview questions specifically focused on SQL for data engineers. These questions will help you demonstrate your SQL proficiency and problem-solving abilities. Let’s explore a combination of primary and advanced SQL interview questions and the best methods to approach and answer them, as follows: Question 1: What is the difference between the WHERE and HAVING clauses? Answer: The WHERE clause filters data based on conditions applied to individual rows, while the HAVING clause filters data based on grouped results. Use WHERE for filtering before aggregating data and HAVING for filtering after aggregating data. Question 2: How do you eliminate duplicate records from a result set? Answer: Use the DISTINCT keyword in the SELECT statement to eliminate duplicate records and retrieve unique values from a column or combination of columns. Question 3: What are primary keys and foreign keys in SQL? Answer: A primary key uniquely identifies each record in a table and ensures data integrity. A foreign key establishes a link between two tables, referencing the primary key of another table to enforce referential integrity and maintain relationships. Question 4: How can you sort data in SQL? Answer: Use the ORDER BY clause in a SELECT statement to sort data based on one or more columns. The ASC (ascending) keyword sorts data in ascending order, while the DESC (descending) keyword sorts it in descending order. Question 5: Explain the difference between UNION and UNION ALL in SQL. Answer: UNION combines and removes duplicate records from the result set, while UNION ALL combines all records without eliminating duplicates. UNION ALL is faster than UNION because it does not involve the duplicate elimination process. Question 6: Can you explain what a self join is in SQL? Answer: A self join is a regular join where a table is joined to itself. This is often useful when the data is related within the same table. To perform a self join, we have to use table aliases to help SQL distinguish the left from the right table. Question 7: How do you optimize a slow-performing SQL query? Answer: Analyze the query execution plan, identify bottlenecks, and consider strategies such as creating appropriate indexes, rewriting the query, or using query optimization techniques such as JOIN order optimization or subquery optimization. Question 8: What are CTEs, and how do you use them? Answer: CTEs are temporarily named result sets that can be referenced within a query. They enhance query readability, simplify complex queries, and enable recursive queries. Use the WITH keyword to define CTEs in SQL. Question 9: Explain the ACID properties in the context of SQL databases. Answer: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. These are basic properties that make sure database operations are reliable and transactional. Atomicity makes sure that a transaction is handled as a single unit, whether it is fully done or not. Consistency makes sure that a transaction moves the database from one valid state to another. Isolation makes sure that transactions that are happening at the same time don’t mess with each other. Durability makes sure that once a transaction is committed, its changes are permanent and can survive system failures. Question 10: How can you handle NULL values in SQL? Answer: Use the IS NULL or IS NOT NULL operator to check for NULL values. Additionally, you can use the COALESCE function to replace NULL values with alternative non-null values. Question 11: What is the purpose of stored procedures and functions in SQL? Answer: Stored procedures and functions are reusable pieces of SQL code encapsulating a set of SQL statements. They promote code modularity, improve performance, enhance security, and simplify database maintenance. Question 12: Explain the difference between a clustered and a non-clustered index. Answer: The physical order of the data in a table is set by a clustered index. This means that a table can only have one clustered index. The data rows of a table are stored in the leaf nodes of a clustered index. A non-clustered index, on the other hand, doesn’t change the order of the data in the table. After sorting the pointers, it keeps a separate object in a table that points back to the original table rows. There can be more than one non-clustered index for a table. Prepare for these interview questions by understanding the underlying concepts, practicing SQL queries, and being able to explain your answers. ConclusionThis article explored the foundational and advanced principles of SQL that empower data engineers to store, manipulate, transform, and migrate data confidently. Understanding these concepts has unlocked the door to seamless data operations, optimized query performance, and insightful data analysis. SQL is the language that bridges the gap between raw data and valuable insights. With a solid grasp of SQL, you possess the skills to navigate databases, write powerful queries, and design efficient data models. Whether preparing for interviews or tackling real-world data engineering challenges, the knowledge you have gained in this chapter will propel you toward success. Remember to continue exploring and honing your SQL skills. Stay updated with emerging SQL technologies, best practices, and optimization techniques to stay at the forefront of the ever-evolving data engineering landscape. Embrace the power of SQL as a critical tool in your data engineering arsenal, and let it empower you to unlock the full potential of your data. Author BioKedeisha Bryan is a data professional with experience in data analytics, science, and engineering. She has prior experience combining both Six Sigma and analytics to provide data solutions that have impacted policy changes and leadership decisions. She is fluent in tools such as SQL, Python, and Tableau.She is the founder and leader at the Data in Motion Academy, providing personalized skill development, resources, and training at scale to aspiring data professionals across the globe. Her other works include another Packt book in the works and an SQL course for LinkedIn Learning.Taamir Ransome is a Data Scientist and Software Engineer. He has experience in building machine learning and artificial intelligence solutions for the US Army. He is also the founder of the Vet Dev Institute, where he currently provides cloud-based data solutions for clients. He holds a master's degree in Analytics from Western Governors University.

3
0
44542

Mostafa Ibrahim

16 Apr 2024

10 min read

LLMOps in Action

Mostafa Ibrahim

16 Apr 2024

10 min read

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionIn an era dominated by the rise of artificial intelligence, the power and promise of Large Language Models (LLMs) stand distinct. These colossal architectures, designed to understand and generate human-like text, have revolutionized the realm of natural language processing. However, with great power comes great responsibility – the onus of managing, deploying, and refining these models in real-world scenarios. This article delves into the world of Large Language Model Operations (LLMOps), an emerging field that bridges the gap between the potential of LLMs and their practical application.BackgroundThe last decade has seen a significant evolution in language models, with models growing in size and capability. Starting with smaller models like Word2Vec and LSTM, we've advanced to behemoths like GPT-3, BERT, and T5. With that said, as these models grew in size and complexity, so did their operational challenges. Deploying, maintaining, and updating these models requires substantial computational resources, expertise, and effective management strategies.MLOps vs LLMOpsIf you've ventured into the realm of machine learning, you've undoubtedly come across the term MLOps. MLOps, or Machine Learning Operations, encapsulates best practices and methodologies for deploying and maintaining machine learning models throughout their lifecycle. It caters to the wide spectrum of models that fall under the machine learning umbrella.On the other hand, with the growth of vast and intricate language models, a more specialized operational domain has emerged: LLMOps. While both MLOps and LLMOps share foundational principles, the latter specifically zeros in on the challenges and nuances of deploying and managing large-scale language models. Given the colossal size, data-intensive nature, and unique architecture of these models, LLMOps brings to the fore bespoke strategies and solutions that are fine-tuned to ensure the efficiency, efficacy, and sustainability of such linguistic powerhouses in real-world scenarios.Core Concepts of LLMOpsLarge Language Models Operations (LLMOps) focuses on the management, deployment, and optimization of large language models (LLMs). One of its foundational concepts is model deployment, emphasizing scalability to handle varied loads, reducing latency for real-time responses, and maintaining version control. As these LLMs demand significant computational resources, efficient resource management becomes pivotal. This includes the use of optimized hardware like GPUs and TPUs, effective memory optimization strategies, and techniques to manage computational costs.Continuous learning and updating, another core concept, revolve around fine-tuning models with new data, avoiding the pitfall of 'catastrophic forgetting', and effectively managing data streams for updates. Parallelly, LLMOps emphasizes the importance of continuous monitoring for performance, bias, fairness, and iterative feedback loops for model improvement. To cater to the vastness of LLMs, model compression techniques like pruning, quantization, and knowledge distillation become crucial.How do LLMOps workPre-training Model DevelopmentLarge Language Models typically start their journey through a process known as pre-training. This involves training the model on vast amounts of text data. The objective during this phase is to capture a broad understanding of language, learning from billions of sentences and paragraphs. This foundational knowledge helps the model grasp grammar, vocabulary, factual information, and even some level of reasoning.This massive-scale training is what makes them "large" and gives them a broad understanding of language. Optimization & CompressionModels trained to this extent are often so large that they become impractical for daily tasks.To make these models more manageable without compromising much on performance, techniques like model pruning, quantization, and knowledge distillation are employed.Model Pruning: After training, pruning is typically the first optimization step. This begins with trimming model weights and may advance to more intensive methods like neuron or channel pruning.Quantization: Following pruning, the model's weights, and potentially its activations, are streamlined. Though weight quantization is generally a post-training process, for deeper reductions, such as very low-bit quantization, one might adopt quantization-aware training from the beginning.Additional recommendations are:Optimizing the model specifically for the intended hardware can elevate its performance. Before initiating training, selecting inherently efficient architectures with fewer parameters is beneficial. Approaches that adopt parameter sharing or tensor factorization prove advantageous. For those planning to train a new model or fine-tune an existing one with an emphasis on sparsity, starting with sparse training is a prudent approach.Deployment Infrastructure After training and compressing our LLM, we will be using technologies like Docker and Kubernetes to deploy models scalably and consistently. This approach allows us to flexibly scale using as many pods as needed. Concluding the deployment process, we'll implement edge deployment strategies. This positions our models nearer to the end devices, proving crucial for applications that demand real-time responses.Continuous Monitoring & FeedbackThe process starts with the Active model in production. As it interacts with users and as language evolves, it can become less accurate, leading to the phase where the Model becomes stale as time passes.To address this, feedback and interactions from users are captured, forming a vast range of new data. Using this data, adjustments are made, resulting in a New fine-tuned model.As user interactions continue and the language landscape shifts, the current model is replaced with the new model. This iterative cycle of deployment, feedback, refinement, and replacement ensures the model always stays relevant and effective.Importance and Benefits of LLMOpsMuch like the operational paradigms of AIOps and MLOps, LLMOps brings a wealth of benefits to the table when managing Large Language Models.MaintenanceAs LLMs are computationally intensive. LLMOps streamlines their deployment, ensuring they run smoothly and responsively in real-time applications. This involves optimizing infrastructure, managing resources effectively, and ensuring that models can handle a wide variety of queries without hiccups.Consider the significant investment of effort, time, and resources required to maintain Large Language Models like Chat GPT, especially given its vast user base.Continuous ImprovementLLMOps emphasizes continuous learning, allowing LLMs to be updated with fresh data. This ensures that models remain relevant, accurate, and effective, adapting to the evolving nature of language and user needs.Building on the foundation of GPT-3, the newer GPT-4 model brings enhanced capabilities. Furthermore, while ChatGPT was previously trained on data up to 2021, it has now been updated to encompass information through 2022.It's important to recognize that constructing and sustaining large language models is an intricate endeavor, necessitating meticulous attention and planning.ConclusionThe ascent of Large Language Models marks a transformative phase in the evolution of machine learning. But it's not just about building them; it's about harnessing their power efficiently, ethically, and sustainably. LLMOps emerge as the linchpin, ensuring that these models not only serve their purpose but also evolve with the ever-changing dynamics of language and user needs. As we continue to innovate, the principles of LLMOps will undoubtedly play a pivotal role in shaping the future of language models and their place in our digital world.Author BioMostafa Ibrahim is a dedicated software engineer based in London, where he works in the dynamic field of Fintech. His professional journey is driven by a passion for cutting-edge technologies, particularly in the realms of machine learning and bioinformatics. When he's not immersed in coding or data analysis, Mostafa loves to travel.Medium

0
0
35225

Chaitanya Yadav

20 Oct 2023

10 min read

ChatGPT for SQL Queries

Chaitanya Yadav

20 Oct 2023

10 min read

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIntroductionChatGPT is an efficient language that may be used in a range of tasks, including the creation of SQL queries. In this article, you will get to know how effectively you will be able to use SQL queries by using ChatGPT to optimize and craft them correctly to get perfect results.It is necessary to have sufficient SQL knowledge before you can use ChatGPT for the creation of SQL queries. The language that the databases are communicating with is SQL. This is meant to be used for the production, reading, updating, and deletion of data from databases. SQL is the most specialized language in this domain. It's one of the main components in a lot of existing applications because it deals with structured data that can be retrieved from tables.There are a number of different SQL queries, but some more common ones include the following:SELECT: It will select data from a database.INSERT: It will insert new data into a database.UPDATE: This query will update the existing data in a database.DELETE: This query is used to delete data from a database.Using ChatGPT to write SQL queriesOnce you have a basic understanding of SQL, you can start using ChatGPT to write SQL queries. To do this, you need to provide ChatGPT with a description of the query that you want to write. After that, ChatGPT will generate the SQL code for you.For example, you could just give ChatGPT the query below to write an SQL query to select all of the customers in your database.Select all of the customers in my databaseFollowing that, ChatGPT will provide the SQL code shown below:SELECT * FROM customers;The customer table's entire set of columns will be selected by this query. Additionally, ChatGPT can be used to create more complex SQL statements.How to Use ChatGPT to Describe Your IntentionsNow let’s have a look at some examples where we will ask ChatGPT to generate SQL code by asking it queries from our side.For Example:We'll be creating a sample database for ChatGPT, so we can ask them to set up restaurant databases and two tables.ChatGPT prompt:Create a sample database with two tables: GuestInfo and OrderRecords. The GuestInfo table should have the following columns: guest_id, first_name, last_name, email_address, and contact_number. The OrderRecords table should have the following columns: order_id, guest_id, product_id, quantity_ordered, and order_date.ChatGPT SQL Query Output:We requested that ChatGPT create a database and two tables in this example. After it generated a SQL query. The following SQL code is to be executed on the Management Studio software for SQL Server. As we are able to see the code which we got from ChatGPT successfully got executed in the SSMS Database software.How ChatGPT Can Be Used for Optimizing, Crafting, and Debugging Your QueriesSQL is an efficient tool to manipulate and interrogate data in the database. However, in particular, for very complex datasets it may be difficult to write efficient SQL queries. The ChatGPT Language Model is a robust model to help you with many tasks, such as optimizing SQL queries.Generating SQL queriesThe creation of SQL queries from Natural Language Statements is one of the most common ways that ChatGPT can be used for SQL optimization. Users who don't know SQL may find this helpful, as well as users who want to quickly create the query for a specific task.For example, you could ask for ChatGPT in the following way:Generate an SQL query to select all customers who have placed an order in the last month.ChatGPT would then generate the following query:SELECT * FROM customers WHERE order_date >= CURRENT_DATE - INTERVAL 1 MONTH;Optimizing existing queriesThe optimization of current SQL queries can also be achieved with ChatGPT. You can do this by giving ChatGPT the query that you want improved performance of and it will then suggest improvements to your query.For example, you could ask for ChatGPT in the following way:SELECT * FROM products WHERE product_name LIKE '%shirt%';ChatGPT might suggest the following optimizations:Add an index to the products table on the product_name column.Use a full-text search index on the product_name column.Use a more specific LIKE clause, such as WHERE product_name = 'shirt' if you know that the product name will be an exact match.Crafting queriesBy providing an interface between SQL and Natural Language, ChatGPT will be able to help with the drafting of complicated SQL queries. For users who are not familiar with SQL and need to create a quick query for a specific task, it can be helpful.For Example:Let's say we want to know which customers have placed an order within the last month, and spent more than $100 on it, then write a SQL query. The following query could be generated by using ChatGPT:SELECT * FROM customers WHERE order_date >= CURRENT_DATE - INTERVAL 1 MONTH AND order_total > 100;This query is relatively easy to perform, but ChatGPT can also be used for the creation of more complicated queries. For example, to select all customers who have placed an order in the last month and who have purchased a specific product, we could use ChatGPT to generate a query.SELECT * FROM customers WHERE order_date >= CURRENT_DATE - INTERVAL 1 MONTH AND order_items LIKE '%product_name%';Generating queries for which more than one table is involved can also be done with ChatGPT. For example, to select all customers who have placed an order in the last month and have also purchased a specific product from a specific category, we could use ChatGPT to generate a query.SELECT customers.*FROM customersINNER JOIN orders ON customers.id = orders.customer_idINNER JOIN order_items ON orders.id = order_items.order_idWHERE order_date >= CURRENT_DATE - INTERVAL 1 MONTHAND order_items_product_id = (SELECT id FROM products WHERE product_name = 'product_name')AND product_category_id = (SELECT id FROM product_categories WHERE category_name = 'category_name');The ChatGPT tool is capable of providing assistance with the creation of complex SQL queries. The ChatGPT feature facilitates users' writing efficient and accurate queries by providing an interface to SQL in a natural language.Debugging SQL queriesFor debugging SQL queries, the ChatGPT can also be used. To get started, you can ask ChatGPT to deliver a query that does not return the anticipated results. It will try to figure out why this is happening.For example, you could ask for ChatGPT in the following way:SELECT * FROM customers WHERE country = 'United States';Let's say that more results than expected are returned by this query. If there are multiple rows in a customer table, or the country column isn't being populated correctly for all clients, ChatGPT may suggest that something is wrong.How ChatGPT can help diagnose SQL query errors and suggest potential fixesYou may find that ChatGPT is useful for diagnosing and identifying problems, as well as suggesting possible remedies when you encounter errors or unexpected results in your SQL queries.To illustrate how ChatGPT could help you diagnose and correct SQL queries, we'll go over a hands-on example.Scenario: You'll be working with a database for Internet store transactions. The 'Products' table is where you would like to see the total revenue from a specific product named "Laptop". But you'll get unexpected results while running a SQL query.Your SQL Query:SELECT SUM(price) AS total_revenue FROM Products WHERE product_name = 'Laptop'; Issue: The query is not providing the expected results. You're not sure what went wrong.ChatGPT Assistance:Diagnosing the Issue:You can ask ChatGPT something like, "What could be the issue with my SQL query to calculate the total revenue of 'Laptop' from the Products table?"ChatGPT’s Response:The ChatGPT believes that the problem may arise from a WHERE clause. It suggests that because the names of products may not be distinctive, and there might be a lot of entries called 'Laptops', it is suggested to use ProductID rather than the product name. This query could be modified as follows:SELECT SUM(price) AS total_revenue FROM Products WHERE product_id = (SELECT product_id FROM Products WHERE product_name = 'Laptop');Explanation and Hands-on Practice:The reasoning behind this adjustment is explained by ChatGPT. In order to check if the revised query is likely to lead to an expected overall profit for a 'Laptop' product, you can then try running it.SELECT SUM(price) AS total_revenue FROM Products WHERE product_id = (SELECT product_id FROM Products WHERE product_name = 'Laptop');We have obtained the correct overall revenue from a 'Laptop' product with this query, which has resolved your unanticipated results issue.This hands-on example demonstrates how ChatGPT can help you diagnose and resolve your SQL problems, provide tailored suggestions, explain the solutions to fix them, and guide you through the process of strengthening your SQL skills by using practical applications.ConclusionIn conclusion, this article provides insight into the important role that ChatGPT plays when it comes to generating efficient SQL queries. In view of the key role played by SQL in database management for structured data, which is essential to modern applications, it stressed that there should be a solid knowledge base on SQL so as to effectively use ChatGPT when creating queries. We explored how ChatGPT could help you generate, optimize, and analyze SQL queries by presenting practical examples and use cases.It explains to users how ChatGPT is able to diagnose SQL errors and propose a solution, which in the end can help them solve unforeseen results and improve their ability to use SQL. In today's data-driven world where effective data manipulation is a necessity, ChatGPT becomes an essential ally for those who seek to speed up the SQL query development process, enhance accuracy, and increase productivity. It will open up new possibilities for data professionals and developers, allowing them to interact more effectively with databases.If you want to learn more about SQL, you can read this book: SQL for Data AnalyticsThis book goes beyond basic SQL and teaches you how to analyze real-world data using advanced techniques like joins, window functions, and statistical analysis. It includes hands-on exercises and case studies to help you build practical, job-ready data analytics skills.Author BioChaitanya Yadav is a data analyst, machine learning, and cloud computing expert with a passion for technology and education. He has a proven track record of success in using technology to solve real-world problems and help others to learn and grow. He is skilled in a wide range of technologies, including SQL, Python, data visualization tools like Power BI, and cloud computing platforms like Google Cloud Platform. He is also 22x Multicloud Certified.In addition to his technical skills, he is also a brilliant content creator, blog writer, and book reviewer. He is the Co-founder of a tech community called "CS Infostics" which is dedicated to sharing opportunities to learn and grow in the field of IT.

0
0
33339

article-image-getting-started-with-med-palm-2

Packt

07 Sep 2023

5 min read

Getting Started with Med-PaLM 2

Packt

07 Sep 2023

5 min read

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. Introduction Med-PaLM 2 is a large language model (LLM) from Google Research, designed for the medical domain. It is trained on a massive dataset of text and code, including medical journals, textbooks, and clinical trials. Med-PaLM 2 can answer questions about a wide range of medical topics, including diseases, treatments, and procedures. It can also generate text, translate languages, and write different kinds of creative content. Use Cases Med-PaLM 2 can be used for a variety of purposes in the healthcare industry, including: Medical research: Med-PaLM 2 can be used to help researchers find and analyze medical data. It can also be used to generate hypotheses and test new ideas. Clinical decision support: Med-PaLM 2 can be used to help doctors diagnose diseases and make treatment decisions. It can also be used to provide patients with information about their condition and treatment options. Health education: Med-PaLM 2 can be used to create educational materials for patients and healthcare professionals. It can also be used to answer patients' questions about their health. Drug discovery: Med-PaLM 2 can be used to help researchers identify new drug targets and develop new drugs. Personalized medicine: Med-PaLM 2 can be used to help doctors personalize treatment for individual patients. It can do this by taking into account the patient's medical history, genetic makeup, and other factors. How to Get Started Med-PaLM 2 is currently available to a limited number of Google Cloud customers. To get started, you can visit the Google Cloud website: https://cloud.google.com/ and sign up for a free trial. Once you have a Google Cloud account, you can request access to Med-PaLM 2. Here are the steps on how to get started with using Med-PaLM: 1. Check if Med-PaLM is available in your country. Med-PaLM is currently only available in the following countries: United States Canada United Kingdom Australia New Zealand Singapore India Japan South KoreaYou can check the Med-PaLM website: https://sites.research.google/med-palm/ for the latest list of supported countries. 2. Create a Google Cloud Platform (GCP) account. Med-PaLM is a cloud-based service, so you will need to create a GCP account in order to use it. You can do this by going to the GCP website: https://cloud.google.com/ and clicking on the "Create Account" button. 3. Enable the Med-PaLM API. Once you have created a GCP account, you will need to enable the Med-PaLM API. You can do this by going to the API Library: https://console.cloud.google.com/apis/library and searching for "Med-PaLM". Click on the "Enable" button to enable the API. 4. Create a Med-PaLM service account. A service account is a special type of account that can be used to access GCP resources. You will need to create a service account in order to use Med-PaLM. You can do this by going to the IAM & Admin: https://console.cloud.google.com/iam-admin/ page and clicking on the "Create Service Account" button. 5. Download the Med-PaLM credentials. Once you have created a service account, you will need to download the credentials. The credentials will be a JSON file that contains your service account's email address and private key. You can download the credentials by clicking on the "Download JSON" button. 6. Set up the Med-PaLM client library. There are client libraries available for a variety of programming languages. You will need to install the client library for the language that you are using. You can find the client libraries on the Med-PaLM website: https://sites.research.google/med-palm/. 7. Initialize the Med-PaLM client. Once you have installed the client library, you can initialize the Med-PaLM client. The client will need your service account's email address and private key in order to authenticate with Med-PaLM. You can initialize the client by using the following code: import medpalm client = medpalm.Client( email="your_service_account_email_address", key_file="your_service_account_private_key.json" ) 8. Start using Med-PaLM! Once you have initialized the Med-PaLM client, you can start using it to access Med-PaLM's capabilities. For example, you can use Med-PaLM to answer medical questions, generate text, and translate languages. Key Features Med-PaLM 2 has a number of key features that make it a valuable tool for the healthcare industry. These features include: Accuracy: Med-PaLM 2 is highly accurate in answering medical questions. It has been shown to achieve an accuracy of 85% on a variety of medical question answering datasets. Expertise: Med-PaLM 2 is trained on a massive dataset of medical text and code. This gives it a deep understanding of medical concepts and terminology. Versatility: Med-PaLM 2 can be used for a variety of purposes in the healthcare industry. It can answer questions, generate text, translate languages, and write different kinds of creative content. Scalability: Med-PaLM 2 is scalable and can be used to process large amounts of data. This makes it a valuable tool for research and clinical applications. Conclusion Med-PaLM 2 is a powerful LLM that has the potential to revolutionize the healthcare industry. It can be used to improve medical research, clinical decision support, health education, drug discovery, and personalized medicine. Med-PaLM 2 is still under development, but it has already demonstrated the potential to make a significant impact on healthcare.

0
0
79259

M.T. White

22 Aug 2023

5 min read

ChatGPT for Ladder Logic

M.T. White

22 Aug 2023

5 min read

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionChatGPT is slowly becoming a pivotal player in software development. It is being used by countless developers to help produce quality and robust code. However, many of these developers are using ChatGPT for text-based programming languages like C++ or Java. There are few, if any, tutorials on how ChatGPT can be utilized to write Ladder Logic code. As such, this tutorial is going to be dedicated to exploring how and why ChatGPT can be used as a tool for traditional Ladder Logic programmers.Why use ChatGPT for Ladder Logic?The first step in learning how to leverage ChatGPT is to learn why to use the system. First of all, ChatGPT is not a programmer, nor is it designed to replace programmers in any way, shape, or form. However, it can be a handy tool for people that are not sure how to complete a task, need to produce some code in a crunch, and so on. To effectively use ChatGPT, a person will have to know how to properly produce a statement, refine that statement, and, if necessary, write subsequent statements that have the right amount of information for ChatGPT to effectively produce a result. In other words, a ChatGPT user still has to be competent, but when used correctly, the AI system can produce code much faster than a human can, especially if the human is inexperienced at a given task.In terms of industrial automation, ChatGPT can be an especially attractive tool. It is no secret that many PLC programmers are not formally trained developers. It is common for many PLC programmers to be maintenance technicians, electricians, or other types of engineers. In any case, it is common for many people who are forced to write complex PLC software to have little more than previous experience guiding them. As such, when faced with a complex situation with little to no resources available, the programmer can often be lost with no hope of finding a solution. This is where ChatGPT can be utilized as a user can pose questions and task the system with finding solutions. With that, how do we use ChatGPT at a basic level? How to use ChatGPT?The key behind using ChatGPT is writing what are called prompts. In a lay sense, a prompt is a query or command that ChatGPT runs. In other words, a prompt can be thought of as a task you ask ChatGPT to do. For example, a person could input the following prompt into ChatGPT:Tell me how to install a Beckhoff PLC.In this case, ChatGPT will give you a high-level overview of the basics of installing a Beckhoff PLC. However, a prompt like this will usually not return any useful results unless you want general, high-level information. Generally, to get a viable answer, you’ll need to hone the prompt with detailed information and exceptions. Learning to write effective prompts is a skill that is in the early stages of development. It is important to remember that ChatGPT is a novel tool. This means that the IT and automation industry as a whole is still learning how to use it effectively. However, a general flow of writing and refining prompts can be summarized in Figure 1.Figure 1 – Prompt Refining FlowThis is just a general logic flow for massaging a prompt to get the desired result. However, sometimes it is necessary and easier to run subsequent prompts instead of refining existing ones. Either way, it may take a couple of tries to get what you need out of ChatGPT. Regardless, to explore how to use ChatGPT, let’s start by creating a basic AND gate. AND ConfigurationAs everyone knows, an AND configuration is simply two contacts in a series. For the coil to turn on, both contacts have to be on at the same time. This a relatively simple Ladder Logic program to write and understand, so we’re going to use it as a test for ChatGPT. To begin, we’re going to use the following as a prompt:Write a ladder logic program for an AND gateWhen this prompt is run, it’ll generate the following network. NETWORK--| |--[ ]---[ ]---( ) | | |----[ ]---( )Figure 2 – AND Gate: Attempt 1 With Description In this example, an and gate is produced, but it also generates an extra rung that wasn’t asked for or necessary for the task at hand. In this case, it is not easy to determine what the extra run does or why it was included; as such, this is a prime example of needing to refine the prompts to only produce what is needed. To do this, let’s refine the prompt a little more to match the following:Write only the necessary contacts and coils for an AND gate in Ladder LogicWhen this query is run, it will produce the network in Figure 2:NETWORK--| |--[ ]---[ ]---( ) Figure 3 – Refined AND Gate: Attempt 2 With DescriptionIn this refined attempt, ChatGPT produces a very simple network of contacts in series. There are no extra rungs or ambiguous instructions. In short, the moral of this exercise is that we started with a general prompt that produced a working result but a result that needed to be refined. We then refined that prompt and produced a high-quality network with only the necessary instructions for a working AND gate. With that, let’s look at something a little more complex, timer networks. Timer ProgramsAs every PLC programmer knows, most PLC programs will usually include a timer of some type. So, with that in mind, let’s explore how ChatGPT handles timer programs. To begin, let’s have ChatGPT generate a simple 10-minute timer. To do this, we can use the following query:Write a ladder logic program for a 10 minute timer.Depending on your usage of ChatGPT, you may get a result without the variables labeled or even in Structured Text. However, the prompt can be adjusted to produce the desired output by using subsequent prompts that are akin to the following:I need this in ladder logicor,assign the variables in the diagramAs stated before, you won’t have to re-write the initial prompt. All you have to do is run these statements as if they were your initial prompt. However, depending on the prompts you run, your result should be something akin to the following:VAR Timer: TON; (* Timer for the 10-minute duration *) StartButton: BOOL; (* Input for starting the timer *) Done: BOOL; (* Output to indicate when the timer is done *) END_VAR NETWORK -----[StartButton]----[/Timer]----- | | | | | | ---|----------------------------|---- NO | | | ---[Done]---In the above ladder logic program:StartButton represents the input contact that starts the timer./Timer represents the timer coil, indicating that the timer is active and counting.Done represents the output contact, which becomes closed (energized) when the timer reaches 10 minutes.When the StartButton input contact is activated, and the timer starts counting. Once the timer reaches 10 minutes, the Done output contact will close, indicating that the timer is done. When the StartButton input contact is deactivated, the timer stops counting, and the Done output contact opens.Please note that the specific syntax and representation of timers may vary depending on the PLC programming software or the manufacturer's conventions. Adjust the syntax and parameters as per the specific requirements of your PLC system.As anyone who has programmed in Ladder Logic before can see, this isn’t standard notation for a timer function. However, you probably noticed that each prompt will generate a relatively short description of how the snippet is supposed to work. If you read the subsequent output, you can piece together the symbolic nature of the snippet. With that, let’s move on to something a little more advanced. Let’s explore building a cascading timer network that will have one timer set for 10 minutes and another set for 15, and when the final timer turns on, it will power an LED. To do this, we can input the following prompt,Write a cascading timer Ladder Logic Network that has one timer set for 10 minutes and the other set for 15 minutes. Turn on an LED when both timers turn on. Give the necessary variables as well.This prompt will produce the following:VAR Timer1: TON; // Timer 1 (10 minutes) Timer2: TON; // Timer 2 (15 minutes) LED: BOOL; // Output LED END_VAR NETWORK // Timer 1 (10 minutes) Timer1(IN:= TRUE, PT:= T#10M); // Set Timer 1 to 10 minutes duration // Timer 2 (15 minutes) Timer2(IN:= Timer1.Q, PT:= T#15M); // Start Timer 2 when Timer 1 completes // LED turns on when both timers complete LED := Timer1.Q AND Timer2.Q; END_NETWORKIn this ladder logic program:Timer1 is a timer variable of type TON set for 10 minutes duration.Timer2 is a timer variable of type TON set for 15 minutes duration.LED is a boolean variable representing the output LED.Here's a breakdown of the logic:Timer1 is triggered to start with the input IN:= TRUE and is set to a preset time of 10 minutes (PT:= T#10M).Timer2 is started when Timer1 completes its timing sequence (IN:= Timer1.Q), and it is set to a preset time of 15 minutes (PT:= T#15M).The LED is turned on (LED := Timer1.Q AND Timer2.Q) when both Timer1 and Timer2 have completed their timing sequences.You can adjust the timing durations of the timers according to your specific requirements in your PLC programming software. Additionally, you can connect the LED output variable to the appropriate LED indicator in your setup.As can be seen, this isn’t our desired output. We want Ladder Logic, and this is Structured Text. In short, ChatGPT got confused. If you read the description, ChatGPT thinks its result is Ladder Logic. In all, this isn’t what we were looking for, so we can add a subsequent prompt that reads the following:I need this program in Ladder LogicThis prompt will produce the following output:NETWORK --|----[ ]----[/]---[ ]----[/]---[ ]----[/]---[ ]--+ | | | | +---[ ]----[/]---[ ]----[/]---[ ]---[/]---[ ]--+ Timer1 Timer2 +---[ ]---+ +---[ ]---+ | | | | +---[/]---+ +---[/]---+ LED LEDIn this ladder logic program:Each [ ] represents a normally open (NO) contact./ represents a normally closed (NC) contact.The timers Timer1 and Timer2 are cascaded to achieve the desired timing sequence.The LED output is controlled based on the completion of both timers.Here's a breakdown of the logic:Timer1 is a timer with a preset time of 10 minutes.Timer2 is a timer with a preset time of 15 minutes.When Timer1 completes its timing sequence, it energizes the output coil [/], and the LED associated with Timer1 turns on.Timer2 is enabled when the output coil [/] of Timer1 is energized.When Timer2 completes its timing sequence, it energizes the output coil [/], and the LED associated with Timer2 turns on.The LED output is controlled by the parallel combination of Timer1's output coil [/] and Timer2's output coil [/]. The LED turns on only when both timers have been completed.Please note that the ladder logic diagram provided is a visual representation, and the actual implementation may vary depending on the PLC programming software or hardware you are using.This output is closer to what we’re looking for, but depending on a person's experience level, this probably won’t be much help. This particular diagram isn’t very descriptive of how everything is supposed to fit together, so if a person had to translate this to a specific programming software, this diagram would probably take a while to fully digest and implement. As such, we’re going to have to add another prompt that will be as follows,I need the variables assigned to the proper contacts.This subsequent prompt will produce the following:NETWORK --|----[StartButton]----[/Timer1.Enabled]---[Timer1.TimerDN]----[/Timer2.Enabled]---[Timer2.TimerDN]--+ | | +---[Timer1.Timer.Q]----[/LED1]---+---[Timer2.Timer.Q]----[/LED2]---+ | | +---[/LED]--------------------------+In this ladder logic program:StartButton is a boolean variable representing the input contact [StartButton] that starts the timer sequence.Timer1.Enabled and Timer2.Enabled are boolean variables representing the input contacts [/Timer1.Enabled] and [/Timer2.Enabled] that enable the respective timers.Timer1.TimerDN and Timer2.TimerDN are boolean variables representing the output contacts [Timer1.TimerDN] and [Timer2.TimerDN] that indicates the completion of the respective timers.Timer1.Timer.Q and Timer2.Timer.Q are boolean variables representing the output contacts [Timer1.Timer.Q] and [Timer2.Timer.Q] that indicates the running state of the respective timers.LED1 and LED2 are boolean variables representing the output contacts [/LED1] and [/LED2] that control the LEDs associated with Timer1 and Timer2.LED is a boolean variable representing the output contact [/LED] that controls the final LED indicating the completion of both timers.Please note that the actual variable names may vary depending on the PLC programming software or hardware you are using. You can assign these variables in your PLC programming software and connect them to the appropriate input and output terminals to implement the desired cascading timer logicThis result is much better as it shows how to wire the network together while the description gives a clear overview of how everything works. With this, a person with a basic knowledge of Ladder Logic will be able to piece this together and port the rungs to other PLC programming programs. SummaryChatGPT can be leveraged as an excellent tool to help Ladder Logic developers with concocting specialized programs. However, ChatGPT is NOT a replacement for programmers, and to effectively use the system, a person must be skilled enough to write descriptive prompts and interpret the results. This means that though ChatGPT is an excellent tool, it does not have the intuition nor the skill to fully replace a programmer.A big part of using ChatGPT is learning to write and refine prompts as well as subsequent follow-up prompts. These prompts are a developing art form that probably will be the next iteration of software development. For now, the art of using ChatGPT and similar systems is novel, and there aren’t any definitive standards that govern how to effectively use these yet, especially when it comes to graphical programming such as Ladder Logic. When used by a knowledgeable person that has a basic idea of PLC programming and ChatGPT, it can be a great way of getting over hurdles that could take hours or days to solve. Author BioM.T. White has been programming since the age of 12. His fascination with robotics flourished when he was a child programming microcontrollers such as Arduino. M.T. currently holds an undergraduate degree in mathematics, and a master's degree in software engineering, and is currently working on an MBA in IT project management. M.T. is currently working as a software developer for a major US defense contractor and is an adjunct CIS instructor at ECPI University. His background mostly stems from the automation industry where he programmed PLCs and HMIs for many different types of applications. M.T. has programmed many different brands of PLCs over the years and has developed HMIs using many different tools.Author of the book: Mastering PLC Programming

0
0
48420

article-image-creating-a-data-model-with-chatgpt-is-easier-than-you-think

Sagar Lad

16 Jun 2023

5 min read

Creating a Data Model with ChatGPT is Easier than you think

Sagar Lad

16 Jun 2023

5 min read

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIn today's data-driven world, the ability to build accurate and efficient data models is paramount for businesses and individuals alike. However, the process of constructing a data model can often be complex and daunting, requiring specialized knowledge and technical skills. But what if there was a way to simplify this process and make it accessible to a wider audience? Enter ChatGPT, a powerful language model developed by OpenAI. In this article, we will explore how ChatGPT can be leveraged to build data models easily, using a practical example. By harnessing the capabilities of ChatGPT, you'll discover how data modeling can become a more approachable and intuitive task for everyone, regardless of their technical background.Build Data Model with ChatGPTConsider data modeling as the process of drawing diagrams for software applications that provide an overview of all the data pieces they include. The data flow is depicted in the diagram using text and symbols. It serves as a model for creating a new database that will allow a company to utilize the data efficiently for its needs. The primary objective of the data model is to establish an overall picture of the types of data that are used, how they are kept in the system, the relationships between the data entities, and the various ways in which they can be arranged and grouped. The norms and procedures for gathering feedback from the business stakeholders are taken into consideration when building data models.The Data Model functions as a better understanding of what is designed, much like a roadmap or blueprint might. It offers a comprehensive review of the standardized methodologies and schema to define and manage data in a way that is common and uniform throughout the organization. According to the level of abstraction, there are three different types of data models.Conceptual Data Model: It provides a helicopter view of the system description, its organization, and business rules to be considered. Initial project requirements are captured using the conceptual model. It mainly consists of the business entities, their constraints and characteristics, and the relationship between them for data integrity and security requirements.Logical Data Model: The logical data model provides detailed insights into the concepts and relationship which consists of data attributes and the relationship between the entities. It is very much useful for data-driven projects or initiatives.Physical Data Model: It provides an overview of how the data should be stored physically within the database. It is a final design to implement the relational database including the relationship using the primary and foreign keys. Image 1 : Types of Data Modelling TechniquesThe data model was created using a variety of data modeling methodologies, as seen in the graphic above. The most popular data modeling technique utilized by any corporate organization is entity relationship modeling, also known as dimensional modeling. Erwin Data Modeler, ER/Studio, Archi, and other tools are available on the market to construct data models utilizing these data modeling methodologies. The data Modelling technique mainly involves below steps : Identify the entitiesFind the entity propertiesCreate a relationship between the entitiesCorrelated attributes to the entityDefine the degree of normalization to improve the performanceValidate and Finalise the data modelLet’s start with creating a data model using chatGPT. The goal is to ask chatGPT to start with the data modeling activities for the anti-money laundering(AML) system of a banking domain: Image 1: The data model for the banking system, Part 1 Image 2: Data Modelling for AML Process for Bank As you can see in the image, once we provide an input to the chatGPT, it provides a step-by-step process of building the data model. The first step is to understand the AML regulations and identify the stakeholders for the system to capture the requirements. Once the stakeholders are identified, the next step is to define the data modeling goals including the list of data sources, and perform the data profiling. Once data profiling steps are done, the next activity is to create a conceptual, logical, and physical data model.Now, Let’s check with chatGPT to create a conceptual model with all the entities. Image 3: Conceptual Data Model, Part 1 Image 4 : AML Conceptual ModelAfter the input, chatGPT responds with the list of actors, entities, and relationships between the entities to define the conceptual model. With this information, we can have a high-level overview of the system by building the conceptual data model. Let’s ask chatGPT to build the logical data model once the conceptual data model is ready:Image 5: AML data model for ERwin Tool Image 6 : AML Logical Data Model, Part 2 As you can see in the above image, step by step process to create a logical data model is to open the Erwin Tool and create a new data model. In the new data model, add all entities, their attributes, and the relationship between the entities. Once entities are defined, set up primary and foreign keys for all entities and validate the data model. After the validation, adjust the review comments and finalize the logical data model and generate the documentation for the same.Next, Let’s ask chatGPT if it can add new customer information to the existing conceptual model. Image 5 : AML Logical Data Model with Customer Information As we can see in the above image, chatGPT asks to first identify the source information and create an entity and attributes for the same. Once it is done, we have to define the cardinality to understand how entities are related to each other. Then define primary and foreign key relationships, data model validation and generate documentation.ConclusionIn this article, we understood the importance of building the data model and step by step process to create the data model. Later in this article, we also checked how to use chatGPT to create conceptual and logical data models.Author BioSagar Lad is a Cloud Data Solution Architect with a leading organisation and has deep expertise in designing and building Enterprise-grade Intelligent Azure Data and Analytics Solutions. He is a published author, content writer, Microsoft Certified Trainer, and C# Corner MVP.Link - Medium , Amazon , LinkedIn

0
0
12503

article-image-scientific-analysis-of-donald-trumps-tweets-on-covid-19-with-transformers

Expert Network

19 May 2021

7 min read

Scientific Analysis of Donald Trump’s Tweets on COVID-19 with Transformers

Expert Network

19 May 2021

7 min read

It takes time and effort to figure out what is fake news and what isn't. Like children, we have to work our way through something we perceive as fake news. This article is an excerpt from the book Transformers for Natural Language Processing by Denis Rothman – A comprehensive guide for deep learning & NLP practitioners, data analysts and data scientists who want an introduction to AI language understanding to process the increasing amounts of language-driven functions. In this article, we will focus on the logic of fake news. We will run the BERT model on SRL and visualize the results on AllenNLP.org. Now, let's go through some presidential tweets on COVID-19. Our goal is certainly not to judge anybody or anything. Fake news involves both opinion and facts. News often depends on the perception of facts by local culture. We will provide ideas and tools to help others gather more information on a topic and find their way in the jungle of information we receive every day. Semantic Role Labeling (SRL) SRL is an excellent educational tool for all of us. We tend just to read Tweets passively and listen to what others say about them. Breaking messages down with SRL is a good way to develop social media analytical skills to distinguish fake from accurate information. I recommend using SRL transformers for educational purposes in class. A young student can enter a Tweet and analyze each verb and its arguments. It could help younger generations become active readers on social media. We will first analyze a relatively undivided Tweet and then a conflictual Tweet. Analyzing the undivided Tweet Let's analyze the latest Tweet found on July 4 while writing the book, Transformers for Natural Language Processing. I took the name of the person who is referred to as a "Black American" out and paraphrased some of the former President's text: "X is a great American, is hospitalized with coronavirus, and has requested prayer. Would you join me in praying for him today, as well as all those who are suffering from COVID-19?" Let's go to AllenNLP.org, visualize our SRL using https://demo.allennlp.org/semantic-role-labeling, run the sentence, and look at the result. The verb "hospitalized" shows the member is staying close to the facts: Figure: SRL arguments of the verb "hospitalized" The message is simple: "X" + "hospitalized" + "coronavirus." The verb "requested" shows that the message is becoming political: Figure: SRL arguments of the verb "requested" We don't know if the person requested the former President to pray or he decided he would be the center of the request. A good exercise would be to display an HTML page and ask the users what they think. For example, the users could be asked to look at the results of the SRL task and answer the two following questions: "Was former President Trump asked to pray, or did he deviate a request made to others for political reasons?" "Is the fact that former President Trump states that he was indirectly asked to pray for X fake news or not?" You can think about it and decide for yourself! Analyzing the Banned Tweet Let's have a look at one that was banned from Twitter. I took the names out and paraphrased it and toned it down. Still, when we run it on AllenNLP.org and visualize the results, we get some surprising SRL outputs. Here is the toned-down and paraphrased Tweet: These thugs are dishonoring the memory of X. When the looting starts, actions must be taken. Although I suppressed the main part of the original Tweet, we can see that the SRL task shows the bad associations made in the Tweet: Figure: SRL arguments of the verb "dishonoring" An educational approach to this would be to explain that we should not associate the arguments "thugs" and "memory" and "looting." They do not fit together at all. An important exercise would be to ask a user why the SRL arguments do not fit together. I recommend many such exercises so that the transformer model users develop SRL skills to have a critical view of any topic presented to them. Critical thinking is the best way to stop the propagation of the fake news pandemic! We have gone through rational approaches to fake news with transformers, heuristics, and instructive websites. However, in the end, a lot of the heat in fake news debates boils down to emotional and irrational reactions. In a world of opinion, you will never find an entirely objective transformer model that detects fake news since opposing sides never agree on what the truth is in the first place! One side will agree with the transformer model's output. Another will say that the model is biased and built by enemies of their opinion! The best approach is to listen to others and try to keep the heat down! Looking for the silver bullet Looking for a silver bullet transformer model can be time-consuming or rewarding, depending on how much time and money you want to spend on continually changing models. For example, a new approach to transformers can be found through disentanglement. Disentanglement in AI allows you to separate the features of a representation to make the training process more flexible. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen designed DeBERTa, a disentangled version of a transformer, and described the model in an interesting article: DeBERTa: Decoding-enhanced BERT with Disentangled Attention, https://arxiv.org/ abs/2006.03654 The two main ideas implemented in DeBERTa are: Disentangle the content and position in the transformer model to train the two vectors separately. Use an absolute position in thedecoderto predict masked tokens in the pretraining process. The authors provide the code on GitHub: https://github.com/microsoft/DeBERTa DeBERTa exceeds the human baseline on the SuperGLUE leaderboard in December 2020 using 1.5B parameters. Should you stop everything you are doing on transformers and rush to this model, integrate your data, train the model, test it, and implement it? It is very probable that by the end of 2021, another model will beat this one and so on. Should you change models all of the time in production? That will be your decision. You can also choose to design better training methods. Looking for reliable training methods Looking for reliable training methods with smaller models such as the PET designed by Timo Schick can also be a solution. Why? Being in a good position on the SuperGLUE leaderboard does not mean that the model will provide a high quality of decision-making for medical, legal, and other critical areas for sequence predications. Looking for customized training solutions for a specific topic could be more productive than trying all the best transformers on the SuperGLUE leaderboard. Take your time to think about implementing transformers to find the best approach for your project. We will now conclude the article. Summary Fake news begins deep inside our emotional history as humans. When an event occurs, emotions take over to help us react quickly to a situation. We are hardwired to react strongly when we are threatened. We went through raging conflicts over COVID-19, former President Trump, and climate change. In each case, we saw that emotional reactions are the fastest ones to build up into conflicts. We then designed a roadmap to take the emotional perception of fake news to a rational level. We showed that it is possible to find key information in Tweets, Facebook messages, and other media. The news used in this article is perceived by some as real news and others as fake news to create a rationale for teachers, parents, friends, co-workers, or just people talking. About the Author Denis Rothman graduated from Sorbonne University and Paris-Diderot University, patenting one of the very first word2matrix embedding solutions. Denis Rothman is the author of three cutting-edge AI solutions: one of the first AI cognitive chatbots more than 30 years ago; a profit-orientated AI resource optimizing system; and an AI APS (Advanced Planning and Scheduling) solution based on cognitive patterns used worldwide in aerospace, rail, energy, apparel, and many other fields. Designed initially as a cognitive AI bot for IBM, it then went on to become a robust APS solution used to this day.

0
0
25196

article-image-distributed-training-in-tensorflow-2-x

Expert Network

30 Apr 2021

7 min read

Distributed training in TensorFlow 2.x

Expert Network

30 Apr 2021

7 min read

TensorFlow 2 is a rich development ecosystem composed of two main parts: Training and Serving. Training consists of a set of libraries for dealing with datasets (tf.data), a set of libraries for building models, including high-level libraries (tf.Keras and Estimators), low-level libraries (tf.*), and a collection of pretrained models (tf.Hub). Training can happen on CPUs, GPUs, and TPUs via distribution strategies and the result can be saved using the appropriate libraries. This article is an excerpt from the book, Deep Learning with TensorFlow 2 and Keras, Second Edition by Antonio Gulli, Amita Kapoor, and Sujit Pal. This book teaches deep learning techniques alongside TensorFlow (TF) and Keras. In this article, we’ll review the addition of the powerful new feature, distributed training, in TensorFlow 2.x. One very useful addition to TensorFlow 2.x is the possibility to train models using distributed GPUs, multiple machines, and TPUs in a very simple way with very few additional lines of code. tf.distribute.Strategy is the TensorFlow API used in this case and it supports both tf.keras and tf.estimator APIs and eager execution. You can switch between GPUs, TPUs, and multiple machines by just changing the strategy instance. Strategies can be synchronous, where all workers train over different slices of input data in a form of sync data parallel computation, or asynchronous, where updates from the optimizers are not happening in sync. All strategies require that data is loaded in batches via the tf.data.Dataset api. Note that the distributed training support is still experimental. A roadmap is given in Figure 1: Figure 1: Distributed training support fr different strategies and APIs Let’s discuss in detail all the different strategies reported in Figure 1. Multiple GPUs TensorFlow 2.x can utilize multiple GPUs. If we want to have synchronous distributed training on multiple GPUs on one machine, there are two things that we need to do: (1) We need to load the data in a way that will be distributed into the GPUs, and (2) We need to distribute some computations into the GPUs too: In order to load our data in a way that can be distributed into the GPUs, we simply need tf.data.Dataset (which has already been discussed in the previous paragraphs). If we do not have a tf.data.Dataset but we have a normal tensor, then we can easily convert the latter into the former using tf.data.Dataset.from_tensors_slices(). This will take a tensor in memory and return a source dataset, the elements of which are slices of the given tensor. In our toy example, we use NumPy to generate training data x and labels y, and we transform it into tf.data.Dataset with tf.data.Dataset.from_tensor_slices(). Then we apply a shuffle to avoid bias in training across GPUs and then generate SIZE_BATCHES batches: import tensorflow as tf import numpy as np from tensorflow import keras N_TRAIN_EXAMPLES = 1024*1024 N_FEATURES = 10 SIZE_BATCHES = 256 # 10 random floats in the half-open interval [0.0, 1.0). x = np.random.random((N_TRAIN_EXAMPLES, N_FEATURES)) y = np.random.randint(2, size=(N_TRAIN_EXAMPLES, 1)) x = tf.dtypes.cast(x, tf.float32) print (x) dataset = tf.data.Dataset.from_tensor_slices((x, y)) dataset = dataset.shuffle(buffer_size=N_TRAIN_EXAMPLES).batch(SIZE_BATCHES) In order to distribute some computations to GPUs, we instantiate a distribution = tf.distribute.MirroredStrategy() object, which supports synchronous distributed training on multiple GPUs on one machine. Then, we move the creation and compilation of the Keras model inside the strategy.scope(). Note that each variable in the model is mirrored across all the replicas. Let’s see it in our toy example: # this is the distribution strategy distribution = tf.distribute.MirroredStrategy() # this piece of code is distributed to multiple GPUs with distribution.scope(): model = tf.keras.Sequential() model.add(tf.keras.layers.Dense(16, activation=‘relu’, input_shape=(N_FEATURES,))) model.add(tf.keras.layers.Dense(1, activation=‘sigmoid’)) optimizer = tf.keras.optimizers.SGD(0.2) model.compile(loss=‘binary_crossentropy’, optimizer=optimizer) model.summary() # Optimize in the usual way but in reality you are using GPUs. model.fit(dataset, epochs=5, steps_per_epoch=10) Note that each batch of the given input is divided equally among the multiple GPUs. For instance, if using MirroredStrategy() with two GPUs, each batch of size 256 will be divided among the two GPUs, with each of them receiving 128 input examples for each step. In addition, note that each GPU will optimize on the received batches and the TensorFlow backend will combine all these independent optimizations on our behalf. In short, using multiple GPUs is very easy and requires minimal changes to the tf.Keras code used for a single server. MultiWorkerMirroredStrategy This strategy implements synchronous distributed training across multiple workers, each one with potentially multiple GPUs. As of September 2019 the strategy works only with Estimators and it has experimental support for tf.Keras. This strategy should be used if you are aiming at scaling beyond a single machine with high performance. Data must be loaded with tf.Dataset and shared across workers so that each worker can read a unique subset. TPUStrategy This strategy implements synchronous distributed training on TPUs. TPUs are Google’s specialized ASICs chips designed to significantly accelerate machine learning workloads in a way often more efficient than GPUs. According to this public information (https://github.com/tensorflow/tensorflow/issues/24412): “the gist is that we intend to announce support for TPUStrategy alongside Tensorflow 2.1. Tensorflow 2.0 will work under limited use-cases but has many improvements (bug fixes, performance improvements) that we’re including in Tensorflow 2.1, so we don’t consider it ready yet.” ParameterServerStrategy This strategy implements either multi-GPU synchronous local training or asynchronous multi-machine training. For local training on one machine, the variables of the models are placed on the CPU and operations are replicated across all local GPUs. For multi-machine training, some machines are designated as workers and some as parameter servers with the variables of the model placed on parameter servers. Computation is replicated across all GPUs of all workers. Multiple workers can be set up with the environment variable TF_CONFIG as in the following example: os.environ[“TF_CONFIG”] = json.dumps({ “cluster”: { “worker”: [“host1:port”, “host2:port”, “host3:port”], “ps”: [“host4:port”, “host5:port”] }, “task”: {“type”: “worker”, “index”: 1} }) In this article, we have seen how it is possible to train models using distributed GPUs, multiple machines, and TPUs in a very simple way with very few additional lines of code. Learn how to build machine and deep learning systems with the newly released TensorFlow 2 and Keras for the lab, production, and mobile devices with Deep Learning with TensorFlow 2 and Keras, Second Edition by Antonio Gulli, Amita Kapoor and Sujit Pal. About the Authors Antonio Gulli is a software executive and business leader with a passion for establishing and managing global technological talent, innovation, and execution. He is an expert in search engines, online services, machine learning, information retrieval, analytics, and cloud computing. Amita Kapoor is an Associate Professor in the Department of Electronics, SRCASW, University of Delhi and has been actively teaching neural networks and artificial intelligence for the last 20 years. She is an active member of ACM, AAAI, IEEE, and INNS. She has co-authored two books. Sujit Pal is a technology research director at Elsevier Labs, working on building intelligent systems around research content and metadata. His primary interests are information retrieval, ontologies, natural language processing, machine learning, and distributed processing. He is currently working on image classification and similarity using deep learning models. He writes about technology on his blog at Salmon Run.

0
0
35240

article-image-how-to-create-tensors-in-pytorch

Expert Network

20 Apr 2021

6 min read

How to Create Tensors in PyTorch

Expert Network

20 Apr 2021

6 min read

A tensor is the fundamental building block of all DL toolkits. The name sounds rather mystical, but the underlying idea is that a tensor is a multi-dimensional array. Building analogy with school math, one single number is like a point, which is zero-dimensional, while a vector is one-dimensional like a line segment, and a matrix is a two-dimensional object. Three-dimensional number collections can be represented by a parallelepiped of numbers, but they don't have a separate name in the same way as a matrix. We can keep this term for collections of higher dimensions, which are named multi-dimensional arrays. Another thing to note about tensors used in DL is that they are only partially related to tensors used in tensor calculus or tensor algebra. In DL, tensor is any multi-dimensional array, but in mathematics, tensor is a mapping between vector spaces, which might be represented as a multi-dimensional array in some cases but has much more semantical payload behind it. Mathematicians usually frown at everybody who uses well-established mathematical terms to name different things, so, be warned. Figure 1: Going from a single number to an n-dimension tensor This article is an excerpt from the book Deep Reinforcement Learning Hands-On - Second Edition by Maxim Lapan. This book is an updated and expanded version of the bestselling guide to the very latest RL tools and techniques. In this article, we’ll discuss the fundamental building block of all DL toolkits, tensor. Creation of tensors If you're familiar with the NumPy library, then you already know that its central purpose is the handling of multi-dimensional arrays in a generic way. In NumPy, such arrays aren't called tensors, but they are in fact tensors. Tensors are used very widely in scientific computations as generic storage for data. For example, a color image could be encoded as a 3D tensor with dimensions of width, height, and color plane. Apart from dimensions, a tensor is characterized by the type of its elements. There are eight types supported by PyTorch: three float types (16-bit, 32-bit, and 64-bit) and five integer types (8-bit signed, 8-bit unsigned, 16-bit, 32-bit, and 64-bit). Tensors of different types are represented by different classes, with the most commonly used being torch.FloatTensor (corresponding to a 32-bit float), torch.ByteTensor (an 8-bit unsigned integer), and torch.LongTensor (a 64-bit signed integer). The rest can be found in the PyTorch documentation. There are three ways to create a tensor in PyTorch: By calling a constructor of the required type. By converting a NumPy array or a Python list into a tensor. In this case, the type will be taken from the array's type. By asking PyTorch to create a tensor with specific data for you. For example, you can use the torch.zeros() function to create a tensor filled with zero values. To give you examples of these methods, let's look at a simple session: >>> import torch >>> import numpy as np >>> a = torch.FloatTensor(3, 2) >>> a tensor([[4.1521e+09, 4.5796e-41], [ 1.9949e-20, 3.0774e-41], [ 4.4842e-44, 0.0000e+00]]) Here, we imported both PyTorch and NumPy and created an uninitialized tensor of size 3×2. By default, PyTorch allocates memory for the tensor, but doesn't initialize it with anything. To clear the tensor's content, we need to use its operation: >> a.zero_() tensor([[ 0., 0.], [ 0., 0.], [ 0., 0.]]) There are two types of operation for tensors: inplace and functional. Inplace operations have an underscore appended to their name and operate on the tensor's content. After this, the object itself is returned. The functional equivalent creates a copy of the tensor with the performed modification, leaving the original tensor untouched. Inplace operations are usually more efficient from a performance and memory point of view. Another way to create a tensor by its constructor is to provide a Python iterable (for example, a list or tuple), which will be used as the contents of the newly created tensor: >>> torch.FloatTensor([[1,2,3],[3,2,1]]) tensor([[ 1., 2., 3.], [ 3., 2., 1.]]) Here we are creating the same tensor with zeroes using NumPy: >>> n = np.zeros(shape=(3, 2)) >>> n array([[ 0., 0.], [ 0., 0.], [ 0., 0.]]) >>> b = torch.tensor(n) >>> b tensor([[ 0., 0.], [ 0., 0.], [ 0., 0.]], dtype=torch.float64) The torch.tensor method accepts the NumPy array as an argument and creates a tensor of appropriate shape from it. In the preceding example, we created a NumPy array initialized by zeros, which created a double (64-bit float) array by default. So, the resulting tensor has the DoubleTensor type (which is shown in the preceding example with the dtype value). Usually, in DL, double precision is not required and it adds an extra memory and performance overhead. The common practice is to use the 32-bit float type, or even the 16-bit float type, which is more than enough. To create such a tensor, you need to specify explicitly the type of NumPy array: >>> n = np.zeros(shape=(3, 2), dtype=np.float32) >>> torch.tensor(n) tensor([[ 0., 0.], [ 0., 0.], [ 0., 0.]]) As an option, the type of the desired tensor could be provided to the torch.tensor function in the dtype argument. However, be careful, since this argument expects to get a PyTorch type specification, not the NumPy one. PyTorch types are kept in the torch package, for example, torch.float32 and torch.uint8. >>> n = np.zeros(shape=(3,2)) >>> torch.tensor(n, dtype=torch.float32) tensor([[ 0., 0.], [ 0., 0.], [ 0., 0.]]) In this article, we saw a quick overview of tensor, the fundamental building block of all DL toolkits. We talked about tensor and how to create it in the PyTorch library. Discover ways to increase efficiency of RL methods both from theoretical and engineering perspective with the book Deep Reinforcement Learning Hands-on, Second Edition by Maxim Lapan.  About the Author Maxim Lapan is a deep learning enthusiast and independent researcher. He has spent 15 years working as a software developer and systems architect. His projects have ranged from low-level Linux kernel driver development to performance optimization and the design of distributed applications working on thousands of servers.  With his areas of expertise including big data, machine learning, and large parallel distributed HPC and non-HPC systems, Maxim is able to explain complicated concepts using simple words and vivid examples. His current areas of interest are practical applications of deep learning, such as deep natural language processing and deep reinforcement learning. Maxim lives in Moscow, Russian Federation, with his family. 

0
0
136987

article-image-5-reasons-why-you-should-use-an-open-source-data-analytics-stack-in-2020

Amey Varangaonkar

28 Jan 2020

7 min read

5 reasons why you should use an open-source data analytics stack in 2020

Amey Varangaonkar

28 Jan 2020

7 min read

Today, almost every company is trying to be data-driven in some sense or the other. Businesses across all the major verticals such as healthcare, telecommunications, banking, insurance, retail, education, etc. make use of data to better understand their customers, optimize their business processes and, ultimately, maximize their profits. This is a guest post sponsored by our friends at RudderStack. When it comes to using data for analytics, companies face two major challenges: Data tracking: Tracking the required data from a multitude of sources in order to get insights out of it. As an example, tracking customer activity data such as logins, signups, purchases, and even clicks such as bookmarks from platforms such as mobile apps and websites becomes an issue for many eCommerce businesses. Building a link between the Data and Business Intelligence: Once data is acquired, transforming it and making it compatible for a BI tool can often prove to be a substantial challenge. A well designed data analytics stack comes is essential in combating these challenges. It will ensure you're well-placed to use the data at your disposal in more intelligent ways. It will help you drive more value. What does a data analytics stack do? A data analytics stack is a combination of tools which when put together, allows you to bring together all of your data in one platform, and use it to get actionable insights that help in better decision-making. As seen the diagram above illustrates, a data analytics stack is built upon three fundamental steps: Data Integration: This step involves collecting and blending data from multiple sources and transforming them in a compatible format, for storage. The sources could be as varied as a database (e.g. MySQL), an organization’s log files, or event data such as clicks, logins, bookmarks, etc from mobile apps or websites. A data analytics stack allows you to use all of such data together and use it to perform meaningful analytics. Data Warehousing: This next step involves storing the data for the purpose of analytics. As the complexity of data grows, it is feasible to consolidate all the data in a single data warehouse. Some of the popular modern data warehouses include Amazon’s Redshift, Google BigQuery and platforms such as Snowflake and MarkLogic. Data Analytics: In this final step, we use a visualization tool to load the data from the warehouse and use it to extract meaningful insights and patterns from the data, in the form of charts, graphs and reports. Choosing a data analytics stack - proprietary or open-source? When it comes to choosing a data analytics stack, businesses are often left with two choices - buy it or build it. On one hand, there are proprietary tools such as Google Analytics, Amplitude, Mixpanel, etc. - where the vendors alone are responsible for their configuration and management to suit your needs. With the best in class features and services that come along with the tools, your primary focus can just be project management, rather than technology management. While using proprietary tools have their advantages, there are also some major cons to them that revolve mainly around cost, data sharing, privacy concerns, and more. As a result, businesses today are increasingly exploring the open-source alternatives to build their data analytics stack. The advantages of open source analytics tools Let's now look at the 5 main advantages that open-source tools have over these proprietary tools. Open source analytics tools are cost effective Proprietary analytics products can cost hundreds of thousands of dollars beyond their free tier. For small to medium-sized businesses, the return on investment does not often justify these costs. Open-source tools are free to use and even their enterprise versions are reasonably priced compared to their proprietary counterparts. So, with a lower up-front costs, reasonable expenses for training, maintenance and support, and no cost for licensing, open-source analytics tools are much more affordable. More importantly, they're better value for money. Open source analytics tools provide flexibility Proprietary SaaS analytics products will invariably set restrictions on the ways in which they can be used. This is especially the case with the trial or the lite versions of the tools, which are free. For example, full SQL is not supported by some tools. This makes it hard to combine and query external data alongside internal data. You'll also often find that warehouse dumps provide no support either. And when they do, they'll probably cost more and still have limited functionality. Data dumps from Google Analytics, for instance, can only be loaded into Google BigQuery. Also, these dumps are time-delayed. That means the loading process can be very slow.. With open-source software, you get complete flexibility: from the way you use your tools, how you combine to build your stack, and even how you use your data. If your requirements change - which, let's face it, they probably will - you can make the necessary changes without paying extra for customized solutions. Avoid vendor lock-in Vendor lock-in, also known as proprietary lock-in, is essentially a state where a customer becomes completely dependent on the vendor for their products and services. The customer is unable to switch to another vendor without paying a significant switching cost. Some organizations spend a considerable amount of money on proprietary tools and services that they heavily rely on. If these tools aren't updated and properly maintained, the organization using it is putting itself at a real competitive disadvantage. This is almost never the case with open-source tools. Constant innovation and change is the norm. Even if the individual or the organization handling the tool moves on, the community catn take over the project and maintain it. With open-source, you can rest assured that your tools will always be up-to-date without heavy reliance on anyone. Improved data security and privacy Privacy has become a talking point in many data-related discussions of late. This is thanks, in part, to data protection laws such as the GDPR and CCPA coming into force. High-profile data leaks have also kept the issue high on the agenda. An open-source stack analytics running inside your cloud or on-prem environment gives complete control of your data. This lets you decide which data is to be used when, and how. It lets you dictate how third parties can access and use your data, if at all. Open-source is the present It's hard to counter the fact that open-source is now mainstream. Companies like Microsoft, Apple, and IBM are now not only actively participating in the open-source community, they're also contributing to it. Open-source puts you on the front foot when it comes to innovation. With it, you'll be able to leverage the power of a vibrant developer community to develop better products in more efficient ways. How RudderStack helps you build an ideal open-source data analytics stack RudderStack is a completely open-source, enterprise-ready platform to simplify data management in the most secure and reliable way. It works as a perfect data integration platform by routing your event data from data sources such as websites, mobile apps and servers, to multiple destinations of your choice - thus helping you save time and effort. RudderStack integrates effortlessly with a multitude of destinations such as Google Analytics, Amplitude, MixPanel, Salesforce, HubSpot, Facebook Ads, and more, as well as popular data warehouses such as Amazon Redshift or S3. If performing efficient clickstream analytics is your goal, RudderStack offers you the perfect data pipeline to collect and route your data securely. Learn more about Rudderstack by visiting the RudderStack website, or check out its GitHub page to find out how it works.

0
0
55561

article-image-3-different-types-of-generative-adversarial-networks-gans-and-how-they-work

Packt Editorial Staff

08 Jan 2020

6 min read

3 different types of generative adversarial networks (GANs) and how they work

Packt Editorial Staff

08 Jan 2020

6 min read

Generative adversarial networks (GANs) have been greeted with real excitement since their creation back in 2014 by Ian Goodfellow and his research team. Yann LeCun, Facebook's Director of AI Research went as far as describing GANs as "the most interesting idea in the last 10 years in ML." With all this excitement, however, it can be easy to miss the subtle diversity of GANs; there are a number of different types of generative adversarial networks, each one working in slightly different ways and helping engineers to achieve slightly different results. To give you a deeper insight on GANs, in this article we'll look at three different generative adversarial networks: SRGANs, CycleGANs, and InfoGANs. We'll explore how these different GANs work and how they can be used. This should give you a solid foundation to explore GANs in more depth and begin to apply them in your own experiments and projects. This article is an excerpt from the book, Deep Learning with TensorFlow 2 and Keras, Second Edition by Antonio Gulli, Amita Kapoor, and Sujit Pal. SRGAN - Super Resolution GANs Remember seeing a crime-thriller where our hero asks the computer guy to magnify the faded image of the crime scene? With the zoom we are able to see the criminal’s face in detail, including the weapon used and anything engraved upon it! Well, SRGAN can perform similar magic. Here a GAN is trained in such a way that it can generate a photorealistic high-resolution image when given a low-resolution image. The SRGAN architecture consists of three neural networks: a very deep generator network, a discriminator network, and a pretrained VGG-16 network. How do SRGANs work? SRGANs use the perceptual loss function (developed by Johnson et al, Perceptual Losses for Real-Time Style Transfer and Super-Resolution). The difference in the feature map activations in high layers of a VGG network between the network output part and the high-resolution part comprises the perceptual loss function. Besides perceptual loss, the authors further added content loss and an adversarial loss so that images generated look more natural and the finer details more artistic. The perceptual loss is defined as the weighted sum of content loss and adversarial loss: lSR = lSR X+ 10−3×lSRGen The first term on the right-hand side is the content loss, obtained using the feature maps generated by pretrained VGG 19. Mathematically it is the Euclidean distance between the feature map of the reconstructed image (that is the one generated by the generator) and the original high-resolution reference image. The second term on the right-hand side is the adversarial loss. It is the standard generative loss term, designed to ensure that images generated by the generator are able to fool the discriminator. You can see in the following figure taken from the original paper that the image generated by SRGAN is much closer to the original high-resolution image: [caption id="attachment_31006" align="aligncenter" width="907"] image via https://arxiv.org/pdf/1609.04802.pdf[/caption] CycleGAN Another noteworthy architecture is CycleGAN; proposed in 2017, it can perform the task of image translation. Once trained you can translate an image from one domain to another domain. For example, when trained on horse and zebra data set, if you give it an image with horses in the ground, the CycleGAN can convert the horses to zebra with the same background. How does CycleGAN work? Have you ever imagined how a scenery would look if Van Gogh or Manet had painted it? We have many sceneries, and many landscapes painted by Gogh/Manet, but we do not have any collection of input-output pairs. CycleGAN performs the image translation, that is, transfers an image given in one domain (scenery for example) to another domain (Van Gogh painting of the same scene, for instance) in the absence of training examples. CycleGAN’s ability to perform image translation in the absence of training pairs is what makes it unique. To achieve image translation the authors of CycleGAN used a very simple and yet effective procedure. They made use of two GANs, the generator of each GAN performing the image translation from one domain to another. To elaborate, let us say the input is X, then the generator of the first GAN performs a mapping G: X → Y, thus its output would be Y = G(X). The generator of the second GAN performs an inverse mapping F: Y → X, resulting in X = F(Y). Each discriminator is trained to distinguish between real images and synthesized images. The idea is shown as follows: To train the combined GANs, the authors added beside the conventional GAN adversarial loss a forward cycle consistency loss (left figure) and a backward cycle consistency loss (right figure). This ensures that if an image X is given as input, then after the two translations F(G(X)) ~ X the obtained image is the same X (similarly the backward cycle consistency loss ensures the G(F(Y)) ~ Y). Following are some of the successful image translations by CycleGAN: Following are few more examples, you can see the translation of seasons (summer → winter), photo → painting and vice versa, horses → zebra: InfoGAN The GAN architectures that we have considered up to now provide us with little or no control over the generated images. InfoGAN changes this; it provides control over various attributes of the images generated. The InfoGAN uses concepts from information theory such that the noise term is transformed into latent codes which provide predictable and systematic control over the output. How does InfoGAN work? The generator in InfoGAN takes two inputs the latent space Z and a latent code c, thus the output of generator is G(Z,c). The GAN is trained such that it maximizes the mutual information between the latent code c and the generated image G(Z,c). The following figure shows the architecture of InfoGAN: The concatenated vector (Z,c) is fed to the Generator. Q(c|X) is also a neural network, combined with the generator it works to form a mapping between random noise Z and its latent code c_hat, it aims to estimate c given X. This is achieved by adding a regularization term to the objective function of conventional GAN: minDmaxG VI(D,G) = VG(D,G) −λI(c;G(Z,c)) The term VG(D,G) is the loss function of conventional GAN, and the second term is the regularization term, where λ is a constant. Its value was set to 1 in the paper, and I(c;G(Z,c)) is the mutual information between the latent code c and the Generator generated image G(Z,c). Below is the results of InfoGAN on the MNIST dataset: That concludes our brief look at three different types of generative adversarial networks. You can find the book from which this article was taken on the Packt store or you can read the first chapter for free on the Packt subscription platform.

0
0
100369

article-image-emmanuel-tsukerman-on-why-a-malware-solution-must-include-a-machine-learning-component

Savia Lobo

30 Dec 2019

11 min read

Emmanuel Tsukerman on why a malware solution must include a machine learning component

Savia Lobo

30 Dec 2019

11 min read

Machine learning is indeed the tech of present times! Security, which is a growing concern for many organizations today and machine learning is one of the solutions to deal with it. ML can help cybersecurity systems analyze patterns and learn from them to help prevent similar attacks and respond to changing behavior. To know more about machine learning and its application in Cybersecurity, we had a chat with Emmanuel Tsukerman, a Cybersecurity Data Scientist and the author of Machine Learning for Cybersecurity Cookbook. The book also includes modern AI to create powerful cybersecurity solutions for malware, pentesting, social engineering, data privacy, and intrusion detection. In 2017, Tsukerman's anti-ransomware product was listed in the Top 10 ransomware products of 2018 by PC Magazine. In his interview, Emmanuel talked about how ML algorithms help in solving problems related to cybersecurity, and also gave a brief tour through a few chapters of his book. He also touched upon the rise of deepfakes and malware classifiers. On using machine learning for cybersecurity Using Machine learning in Cybersecurity scenarios will enable systems to identify different types of attacks across security layers and also help to take a correct POA. Can you share some examples of the successful use of ML for cybersecurity you have seen recently? A recent and interesting development in cybersecurity is that the bad guys have started to catch up with technology; in particular, they have started utilizing Deepfake tech to commit crime; for example,they have used AI to imitate the voice of a CEO in order to defraud a company of $243,000. On the other hand, the use of ML in malware classifiers is rapidly becoming an industry standard, due to the incredible number of never-before-seen samples (over 15,000,000) that are generated each year. On staying updated with developments in technology to defend against attacks Machine learning technology is not only used by ethical humans, but also by Cybercriminals who use ML for ML-based intrusions. How can organizations counter such scenarios and ensure the safety of confidential organizational/personal data? The main tools that organizations have at their disposal to defend against attacks are to stay current and to pentest. Staying current, of course, requires getting educated on the latest developments in technology and its applications. For example, it’s important to know that hackers can now use AI-based voice imitation to impersonate anyone they would like. This knowledge should be propagated in the organization so that individuals aren’t caught off-guard. The other way to improve one’s security is by performing regular pen tests using the latest attack methodology; be it by attempting to avoid the organization’s antivirus, sending phishing communications, or attempting to infiltrate the network. In all cases, it is important to utilize the most dangerous techniques, which are often ML-based On how ML algorithms and GANs help in solving cybersecurity problems In your book, you have mentioned various algorithms such as clustering, gradient boosting, random forests, and XGBoost. How do these algorithms help in solving problems related to cybersecurity? Unless a machine learning model is limited in some way (e.g., in computation, in time or in training data), there are 5 types of algorithms that have historically performed best: neural networks, tree-based methods, clustering, anomaly detection and reinforcement learning (RL). These are not necessarily disjoint, as one can, for example, perform anomaly detection via neural networks. Nonetheless, to keep it simple, let’s stick to these 5 classes. Neural networks shine with large amounts of data on visual, auditory or textual problems. For that reason, they are used in Deepfakes and their detection, lie detection and speech recognition. Many other applications exist as well. But one of the most interesting applications of neural networks (and deep learning) is in creating data via Generative adversarial networks (GANs). GANs can be used to generate password guesses and evasive malware. For more details, I’ll refer you to the Machine Learning for Cybersecurity Cookbook. The next class of models that perform well are tree-based. These include Random Forests and gradient boosting trees. These perform well on structured data with many features. For example, the PE header of PE files (including malware) can be featurized, yielding ~70 numerical features. It is convenient and effective to construct an XGBoost model (a gradient-boosting model) or a Random Forest model on this data, and the odds are good that performance will be unbeatable by other algorithms. Next there is clustering. Clustering shines when you would like to segment a population automatically. For example, you might have a large collection of malware samples, and you would like to classify them into families. Clustering is a natural choice for this problem. Anomaly detection lets you fight off unseen and unknown threats. For instance, when a hacker utilizes a new tactic to intrude on your network, an anomaly detection algorithm can protect you even if this new tactic has not been documented. Finally, RL algorithms perform well on dynamic problems. The situation can be, for example, a penetration test on a network. The DeepExploit framework, covered in the book, utilizes an RL agent on top of metasploit to learn from prior pen tests and becomes better and better at finding vulnerabilities. Generative Adversarial Networks (GANs) are a popular branch of ML used to train systems against counterfeit data. How can these help in malware detection and safeguarding systems to identify correct intrusion? A good way to think about GANs is as a pair of neural networks, pitted against each other. The loss of one is the objective of the other. As the two networks are trained, each becomes better and better at its job. We can then take whichever side of the “tug of war” battle, separate it from its rival, and use it. In other cases, we might choose to “freeze” one of the networks, meaning that we do not train it, but only use it for scoring. In the case of malware, the book covers how to use MalGAN, which is a GAN for malware evasion. One network, the detector, is frozen. In this case, it is an implementation of MalConv. The other network, the adversarial network, is being trained to modify malware until the detection score of MalConv drops to zero. As it trains, it becomes better and better at this. In a practical situation, we would want to unfreeze both networks. Then we can take the trained detector, and use it as part of our anti-malware solution. We would then be confident knowing that it is very good at detecting evasive malware. The same ideas can be applied in a range of cybersecurity contexts, such as intrusion and deepfakes. On how Machine Learning for Cybersecurity Cookbook can help with easy implementation of ML for Cybersecurity problems What are some of the tools/ recipes mentioned in your book that can help cybersecurity professionals to easily implement machine learning and make it a part of their day-to-day activities? The Machine Learning for Cybersecurity Cookbook offers an astounding 80+ recipes. Themost applicable recipes will vary between individual professionals, and even for each individual different recipes will be applicable at different times in their careers. For a cybersecurity professional beginning to work with malware, the fundamentals chapter, chapter 2:ML-based Malware Detection, provides a solid and excellent start to creating a malware classifier. For more advanced malware analysts, Chapter 3:Advanced Malware Detection will offer more sophisticated and specialized techniques, such as dealing with obfuscation and script malware. Every cybersecurity professional would benefit from getting a firm grasp of chapter 4, “ML for Social Engineering”. In fact, anyone at all should have an understanding of how ML can be used to trick unsuspecting users, as part of their cybersecurity education. This chapter really shows that you have to be cautious because machines are becoming better at imitating humans. On the other hand, ML also provides the tools to know when such an attack is being performed. Chapter 5, “Penetration Testing Using ML” is a technical chapter, and is most appropriate to cybersecurity professionals that are concerned with pen testing. It covers 10 ways in which pen testing can be improved by using ML, including neural network-assisted fuzzing and DeepExploit, a framework that utilizes a reinforcement learning (RL) agent on top of metasploit to perform automatic pen testing. Chapter 6, “Automatic Intrusion Detection” has a wider appeal, as a lot of cybersecurity professionals have to know how to defend a network from intruders. They would benefit from seeing how to leverage ML to stop zero-day attacks on their network. In addition, the chapter covers many other use cases, such as spam filtering, Botnet detection and Insider Threat detection, which are more useful to some than to others. Chapter 7, “Securing and Attacking Data with ML” provides great content to cybersecurity professionals interested in utilizing ML for improving their password security and other forms of data security. Chapter 8, “Secure and Private AI”, is invaluable to data scientists in the field of cybersecurity. Recipes in this chapter include Federated Learning and differential privacy (which allow to train an ML model on clients’ data without compromising their privacy) and testing adversarial robustness (which allows to improve the robustness of ML models to adversarial attacks). Your book talks about using machine learning to generate custom malware to pentest security. Can you elaborate on how this works and why this matters? As a general rule, you want to find out your vulnerabilities before someone else does (who might be up to no-good). For that reason, pen testing has always been an important step in providing security. To pen test your Antivirus well, it is important to use the latest techniques in malware evasion, as the bad guys will certainly try them, and these are deep learning-based techniques for modifying malware. On Emmanuel’s personal achievements in the Cybersecurity domain Dr. Tsukerman, in 2017, your anti-ransomware product was listed in the ‘Top 10 ransomware products of 2018’ by PC Magazine. In your experience, why are ransomware attacks on the rise and what makes an effective anti-ransomware product? Also, in 2018, you designed an ML-based, instant-verdict malware detection system for Palo Alto Networks' WildFire service of over 30,000 customers. Can you tell us more about this project? If you monitor cybersecurity news, you would see that ransomware continues to be a huge threat. The reason is that ransomware offers cybercriminals an extremely attractive weapon. First, it is very difficult to trace the culprit from the malware or from the crypto wallet address. Second, the payoffs can be massive, be it from hitting the right target (e.g., a HIPAA compliant healthcare organization) or a large number of targets (e.g., all traffic to an e-commerce web page). Thirdly, ransomware is offered as a service, which effectively democratizes it! On the flip side, a lot of the risk of ransomware can be mitigated through common sense tactics. First, backing up one’s data. Second, having an anti-ransomware solution that provides guarantees. A generic antivirus can provide no guarantee - it either catches the ransomware or it doesn’t. If it doesn’t, your data is toast. However, certain anti-ransomware solutions, such as the one I have developed, do offer guarantees (e.g., no more than 0.1% of your files lost). Finally, since millions of new ransomware samples are developed each year, the malware solution must include a machine learning component, to catch the zero-day samples, which is another component of the anti-ransomware solution I developed. The project at Palo Alto Networks is a similar implementation of ML for malware detection. The one difference is that unlike the anti-ransomware service, which is an endpoint security tool, it offers protection services from the cloud. Since Palo Alto Networks is a firewall-service provider, that makes a lot of sense, since ideally, the malicious sample will be stopped at the firewall, and never even reach the endpoint. To learn how to implement the techniques discussed in this interview, grab your copy of the Machine Learning for Cybersecurity Cookbook Don’t wait - the bad guys aren’t waiting. Author Bio Emmanuel Tsukerman graduated from Stanford University and obtained his Ph.D. from UC Berkeley. In 2017, Dr. Tsukerman's anti-ransomware product was listed in the Top 10 ransomware products of 2018 by PC Magazine. In 2018, he designed an ML-based, instant-verdict malware detection system for Palo Alto Networks' WildFire service of over 30,000 customers. In 2019, Dr. Tsukerman launched the first cybersecurity data science course. About the book Machine Learning for Cybersecurity Cookbook will guide you through constructing classifiers and features for malware, which you'll train and test on real samples. You will also learn to build self-learning, reliant systems to handle cybersecurity tasks such as identifying malicious URLs, spam email detection, intrusion detection, network protection, and tracking user and process behavior, and much more! DevSecOps and the shift left in security: how Semmle is supporting software developers [Podcast] Elastic marks its entry in security analytics market with Elastic SIEM and Endgame acquisition Businesses are confident in their cybersecurity efforts, but weaknesses prevail

0
0
30378

How-To Tutorials - Data

Mastering Transfer Learning: Fine-Tuning BERT and Vision Transformers

Airflow Ops Best Practices: Observation and Monitoring

Vertex AI Workbench: Your Complete Guide to Scaling Machine Learning with Google Cloud

Essential SQL for Data Engineers

LLMOps in Action

ChatGPT for SQL Queries

Getting Started with Med-PaLM 2

ChatGPT for Ladder Logic

Creating a Data Model with ChatGPT is Easier than you think

Scientific Analysis of Donald Trump’s Tweets on COVID-19 with Transformers

Trending Topics

Distributed training in TensorFlow 2.x

How to Create Tensors in PyTorch

5 reasons why you should use an open-source data analytics stack in 2020

3 different types of generative adversarial networks (GANs) and how they work

Emmanuel Tsukerman on why a malware solution must include a machine learning component

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access