Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7011 Articles
article-image-why-decision-trees-are-more-flexible-than-linear-models-explains-stephen-klosterman
Guest Contributor
11 Dec 2019
7 min read
Save for later

Why decision trees are more flexible than linear models, explains Stephen Klosterman

Guest Contributor
11 Dec 2019
7 min read
This blog post will examine a hypothetical dataset of website visits and customer conversion, to illustrate how decision trees are a more flexible mathematical model than linear models such as logistic regression. Imagine you are monitoring the webpage of one of your products. You are keeping track of how many times individual customers visit this page, the total amount of time they've spent on the page across all their visits, and whether or not they bought the product. Your goal is to be able to predict, for future visitors, how likely they are to buy the product, based on the page visit data. You are considering presenting a discount, or some other kind of offer, to customers you think are likely to buy the product but haven't yet. Get to know more about decision trees and linear models! [box type="shadow" align="" class="" width=""]If you are interested in building your knowledge to prepare data for regularized logistic regression and random forest algorithms, read our book Data Science Projects with Python written by Stephen Klosterman. This book will give you practical guidance on industry-standard data analysis and machine learning tools in Python, with the help of realistic data. You will also learn how to use pandas and Matplotlib to critically examine a dataset with summary statistics and graphs and extract the insights you seek to derive. [/box] After logging the data on many customers, you visualize them and see the following, including some jitter to help see all the data points: There are several interesting patterns visible here. We see that in general, the longer someone spends on the page, the more likely they are to purchase the item. However, this effect seems to depend on the number of visits, in a complex way. Someone who visited the page once and spent at least two minutes there (i.e. two minutes per visit) seems likely to buy, at least up until 18 or so minutes. But someone who visited 10 times as much as this seems likely to buy after only 12 minutes cumulative time (1.2 minutes per visit). Additionally, there is a phenomenon of customers who spend a relatively long time (at least 18 or 19 minutes) over a relatively small number of visits (just one or two), who don't buy. Maybe they opened the page, but then walked away from their computer, and closed the page as soon as they came back. Whatever the reason, the patterns in this data set are interesting and complicated. If you want to create a predictive model of these data, you should consider the likely success of non-linear models, such as decision trees, versus linear models, such as logistic regression (for more information see chapter 3 of my book, Data Science Projects with Python). Logistic Regression as a linear model At a high level, linear models will take the feature space (the two-dimensional space where time is on the x-axis and number of visits is on the y-axis, as in the graph above), and seek to draw a straight line somewhere that creates an accurate division of the two classes of the response variable ("Bought" or "Did not buy"). Consider how likely this is to work. Where would you draw a straight line on the graph above, so that the two regions on either side of the line would primarily contain responses of only one class? It should be apparent that this is not likely to be an entirely successful task. The best you could probably do would be to draw a line that isolates non-buying customers who spent relatively little time on the page, represented by the region of dots to the left of the graph, from the blue dots representing buying customers to the right. While this would basically ignore the little group of customers to the lower right, it's probably the best you could do overall for most customers, using the straight-line approach. In fact, this is essentially what a logistic regression classifier looks like when the model is calibrated to these data. The above graph shows the regions of prediction ("Unlikely to buy" and "Likely to buy") as red or blue shading in the background. Deeper colors indicate a higher likelihood for either class. The conceptual straight-line decision boundary that divides the two regions mentioned above, would run right through the white portion of the background, where the probability of belonging to either class is very low. In other words, the model is "uncertain" about what prediction to make in this region. From the above graph, it can be seen that in addition to ignoring the small group of non-buying customers in the lower right, a straight line is also not a great model for isolating the non-buying customers on the left of the graph. While you can imagine that a curve might be able to define this boundary, a straight line cannot. Decision Trees as a non-linear model How can we do better? Enter non-linear models. Decision trees are a prime example of non-linear models. Decision trees work by dividing the data up into regions based on the "if-then" type of questions. For example, if a user spends less than three minutes over two or fewer visits, how likely are they to buy? Graphically, by asking many of these types of questions, a decision tree can divide up the feature space using little segments of vertical and horizontal lines. This approach can create a much more complex decision boundary, as shown below. It should be clear that decision trees can be used with more success, to model this data set. Given this, you would have a better model for the likelihood of customer conversion and could then proceed to design offers to increase conversion (for more information see chapter 5 of my book, Data Science Projects with Python). In conclusion, this post has shown how non-linear models, such as decision trees, can more effectively describe relationships in complex data sets than linear models, such as logistic regression. It should be noted that linear models can be extended to non-linearity by various means including feature engineering. On the other hand, non-linear models may suffer from overfitting, since they are so flexible. Nonetheless, approaches to prevent decision trees from overfitting have been formulated using ensemble models such as random forests and gradient boosted trees, which are among the most successful machine learning techniques in use today. As a final caveat, note this blog post presents a hypothetical, synthetic data set, which can be modeled almost perfectly with decision trees. Real-world data is messier, but the same principles hold. I hope you found this conceptual discussion helpful. For a more detailed explanation of how decision trees and logistic regression work "under the hood" with real-world data, and the python code for a similar hypothetical example to that shown here, check out my book Data Science Projects with Python. Author Bio Stephen Klosterman is a machine learning data scientist and the author of the book Data Science Projects with Python. He enjoys helping to frame problems in a data science context and delivering machine learning solutions that business stakeholders understand and value. His education includes a Ph.D. in biology from Harvard University, where he was an assistant teacher of the data science course. About the Book This book Data Science Projects with Python will help you understand the working and output of machine learning algorithms and gain insight into not only the predictive capabilities of the models but also their reasons for making these predictions. The book also provides detailed insight on how to build a classification model and how to conduct a financial analysis to find the optimal threshold for binary classification. This will help you with financial budgeting and operational strategy for a well-optimized usage model. At the end of this book, you will be able to confidently use various machine learning algorithms to perform detailed data analysis. Netflix open-sources Metaflow, its Python framework for building and managing data science projects What does a data science team look like? Get Ready for Open Data Science Conference 2019 in Europe and California How to learn data science: from data mining to machine learning Dr.Brandon explains Decision Trees to Jon
Read more
  • 0
  • 0
  • 44780

article-image-best-practices-for-restful-web-services-naming-conventions-and-api-versioning-tutorial
Sugandha Lahoti
12 Jul 2019
12 min read
Save for later

Best practices for RESTful web services : Naming conventions and API Versioning [Tutorial]

Sugandha Lahoti
12 Jul 2019
12 min read
This article covers two important best practices for REST and RESTful APIs: Naming conventions and API Versioning. This article is taken from the book Hands-On RESTful Web Services with TypeScript 3 by Biharck Muniz Araújo. This book will guide you in designing and developing RESTful web services with the power of TypeScript 3 and Node.js. What are naming conventions One of the keys to achieving a good RESTful design is naming the HTTP verbs appropriately. It is really important to create understandable resources that allow people to easily discover and use your services. A good resource name implies that the resource is intuitive and clear to use. On the other hand, the usage of HTTP methods that are incompatible with REST patterns creates noise and makes the developer's life harder. In this section, there will be some suggestions for creating clear and good resource URIs. It is good practice to expose resources as nouns instead of verbs. Essentially, a resource represents a thing, and that is the reason you should use nouns. Verbs refer to actions, which are used to factor HTTP actions. Three words that describe good resource naming conventions are as follows: Understandability: The resource's representation format should be understandable and utilizable by both the server and the client Completeness: A resource should be completely represented by the format Linkability: A resource can be linked to another resource Some example resources are as follows: Users of a system Blogs posts An article Disciplines in which a student is enrolled Students in which a professor teaches A blog post draft Each resource that's exposed by any service in a best-case scenario should be exposed by a unique URI that identifies it. It is quite common to see the same resource being exposed by more than one URI, which is definitely not good. It is also good practice to do this when the URI makes sense and describes the resource itself clearly. URIs need to be predictable, which means that they have to be consistent in terms of data structure. In general, this is not a REST required rule, but it enhances the service and/or the API. A good way to write good RESTful APIs is by writing them while having your consumers in mind. There is no reason to write an API and name it while thinking about the APIs developers rather than its consumers, who will be the people who are actually consuming your resources and API (as the name suggests). Even though the resource now has a good name, which means that it is easier to understand, it is still difficult to understand its boundaries. Imagine that services are not well named; bad naming creates a lot of chaos, such as business rule duplications, bad API usage, and so on. In addition to this, we will explain naming conventions based on a hypothetical scenario. Let's imagine that there is a company that manages orders, offers, products, items, customers, and so on. Considering everything that we've said about resources, if we decided to expose a customer resource and we want to insert a new customer, the URI might be as follows: POST https://<HOST>/customers The hypothetical request body might be as follows: { "fist-name" : "john", "last-name" : "doe", "e-mail" : "john.doe@email.com" } Imagine that the previous request will result in a customer ID of 445839 when it needs to recover the customer. The GET method could be called as follows: GET https://<HOST>/customers/445839 The response will look something like this: sample body response for customer #445839: { "customer-id": 445839, "fist-name" : "john", "last-name" : "doe", "e-mail" : "john.doe@email.com" } The same URI can be used for the PUT and DELETE operations, respectively: PUT https://<HOST>/customers/445839 The PUT body request might be as follows: { "last-name" : "lennon" } For the DELETE operation, the HTTP request to the URI will be as follows: DELETE https://<HOST>/customers/445839 Moving on, based on the naming conventions, the product URI might be as follows: POST https://<HOST>/products sample body request: { "name" : "notebook", "description" : "and fruit brand" } GET https://<HOST>/products/9384 PUT https://<HOST>/products/9384 sample body request: { "name" : "desktop" } DELETE https://<HOST>/products/9384 Now, the next step is to expose the URI for order creation. Before we continue, we should go over the various ways to expose the URI. The first option is to do the following: POST https://<HOST>/orders However, this could be outside the context of the desired customer. The order exists without a customer, which is quite odd. The second option is to expose the order inside a customer, like so: POST https://<HOST>/customers/445839/orders Based on that model, all orders belong to user 445839. If we want to retrieve those orders, we can make a GET request, like so: GET https://<HOST>/customers/445839/orders As we mentioned previously, it is also possible to write hierarchical concepts when there is a relationship between resources or entities. Following the same idea of orders, how should we represent the URI to describe items within an order and an order that belongs to user 445839? First, if we would like to get a specific order, such as order 7384, we can do that like so: GET https://<HOST>/customers/445839/orders/7384 Following the same approach, to get the items, we could use the following code: GET https://<HOST>/customers/445839/orders/7384/items The same concept applies to the create process, where the URI is still the same, but the HTTP method is POST instead of GET. In this scenario, the body also has to be sent: POST https://<HOST>/customers/445839/orders/7384 { "id" : 7834, "quantity" : 10 } Now, you should have a good idea of what the GET operation offers in regard to orders. The same approach can also be applied so that you can go deeper and get a specific item from a specific order and from a specific user: GET https://<HOST>/customers/445839/orders/7384/items/1 Of course, this hierarchy applies to the PUT, PATCH, and POST methods, and in some cases, the DELETE method as well. It will depend on your business rules; for example, can the item be deleted? Can I update an order? What is API versioning As APIs are being developed, gathering more business rules for their context on a day-to-day basis, generating tech debits and maturing, there often comes a point where teams need to release breaking functionality. It is also a challenge to keep their existing consumers working perfectly. One way to keep them working is by versioning APIs. Breaking changes can get messy. When something changes abruptly, it often generates issues for consumers, as this usually isn't planned and directly affects the ability to deliver new business experiences. There is a variant that says that APIs should be versionless. This means that building APIs that won't change their contract forces every change to be viewed through the lens of backward compatibility. This drives us to create better API interfaces, not only to solve any current issues, but to allow us to build APIs based on foundational capabilities or business capabilities themselves. Here are a few tips that should help you out: Put yourself in the consumer's shoes: When it comes to product perspective, it is suggested that you think from the consumer's point of view when building APIs. Most breaking changes happen because developers build APIs without considering the consumers, which means that they are building something for themselves and not for the real users' needs. Contract-first design: The API interface has to be treated as a formal contract, which is harder to change and more important than the coding behind it. The key to API design success is understanding the consumer's needs and the business associated with it to create a reliable contract. This is essentially a good, productive conversation between the consumers and the producers. Requires tolerant readers: It is quite common to add new fields to a contract with time. Based on what we have learned so far, this could generate a breaking change. This sometimes occurs because, unfortunately, many consumers utilize a deserializer strategy, which is strict by default. This means that, in general, the plugin that's used to deserialize throws exceptions on fields that have never been seen before. It is not recommended to version APIs, but only because you need to add a new optional field to the contract. However, in the same way, we don't want to break changes on the client side. Some good advice is documenting any changes, stating that new fields might be added so that the consumers aren't surprised by any new changes. Add an object wrapper: This sounds obvious, but when teams release APIs without object wrappers, the APIs turn on hard APIs, which means that they are near impossible to evolve without having to make breaking changes. For instance, let's say your team has delivered an API based on JSON that returns a raw JSON array. So far, so good. However, as they continue, they find out that they have to deal with paging, or have to internationalize the service or any other context change. There is no way of making changes without breaking something because the return is based on raw JSON. Always plan to version: Don't think you have built the best turbo API in the world ever. APIs are built with a final date, even though you don't know it yet. It's always a good plan to build APIs while taking versioning into consideration. Including the version in the URL Including the version in the URL is an easy strategy for having the version number added at the end of the URI. Let's see how this is done: https://api.domain.com/v1/ https://api.domain.com/v2/ https://api.domain.com/v3/ Basically, this model tells the consumers which API version they are using. Every breaking change increases the version number. One issue that may occur when the URI for a resource changes is that the resource may no longer be found with the old URI unless redirects are used. Versioning in the subdomain In regard to versioning in the URL, subdomain versioning puts the version within the URI but associated with the domain, like so: https://v1.api.domain.com/ https://v2.api.domain.com/ https://v3.api.domain.com/ This is quite similar to versioning at the end of the URI. One of the advantages of using a subdomain strategy is that your API can be hosted on different servers. Versioning on media types Another approach to versioning is using MIME types to include the API version. In short, API producers register these MIME types on their backend and then the consumers need to include accept and content-type headers. The following code lets you use an additional header: GET https://<HOST>/orders/1325 HTTP/1.1 Accept: application/json Version: 1 GET https://<HOST>/orders/1325 HTTP/1.1 Accept: application/json Version: 2 GET https://<HOST>/orders/1325 HTTP/1.1 Accept: application/json Version: 3 The following code lets you use an additional field in the accept/content-type header: GET https://<HOST>/orders/1325 HTTP/1.1 Accept: application/json; version=1 GET https://<HOST>/orders/1325 HTTP/1.1 Accept: application/json; version=2 GET https://<HOST>/orders/1325 HTTP/1.1 Accept: application/json; version=3 The following code lets you use a Media type: GET https://<HOST>/orders/1325 HTTP/1.1 Accept: application/vnd.<host>.orders.v1+json GET https://<HOST>/orders/1325 HTTP/1.1 Accept: application/vnd.<host>.orders.v2+json GET https://<HOST>/orders/1325 HTTP/1.1 Accept: application/vnd.<host>.orders.v3+json Recommendation When using a RESTful service, it is highly recommended that you use header-based versioning. However, the recommendation is to keep the version in the URL. This strategy allows the consumers to open the API in a browser, send it in an email, bookmark it, share it more easily, and so on. This format also enables human log readability. There are also a few more recommendations regarding API versioning: Use only the major version: API consumers should only care about breaking changes. Use a version number: Keep things clear; numbering the API incrementally allows the consumer to track evolvability. Versioning APIs using timestamps or any other format only creates confusion in the consumer's mind. This also exposes more information about versioning than is necessary. Require that the version has to be passed: Even though this is more convenient from the API producer's perspective, starting with a version is a good strategy because the consumers will know that the API version might change and they will be prepared for that. Document your API time-to-live policy: Good documentation is a good path to follow. Keeping everything well-described will mean that consumers avoid finding out that there is no Version 1 available anymore because it has been deprecated. Policies allow consumers to be prepared for issues such as depreciation. In this article, we learned about best practices related to RESTful web services such naming conventions, and API versioning formats. Next, to look at how to design RESTful web services with OpenAPI and Swagger, focusing on the core principles while creating web services, read our book Hands-On RESTful Web Services with TypeScript 3. 7 reasons to choose GraphQL APIs over REST for building your APIs Which Python framework is best for building RESTful APIs? Django or Flask? Understanding advanced patterns in RESTful API [Tutorial]
Read more
  • 0
  • 0
  • 44761

article-image-configuring-esp8266
Packt
14 Jun 2017
10 min read
Save for later

Configuring the ESP8266

Packt
14 Jun 2017
10 min read
In this article by Marco Schwartz the authors of the book ESP8266 Internet of Things Cookbook, we will learn following recipes: Setting up the Arduino development environment for the ESP8266 Choosing an ESP8266 Required additional components (For more resources related to this topic, see here.) Setting up the Arduino development environment for the ESP8266 To start us off, we will look at how to set up Arduino IDE development environment so that we can use it to program the ESP8266. This will involve installing the Arduino IDE and getting the board definitions for our ESP8266 module. Getting ready The first thing you should do is download the Arduino IDE if you do not already have it installed in your computer. You can do that from this link: https://www.arduino.cc/en/Main/Software. The webpage will appear as shown. It features that latest version of the Arduino IDE. Select your operating system and download the latest version that is available when you access the link (it was 1.6.13 at when this articlewas being written): When the download is complete, install the Arduino IDE and run it on your computer. Now that the installation is complete it is time to get the ESP8266 definitions. Open the preference window in the Arduino IDE from File|Preferences or by pressing CTRL+Comma. Copy this URL: http://arduino.esp8266.com/stable/package_esp8266com_index.json. Paste it in the filed labelled additional board manager URLs as shown in the figure. If you are adding other URLs too, use a comma to separate them: Open the board manager from the Tools|Board menu and install the ESP8266 platform. The board manager will download the board definition files from the link provided in the preferences window and install them. When the installation is complete the ESP8266 board definitions should appear as shown in the screenshot. Now you can select your ESP8266 board from Tools|Board menu: How it works… The Arduino IDE is an open source development environment used for programming Arduino boards and Arduino-based boards. It is also used to upload sketches to other open source boards, such as the ESP8266. This makes it an important accessory when creating Internet of Things projects. Choosing an ESP8266 board The ESP8266 module is a self-contained System On Chip (SOC) that features an integrated TCP/IP protocol stack that allows you to add Wi-Fi capability to your projects. The module is usually mounted on circuit boards that breakout the pins of the ESP8266 chip, making it easy for you program the chip and to interface with input and output devices. ESP8266 boards come in different forms depending on the company that manufactures them. All the boards use Espressif’s ESP8266 chip as the main controller, but have different additional components and different pin configurations, giving each board unique additional features. Therefore, before embarking on your IoT project, take some time to compare and contrast the different types of ESP8266 boards that are available. This way, you will be able to select the board that has features best suited for your project. Available options The simple ESP8266-01 module is the most basic ESP8266 board available in the market. It has 8 pins which include 4 General Purpose Input/Output (GPIO) pins, serial communication TX and RX pins, enable pin and power pins VCC and GND. Since it only has 4 GPIO pins, you can only connect three inputs or outputsto it. The 8-pin header on the ESP8266-01 module has a 2.0mm spacing which is not compatible with breadboards. Therefore, you have to look for another way to connect the ESP8266-01 module to your setup when prototyping. You can use female to male jumper wires to do that: The ESP8266-07 is an improved version of the ESP8266-01 module. It has 16 pins which comprise of 9 GPIO pins, serial communication TX and RX pins, a reset pin, an enable pin and power pins VCC and GND. One of the GPIO pins can be used as an analog input pin.The board also comes with a U.F.L. connector that you can use to plug an external antenna in case you need to boost Wi-Fi signal. Since the ESP8266 has more GPIO pins you can have more inputs and outputs in your project. Moreover, it supports both SPI and I2C interfaces which can come in handy if you want to use sensors or actuators that communicate using any of those protocols. Programming the board requires the use of an external FTDI breakout board based on USB to serial converters such as the FT232RL chip. The pads/pinholes of the ESP8266-07 have a 2.0mm spacing which is not breadboard friendly. To solve this, you have to acquire a plate holder that breaks out the ESP8266-07 pins to a breadboard compatible pin configuration, with 2.54mm spacing between the pins. This will make prototyping easier. This board has to be powered from a 3.3V which is the operating voltage for the ESP8266 chip: The Olimex ESP8266 module is a breadboard compatible board that features the ESP8266 chip. Just like the ESP8266-07 board, it has SPI, I2C, serial UART and GPIO interface pins. In addition to that it also comes with Secure Digital Input/Output (SDIO) interface which is ideal for communication with an SD card. This adds 6 extra pins to the configuration bringing the total to 22 pins. Since the board does not have an on-board USB to serial converter, you have to program it using an FTDI breakout board or a similar USB to serial board/cable. Moreover it has to be powered from a 3.3V source which is the recommended voltage for the ESP8266 chip: The Sparkfun ESP8266 Thing is a development board for the ESP8266 Wi-Fi SOC. It has 20 pins that are breadboard friendly, which makes prototyping easy. It features SPI, I2C, serial UART and GPIO interface pins enabling it to be interfaced with many input and output devices.There are 8 GPIO pins including the I2C interface pins. The board has a 3.3V voltage regulator which allows it to be powered from sources that provide more than 3.3V. It can be powered using a micro USB cable or Li-Po battery. The USB cable also charges the attached Li-Po battery, thanks to the Li-Po battery charging circuit on the board. Programming has to be done via an external FTDI board: The Adafruit feather Huzzah ESP8266 is a fully stand-alone ESP8266 board. It has built in USB to serial interface that eliminates the need for using an external FTDI breakout board to program it. Moreover, it has an integrated battery charging circuit that charges any connected Li-Po battery when the USB cable is connected. There is also a 3.3V voltage regulator on the board that allows the board to be powered with more than 3.3V. Though there are 28 breadboard friendly pins on the board, only 22 are useable. 10 of those pins are GPIO pins and can also be used for SPI as well as I2C interfacing. One of the GPIO pins is an analog pin: What to choose? All the ESP8266 boards will add Wi-Fi connectivity to your project. However, some of them lack important features and are difficult to work with. So, the best option would be to use the module that has the most features and is easy to work with. The Adafruit ESP8266 fits the bill. The Adafruit ESP8266 is completely stand-alone and easy to power, program and configure due to its on-board features. Moreover, it offers many input/output pins that will enable you to add more features to your projects. It is affordable andsmall enough to fit in projects with limited space. There’s more… Wi-Fi isn’t the only technology that we can use to connect out projects to the internet. There are other options such as Ethernet and 3G/LTE. There are shields and breakout boards that can be used to add these features to open source projects. You can explore these other options and see which works for you. Required additional components To demonstrate how the ESP8266 works we will use some addition components. These components will help us learn how to read sensor inputs and control actuators using the GPIO pins. Through this you can post sensor data to the internet and control actuators from the internet resources such as websites. Required components The components we will use include: Sensors DHT11 Photocell Soil humidity Actuators Relay Powerswitch tail kit Water pump Breadboard Jumper wires Micro USB cable Sensors Let us discuss the three sensors we will be using. DHT11 The DHT11 is a digital temperature and humidity sensor. It uses a thermistor and capacitive humidity sensor to monitor the humidity and temperature of the surrounding air and produces a digital signal on the data pin. A digital pin on the ESP8266 can be used to read the data from the sensor data pin: Photocell A photocell is a light sensor that changes its resistance depending on the amount of incident light it is exposed to. They can be used in a voltage divider setup to detect the amount of light in the surrounding. In a setup where the photocell is used in the Vcc side of the voltage divider, the output of the voltage divider goes high when the light is bright and low when the light is dim. The output of the voltage divider is connected to an analog input pin and the voltage readings can be read: Soil humidity sensor The soil humidity sensor is used for measuring the amount of moisture in soil and other similar materials. It has two large exposed pads that act as a variable resistor. If there is more moisture in the soil the resistance between the pads reduces, leading to higher output signal. The output signal is connected to an analog pin from where its value is read: Actuators Let’s discuss about the actuators. Relays A relay is a switch that is operated electrically. It uses electromagnetism to switch large loads using small voltages. It comprises of three parts: a coil, spring and contacts. When the coil is energized by a HIGH signal from a digital pin of the ESP8266 it attracts the contacts forcing them closed. This completes the circuit and turns on the connected load. When the signal on the digital pin goes LOW, the coil is no longer energized and the spring pulls the contacts apart. This opens the circuit and turns of the connected load: Power switch tail kit A power switch tail kit is a device that is used to control standard wall outlet devices with microcontrollers. It is already packaged to prevent you from having to mess around with high voltage wiring. Using it you can control appliances in your home using the ESP8266: Water pump A water pump is used to increase the pressure of fluids in a pipe. It uses a DC motor to rotate a fan and create a vacuum that sucks up the fluid. The sucked fluid is then forced to move by the fan, creating a vacuum again that sucks up the fluid behind it. This in effect moves the fluid from one place to another: Breadboard A breadboard is used to temporarily connect components without soldering. This makes it an ideal prototyping accessory that comes in handy when building circuits: Jumper wires Jumper wires are flexible wires that are used to connect different parts of a circuit on a breadboard: Micro USB cable A micro USB cable will be used to connect the Adafruit ESP8266 board to the compute: Summary In this article we have learned how to setting up the Arduino development environment for the ESP8266,choosing an ESP8266, and required additional components.  Resources for Article: Further resources on this subject: Internet of Things with BeagleBone [article] Internet of Things Technologies [article] BLE and the Internet of Things [article]
Read more
  • 0
  • 0
  • 44643

article-image-essential-sql-for-data-engineers
Kedeisha Bryan, Taamir Ransome
31 Oct 2024
10 min read
Save for later

Essential SQL for Data Engineers

Kedeisha Bryan, Taamir Ransome
31 Oct 2024
10 min read
This article is an excerpt from the book, Cracking the Data Engineering Interview, by Kedeisha Bryan, Taamir Ransome. The book is a practical guide that’ll help you prepare to successfully break into the data engineering role. The chapters cover technical concepts as well as tips for resume, portfolio, and brand building to catch the employer's attention, while also focusing on case studies and real-world interview questions.Introduction In the world of data engineering, SQL is the unsung hero that empowers us to store, manipulate, transform, and migrate data easily. It is the language that enables data engineers to communicate with databases, extract valuable insights, and shape data to meet their needs. Regardless of the nature of the organization or the data infrastructure in use, a data engineer will invariably need to use SQL for creating, querying, updating, and managing databases. As such, proficiency in SQL can often the difference between a good data engineer and a great one. Whether you are new to SQL or looking to brush up your skills, this chapter will serve as a comprehensive guide. By the end of this chapter, you will have a solid understanding of SQL as a data engineer and be prepared to showcase your knowledge and skills in an interview setting. In this article, we will cover the following topics: Must-know foundational SQL concepts Must-know advanced SQL concepts Technical interview questions Must-know foundational SQL concepts In this section, we will delve into the foundational SQL concepts that form the building blocks of data engineering. Mastering these fundamental concepts is crucial for acing SQL-related interviews and effectively working with databases. Let’s explore the critical foundational SQL concepts every data engineer should be comfortable with, as follows: SQL syntax: SQL syntax is the set of rules governing how SQL statements should be written. As a data engineer, understanding SQL syntax is fundamental because you’ll be writing and reviewing SQL queries regularly. These queries enable you to extract, manipulate, and analyze data stored in relational databases. SQL order of operations: The order of operations dictates the sequence in which each of the following operators is executed in a query: FROM and JOIN WHERE GROUP BY HAVING SELECT DISTINCT ORDER BY LIMIT/OFFSET Data types: SQL supports a variety of data types, such as INT, VARCHAR, DATE, and so on. Understanding these types is crucial because they determine the kind of data that can be stored in a column, impacting storage considerations, query performance, and data integrity. As a data engineer, you might also need to convert data types or handle mismatches. SQL operators: SQL operators are used to perform operations on data. They include arithmetic operators (+, -, *, /), comparison operators (>, <, =, and so on), and logical operators (AND, OR, and NOT). Knowing these operators helps you construct complex queries to solve intricate data-related problems. Data Manipulation Language (DML), Data Definition Language (DDL), and Data Control  Language (DCL) commands: DML commands such as SELECT, INSERT, UPDATE, and DELETE allow you to manipulate data stored in the database. DDL commands such as CREATE, ALTER, and DROP enable you to manage database schemas. DCL commands such as GRANT and REVOKE are used for managing permissions. As a data engineer, you will frequently use these commands to interact with databases. Basic queries: Writing queries to select, filter, sort, and join data is an essential skill for any data engineer. These operations form the basis of data extraction and manipulation. Aggregation functions: Functions such as COUNT, SUM, AVG, MAX, MIN, and GROUP BY are used to perform calculations on multiple rows of data. They are essential for generating reports and deriving statistical insights, which are critical aspects of a data engineer’s role. The following section will dive deeper into must-know advanced SQL concepts, exploring advanced techniques to elevate your SQL proficiency. Get ready to level up your SQL game and unlock new possibilities in data engineering! Must-know advanced SQL concepts This section will explore advanced SQL concepts that will elevate your data engineering skills to the next level. These concepts will empower you to tackle complex data analysis, perform advanced data transformations, and optimize your SQL queries. Let’s delve into must-know advanced SQL concepts, as follows: Window functions: These do a calculation on a group of rows that are related to the current row. They are needed for more complex analyses, such as figuring out running totals or moving averages, which are common tasks in data engineering. Subqueries: Queries nested within other queries. They provide a powerful way to perform complex data extraction, transformation, and analysis, often making your code more efficient and readable. Common Table Expressions (CTEs): CTEs can simplify complex queries and make your code more maintainable. They are also essential for recursive queries, which are sometimes necessary for problems involving hierarchical data. Stored procedures and triggers: Stored procedures help encapsulate frequently performed tasks, improving efficiency and maintainability. Triggers can automate certain operations, improving data integrity. Both are important tools in a data engineer’s toolkit. Indexes and optimization: Indexes speed up query performance by enabling the database to locate data more quickly. Understanding how and when to use indexes is key for a data engineer, as it affects the efficiency and speed of data retrieval. Views: Views simplify access to data by encapsulating complex queries. They can also enhance security by restricting access to certain columns. As a data engineer, you’ll create and manage views to facilitate data access and manipulation. By mastering these advanced SQL concepts, you will have the tools and knowledge to handle complex data scenarios, optimize your SQL queries, and derive meaningful insights from your datasets. The following section will prepare you for technical interview questions on SQL. We will equip you with example answers and strategies to excel in SQL-related interview discussions. Let’s further enhance your SQL expertise and be well prepared for the next phase of your data engineering journey. Technical interview questions This section will address technical interview questions specifically focused on SQL for data engineers. These questions will help you demonstrate your SQL proficiency and problem-solving abilities. Let’s explore a combination of primary and advanced SQL interview questions and the best methods to approach and answer them, as follows: Question 1: What is the difference between the WHERE and HAVING clauses? Answer: The WHERE clause filters data based on conditions applied to individual rows, while the HAVING clause filters data based on grouped results. Use WHERE for filtering before aggregating data and HAVING for filtering after aggregating data. Question 2: How do you eliminate duplicate records from a result set? Answer: Use the DISTINCT keyword in the SELECT statement to eliminate duplicate records and retrieve unique values from a column or combination of columns. Question 3: What are primary keys and foreign keys in SQL? Answer: A primary key uniquely identifies each record in a table and ensures data integrity. A foreign key establishes a link between two tables, referencing the primary key of another table to enforce referential integrity and maintain relationships. Question 4: How can you sort data in SQL? Answer: Use the ORDER BY clause in a SELECT statement to sort data based on one or more columns. The ASC (ascending) keyword sorts data in ascending order, while the DESC (descending) keyword sorts it in descending order. Question 5: Explain the difference between UNION and UNION ALL in SQL. Answer: UNION combines and removes duplicate records from the result set, while UNION ALL combines all records without eliminating duplicates. UNION ALL is faster than UNION because it does not involve the duplicate elimination process. Question 6: Can you explain what a self join is in SQL? Answer: A self join is a regular join where a table is joined to itself. This is often useful when the data is related within the same table. To perform a self join, we have to use table aliases to help SQL distinguish the left from the right table. Question 7: How do you optimize a slow-performing SQL query? Answer: Analyze the query execution plan, identify bottlenecks, and consider strategies such as creating appropriate indexes, rewriting the query, or using query optimization techniques such as JOIN order optimization or subquery optimization.  Question 8: What are CTEs, and how do you use them? Answer: CTEs are temporarily named result sets that can be referenced within a query. They enhance query readability, simplify complex queries, and enable recursive queries. Use the WITH keyword to define CTEs in SQL. Question 9: Explain the ACID properties in the context of SQL databases. Answer: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. These are basic properties that make sure database operations are reliable and transactional. Atomicity makes sure that a transaction is handled as a single unit, whether it is fully done or not. Consistency makes sure that a transaction moves the database from one valid state to another. Isolation makes sure that transactions that are happening at the same time don’t mess with each other. Durability makes sure that once a transaction is committed, its changes are permanent and can survive system failures. Question 10: How can you handle NULL values in SQL? Answer: Use the IS NULL or IS NOT NULL operator to check for NULL values. Additionally, you can use the COALESCE function to replace NULL values with alternative non-null values. Question 11: What is the purpose of stored procedures and functions in SQL? Answer: Stored procedures and functions are reusable pieces of SQL code encapsulating a set of SQL statements. They promote code modularity, improve performance, enhance security, and simplify database maintenance. Question 12: Explain the difference between a clustered and a non-clustered index. Answer: The physical order of the data in a table is set by a clustered index. This means that a table can only have one clustered index. The data rows of a table are stored in the leaf nodes of a clustered index. A non-clustered index, on the other hand, doesn’t change the order of the data in the table. After sorting the pointers, it keeps a separate object in a table that points back to the original table rows. There can be more than one non-clustered index for a table. Prepare for these interview questions by understanding the underlying concepts, practicing SQL queries, and being able to explain your answers. ConclusionThis article explored the foundational and advanced principles of SQL that empower data engineers to store, manipulate, transform, and migrate data confidently. Understanding these concepts has unlocked the door to seamless data operations, optimized query performance, and insightful data analysis. SQL is the language that bridges the gap between raw data and valuable insights. With a solid grasp of SQL, you possess the skills to navigate databases, write powerful queries, and design efficient data models. Whether preparing for interviews or tackling real-world data engineering challenges, the knowledge you have gained in this chapter will propel you toward success. Remember to continue exploring and honing your SQL skills. Stay updated with emerging SQL technologies, best practices, and optimization techniques to stay at the forefront of the ever-evolving data engineering landscape. Embrace the power of SQL as a critical tool in your data engineering arsenal, and let it empower you to unlock the full potential of your data. Author BioKedeisha Bryan is a data professional with experience in data analytics, science, and engineering. She has prior experience combining both Six Sigma and analytics to provide data solutions that have impacted policy changes and leadership decisions. She is fluent in tools such as SQL, Python, and Tableau.She is the founder and leader at the Data in Motion Academy, providing personalized skill development, resources, and training at scale to aspiring data professionals across the globe. Her other works include another Packt book in the works and an SQL course for LinkedIn Learning.Taamir Ransome is a Data Scientist and Software Engineer. He has experience in building machine learning and artificial intelligence solutions for the US Army. He is also the founder of the Vet Dev Institute, where he currently provides cloud-based data solutions for clients. He holds a master's degree in Analytics from Western Governors University.
Read more
  • 2
  • 0
  • 44542

article-image-implementing-software-engineering-best-practices-and-techniques-apache-maven
Packt
24 Aug 2011
10 min read
Save for later

Implementing Software Engineering Best Practices and Techniques with Apache Maven

Packt
24 Aug 2011
10 min read
  Apache Maven 3 Cookbook Over 50 recipes towards optimal Java Software Engineering with Maven 3 These techniques have been around for more than a decade and are well-known by practitioners of software engineering. The benefits, trade-offs, and pros and cons of these practices are well-known and will only need little mentioning. These practices are not inter-dependent, but some of them are inter-related in the larger scheme of things. One such example would be the relation between project modularization and dependency management. While nothing stops either from being implemented in isolation, they are more beneficial when implemented together. These techniques can be further supplemented by the industry's best practices such as continuous integration, maintaining centralized repositories, source code integration, and so on. Our focus here will be on steadily understanding these software engineering techniques within the context of Maven projects and we will look at practical ways to implement and integrate them. Build automation Build automation is the scripting of tasks that software developers have to do on a day-to-day basis. These tasks include: Compilation of source code to binary code Packaging of binary code Running tests Deployment to remote systems Creation of documentation and release notes Build automation offers a range of benefits including speeding up of builds, elimination of bad builds, standardization in teams and organizations, increased efficiency, and improvements in product quality. Today, it is considered as an absolute essential for software engineering practitioners. Getting ready You need to have a Maven project ready. If you don't have one, run the following in the command line to create a simple Java project: $ mvn archetype:generate -DgroupId=net.srirangan.packt.maven -DartifactId=MySampleApp How to do it... The archetype:generate command would have generated a sample Apache Maven project for us. If we choose the maven-archetype-quickstart archetype from the list, our project structure would look similar to the following: └───src ├───main │ └───java │ └───net │ └───srirangan │ └───packt │ └───maven └───test └───java └───net └───srirangan └───packt └───maven In every Apache Maven project, including the one we just generated, the build is pre-automated following the default build lifecycle. Follow the steps given next to validate the same: Start the command-line terminal and navigate to the root of the Maven project. Try running the following commands in serial order: $ mvn validate ... $ mvn compile ... $ mvn package ... $ mvn test ... You just triggered some of the phases of the build life cycle by individual commands. Maven lets you automate the running of all the phases in the correct order. Just execute the following command, mvn install, and it will encapsulate much of the default build lifecycle including compiling, testing, packaging, and installing the artifact in the local repository. How it works... For every Apache Maven project, regardless of the packaging type, the default build lifecycle is applied and the build is automated. As we just witnessed, the default build lifecycle consists of phases that can be executed from the command-line terminal. These phases are: Validate: Validates that all project information is available and correct Compile: Compiles the source code Test: Runs unit tests within a suitable framework Package: Packages the compiled code in its distribution format Integration-test: Processes the package in the integration test environment Verify: Runs checks to verify if the package is valid Install: Installs the package in the local repository Deploy: Installs the final package in a remote repository Each of the build lifecycle phases is a Maven plugin. When you execute them for the first time, Apache Maven will download the plugin from the default online Maven Central Repository that can be found at http://repo1.maven.org/maven2 and will install it in your local Apache Maven repository. This ensures that build automation is always set up in a consistent manner for everyone in the team, while the specifics and internals of the build are abstracted out. Maven build automation also pushes for standardization among different projects within an organization, as the commands to execute build phases remain the same. Project modularization Considering that you're building a large enterprise application, it will need to interact with a legacy database, work with existing services, provide a modern web and device capable user interface, and expose APIs for other applications to consume. It does make sense to split this rather large project into subprojects or modules. Apache Maven provides impeccable support for such a project organization through Apache Maven Multi-modular projects. Multi-modular projects consist of a "Parent Project" which contains "Child Projects" or "Modules". The parent project's POM file contains references to all these sub-modules. Each module can be of a different type, with a different packaging value. Getting ready We begin by creating the parent project. Remember to set the value of packaging to pom, as highlighted in the following code: <?xml version="1.0" encoding="UTF-8"?> <project xsi_schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>net.srirangan.packt.maven</groupId> <artifactId>TestModularApp</artifactId> <version>1.0-SNAPSHOT</version> <packaging>pom</packaging> <name>MyLargeModularApp</name> </project> This is the base parent POM file for our project MyLargeModularApp. It doesn't contain any sub-modules for now. How to do it... To create your first sub-module, start the command-line terminal, navigate to the parent POM directory, and run the following command: $ mvn archetype:generate This will display a list of archetypes for you to select. You can pick archetype number 101, maven-archetype-quickstart, which generates a basic Java project. The archetype:generate command also requires you to fill in the Apache Maven project co-ordinates including the project groupId, artifactId, package, and version. After project generation, inspect the POM file of the original parent project. You will find the following block added: <modules> <module>moduleJar</module> </modules> The sub-module we created has been automatically added in the parent POM. It simply works—no intervention required! We now create another sub-module, this time a Maven web application by running the following in the command line: $ mvn archetype:generate -DarchetypeArtifactId=maven-archetype- webapp Let's have another look at the parent POM file; we should see both the sub-modules included: <modules> <module>moduleJar</module> <module>moduleWar</module> </modules> Our overall project structure should look like this: MyLargeModularApp ├───MyModuleJar │ └───src │ ├───main │ │ └───java │ │ └───net │ │ └───srirangan │ │ └───packt │ │ └───maven │ └───test │ └───java │ └───net │ └───srirangan │ └───packt │ └───maven └───MyModuleWar └───src └───main ├───resources └───webapp └───WEB-INF How it works... Compiling and installing both sub-modules (in the correct order in case sub-modules are interdependent) is essential. It can be done in the command line by navigating to the parent POM folder and running the following command: $ mvn clean install Thus, executing build phase on the parent project automatically gets executed for all its child projects in the correct order. You should get an output similar to: ------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] MyLargeModularApp ........................ SUCCESS [0.439s] [INFO] MyModuleJar .............................. SUCCESS [3.047s] [INFO] MyModuleWar Maven Webapp ................. SUCCESS [0.947s] ------------------------------------------------------------------ [INFO] BUILD SUCCESS ------------------------------------------------------------------ Dependency management Dependency management can be universally acknowledged as one of the best features of Apache Maven. In Multi-modular projects, where dependencies can run into tens or even hundreds, Apache Maven excels in allowing you to retain a high degree of control and stability. Apache Maven dependencies are transient, which means Maven will automatically discover artifacts that your dependencies require. This feature has been available since Maven 2, and it especially comes in handy for many of the open source project dependencies we have in today's enterprise projects. Getting ready Maven dependencies have six possible scopes: Compile: This is the default scope. Compile dependencies are available in the classpaths. Provided: This scope assumes that the JDK or the environment provides dependencies at runtime. Runtime: Dependencies that are required at runtime and are specified in the runtime classpaths. Test: Dependencies required for test compilation and execution. System: Dependency is always available, but the JAR is provided nonetheless. Import: Imports dependencies specified in POM included via the <dependencyManagement/> element. How to do it... Dependencies for Apache Maven projects are described in project POM files. While we take a closer look at these in the How it works... section of this recipe, here we will explore the Apache Maven dependency plugin. According to http://maven.apache.org/plugins/maven-dependency-plugin/: "The dependency plugin provides the capability to manipulate artifacts. It can copy and/or unpack artifacts from local or remote repositories to a specified location." It's a decent little plugin and provides us with a number of very useful goals. They are as follows: $ mvn dependency:analyze Analyzes dependencies (used, unused, declared, undeclared) $ mvn dependency:analyze-duplicate Determines duplicate dependencies $ mvn dependency:resolve Resolves all dependencies $ mvn dependency:resolve-plugin Resolves all plugins $ mvn dependency:tree Displays dependency trees How it works... Most Apache Maven projects have dependencies to other artifacts (that is, other projects, libraries, and tools). Management of dependencies and their seamless integration is one of Apache Maven's strongest features. These dependencies for a Maven project are specified in the project's POM file. <dependencies> <dependency> <groupId>...</groupId> <artifactId>...</artifactId> <version>...</version> <scope>...</scope> </dependency> </dependencies> In Multi-modular projects, dependencies can be defined in the parent POM files and can be subsequently inherited by child POM files as and when required. Having a single source for all dependency definitions makes dependency versioning simpler, thus keeping large projects' dependencies organized and manageable over time. The following is an example to show a Multi-modular project having a MySQL dependency. The parent POM would contain the complete definition of the dependency: <dependencyManagement> <dependencies> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.2</version> </dependency> <dependencies> </dependencyManagement> All child modules that require MySQL would only include a stub dependency definition: <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> </dependency> There will be no version conflicts between multiple child modules having the same dependencies. The dependencies scope and type are defaulted to compile and JAR. However, they can be overridden as required: <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.8.2</version> <scope>test</scope> </dependency> <dependency> <groupId>...</groupId> <artifactId>...</artifactId> <version>...</version> <type>war</type> </dependency> There's more... System dependencies are not looked for in the repository. For them, we need to specify the path to the JAR: <dependencies> <dependency> <groupId>sun.jdk</groupId> <artifactId>tools</artifactId> <version>1.5.0</version> <scope>system</scope> <systemPath>${java.home}/../lib/tools.jar</systemPath> </dependency> </dependencies> However, avoiding the use of system dependencies is strongly recommended because it kills the whole purpose of Apache Maven dependency management in the first place. Ideally, a developer should be able to clone code out of the SCM and run Apache Maven commands. After that, it should be the responsibility of Apache Maven to take care of including all dependencies. System dependencies would force the developer to take extra steps and that dilutes the effectiveness of Apache Maven in your team environment.
Read more
  • 0
  • 0
  • 44521

article-image-gemini-10-pro-vision-in-bigquery-python-ui-library-feature-engineering-with-fabric-and-pyspark-power-analytics-with-redshift-amazon-rds-for-mysql
Merlyn Shelley
19 Apr 2024
14 min read
Save for later

Gemini 1.0 Pro Vision in BigQuery, Python UI Library, Feature Engineering with Fabric and PySpark, Power analytics with Redshift, Amazon RDS for MySQL

Merlyn Shelley
19 Apr 2024
14 min read
Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!Get the first look at Sigma's new features and functionality at our virtual product launch on May 2nd at 12pm ET/9am PT.The virtual event will showcase talks and demos from Sigma's CEO, co-founders, and product managers about what's next in the future of analytics.Don't miss out. See how Sigma is reinventing BI.👋 Hello,Welcome to BI-Pro #52: Your Premier Destination for Data and BI Insights! 🌟 In This Edition: 🔮 Data Viz with Python Libraries Exploring causality with Python. Meet NiceGUI: Your Soon-to-be Favorite Python UI Library. Feature Engineering with Microsoft Fabric and PySpark. 10 GitHub Repositories to Master Python. 🔌 Power BI On-premises data gateway April 2024 release. Copilot in Power BI expansion. 🛠️ Microsoft Fabric Introducing Optimistic Job Admission for Fabric Spark. Introducing Job Queueing for Notebook in Microsoft Fabric. ☁️ AWS BI Meet Amazon QuickSight expert Sanjeeb Mohapatra. Handle tables without primary keys for Amazon Aurora MySQL and Amazon RDS for MySQL. Power analytics with Amazon Redshift. 🌐 Google Cloud Data Gemini 1.0 Pro Vision in BigQuery. BigQuery data canvas. Gemini in Looker AI-powered BI. Memorystore for Redis Cluster updates. Firestore launch updates. 📊Tableau Tableau vs Power BI: A Comparison of AI-Powered Analytics Tools. Salesforce-Informatica Deal Could Transform Enterprise GenAI Forever. ✨ Expert Insights from Packt Community ChatGPT for Cybersecurity Cookbook by Clint Bodungen. 💡 What's the Latest Scoop from the BI Community? Geospatial Data Analysis with Geemap. Microsoft Fabric Table Maintenance - Checkpoint and Statistics. Identifying Customer Buying Pattern in Power BI - Part 1. Full vs. Incremental Loads – Data Engineering with Fabric. Joining Queries in Azure Data Factory on Cosmos DB Sources. Feature Engineering with Microsoft Fabric and Dataflow Gen2. Stay ahead in the ever-evolving landscape of business intelligence with BI-Pro. Unleash the full potential of your data today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos 🐾 altair - Vega-Altair is a Python library for statistical visualization, offering simplicity, friendliness, and consistency for creating beautiful and effective visualizations. 🐾 bokeh - Bokeh is a Python library for creating interactive plots and data applications in web browsers, offering elegant and versatile graphics. 🐾 bqplot - bqplot is a 2-D visualization system for Jupyter, based on the Grammar of Graphics, enabling interactive plots with other Jupyter widgets. 🐾 cartopy - Cartopy simplifies map drawing in Python, offering easy projection definitions, point transformations, and integration with Matplotlib for advanced mapping. 🐾 diagrams - Diagrams simplifies cloud system architecture design in Python, supporting major providers and frameworks, allowing prototyping and visualization of existing architectures. Email Forwarded? Join BI-Pro Here!🔮 Data Viz with Python Libraries   🐍 Exploring causality with Python. Difference-in-differences: The series dives into causal inference, crucial in modern analytics, explaining tools like difference-in-differences. It explores how events impact outcomes, using examples such as minimum wage effects on employment. The setup involves treatment and control groups to establish cause-and-effect relationships in diverse real-world scenarios. 🐍 Meet the NiceGUI: Your Soon-to-be Favorite Python UI Library. NiceGUI is a Python UI framework for web and desktop apps, offering a simple interface for small projects, dashboards, and robotics. It simplifies state management and interaction, boasting features like easy layout, visualization tools, and integration with popular libraries. 🐍 Feature Engineering with Microsoft Fabric and PySpark: The post delves into feature engineering in Microsoft Fabric, emphasizing its importance in ML development. It explores PySpark's role in handling large datasets and provides a basic overview and example of using PySpark for feature engineering. 🐍 10 GitHub Repositories to Master Python: The blog explores 10 essential GitHub repositories for mastering Python, emphasizing hands-on experience and real-world projects to enhance skills. It covers a range of topics, from beginner to advanced, including machine learning, web development, and data analysis. Asabeneh/30-Days-Of-Python  trekhleb/learn-python  Avik-Jain/100-Days-Of-ML-Code  realpython/python-guide  zhiwehu/Python-programming-exercises  geekcomputers/Python  practical-tutorials/project-based-learning  avinashkranjan/Amazing-Python-Scripts  TheAlgorithms/Python  vinta/awesome-python   ⚡Stay Informed with Industry Highlights Power BI 📊 On-premises data gateway April 2024 release: This update to the on-premises data gateway aligns it with the April 2024 release of Power BI Desktop, ensuring consistency in query execution. Additionally, the gateway now supports refreshes longer than one hour, allowing tokens to be refreshed mid-stream for continuous operation.  📊 Copilot in Power BI: Soon available to more users in your organization. The update introduces changes to Copilot in Power BI, including enabling Copilot by default for all tenants starting May 20th, 2024. It also addresses features reported by customers and community, updates abuse monitoring to not store prompts, and improves geo mapping for EU data boundary customers. Microsoft Fabric📊 Introducing Optimistic Job Admission for Fabric Spark: The post introduces Optimistic Job Admission for Spark in Microsoft Fabric, a new feature aimed at improving concurrency and job admission experience. It explains how this feature optimizes resource allocation and increases the number of concurrent jobs that can be admitted to the cluster. 📊 Introducing Job Queueing for Notebook in Microsoft Fabric: Microsoft Fabric introduces Job Queueing for Notebook Jobs to streamline data engineering and data science processes. This feature automatically queues notebook jobs when Fabric capacity is maxed out, eliminating manual retries and improving user experience. Jobs are retried when resources become available, enhancing efficiency for enterprise users. AWS BI  📊 Meet one of Amazon QuickSight’s Top Community Experts: Sanjeeb Mohapatra. The Amazon QuickSight Community, launched in 2022, is a hub for BI authors and developers to collaborate, ask and answer questions, and learn about QuickSight. Sanjeeb Mohapatra, the top Community Expert for 2023, exemplifies the community's spirit by providing over 1,700 replies and 235 solutions in one year. 📊 Handle tables without primary keys while creating Amazon Aurora MySQL or Amazon RDS for MySQL zero-ETL integrations with Amazon Redshift: AWS is advancing its zero-ETL vision with Amazon Aurora zero-ETL integration to Amazon Redshift, combining transactional data with analytics capabilities. This integration, along with four new ones announced at re:Invent 2023, empowers customers to implement near real-time analytics for various use cases. 📊 Power analytics as a service capabilities using Amazon Redshift: Analytics as a service (AaaS) leverages cloud-based analytic capabilities to enable cost-effective, scalable solutions for organizations. Amazon Redshift, a cloud data warehouse service, facilitates real-time insights and predictive analytics, empowering AaaS providers to embed rich data analytics capabilities. Delivery models include managed, bring-your-own-Redshift (BYOR), and hybrid options, offering flexibility to meet customer needs. Google Cloud Data 📊 How to use Gemini 1.0 Pro Vision in BigQuery? BigQuery integrates with Vertex AI to leverage Gemini 1.0 Pro, PaLM, Vision AI, Speech AI, Doc AI, Natural Language AI, enabling analysis of unstructured data like images, audio, and documents. New integrations support multimodal generative AI, enhancing capabilities for object recognition, info seeking, captioning, digital content understanding, and structured content generation, allowing structured data output for deeper analysis. 📊 Get to know BigQuery data canvas: BigQuery Data Canvas simplifies the data-to-insights journey by offering a natural language-driven experience. It centralizes data tasks, accelerates analysis, and fosters collaboration, all within a unified workspace, enabling faster and more efficient data analytics. 📊 Gemini in Looker to bring intelligent AI-powered BI to everyone: Gemini in Looker introduces Conversational Analytics, transforming how businesses engage with data. It offers a natural language-driven experience, simplifying data analytics and fostering collaboration, all within a unified workspace. 📊 Memorystore for Redis Cluster updates at Next ‘24: The article elaborates on the rapid adoption and recent enhancements of Google Cloud's Memorystore for Redis Cluster. It features customer testimonials from companies like Statsig, Character.AI, and AXON Networks, showcasing the service's performance, scalability, and cost-effectiveness. It also highlights new features such as data persistence, new node types, and ultra-fast vector search. 📊 Firestore launches at Next ‘24: Firestore is beloved by developers for its speed in app development. Updates include improved developer productivity, AI-enabled app building, richer queries, and enterprise-level scalability. Gemini Code Assist now supports Firestore, allowing natural language queries and data model definitions, enhancing the development experience. Firestore also supports AI applications and integrations with LangChain and LlamaIndex for generative AI. Tableau📊 Tableau vs Power BI: A Comparison of AI-Powered Analytics Tools. The comparison delves into the unique strengths of Tableau and Power BI, showcasing how each excels in different areas of data visualization and analytics. It outlines Tableau's robust visualizations and analytics capabilities, especially for large datasets, contrasting with Power BI's integration with Microsoft services and affordability for small to medium-sized businesses. 📊 Salesforce-Informatica Deal Could Transform Enterprise GenAI Forever: Salesforce is reportedly in advanced talks to acquire Informatica, a data-management software provider, for $11 billion. This aligns with Salesforce's strategy to expand beyond CRM, bolstered by recent AI advancements like Einstein Copilot, complementing Informatica's data integration expertise and potential synergy with Tableau and MuleSoft. Additionally, it aligns with Salesforce's strategy to expand beyond CRM and become a comprehensive data journey platform. ✨ Expert Insights from Packt Community ChatGPT for Cybersecurity Cookbook - By Clint Bodungen Sending API Requests and Handling Responses with PythonIn this recipe, we will explore how to send requests to the OpenAI GPT API and handle the responses using Python. We’ll walk through the process of constructing API requests, sending them, and processing the responses using the openai module. Getting ready Ensure you have Python installed on your system. Install the OpenAI Python module by running the following command in your Terminal or command prompt: pip install openai How to do it… The importance of using the API lies in its ability to communicate with and get valuable insights from ChatGPT in real time. By sending API requests and handling responses, you can harness the power of GPT to answer questions, generate content, or solve problems in a dynamic and customizable way. In the following steps, we’ll demonstrate how to construct API requests, send them, and process the responses, enabling you to effectively integrate ChatGPT into your projects or applications: Start by importing the required modules: import openai from openai import OpenAI import os Set up your API key by retrieving it from an environment variable, as we did in the Setting the OpenAI API key as an Environment Variable recipe: openai.api_key = os.getenv("OPENAI_API_KEY") Define a function to send a prompt to the OpenAI API and receive a response:client = OpenAI() def get_chat_gpt_response(prompt):  response = client.chat.completions.create(    model="gpt-3.5-turbo",    messages=[{"role": "user", "content": prompt}],    max_tokens=2048,    temperature=0.7  )  return response.choices[0].message.content.strip() Call the function with a prompt to send a request and receive a response:prompt = "Explain the difference between symmetric and asymmetric encryption." response_text = get_chat_gpt_response(prompt) print(response_text) How it works… First, we import the required modules. The openai module is the OpenAI API library, and the os module helps us retrieve the API key from an environment variable. We set up the API key by retrieving it from an environment variable using the os module. Next, we define a function called get_chat_gpt_response() that takes a single argument: the prompt. This function sends a request to the OpenAI API using the openai.Completion.create() method. This method has several parameters: engine: Here, we specify the engine (in this case, chat-3.5-turbo). prompt: The input text for the model to generate a response. max_tokens: The maximum number of tokens in the generated response. A token can be as short as one character or as long as one word. n: The number of generated responses you want to receive from the model. In this case, we’ve set it to 1 to receive a single response. stop: A sequence of tokens that, if encountered by the model, will stop the generation process. This can be useful for limiting the response’s length or stopping at specific points, such as the end of a sentence or paragraph. temperature: A value that controls the randomness of the generated response. A higher temperature (for example, 1.0) will result in more random responses, while a lower temperature (for example, 0.1) will make the responses more focused and deterministic. Discover more insights from ChatGPT for Cybersecurity Cookbook - By Clint Bodungen. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!Read Here💡 What's the Latest Scoop from the BI Community?  🧠 Geospatial Data Analysis with Geemap: This article introduces geospatial data analysis, focusing on raster data from Google Earth Engine, accessed and analyzed using the Geemap Python library. Earth Engine offers a vast catalog of geospatial datasets, and Geemap simplifies access and analysis, making it easier to work with such data in Python. 🧠 Microsoft Fabric Table Maintenance - Checkpoint and Statistics: This article discusses the maintenance requirements for warehouse tables in Microsoft Fabric, particularly focusing on tasks like updating statistics, removing fragmentation, and managing log files. While some maintenance tasks, such as data compaction and log file checkpointing, are automated, others, like managing statistics, may require manual intervention. 🧠 Identifying Customer Buying Pattern in Power BI - Part 1: This article is part 1 of a retail analytics analysis in Power BI, focusing on customer purchasing frequency for various products over the years. It includes identifying data elements, creating calculated columns, and analyzing trends to aid in business decision-making. 🧠 Full vs. Incremental Loads – Data Engineering with Fabric: This article discusses using Apache Spark in Microsoft Fabric to achieve data quality zones (bronze and silver) in a data lake. It explores loading weather data, transforming it with Spark SQL and DataFrames, and implementing full and incremental load patterns. 🧠 Joining Queries in Azure Data Factory on Cosmos DB Sources: This article provides a detailed guide on joining two queries in Azure Data Factory (ADF). It covers prerequisites, creation of data sources, defining queries for each dataset, and using the "Join" transformation in ADF to merge data. Different join types such as inner, left outer, right outer, and full outer joins are explained. 🧠 Feature Engineering with Microsoft Fabric and Dataflow Gen2: This article introduces Dataflow Gen2 as a low-code data transformation and integration engine for creating data pipelines in Microsoft Fabric. It focuses on using Dataflow Gen2 to create features needed for training a machine learning model with college basketball game data, offering different approaches from no code to all code. See you next time!
Read more
  • 0
  • 0
  • 44486
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
Packt
23 Feb 2017
12 min read
Save for later

How to start using AWS

Packt
23 Feb 2017
12 min read
In this article, by Lucas Chan and Rowan Udell, authors of the book, AWS Administration Cookbook, we will cover the following: Infrastructure as Code AWS CloudFormation (For more resources related to this topic, see here.) What is AWS? Amazon Web Services (AWS) is a public cloud provider. It provides infrastructure and platform services at a pay-per-use rate. This means you get on-demand access to resources that you used to have to buy outright. You can get access to enterprise-grade services while only paying for what you need, usually down to the hour. AWS prides itself on providing the primitives to developers so that they can build and scale the solutions that they require. How to create an AWS account In order to follow along, you will need an AWS account. Create an account here. by clicking on the Sign Up button and entering your details. Regions and Availability Zones on AWS A fundamental concept of AWS is that its services and the solutions built on top of them are architected for failure. This means that a failure of the underlying resources is a scenario actively planned for, rather than avoided until it cannot be ignored. Due to this, all the services and resources available are divided up in to geographically diverse regions. Using specific regions means you can provide services to your users that are optimized for speed and performance. Within a region, there are always multiple Availability Zones (a.k.a. AZ). Each AZ represents a geographically distinct—but still close—physical data center. AZs have their own facilities and power source, so an event that might take a single AZ offline is unlikely to affect the other AZs in the region. The smaller regions have at least two AZs, and the largest has five. At the time of writing this, the following regions are active: Code Name Availability Zones us-east-1 N. Virginia 5 us-east-2 Ohio 3 us-west-1 N. California 3 us-west-2 Oregon 3 ca-central-1 Canada 2 eu-west-1 Ireland 3 eu-west-2 London 2 eu-central-1 Frankfurt 2 ap-northeast-1 Tokyo 3 ap-northeast-2 Seoul 2 ap-southeast-1 Singapore 2 ap-southeast-2 Sydney 3 ap-south-1 Mumbai 2 sa-east-1 Sao Paulo 3 The AWS web console The web-based console is the first thing you will see after creating your AWS account, and you will often refer to it when viewing and confirming your configuration. The AWS web console The console provides an overview of all the services available as well as associated billing and cost information. Each service has its own section, and the information displayed depends on the service being viewed. As new features and services are released, the console will change and improve. Don't be surprised if you log in and things have changed from one day to the next. Keep in mind that the console always shows your resources by region. If you cannot see a resource that you created, make sure you have the right region selected. Choose the region closest to your physical location for the fastest response times. Note that not all regions have the same services available. The larger, older regions generally have the most services available. Some of the newer or smaller regions (that might be closest to you) might not have all services enabled yet. While services are continually being released to regions, you may have to use another region if you simply must use a newer service. The us-east-1 (a.k.a. North Virginia) region is special given its status as the first region. All services are available there, and new services are always released there. As you get more advanced with your usage of AWS, you will spend less time in the console and more time controlling your services programmatically via the AWS CLI tool and CloudFormation, which we will go into in more detail in the next few topics. CloudFormation templates CloudFormation is the Infrastructure as Code service from AWS. Where CloudFormation was not applicable, we have used the AWS CLI to make the process repeatable and automatable. Since the recipes are based on CloudFormation templates, you can easily combine different templates to achieve your desired outcomes. By editing the templates or joining them, you can create more useful and customized configurations with minimal effort. What is Infrastructure as Code? Infrastructure as Code (IaC) is the practice of managing infrastructure though code definitions. On an Infrastructure-as-a-Service (IaaS) platform such as AWS, IaC is needed to get the most utility and value. IaC differs primarily from traditional interactive methods of managing infrastructure because it is machine processable. This enables a number of benefits: Improved visibility of resources Higher levels of consistency between deployments and environments Easier troubleshooting of issues The ability to scale more with less effort Better control over costs On a less tangible level, all of these factors contribute to other improvements for your developers: you can now leverage tried-and-tested software development practices for your infrastructure and enable DevOps practices in your teams. Visibility As your infrastructure is represented in machine-readable files, you can treat it like you do your application code. You can take the best-practice approaches to software development and apply them to your infrastructure. This means you can store it in version control (for example, Git and SVN) just like you do your code, along with the benefits that it brings: All changes to infrastructure are recorded in commit history You can review changes before accepting/merging them You can easily compare different configurations You can pick and use specific point-in-time configurations Consistency Consistent configuration across your environments (for example, dev, test, and prod) means that you can more confidently deploy your infrastructure. When you know what configuration is in use, you can easily test changes in other environments due to a common baseline. IaC is not the same as just writing scripts for your infrastructure. Most tools and services will leverage higher-order languages and DSLs to allow you to focus on your higher-level requirements. It enables you to use advanced software development techniques, such as static analysis, automated testing, and optimization. Troubleshooting IaC makes replicating and troubleshooting issues easier: since you can duplicate your environments, you can accurately reproduce your production environment for testing purposes. In the past, test environments rarely had exactly the same infrastructure due to the prohibitive cost of hardware. Now that it can be created and destroyed on demand, you are able to duplicate your environments only when they are needed. You only need to pay for the time that they are running for, usually down to the hour. Once you have finished testing, simply turn your environments off and stop paying for them. Even better than troubleshooting is fixing issues before they cause errors. As you refine your IaC in multiple environments, you will gain confidence that is difficult to obtain without it. By the time you deploy your infrastructure in to production, you have done it multiple times already. Scale Configuring infrastructure by hand can be a tedious and error-prone process. By automating it, you remove the potential variability of a manual implementation: computers are good at boring, repetitive tasks, so use them for it! Once automated, the labor cost of provisioning more resources is effectively zero—you have already done the work. Whether you need to spin up one server or a thousand, it requires no additional work. From a practical perspective, resources in AWS are effectively unconstrained. If you are willing to pay for it, AWS will let you use it. Costs AWS have a vested (commercial) interest in making it as easy as possible for you to provision infrastructure. The benefit to you as the customer is that you can create and destroy these resources on demand. Obviously, destroying infrastructure on demand in a traditional, physical hardware environment is simply not possible. You would be hard-pressed to find a data center that will allow you to stop paying for servers and space simply because you are not currently using them. Another use case where on demand infrastructure can make large cost savings is your development environment. It only makes sense to have a development environment while you have developers to use it. When your developers go home at the end of the day, you can switch off your development environments so that you no longer pay for them. Before your developers come in in the morning, simply schedule their environments to be created. DevOps DevOps and IaC go hand in hand. The practice of storing your infrastructure (traditionally the concern of operations) as code (traditionally the concern of development) encourages a sharing of responsibilities that facilitates collaboration. Image courtesy: Wikipedia By automating the PACKAGE, RELEASE, and CONFIGURE activities in the software development life cycle (as pictured), you increase the speed of your releases while also increasing the confidence. Cloud-based IaC encourages architecture for failure: as your resources are virtualized, you must plan for the chance of physical (host) hardware failure, however unlikely. Being able to recreate your entire environment in minutes is the ultimate recovery solution. Unlike physical hardware, you can easily simulate and test failure in your software architecture by deleting key components—they are all virtual anyway! Server configuration Server-side examples of IaC are configuration-management tools such as Ansible, Chef, and Puppet. While important, these configuration-management tools are not specific to AWS, so we will not be covering them in detail here. IaC on AWS CloudFormation is the IaC service from AWS. Templates written in a specific format and language define the AWS resources that should be provisioned. CloudFormation is declarative and can not only provision resources, but also update them. We will go into CloudFormation in greater detail in the following topic. CloudFormation We'll use CloudFormation extensively throughout, so it's important that you have an understanding of what it is and how it fits in to the AWS ecosystem. There should easily be enough information here to get you started, but where necessary, we'll refer you to AWS' own documentation. What is CloudFormation? The CloudFormation service allows you to provision and manage a collection of AWS resources in an automated and repeatable fashion. In AWS terminology, these collections are referred to as stacks. Note however that a stack can be as large or as small as you like. It might consist of a single S3 bucket, or it might contain everything needed to host your three-tier web app. In this article, we'll show you how to define the resources to be included in your CloudFormation stack. We'll talk a bit more about the composition of these stacks and why and when it's preferable to divvy up resources between a number of stacks. Finally, we'll share a few of the tips and tricks we've learned over years of building countless CloudFormation stacks. Be warned: pretty much everyone incurs at least one or two flesh wounds along their journey with CloudFormation. It is all very much worth it, though. Why is CloudFormation important? By now, the benefits of automation should be starting to become apparent to you. But don't fall in to the trap of thinking CloudFormation will be useful only for large collections of resources. Even performing the simplest task of, say, creating an S3 bucket can get very repetitive if you need to do it in every region. We work with a lot of customers who have very tight controls and governance around their infrastructure, especially in the finance sector, and especially in the network layer (think VPCs, NACLs, and security groups). Being able to express one's cloud footprint in YAML (or JSON), store it in a source code repository, and funnel it through a high-visibility pipeline gives these customers confidence that their infrastructure changes are peer-reviewed and will work as expected in production. Discipline and commitment to IaC SDLC practices are of course a big factor in this, but CloudFormation helps bring us out of the era of following 20-page run-sheets for manual changes, navigating untracked or unexplained configuration drift, and unexpected downtime caused by fat fingers. The layer cake Now is a good time to start thinking about your AWS deployments in terms of layers. Your layers will sit atop one another, and you will have well-defined relationships between them. Here's a bottom-up example of how your layer cake might look: VPC Subnets, routes, and NACLs NAT gateways, VPN or bastion hosts, and associated security groups App stack 1: security groups, S3 buckets App stack 1: cross-zone RDS and read replica App stack 1: app and web server auto scaling groups and ELBs App stack 1: cloudfront and WAF config In this example, you may have many occurrences of the app stack layers inside your VPC, assuming you have enough IP addresses in your subnets! This is often the case with VPCs living inside development environments. So immediately, you have the benefit of multi-tenancy capability with application isolation. One advantage of this approach is that while you are developing your CloudFormation template, if you mess up the configuration of your app server, you don't have to wind back all the work CFN did on your behalf. You can just turf that particular layer (and the layers that depend on it) and restart from there. This is not the case if you have everything contained in a single template. We commonly work with customers for whom ownership and management of each layer in the cake reflects the structure of the technological divisions within a company. The traditional infrastructure, network, and cyber security folk are often really interested in creating a safe place for digital teams to deploy their apps, so they like to heavily govern the foundational layers of the cake. Conway's Law, coined by Melvin Conway, starts to come in to play here: "Any organization that designs a system will inevitably produce a design whose structure is a copy of the organization's communication structure." Finally, even if you are a single-person infrastructure coder working in a small team, you will benefit from this approach. For example, you'll find that it dramatically reduces your exposure to things such as AWS limits, timeouts, and circular dependencies.
Read more
  • 0
  • 0
  • 44469

article-image-build-and-train-rnn-chatbot-using-tensorflow
Sunith Shetty
28 Jun 2018
21 min read
Save for later

Build and train an RNN chatbot using TensorFlow [Tutorial]

Sunith Shetty
28 Jun 2018
21 min read
Chatbots are increasingly used as a way to provide assistance to users. Many companies, including banks, mobile/landline companies and large e-sellers now use chatbots for customer assistance and for helping users in pre and post sales queries. They are a great tool for companies which don't need to provide additional customer service capacity for trivial questions: they really look like a win-win situation! In today’s tutorial, we will understand how to train an automatic chatbot that will be able to answer simple and generic questions, and how to create an endpoint over HTTP for providing the answers via an API. This article is an excerpt from a book written by Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur, and Rajalingappaa Shanmugamani titled TensorFlow Deep Learning Projects. There are mainly two types of chatbot: the first is a simple one, which tries to understand the topic, always providing the same answer for all questions about the same topic. For example, on a train website, the questions Where can I find the timetable of the City_A to City_B service? and What's the next train departing from City_A? will likely get the same answer, that could read Hi! The timetable on our network is available on this page: <link>. This types of chatbots use classification algorithms to understand the topic (in the example, both questions are about the timetable topic). Given the topic, they always provide the same answer. Usually, they have a list of N topics and N answers; also, if the probability of the classified topic is low (the question is too vague, or it's on a topic not included in the list), they usually ask the user to be more specific and repeat the question, eventually pointing out other ways to do the question (send an email or call the customer service number, for example). The second type of chatbots is more advanced, smarter, but also more complex. For those, the answers are built using an RNN, in the same way, that machine translation is performed. Those chatbots are able to provide more personalized answers, and they may provide a more specific reply. In fact, they don't just guess the topic, but with an RNN engine, they're able to understand more about the user's questions and provide the best possible answer: in fact, it's very unlikely you'll get the same answers with two different questions using these types of chatbots. The input corpus Unfortunately, we haven't found any consumer-oriented dataset that is open source and freely available on the Internet. Therefore, we will train the chatbot with a more generic dataset, not really focused on customer service. Specifically, we will use the Cornell Movie Dialogs Corpus, from the Cornell University. The corpus contains the collection of conversations extracted from raw movie scripts, therefore the chatbot will be able to answer more to fictional questions than real ones. The Cornell corpus contains more than 200,000 conversational exchanges between 10+ thousands of movie characters, extracted from 617 movies. The dataset is available here: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. We would like to thank the authors for having released the corpus: that makes experimentation, reproducibility and knowledge sharing easier. The dataset comes as a .zip archive file. After decompressing it, you'll find several files in it: README.txt contains the description of the dataset, the format of the corpora files, the details on the collection procedure and the author's contact. Chameleons.pdf is the original paper for which the corpus has been released. Although the goal of the paper is strictly not around chatbots, it studies the language used in dialogues, and it's a good source of information to understanding more movie_conversations.txt contains all the dialogues structure. For each conversation, it includes the ID of the two characters involved in the discussion, the ID of the movie and the list of sentences IDs (or utterances, to be more precise) in chronological order. For example, the first line of the file is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'] That means that user u0 had a conversation with user u2 in the movie m0 and the conversation had 4 utterances: 'L194', 'L195', 'L196' and 'L197' movie_lines.txt contains the actual text of each utterance ID and the person who produced it. For example, the utterance L195 is listed here as: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. So, the text of the utterance L195 is Well, I thought we'd start with pronunciation, if that's okay with you. And it was pronounced by the character u2 whose name is CAMERON in the movie m0. movie_titles_metadata.txt contains information about the movies, including the title, year, IMDB rating, the number of votes in IMDB and the genres. For example, the movie m0 here is described as: m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance'] So, the title of the movie whose ID is m0 is 10 things i hate about you, it's from 1999, it's a comedy with romance and it received almost 63 thousand votes on IMDB with an average score of 6.9 (over 10.0) movie_characters_metadata.txt contains information about the movie characters, including the name the title of the movie where he/she appears, the gender (if known) and the position in the credits (if known). For example, the character “u2” appears in this file with this description: u2 +++$+++ CAMERON +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ m +++$+++ 3 The character u2 is named CAMERON, it appears in the movie m0 whose title is 10 things i hate about you, his gender is male and he's the third person appearing in the credits. raw_script_urls.txt contains the source URL where the dialogues of each movie can be retrieved. For example, for the movie m0 that's it: m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html As you will have noticed, most files use the token  +++$+++  to separate the fields. Beyond that, the format looks pretty straightforward to parse. Please take particular care while parsing the files: their format is not UTF-8 but ISO-8859-1. Creating the training dataset Let's now create the training set for the chatbot. We'd need all the conversations between the characters in the correct order: fortunately, the corpora contains more than what we actually need. For creating the dataset, we will start by downloading the zip archive, if it's not already on disk. We'll then decompress the archive in a temporary folder (if you're using Windows, that should be C:Temp), and we will read just the movie_lines.txt and the movie_conversations.txt files, the ones we really need to create a dataset of consecutive utterances. Let's now go step by step, creating multiple functions, one for each step, in the file corpora_downloader.py. The first function we need is to retrieve the file from the Internet, if not available on disk. def download_and_decompress(url, storage_path, storage_dir): import os.path directory = storage_path + "/" + storage_dir zip_file = directory + ".zip" a_file = directory + "/cornell movie-dialogs corpus/README.txt" if not os.path.isfile(a_file): import urllib.request import zipfile urllib.request.urlretrieve(url, zip_file) with zipfile.ZipFile(zip_file, "r") as zfh: zfh.extractall(directory) return This function does exactly that: it checks whether the “README.txt” file is available locally; if not, it downloads the file (thanks for the urlretrieve function in the urllib.request module) and it decompresses the zip (using the zipfile module). The next step is to read the conversation file and extract the list of utterance IDS. As a reminder, its format is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'], therefore what we're looking for is the fourth element of the list after we split it on the token  +++$+++ . Also, we'd need to clean up the square brackets and the apostrophes to have a clean list of IDs. For doing that, we shall import the re module, and the function will look like this. import re def read_conversations(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_conversations.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: conversations_chunks = [line.split(" +++$+++ ") for line in fh] return [re.sub('[[]']', '', el[3].strip()).split(", ") for el in conversations_chunks] As previously said, remember to read the file with the right encoding, otherwise, you'll get an error. The output of this function is a list of lists, each of them containing the sequence of utterance IDS in a conversation between characters. Next step is to read and parse the movie_lines.txt file, to extract the actual utterances texts. As a reminder, the file looks like this line: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. Here, what we're looking for are the first and the last chunks. def read_lines(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_lines.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: lines_chunks = [line.split(" +++$+++ ") for line in fh] return {line[0]: line[-1].strip() for line in lines_chunks} The very last bit is about tokenization and alignment. We'd like to have a set whose observations have two sequential utterances. In this way, we will train the chatbot, given the first utterance, to provide the next one. Hopefully, this will lead to a smart chatbot, able to reply to multiple questions. Here's the function: def get_tokenized_sequencial_sentences(list_of_lines, line_text): for line in list_of_lines: for i in range(len(line) - 1): yield (line_text[line[i]].split(" "), line_text[line[i+1]].split(" ")) Its output is a generator containing a tuple of the two utterances (the one on the right follows temporally the one on the left). Also, utterances are tokenized on the space character. Finally, we can wrap up everything into a function, which downloads the file and unzip it (if not cached), parse the conversations and the lines, and format the dataset as a generator. As a default, we will store the files in the /tmp directory: def retrieve_cornell_corpora(storage_path="/tmp", storage_dir="cornell_movie_dialogs_corpus"): download_and_decompress("http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip", storage_path, storage_dir) conversations = read_conversations(storage_path, storage_dir) lines = read_lines(storage_path, storage_dir) return tuple(zip(*list(get_tokenized_sequencial_sentences(conversations, lines)))) At this point, our training set looks very similar to the training set used in the translation project. We can, therefore, use some pieces of code we've developed in the machine learning translation article. For example, the corpora_tools.py file can be used here without any change (also, it requires the data_utils.py). Given that file, we can dig more into the corpora, with a script to check the chatbot input. To inspect the corpora, we can use the corpora_tools.py, and the file we've previously created. Let's retrieve the Cornell Movie Dialog Corpus, format the corpora and print an example and its length: from corpora_tools import * from corpora_downloader import retrieve_cornell_corpora sen_l1, sen_l2 = retrieve_cornell_corpora() print("# Two consecutive sentences in a conversation") print("Q:", sen_l1[0]) print("A:", sen_l2[0]) print("# Corpora length (i.e. number of sentences)") print(len(sen_l1)) assert len(sen_l1) == len(sen_l2) This code prints an example of two tokenized consecutive utterances, and the number of examples in the dataset, that is more than 220,000: # Two consecutive sentences in a conversation Q: ['Can', 'we', 'make', 'this', 'quick?', '', 'Roxanne', 'Korrine', 'and', 'Andrew', 'Barrett', 'are', 'having', 'an', 'incredibly', 'horrendous', 'public', 'break-', 'up', 'on', 'the', 'quad.', '', 'Again.'] A: ['Well,', 'I', 'thought', "we'd", 'start', 'with', 'pronunciation,', 'if', "that's", 'okay', 'with', 'you.'] # Corpora length (i.e. number of sentences) 221616 Let's now clean the punctuation in the sentences, lowercase them and limits their size to 20 words maximum (that is examples where at least one of the sentences is longer than 20 words are discarded). This is needed to standardize the tokens: clean_sen_l1 = [clean_sentence(s) for s in sen_l1] clean_sen_l2 = [clean_sentence(s) for s in sen_l2] filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2) print("# Filtered Corpora length (i.e. number of sentences)") print(len(filt_clean_sen_l1)) assert len(filt_clean_sen_l1) == len(filt_clean_sen_l2) This leads us to almost 140,000 examples: # Filtered Corpora length (i.e. number of sentences) 140261 Then, let's create the dictionaries for the two sets of sentences. Practically, they should look the same (since the same sentence appears once on the left side, and once in the right side) except there might be some changes introduced by the first and last sentences of a conversation (they appear only once). To make the best out of our corpora, let's build two dictionaries of words and then encode all the words in the corpora with their dictionary indexes: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=15000, storage_path="/tmp/l1_dict.p") dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=15000, storage_path="/tmp/l2_dict.p") idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) print("# Same sentences as before, with their dictionary ID") print("Q:", list(zip(filt_clean_sen_l1[0], idx_sentences_l1[0]))) print("A:", list(zip(filt_clean_sen_l2[0], idx_sentences_l2[0]))) That prints the following output. We also notice that a dictionary of 15 thousand entries doesn't contain all the words and more than 16 thousand (less popular) of them don't fit into it: [sentences_to_indexes] Did not find 16823 words [sentences_to_indexes] Did not find 16649 words # Same sentences as before, with their dictionary ID Q: [('well', 68), (',', 8), ('i', 9), ('thought', 141), ('we', 23), ("'", 5), ('d', 83), ('start', 370), ('with', 46), ('pronunciation', 3), (',', 8), ('if', 78), ('that', 18), ("'", 5), ('s', 12), ('okay', 92), ('with', 46), ('you', 7), ('.', 4)] A: [('not', 31), ('the', 10), ('hacking', 7309), ('and', 23), ('gagging', 8761), ('and', 23), ('spitting', 6354), ('part', 437), ('.', 4), ('please', 145), ('.', 4)] As the final step, let's add paddings and markings to the sentences: data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) print("# Prepared minibatch with paddings and extra stuff") print("Q:", data_set[0][0]) print("A:", data_set[0][1]) print("# The sentence pass from X to Y tokens") print("Q:", len(idx_sentences_l1[0]), "->", len(data_set[0][0])) print("A:", len(idx_sentences_l2[0]), "->", len(data_set[0][1])) And that, as expected, prints: # Prepared minibatch with paddings and extra stuff Q: [0, 68, 8, 9, 141, 23, 5, 83, 370, 46, 3, 8, 78, 18, 5, 12, 92, 46, 7, 4] A: [1, 31, 10, 7309, 23, 8761, 23, 6354, 437, 4, 145, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0] # The sentence pass from X to Y tokens Q: 19 -> 20 A: 11 -> 22 Training the chatbot After we're done with the corpora, it's now time to work on the model. This project requires again a sequence to sequence model, therefore we can use an RNN. Even more, we can reuse part of the code from the previous project: we'd just need to change how the dataset is built, and the parameters of the model. We can then copy the training script, and modify the build_dataset function, to use the Cornell dataset. Mind that the dataset used in this article is bigger than the one used in the machine learning translation article, therefore you may need to limit the corpora to a few dozen thousand lines. On a 4 years old laptop with 8GB RAM, we had to select only the first 30 thousand lines, otherwise, the program ran out of memory and kept swapping. As a side effect of having fewer examples, even the dictionaries are smaller, resulting in less than 10 thousands words each. def build_dataset(use_stored_dictionary=False): sen_l1, sen_l2 = retrieve_cornell_corpora() clean_sen_l1 = [clean_sentence(s) for s in sen_l1][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP clean_sen_l2 = [clean_sentence(s) for s in sen_l2][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2, max_len=10) if not use_stored_dictionary: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=10000, storage_path=path_l1_dict) dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=10000, storage_path=path_l2_dict) else: dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2_length = len(dict_l2) idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) max_length_l1 = extract_max_length(idx_sentences_l1) max_length_l2 = extract_max_length(idx_sentences_l2) data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) return (filt_clean_sen_l1, filt_clean_sen_l2), data_set, (max_length_l1, max_length_l2), (dict_l1_length, dict_l2_length) By inserting this function into the train_translator.py file and rename the file as train_chatbot.py, we can run the training of the chatbot. After a few iterations, you can stop the program and you'll see something similar to this output: [sentences_to_indexes] Did not find 0 words [sentences_to_indexes] Did not find 0 words global step 100 learning rate 1.0 step-time 7.708967611789704 perplexity 444.90090078460474 eval: perplexity 57.442316329639176 global step 200 learning rate 0.990234375 step-time 7.700247814655302 perplexity 48.8545568311572 eval: perplexity 42.190180314697045 global step 300 learning rate 0.98046875 step-time 7.69800933599472 perplexity 41.620538109894945 eval: perplexity 31.291903031786116 ... ... ... global step 2400 learning rate 0.79833984375 step-time 7.686293318271639 perplexity 3.7086356605442767 eval: perplexity 2.8348589631663046 global step 2500 learning rate 0.79052734375 step-time 7.689657487869262 perplexity 3.211876894960698 eval: perplexity 2.973809378544393 global step 2600 learning rate 0.78271484375 step-time 7.690396382808681 perplexity 2.878854805600354 eval: perplexity 2.563583924617356 Again, if you change the settings, you may end up with a different perplexity. To obtain these results, we set the RNN size to 256 and 2 layers, the batch size of 128 samples, and the learning rate to 1.0. At this point, the chatbot is ready to be tested. Although you can test the chatbot with the same code as in the test_translator.py, here we would like to do a more elaborate solution, which allows exposing the chatbot as a service with APIs. Chatbox API First of all, we need a web framework to expose the API. In this project, we've chosen Bottle, a lightweight simple framework very easy to use. To install the package, run pip install bottle from the command line. To gather further information and dig into the code, take a look at the project webpage, https://bottlepy.org. Let's now create a function to parse an arbitrary sentence provided by the user as an argument. All the following code should live in the test_chatbot_aas.py file. Let's start with some imports and the function to clean, tokenize and prepare the sentence using the dictionary: import pickle import sys import numpy as np import tensorflow as tf import data_utils from corpora_tools import clean_sentence, sentences_to_indexes, prepare_sentences from train_chatbot import get_seq2seq_model, path_l1_dict, path_l2_dict model_dir = "/home/abc/chat/chatbot_model" def prepare_sentence(sentence, dict_l1, max_length): sents = [sentence.split(" ")] clean_sen_l1 = [clean_sentence(s) for s in sents] idx_sentences_l1 = sentences_to_indexes(clean_sen_l1, dict_l1) data_set = prepare_sentences(idx_sentences_l1, [[]], max_length, max_length) sentences = (clean_sen_l1, [[]]) return sentences, data_set The function prepare_sentence does the following: Tokenizes the input sentence Cleans it (lowercase and punctuation cleanup) Converts tokens to dictionary IDs Add markers and paddings to reach the default length Next, we will need a function to convert the predicted sequence of numbers to an actual sentence composed of words. This is done by the function decode, which runs the prediction given the input sentence and with softmax predicts the most likely output. Finally, it returns the sentence without paddings and markers: def decode(data_set): with tf.Session() as sess: model = get_seq2seq_model(sess, True, dict_lengths, max_sentence_lengths, model_dir) model.batch_size = 1 bucket = 0 encoder_inputs, decoder_inputs, target_weights = model.get_batch( {bucket: [(data_set[0][0], [])]}, bucket) _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket, True) outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] if data_utils.EOS_ID in outputs: outputs = outputs[1:outputs.index(data_utils.EOS_ID)] tf.reset_default_graph() return " ".join([tf.compat.as_str(inv_dict_l2[output]) for output in outputs]) Finally, the main function, that is, the function to run in the script: if __name__ == "__main__": dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l2_length = len(dict_l2) inv_dict_l2 = {v: k for k, v in dict_l2.items()} max_lengths = 10 dict_lengths = (dict_l1_length, dict_l2_length) max_sentence_lengths = (max_lengths, max_lengths) from bottle import route, run, request @route('/api') def api(): in_sentence = request.query.sentence _, data_set = prepare_sentence(in_sentence, dict_l1, max_lengths) resp = [{"in": in_sentence, "out": decode(data_set)}] return dict(data=resp) run(host='127.0.0.1', port=8080, reloader=True, debug=True) Initially, it loads the dictionary and prepares the inverse dictionary. Then, it uses the Bottle API to create an HTTP GET endpoint (under the /api URL). The route decorator sets and enriches the function to run when the endpoint is contacted via HTTP GET. In this case, the api() function is run, which first reads the sentence passed as HTTP parameter, then calls the prepare_sentence function, described above, and finally runs the decoding step. What's returned is a dictionary containing both the input sentence provided by the user and the reply of the chatbot. Finally, the webserver is turned on, on the localhost at port 8080. Isn't very easy to have a chatbot as a service with Bottle? It's now time to run it and check the outputs. To run it, run from the command line: $> python3 –u test_chatbot_aas.py Then, let's start querying the chatbot with some generic questions, to do so we can use CURL, a simple command line; also all the browsers are ok, just remember that the URL should be encoded, for example, the space character should be replaced with its encoding, that is, %20. Curl makes things easier, having a simple way to encode the URL request. Here are a couple of examples: $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "i ' m here with you .", "in": "where are you?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you here?" {"data": [{"out": "yes .", "in": "are you here?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you a chatbot?" {"data": [{"out": "you ' for the stuff to be right .", "in": "are you a chatbot?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=what is your name ?" {"data": [{"out": "we don ' t know .", "in": "what is your name ?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "that ' s okay .", "in": "how are you?"}]} If the system doesn't work with your browser, try encoding the URL, for example: $> curl -X GET http://127.0.0.1:8080/api?sentence=how%20are%20you? {"data": [{"out": "that ' s okay .", "in": "how are you?"}]}. Replies are quite funny; always remember that we trained the chatbox on movies, therefore the type of replies follow that style. To turn off the webserver, use Ctrl + C. To summarize, we've learned to implement a chatbot, which is able to respond to questions through an HTTP endpoint and a GET API. To know more how to design deep learning systems for a variety of real-world scenarios using TensorFlow, do checkout this book TensorFlow Deep Learning Projects. Facebook’s Wit.ai: Why we need yet another chatbot development framework? How to build a chatbot with Microsoft Bot framework Top 4 chatbot development frameworks for developers
Read more
  • 0
  • 1
  • 44455

article-image-the-seven-deadly-sins-of-web-design
Guest Contributor
13 Mar 2019
7 min read
Save for later

The seven deadly sins of web design

Guest Contributor
13 Mar 2019
7 min read
Just 30 days before the debut of "Captain Marvel," the latest cinematic offering by the successful and prolific Marvel Studios, a delightful and nostalgia-filled website was unveiled to promote the movie. Since the story of "Captain Marvel" is set in the 1990s, the brilliant minds at the marketing department of Marvel Studios decided to design a website with the right look and feel, which in this case meant using FrontPage and hosting on Angelfire. The "Captain Marvel" promo website is filled with the typography, iconography, glitter, and crudely animated GIFs you would expect from a 1990s creation, including a guestbook, hidden easter eggs, flaming borders, hit counter, and even headers made with Microsoft WordArt. (Image courtesy of Marvel) The site is delightful not just for the dead-on nostalgia trip it provides to visitors, but also because it is very well developed. This is a site with a lot to explore, and it is clearly evident that the website developers met client demands while at the same time thinking about users. This site may look and feel like it was made during the GeoCities era, but it does not make any of the following seven mistakes: Sin #1: Non-Responsiveness In 2019, it is simply inconceivable to think of a web development firm that neglects to make a responsive site. Since 2016, internet traffic flowing through mobile devices has been higher than the traffic originating from desktops and laptops. Current rates are about 53 percent smartphones and tablets versus 47 percent desktops, laptops, kiosks, and smart TVs. Failure to develop responsive websites means potentially alienating more than 50 percent of prospective visitors. As for the "Captain Marvel" website, it is amazingly responsive when considering that internet users in the 1990s barely dreamed about the day when they would be able to access the web from handheld devices (mobile phones were yet to be mass distributed back then). Sin #2: Way too much Jargon (Image courtesy of the Botanical Linguist) Not all website developers have a good sense of readability, and this is something that often shows up when completed projects result in product visitors struggling to comprehend. We’re talking about jargon. There’s a lot of it online, not only in the usual places like the privacy policy and terms of service sections but sometimes in content too. Regardless of how jargon creeps onto your website, it should be rooted out. The "Captain Marvel" website features legal notices written by The Walt Disney Company, and they are very reader-friendly with minimal jargon. The best way to handle jargon is to avoid it as much as possible unless the business developer has good reasons to include it. Sin #3: A noticeable lack of content No content means no message, and this is the reason 46 percent of visitors who land on B2B websites end up leaving without further exploration or interaction. Quality content that is relevant to the intention of a website is crucial in terms of establishing credibility, and this goes beyond B2B websites. In the case of "Captain Marvel," the amount of content is reduced to match the retro sensibility, but there are enough photos, film trailers, character bios, and games to keep visitors entertained. Modern website development firms that provide full-service solutions can either provide or advise clients on the content they need to get started. Furthermore, they can also offer lessons on how to operate content management systems. Sin #4: Making essential information hard to find There was a time when the "mystery meat navigation” issue of website development was thought to have been eradicated through the judicious application of recommended practices, but then mobile apps came around. Even technology giant Google fell victim to mystery meat navigation with its 2016 release of Material Design, which introduced bottom navigation bars intended to offer a more clarifying alternative to hamburger menus. Unless there is a clever purpose for prompting visitors to click or tap on a button, link or page element, that does not explain next steps, mystery meat navigation should be avoided, particularly when it comes to essential information. When the 1990s "Captain Marvel" page loads, visitors can click or tap on labeled links to get information about the film, enjoy multimedia content, play games, interact with the guestbook, or get tickets. There is a mysterious old woman that pops up every now and then from the edges of the screen, but the reason behind this mysterious element is explained in the information section. Sin #5: Website loads too slow (Image courtesy of Horton Marketing Solutions) There is an anachronism related to the "Captain Marvel" website that users who actually used Netscape in the 1990s will notice: all pages load very fast. This is one retro aspect that Marvel Studios decided to not include on this site, and it makes perfect sense. For a fast-loading site, a web design rule of thumb is to simplify and this responsibility lies squarely with the developer. It stands to reason that the more “stuff” you have on a page (images, forms, videos, widgets, shiny things), the longer it takes the server to send over the site files and the longer it takes the browser to render them. Here are a few design best practices to keep in mind: 1 Make the site light - get rid of non-essential elements, especially if they are bandwidth-sucking images or video. 2 Compress your pages - it’s easy with Gzip. 3 Split long pages into several shorter ones 4 Write clean code that doesn’t rely on external sources 5 Optimize images For more web design tips that help your site load in the sub-three second range, like Google expects in 2019, check out our article on current design trends.   Once you have design issues under control, investigate your web host. They aren’t all created equal. Cheap, entry-level shared packages are notoriously slow and unpredictable, especially as your traffic increases. But even beyond that, the reality is that some companies spend money buying better, faster servers and don’t overload them with too many clients. Some do. Recent testing from review site HostingCanada.org checked load times across the leading providers and found variances from a ‘meh’ 2,850 ms all the way down to speedy 226 ms. With pricing amongst credible competitors roughly equal, web developers should know which hosts are the fastest and point clients in that direction. Sin #6: Outdated information Functional and accurate information will always triumph over form. The "Captain Marvel" website is garish to look at by 2019 standards, but all the information is current. The film's theater release date is clearly displayed, and should something happen that would require this date to change, you can be sure that Marvel Studios will fire up FrontPage to promptly make the adjustment. Sin #7: No clear call to action Every website should compel visitors to do something. Even if the purpose is to provide information, the call-to-action or CTA should encourage visitors to remember it and return for updates. The CTA should be as clear as the navigation elements, otherwise, the purpose of the visit is lost. Creating enticements is acceptable, but the CTA message should be explained nonetheless. In the case of "Captain Marvel," visitors can click on "Get Tickets" link to be taken to a Fandango.com page with geolocation redirection for their region. The Bottom Line In the end, the seven mistakes listed herein are easy to avoid. Whenever developers run into clients whose instructions may result in one of these mistakes, proper explanations should be given. Author Bio Gary Stevens is a front-end developer. He’s a full-time blockchain geek and a volunteer working for the Ethereum foundation as well as an active Github contributor. 7 Web design trends and predictions for 2019 How to create a web designer resume that lands you a Job Will Grant’s 10 commandments for effective UX Design
Read more
  • 0
  • 0
  • 44445

article-image-how-get-started-redux-react-native
Emilio Rodriguez
04 Apr 2016
5 min read
Save for later

How To Get Started with Redux in React Native

Emilio Rodriguez
04 Apr 2016
5 min read
In mobile development there is a need for architectural frameworks, but complex frameworks designed to be used in web environments may end up damaging the development process or even the performance of our app. Because of this, some time ago I decided to introduce in all of my React Native projects the leanest framework I ever worked with: Redux. Redux is basically a state container for JavaScript apps. It is 100 percent library-agnostic so you can use it with React, Backbone, or any other view library. Moreover, it is really small and has no dependencies, which makes it an awesome tool for React Native projects. Step 1: Install Redux in your React Native project. Redux can be added as an npm dependency into your project. Just navigate to your project’s main folder and type: npm install --save react-redux By the time this article was written React Native was still depending on React Redux 3.1.0 since versions above depended on React 0.14, which is not 100 percent compatible with React Native. Because of this, you will need to force version 3.1.0 as the one to be dependent on in your project. Step 2: Set up a Redux-friendly folder structure. Of course, setting up the folder structure for your project is totally up to every developer but you need to take into account that you will need to maintain a number of actions, reducers, and components. Besides, it’s also useful to keep a separate folder for your API and utility functions so these won’t be mixing with your app’s core functionality. Having this in mind, this is my preferred folder structure under the src folder in any React Native project: Step 3: Create your first action. In this article we will be implementing a simple login functionality to illustrate how to integrate Redux inside React Native. A good point to start this implementation is the action, a basic function called from the component whenever we want the whole state of the app to be changed (i.e. changing from the logged out state into the logged in state). To keep this example as concise as possible we won’t be doing any API calls to a backend – only the pure Redux integration will be explained. Our action creator is a simple function returning an object (the action itself) with a type attribute expressing what happened with the app. No business logic should be placed here; our action creators should be really plain and descriptive. Step 4: Create your first reducer. Reducers are the ones in charge of updating the state of the app. Unlike in Flux, Redux only has one store for the whole app, but it will be conveniently name-spaced automatically by Redux once the reducers have been applied. In our example, the user reducer needs to be aware of when the user is logged in. Because of that, it needs to import the LOGIN_SUCCESS constant we defined in our actions before and export a default function, which will be called by Redux every time an action occurs in the app. Redux will automatically pass the current state of the app and the action occurred. It’s up to the reducer to realize if it needs to modify the state or not based on the action.type. That’s why almost every time our reducer will be a function containing a switch statement, which modifies and returns the state based on what action occurred. It’s important to state that Redux works with object references to identify when the state is changed. Because of this, the state should be cloned before any modification. It’s also interesting to know that the action passed to the reducers can contain other attributes apart from type. For example, when doing a more complex login, the user first name and last name can be added to the action by the action created and used by the reducer to update the state of the app. Step 5: Create your component. This step is almost pure React Native coding. We need a component to trigger the action and to respond to the change of state in the app. In our case it will be a simple View containing a button that disappears when logged in. This is a normal React Native component except for some pieces of the Redux boilerplate: The three import lines at the top will require everything we need from Redux ‘mapStateToProps’ and ‘mapDispatchToProps’ are two functions bound with ‘connect’ to the component: this makes Redux know that this component needs to be passed a piece of the state (everything under ‘userReducers’) and all the actions available in the app. Just by doing this, we will have access to the login action (as it is used in the onLoginButtonPress) and to the state of the app (as it is used in the !this.props.user.loggedIn statement) Step 6: Glue it all from your index.ios.js. For Redux to apply its magic, some initialization should be done in the main file of your React Native project (index.ios.js). This is pure boilerplate and only done once: Redux needs to inject a store holding the app state into the app. To do so, it requires a ‘Provider’ wrapping the whole app. This store is basically a combination of reducers. For this article we only need one reducer, but a full app will include many others and each of them should be passed into the combineReducers function to be taken into account by Redux whenever an action is triggered. About the Author Emilio Rodriguez started working as a software engineer for Sun Microsystems in 2006. Since then, he has focused his efforts on building a number of mobile apps with React Native while contributing to the React Native project. These contributions helped his understand how deep and powerful this framework is.
Read more
  • 0
  • 0
  • 44376
article-image-5-developers-explain-why-they-use-visual-studio-code-sponsored-by-microsoft
Richard Gall
22 May 2019
7 min read
Save for later

5 developers explain why they use Visual Studio Code [Sponsored by Microsoft]

Richard Gall
22 May 2019
7 min read
Visual Studio Code has quickly become one of the most popular text editors on the planet. While debate will continue to rage about the relative merits of every text editor, it’s nevertheless true that Visual Studio Code is unique in that it is incredibly customizable: it can be as lightweight as a text editor or as feature-rich as an IDE. This post is part of a series brought to you in conjunction with Microsoft. Download Learning Node.js Development for free from Microsoft here. Try Visual Studio Code yourself. Learn more here. This means the range of developers using Visual Studio Code are incredibly diverse. Each one faces a unique set of challenges alongside their personal preferences. I spoke to a few of them about why they use Visual Studio Code and how they make it work for them. “Visual Studio Code is streamlined and flexible” Ben Sibley is the Founder of Complete Themes. He likes Visual Studio Code because it is relatively lightweight while also offering considerable flexibility. “I love how streamlined and flexible Visual Studio Code is. Personally, I don’t need a ton of functionality from my IDE, so I appreciate how simple the default configuration is. There's a very concise set of features built-in like the Git integration. “I was using PHPStorm previously and while it was really feature-rich, it was also overwhelming at times. VSC is faster, lighter, and with the extension market you can pick and choose which additional tools you need. And it’s a popular enough editor that you can usually find a reliable and well-reviewed extension.” Read next: How Visual Studio Code can help bridge the gap between full-stack development and DevOps [Sponsored by Microsoft] “Visual Studio Code is the best in terms of extension ecosystem, language support and configuration” Libby Horacek is a developer at Position Development. She has worked with several different code editors but struggled to find one that allowed her to effectively move between languages. For Libby, Visual Studio Code offered the right level of flexibility. She also explained how the team at Position Development have used VSC’s Live Share feature which allows developers to directly share and collaborate on code inside their editor. “I currently use Visual Studio Code. I’ve tried a LOT of different editors. I’m a polyglot developer, so I need an editor that isn’t just for one language. RubyMine is great for Ruby, and PyCharm is good for Python, but I don’t want to switch editors every time I switch languages (sometimes multiple times a day). My main constraint is Haskell language support — there are plugins for most IDEs now, but some are better than others. “For a long time I used Emacs just because I was able to steal a great configuration setup for it from a coworker, but a few months back it stopped working due to updates and I didn’t want to acquire the Emacs expertise to fix it. So I tried IntelliJ, Visual Studio, Atom, Sublime Text, even Vim… but in the end I liked Visual Studio the best in terms of extension ecosystem, language support, and ease of use and configuration. “My team also uses Visual Studio’s Live Share for pairing. I haven’t tried it personally but it looks like a great option for remote pairing. The only thing my coworkers have cautioned is that they encountered a bug with the “undo” functionality that wiped out most of a file they were working on. Maybe that bug has been fixed by now, but as always, commit early and commit often!” “As a JavaScript dev shop, we love that VSC is written in JavaScript” Cody Swann is the CEO of Gunner Technology, a software development company that builds using JavaScript on AWS for both the public and private sector. “All our developers here [at Gunner Technology] use VSC. “We switched from Sublime about two years ago because Sublime started to feel slow and neglected. “Before that, we used TextMate and abandoned that for the same reasons. “As a JavaScript dev shop, we love that VSC is written in JavaScript. It makes it easier for us to write in-house extensions and such. “Additionally, we love that Microsoft releases monthly updates and keeps improving performance.” Read next: Microsoft Build 2019: Microsoft showcases new updates to MS 365 platform with focus on AI and developer productivity “The Visual Studio Code team pay close attention to the problems developers face” Ajeet Dhaliwal is a software developer at Tesults. He explains he has used several different IDEs and editors but came to Visual Studio Code after spending some time using Node.js and React on Brackets. “I have used Visual Studio Code almost exclusively for the last couple of years. “In years prior to making this switch, the nature of the development work that I did meant that I was broadly limited to using specific IDEs such as Visual Studio and Xcode. Then in 2014 I stated to get into Node.js and was looking for a code editor that would be more suitable. I tried out a few and ultimately settled on Brackets. “I used Brackets for a while but wasn’t always happy with it. The most annoying issue was the way text was rendered on my Mac. “Over time I started doing React work too and every time I revisited VSC the improvements were impressive, it seemed to me that the developers were closely paying attention to the problems developers face, they were creating features I had never even thought I would need and the extensions added highly useful features for Node.js and React dev work. The font rendering was not an issue either so it became an inevitable switch.” “I have to context switch regularly - I expect my brain to be the slowest element, not the IDE” Kyle Balnave is Senior Developer and Squad Manager at High Speed Training. Despite working with numerous editors and IDEs, he likes Visual Studio Code because it allows him to move between different contexts incredibly quickly. Put simply, it allows him to work faster than other IDEs do. "I've used several different editors over the years. They generally fall under two categories: Monolithic (I can do anything you'll ever want to do out of the box). Modular (I do the basics but allow extensions to be added to do most the rest). “The former are IDEs like Netbeans, IntelliJ and Visual Studio. In my experience they are slow to load and need a more powerful development machine to keep responsive. They have a huge range of functionality, but in everyday development I just need it to be an intelligent code editor. “The latter are IDEs like Eclipse, Visual Studio Code, Atom. They load quickly, respond fast and have a wide range of extensions that allow me to develop what I need. They sometimes fall short in their functionality, but I generally find this to be infrequent. “Why do I use VSCode? Because it doesn't slow me down when I code. I have to context switch regularly so I expect my own brain to be the slowest element, not the IDE. Learn how to develop with Node.js on Azure by downloading Learning Node.js with Azure for free, courtesy of Microsoft.
Read more
  • 0
  • 0
  • 44251

article-image-google-is-circumventing-gdpr-reveals-braves-investigation-for-the-authorized-buyers-ad-business-case
Bhagyashree R
06 Sep 2019
6 min read
Save for later

Google is circumventing GDPR, reveals Brave's investigation for the Authorized Buyers ad business case

Bhagyashree R
06 Sep 2019
6 min read
Last year, Dr. Johnny Ryan, the Chief Policy & Industry Relations Officer at Brave, filed a complaint against Google’s DoubleClick/Authorized Buyers ad business with the Irish Data Protection Commission (DPC). New evidence produced by Brave reveals that Google is circumventing GDPR and also undermining its own data protection measures. Brave calls Google’s Push Pages a GDPR workaround Brave’s new evidence rebuts some of Google’s claims regarding its DoubleClick/Authorized Buyers system, the world’s largest real-time advertising auction house. Google says that it prohibits companies that use its real-time bidding (RTB) ad system “from joining data they receive from the Cookie Matching Service.” In September last year, Google announced that it has removed encrypted cookie IDs and list names from bid requests with buyers in its Authorized Buyers marketplace. Brave’s research, however, found otherwise, “Brave’s new evidence reveals that Google allowed not only one additional party, but many, to match with Google identifiers. The evidence further reveals that Google allowed multiple parties to match their identifiers for the data subject with each other.” When you visit a website that has Google ads embedded on its web pages, Google will run a real-time bidding ad auction to determine which advertiser will get to display its ads. For this, it uses Push Pages, which is the mechanism in question here. Brave hired Zach Edwards, the co-founder of digital analytics startup Victory Medium, and MetaX, a company that audits data supply chains, to investigate and analyze a log of Dr. Ryan’s web browsing. The research revealed that Google's Push Pages can essentially be used as a workaround for user IDs. Google shares a ‘google_push’ identifier with the participating companies to identify a user. Brave says that the problem here is that the identifier that was shared was common to multiple companies. This means that these companies could have cross-referenced what they learned about the user from Google with each other. Used by more than 8.4 million websites, Google's DoubleClick/Authorized Buyers broadcasts personal data of users to 2000+ companies. This data includes the category of what a user is reading, which can reveal their political views, sexual orientation, religious beliefs, as well as their locations. There are also unique ID codes that are specific to a user that can let companies uniquely identify a user. All this information can give these companies a way to keep tabs on what users are “reading, watching, and listening to online.” Brave calls Google’s RTB data protection policies “weak” as they ask these companies to self-regulate. Google does not have much control over what these companies do with the data once broadcast. “Its policy requires only that the thousands of companies that Google shares peoples’ sensitive data with monitor their own compliance, and judge for themselves what they should do,” Brave wrote. A Google spokesperson, as a response to this news, told Forbes, “We do not serve personalised ads or send bid requests to bidders without user consent. The Irish DPC — as Google's lead DPA — and the UK ICO are already looking into real-time bidding in order to assess its compliance with GDPR. We welcome that work and are co-operating in full." Users recommend starting an “information campaign” instead of a penalty that will hardly affect the big tech This news triggered a discussion on Hacker News where users talked about the implications of RTB and what strict actions the EU can take to protect user privacy. A user explained, "So, let's say you're an online retailer, and you have Google IDs for your customers. You probably have some useful and sensitive customer information, like names, emails, addresses, and purchase histories. In order to better target your ads, you could participate in one of these exchanges, so that you can use the information you receive to suggest products that are as relevant as possible to each customer. To participate, you send all this sensitive information, along with a Google ID, and receive similar information from other retailers, online services, video games, banks, credit card providers, insurers, mortgage brokers, service providers, and more! And now you know what sort of vehicles your customers drive, how much they make, whether they're married, how many kids they have, which websites they browse, etc. So useful! And not only do you get all these juicy private details, but you've also shared your customers sensitive purchase history with anyone else who is connected to the exchange." Others said that a penalty is not going to deter Google. "The whole penalty system is quite silly. The fines destroy small companies who are the ones struggling to comply, and do little more than offer extremely gentle pokes on the wrist for megacorps that have relatively unlimited resources available for complete compliance, if they actually wanted to comply." Users suggested that the EU should instead start an information campaign. "EU should ignore the fines this time and start an "information campaign" regarding behavior of Google and others. I bet that hurts Google 10 times more." Some also said that not just Google but the RTB participants should also be held responsible. "Because what Google is doing is not dissimilar to how any other RTB participant is acting, saying this is a Google workaround seems disingenuous." With this case, Brave has launched a full-fledged campaign that aims to “reform the multi-billion dollar RTB industry spans sixteen EU countries.” To achieve this goal it has collaborated with several privacy NGOs and academics including the Open Rights Group, Dr. Michael Veale of the Turing Institute, among others. In other news, a Bloomberg report reveals that Google and other internet companies have recently asked for an amendment to the California Consumer Privacy Act, which will be enacted in 2020. The law currently limits how digital advertising companies collect and make money from user data. The amendments proposed include approval for collecting user data for targeted advertising, using the collected data from websites for their own analysis, and many others. Read the Bloomberg report to know more in detail. Other news in Data Facebook content moderators work in filthy, stressful conditions and experience emotional trauma daily, reports The Verge GDPR complaint in EU claim billions of personal data leaked via online advertising bids European Union fined Google 1.49 billion euros for antitrust violations in online advertising  
Read more
  • 0
  • 0
  • 44217

article-image-detecting-addressing-llm-hallucinations-in-finance
James Bryant, Alok Mukherjee
04 Jan 2024
9 min read
Save for later

Detecting & Addressing LLM 'Hallucinations' in Finance

James Bryant, Alok Mukherjee
04 Jan 2024
9 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, The Future of Finance with ChatGPT and Power BI, by James Bryant, Alok Mukherjee. Enhance decision-making, transform your market approach, and find investment opportunities by exploring AI, finance, and data visualization with ChatGPT's analytics and Power BI's visuals.IntroductionLLMs, such as OpenAI’s GPT series, can sometimes generate responses that are referred to as “hallucinations.” These are instances where the output from the model is factually incorrect, it presents information that it could not possibly know (given it doesn’t have access to real-time or personalized data), or it might output something nonsensical or highly improbable.Let’s explore deeper into what hallucinations are, how to identify them, and what steps can be taken to mitigate their impact, especially in a context where accurate and reliable information is crucial, such as financial analysis, trading, or visual data presentations.Understanding hallucinationsLet’s look at some examples:Factual inaccuracies: Suppose an LLM provides information stating that Apple Inc. was founded in 1985. This is a clear factual inaccuracy because Apple was founded in 1976.Speculative statements: If an LLM were to suggest that “As of 2023, Tesla’s share price has hit $3,000,” this is a hallucination. The model doesn’t know real-time data and any post-2021 prediction or speculation it makes about specific stock prices is unfounded.Confident misinformation: For instance, if an LLM confidently states that “Amazon has declared bankruptcy in late 2022,” this is a hallucination and can have serious consequences if it’s acted upon without verification.How can we spot hallucinations?Here are some useful ways to spot hallucinations:Cross-verification: If an LLM suggests an unusual trading strategy, such as shorting a typically stable blue-chip stock based on some supposed insider information, always cross-verify this advice with other reliable sources or consult a financial advisor.Questioning the source: If an LLM claims that “our internal data shows a bullish trend for cryptocurrency X,” this is likely a hallucination. The model doesn’t have access to proprietary internal data.Time awareness: If the model provides information or trends post-September 2021 without the user explicitly asking for a hypothetical or simulated scenario, consider this a red flag. For example, GPT-4 giving specific “real-time” market cap values for companies in 2023 would be a hallucination.What can we do about hallucinations?Here are some ideas:Promote awareness: If you are developing an AI-assisted trading app that uses an LLM, ensure users are aware of potential hallucinations, perhaps with a disclaimer or notification upon usageImplement checks: You might integrate a news API that could help validate major financial events or claims made by the modelMinimizing hallucinations in the futureThere are various ways we can minimize hallucinations. Here are some examples:Training improvements: Imagine developing a better model that understands context and sticks to the known data more closely, avoiding speculative or incorrect financial statements. Future versions of the model could be specifically trained on financial data, news, and reports to understand the context and semantics of financial trading and investment better. We could do this to ensure that it understands a short squeeze scenario accurately, or is aware that penny stocks typically come with higher risks.Better evaluation metrics: For instance, develop a specific metric that calculates the percentage of the model’s outputs that were flagged as hallucinations during testing. In the development phase, the models could be evaluated on more focused tasks such as generating valid trading strategies or predicting the impact of certain macroeconomic events on stock prices. The better the model performs on these tasks, the lower the chance of hallucinations occurring.Post-processing methods: Develop an algorithm that cross-references model outputs against reliable financial data sources and flags potential inaccuracies. After the model generates a potential trading strategy or investment suggestion, this output could be cross-verified using a rules-based system. For instance, if the model suggests shorting a stock that has consistently performed well without any recent negative news or poor earnings reports, the system might flag this as a potential hallucination.As an example, you can use libraries such as yfinance or pandas_datareader to access real-time or historical financial data:!pip install yfinance pandas_datareader import yfinance as yf def get_stock_data(ticker, start, end): stock = yf.Ticker(ticker) data = stock.history(start=start, end=end) return data # Example Usage: data = get_stock_data("AAPL", "2021-01-01", "2023-01-01")You could also develop a cross-verification algorithm and compare the model’s outputs with the collected financial data to flag potential inaccuracies.Integration with real-time data: While creating Power BI visualizations, data that’s been pulled from the LLM could be cross-verified with real-time data from financial databases or APIs. Any discrepancies, such as inconsistent market share percentages or revenue growth rates, could be flagged. This reduces the risk of presenting hallucinated data in visualizations. Let’s look at some examples: Extracting real-time data: You can continue to use yfinance or pandas_datareader to extract real-time data Cross-verifying with real-time data: You can compare the model’s output with real-time data to identify discrepancies:def real_time_cross_verify(output, real_time_data): # Assume output is a dict with keys 'market_share', 'revenue_ growth', and 'ticker' ticker = output['ticker'] # Fetch real-time data (assuming a function get_real_time_ data is defined) real_time_data = get_real_time_ data(ticker) # Compare the model's output with real-time data if abs(output['market_share'] - real_time_data['market_ share']) > 0.05 or \ abs(output['revenue_growth'] - real_time_data['revenue_ growth']) > 0.05: return True # Flagged as a potential hallucination return False # Not flagged # Example Usage: output = {'market_share': 0.25, 'revenue_growth': 0.08, 'ticker': 'AAPL'} real_time_data = {'market_share': 0.24, 'revenue_growth': 0.07, 'ticker': 'AAPL'} flagged = real_time_cross_verify(output, real_time_data)User feedback loop: A mechanism can be incorporated to allow users to report potential hallucinations. For instance, if a user spots an error in the LLM’s output during a Power BI data analysis session, they can report this. Over time, these reports can be used to further train the model and reduce hallucinations.OpenAI is on the caseTo tackle the chatbot’s missteps, OpenAI engineers are working on ways for its AI models to reward themselves for outputting correct data when moving toward an answer, instead of rewarding themselves only at the point of conclusion. The system could lead to better outcomes as it incorporates more of a human-like chain-of-thought procedure, according to the engineers.These examples should help in illustrating the concept and risks of LLM hallucinations, particularly in high-stakes contexts such as finance. As always, these models should be seen as powerful tools for assistance, but not as a final authority.Trading examplesHallucination scenario: Let’s assume you’ve asked an LLM for a prediction on the future performance of a specific stock, let’s say Tesla. The LLM might generate a response that appears confident and factual, such as “Based on the latest earnings report, Tesla has declared bankruptcy.” If you acted on this hallucinated information, you might rush to sell Tesla shares only to find out that Tesla is not bankrupt at all. This is an example of a potentially disastrous hallucination.Action: Before making any trading decision based on the LLM’s output, always cross-verify the information from a reliable financial news source or the company’s official communications.Power BI visualization examplesHallucination scenario: Suppose you’re using an LLM to generate text descriptions for a Power BI dashboard that tracks the market share of different automakers in the EV market. The LLM might hallucinate and produce a statement such as “Rivian has surpassed Tesla in terms of global EV market share.” This statement might be completely inaccurate as Tesla had a significantly larger market share than Rivian.Action: When using LLMs to generate text descriptions or insights for your Power BI dashboards, it’s crucial to cross-verify any assertions that are made by the model. You can do this by cross-referencing the underlying data in your Power BI dashboard or by referring to reliable external sources of information.To minimize hallucinations in the future, the model can be fine-tuned with a dataset that’s been specifically curated to cover the relevant domain. The use of a structured validation set can help spot and rectify hallucinations during the model training process. Also, employing a robust fact-checking mechanism on the output of the model before acting on its suggestions or insights can help catch and rectify any hallucinations.Remember, while LLMs can provide valuable insights and suggestions, their output should always be used as one of many inputs in your decision-making process, particularly in high-stakes environments such as financial trading and analysis.ConclusionIn the dynamic world of financial analysis and data visualization, the presence of LLM 'hallucinations' poses a challenge. Awareness, verification, and ongoing improvement strategies stand as pillars against these inaccuracies. While LLMs offer invaluable support, their outputs must be scrutinized, verified, and used as one among many tools in decision-making. As we navigate this landscape, vigilance, continuous refinement, and a critical eye will fortify our ability to harness the power of LLMs while mitigating the risks they present in high-stakes financial contexts.Author BioJames Bryant, a finance and technology expert, excels at identifying untapped opportunities and leveraging cutting-edge tools to optimize financial processes. With expertise in finance automation, risk management, investments, trading, and banking, he's known for staying ahead of trends and driving innovation in the financial industry. James has built corporate treasuries like Salesforce and transformed companies like Stanford Health Care through digital innovation. He is passionate about sharing his knowledge and empowering others to excel in finance. Outside of work, James enjoys skiing with his family in Lake Tahoe, running half marathons, and exploring new destinations and culinary experiences with his wife and daughter.Aloke Mukherjee is a seasoned technologist with over a decade of experience in business architecture, digital transformation, and solutions architecture. He excels at applying data-driven solutions to real-world problems and has proficiency in data analytics and planning. Aloke worked at EMC Corp and Genentech and currently spearheads the digital transformation of Finance Business Intelligence at Stanford Health Care. In addition to his work, Aloke is a Certified Personal Trainer and is passionate about helping his clients stay fit. Aloke also has a passion for wine and exploring new vineyards. 
Read more
  • 0
  • 0
  • 44096
article-image-chatgpt-for-exploratory-data-analysis-eda
Rama Kattunga
08 Sep 2023
9 min read
Save for later

ChatGPT for Exploratory Data Analysis (EDA)

Rama Kattunga
08 Sep 2023
9 min read
IntroductionExploratory data analysis (EDA) refers to the initial investigation of data to discover patterns, identify outliers and anomalies, test hypotheses, and check assumptions with the goal of informing future analysis and model building. It is an iterative, exploratory process of questioning, analyzing, and visualizing data.Some key aspects of exploratory data analysis include:Getting to know the data - Examining individual variables, their values, distributions, and relationships between variables.Data cleaning - Checking and handling missing values, outliers, formatting inconsistencies, etc., before further analysis.Univariate analysis - Looking at one variable at a time to understand its distribution, central tendency, spread, outliers, etc.Bivariate analysis - Examining relationships between two variables using graphs, charts, and statistical tests. This helps find correlations.Multivariate analysis - Analyzing patterns between three or more variables simultaneously using techniques like cluster analysis.Hypothesis generation - Coming up with potential explanations or hypotheses about relationships in the data based on initial findings.Data visualization - Creating graphs, plots, and charts to summarize findings and detect patterns and anomalies more easily.The goals of EDA are to understand the dataset, detect useful patterns, formulate hypotheses, and make decisions on how to prepare/preprocess the data for subsequent modeling and analysis. It is an iterative, exploratory process of questioning, analyzing, and visualizing data.Why ChatGPT for EDA?Exploratory data analysis (EDA) is an important but often tedious process with challenges and pitfalls. The use of ChatGPT saves hours on repetitive tasks. ChatGPT handles preparatory data wrangling, exploration, and documentation - freeing you to focus on insights. Its capabilities will only grow through continued learning. Soon, it may autonomously profile datasets and propose multiple exploratory avenues. ChatGPT is the perfect on-demand assistant for solo data scientists and teams seeking an effortless boost to the EDA process. The drawback of ChatGPT is it can only handle small datasets. There are a few methods like handling smaller datasets and generating Python code to do the necessary analysis.The following table provides detailed challenges/pitfalls during EDA:Challenge/PitfallDetailsGetting lost in the weedsSpending too much time on minor details without focusing on the big picture. This leads to analysis paralysis.Premature conclusionsDrawing conclusions without considering all possible factors or testing different hypotheses thoroughly.BiasPersonal biases, preconceptions or domain expertise can skew analysis in a particular direction.Multiple comparisonsTesting many hypotheses without adjusting for Type 1 errors, leading to false discoveries.DocumentationFailing to properly document methods, assumptions, and thought processes along the way.Lack of focusJumping randomly without a clear understanding of the business objective.Ignoring outliersNot handling outliers appropriately, can distort analysis and patterns.Correlation vs causationIncorrectly inferring causation based only on observed correlations.OverfittingFinding patterns in sample data that may not generalize to new data.Publication biasOnly focusing on publishable significant or "interesting" findings.Multiple rolesWearing data analyst and subject expert hats, mixing subjective and objective analysis. With ChatGPT, get an AI assistant to be your co-pilot on the journey of discovery. ChatGPT can provide EDA at various stages of your data analysis within the limits that we discussed earlier. The following table provides different stages of data analysis with prompts (these prompts either generate the output or Python code for you to execute separately):Type of EDAPromptSummary StatisticsDescribe the structure and summary statistics of this dataset. Check for any anomalies in variable distributions or outliers.Univariate AnalysisCreate histograms and density plots of each numeric variable to visualize their distributions and identify any unusual shapes or concentrations of outliers.Bivariate AnalysisGenerate a correlation matrix and heatmap to examine relationships between variables. Flag any extremely high correlations that could indicate multicollinearity issues.Dimensionality ReductionUse PCA to reduce the dimensions of this high-dimensional dataset and project it into 2D. Do any clusters or groupings emerge that provide new insights?ClusteringApply K-Means clustering on the standardized dataset with different values of k. Interpret the resulting clusters and check if they reveal any meaningful segments or categories.Text AnalysisSummarize the topics and sentiments discussed in this text column using topic modeling algorithms like LDA. Do any dominant themes or opinions stand out?Anomaly DetectionImplement an isolation forest algorithm on the dataset to detect outliers independently in each variable. Flag and analyze any suspicious or influential data points.Model PrototypingQuickly prototype different supervised learning algorithms like logistic regression, decision trees, random forest on this classification dataset. Compare their performance and feature importance.Model EvaluationGenerate a correlation matrix between predicted vs actual values from different models. Any low correlations potentially indicate nonlinear patterns worth exploring further.Report GenerationAutogenerate a Jupyter notebook report with key visualizations, findings, concentrations, and recommendations for the next steps based on the exploratory analyses performed.How do we feed data to ChatGPT for EDA?Describe your dataset through natural language prompts, and ChatGPT instantly runs analyses to find hidden insights. No need to write code - let the AI do the heavy lifting! For this article, let’s use the CSV file available at: (https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-1000.csv) (http://tinyurl.com/mphebj4k)Here are some examples of how ChatGPT can be used for exploratory data analysis:Prompts:Describe the structure and summary statistics of this CSV file: [Pasted URL or file contents]What variable types are in this DataFrame? Import Pandas and show column data types.Generate a correlation matrix and heatmap for these variables.Check for missing values in each column and calculate the percentage missing.Create a histogram to visualize the distribution of this numeric variable.Compare the means of this categorical variable across groups using a boxplot.In the below example, I pasted two hundred rows and see the prompt I gave.This resulted in the following.Look at the column details it provided. Asking for the list of categorical and numerical variables in table format produces the below:Asking for a statistical summary of numerical columns would produce the following:Python Code generation:For certain aspects of EDA, ChatGPT produces Python code with pandas library import and the result will be shown like this:import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load the CSV file from the URL into a DataFrame url = "https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-1000.csv" df = pd.read_csv(url) # Display basic information about the DataFrame print("Basic Info About the DataFrame:") print(df.info()) # Display the first few rows of the DataFrame print("\nFirst Few Rows of the DataFrame:") print(df.head()) # Summary statistics of numerical columns print("\nSummary Statistics of Numerical Columns:") print(df.describe()) # Check for missing values print("\nMissing Values:") print(df.isnull().sum()) # Visualize data # Example: Histogram of a numerical column (replace 'col_name' with the column name you want to plot) # plt.hist(df['col_name'], bins=20) # plt.xlabel('X-axis Label') # plt.ylabel('Y-axis Label') # plt.title('Histogram of col_name') # plt.show() # You can create more visualizations and explore relationships between columns as needed. # Correlation matrix heatmap (for numerical columns) correlation_matrix = df.corr() plt.figure(figsize=(10, 8)) sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f") plt.title("Correlation Matrix Heatmap") plt.show()Running this in Spyder (Anaconda UI) produces the following output without a single error: EDA on large datasets with millions of recordsAs mentioned earlier in this article, ChatGPT is very powerful for its size, but there are still limitations since it runs on general consumer hardware rather than massive server clusters. Here are a few things to keep in mind regarding its capabilities with large datasets:ChatGPT works best for datasets under 50-100MB in size. It can handle some operations on larger files up to 1GB, but performance will degrade.For initial exploration of very large datasets, ChatGPT is still useful. It can quickly summarize dimensions, types, distributions, outliers, etc., to help shape hypotheses.Advanced analytics like complex multi-variable modeling may not be feasible on the largest datasets directly in ChatGPT.However, it can help with the data prep - filtering, aggregations, feature engineering, etc. to reduce a large dataset into a more manageable sample for detailed analysis.Integration with tools that can load large datasets directly (e.g., BigQuery, Spark, Redshift) allows ChatGPT to provide insights on files too big to import wholesale.As AI capabilities continue advancing, future versions powered by more computing may be able to handle larger files for a broader set of analytics tasks.ConclusionChatGPT revolutionizes Exploratory Data Analysis (EDA) by streamlining the process and making it accessible to a wider audience. EDA is crucial for understanding data, and ChatGPT automates tasks like generating statistics, visualizations, and even code, simplifying the process.ChatGPT's natural language interface enables users to interact with data using plain language, eliminating the need for extensive coding skills. While it excels in initial exploration and data preparation, it may have limitations with large datasets or complex modeling tasks. ChatGPT is a valuable EDA companion, empowering data professionals to uncover insights and make data-driven decisions efficiently. ChatGPT's role in data analytics is expected to expand as AI technology evolves, offering even more support for data-driven decision-making.Author BioRama Kattunga has been working with data for over 15 years at tech giants like Microsoft, Intel, and Samsung. As a geek and a business wonk with degrees from Kellogg and two technology degrees from India, Rama uses his engineering know-how and strategy savvy to get stuff done with analytics, AI, and unlocking insights from massive datasets. When he is not analyzing data, you can find Rama sharing his thoughts as an author, speaker, and digital transformation specialist. Moreover, Rama also finds joy in experimenting with cooking, using videos as his guide to create delicious dishes that he can share with others. This diverse range of interests and skills highlights his well-rounded and dynamic character. LinkedIn
Read more
  • 0
  • 0
  • 44088

article-image-10-tech-startups-for-2020-that-will-help-the-world-build-more-resilient-secure-and-observable-software
Richard Gall
30 Dec 2019
10 min read
Save for later

10 tech startups for 2020 that will help the world build more resilient, secure, and observable software

Richard Gall
30 Dec 2019
10 min read
The Datadog IPO in September marked an important moment for the tech industry. This wasn’t just because the company was the fourth tech startup to reach a $10 billion market cap in 2019, but also because it announced something that many people, particularly those in and around Silicon Valley, have been aware of for some time: the most valuable software products in the world aren’t just those that offer speed, and efficiency, they’re those that provide visibility, and security across our software systems. It shouldn’t come as a surprise. As software infrastructure becomes more complex, constantly shifting and changing according to the needs of users and businesses, the ability to assume some degree of control emerges as particularly precious. Indeed, the idea of control and stability might feel at odds with a decade or so that has prized innovation at speed. The mantra ‘move fast and break things’ is arguably one of the defining ones of the last decade. And while that lust for change might never disappear, it’s nevertheless the case that we’re starting to see a mindset shift in how business leaders think about technology. If everyone really is a tech company now, there’s now a growing acceptance that software needs to be treated with more respect and care. The Datadog IPO, then, is just the tip of an iceberg in which monitoring, observability, security, and resiliency tools have started to capture the imagination of technology leaders. While what follows is far from exhaustive, it does underline some of the key players in a growing field. Whether you're an investor or technology decision maker, here are ten tech startups you should watch out for in 2020 from across the cloud and DevOps space. Honeycomb Honeycomb has been at the center of the growing conversation around observability. Designed to help you “own production in hi-res,” what makes it unique in the market is that it allows you to understand and visualize your systems through high-cardinality dimensions (eg. at a user by user level, rather than, say, browser type or continent). The driving force behind Honeycomb is Charity Majors, its co-founder and former CEO. I was lucky enough to speak to her at the start of the year, and it was clear that she has an acute understanding of the challenges facing engineering teams. What was particularly striking in our conversation is how she sees Honeycomb as a tool for empowering developers. It gives them ownership over the code they write and the systems they build. “Ownership gives you the power to fix the thing you know you need to fix and the power to do a good job…” she told me. “People who find ownership is something to be avoided – that’s a terrible sign of a toxic culture.” Honeycomb’s investment status At the time of writing, Honeycomb has received $26.9 million in funding, with $11.4 million series A back in September. Firehydrant “You just got paged. Now what?” That’s the first line that greets you on the FireHydrant website. We think it sums up many of the companies on this list pretty well; many of the best tools in the DevOps space are designed to help tackle the challenges on-call developers face. FireHydrant isn't a tech startup with the profile of Honeycomb. However, as an incident management tool that integrates very neatly into a massive range of workflow tools, we’re likely to see it gain traction in 2020. We particularly like the one-click post mortem feature - it’s clear the product has been built in a way that allows developers to focus on the hard stuff and minimize the things that can just suck up time. FireHydrant’s investment status FireHydrant has raised $1.5 million in seed funding. NS1 Managing application traffic can be business-critical. That’s why NS1 exists; with DNS, DHCP and IP address management capabilities, it’s arguably one of the leading tools on the planet for dealing with the diverse and extensive challenges that come with managing massive amounts of traffic across complex interlocking software applications and systems. The company boasts an impressive roster of clients, including DropBox, The Guardian and LinkedIn, which makes it hard to bet against NS1 going from strength to strength in 2020. Like all software adoption, it might take some time to move beyond the realms of the largest and most technically forward-thinking organizations, but it’s surely only a matter of time until it the importance of smarter and more efficient becomes clear to even the smallest businesses. NS1’s investment status NS1 has raised an impressive $78.4 million in funding from investors (although it’s important to note that it’s one of the oldest companies on this list, founded all the way back in 2013). It received $33 million in series C funding at the beginning of October. Rookout “It’s time to liberate your data” Rookout implores us. For too long, the startup’s argument goes, data has been buried inside our applications where it’s useless for developers and engineers. Once it has been freed, it can help inform how we go about debugging and monitoring our systems. Designed to work for modern architectural and deployment patterns such as Kubernetes and serverless, Rookout is a tool that not only brings simplicity in the midst of complexity, it can also save engineering teams a serious amount of time when it comes to debugging and logging - the company claims by 80%. Like FireHydrant, this means engineers can focus on other areas of application performance and resilience. Rookout’s investment status Back in August, Rookout raised $8 million in Series A funding, taking its total funding amount to $12.2 million dollars. LaunchDarkly Feature flags or toggles are a concept that have started to gain traction in engineering teams in the last couple of years or so. They allow engineering teams to “modify system behavior without changing code” (thank you Martin Fowler). LaunchDarkly is a platform specifically built to allow engineers to use feature flags. At a fundamental level, the product allows DevOps teams to deploy code (ie. change features) quickly and with minimal risk. This allows for testing in production and experimentation on a large scale. With support for just about every programming language, it’s not surprising to see LaunchDarkly boast a wealth of global enterprises on its list of customers. This includes IBM and NBC. LaunchDarkly’s investment status LaunchDarkly raised $44 million in series C funding early in 2019. To date, it has raised $76.3 million. It’s certainly one to watch closely in 2020; it's ability to help teams walk the delicate line between innovation and instability is well-suited to the reality of engineering today. Gremlin Gremlin is a chaos engineering platform designed to help engineers to ‘stress test’ their software systems. This is important in today’s technology landscape. With system complexity making unpredictability a day-to-day reality, Gremlin lets you identify weaknesses before they impact customers and revenue. Gremlin’s mission is to “help build a more reliable internet.” That’s not just a noble aim, it’s an urgent one too. What’s more, you can see that the business is really living out its mission. With Gremlin Free launching at the start of 2019, and the second ChaosConf taking place in the fall, it’s clear that the company is thinking beyond the core product: they want to make chaos engineering more accessible to a world where resilience can feel impossible in the face of increasing complexity. Gremlin’s investment status Since being founded back in 2016 by CTO Matt Fornaciari and CEO Kolton Andrus, Gremlin has raised $26.8Million in funding from Redpoint Ventures, Index Ventures, and Amplify Partners. Cockroach Labs Cockroach Labs is the organization behind CockroachDB, the cloud-native distributed SQL database. CockroachDB’s popularity comes from two things: it’s ability to scale from a single instance to thousands, and it’s impressive resilience. Indeed, its resilience is where it takes its name from. Like a cockroach, CockroachDB is built to keep going even after everything else has burned to the ground. It’s been an interesting year for CockroachLabs and CockroachDB - in June the company changed the CockroachDB core licence from open source Apache license to the Business Source License (BSL), developed by the MariaDB team. The reason for this was ultimately to protect the product as it seeks to grow. The BSL still means the source code is accessible for any use other than for a DBaaS (you’ll need an enterprise license for that). A few months later, the company took another step in pushing forward in the market with $55 million series C funding. Both stories were evidence that CockroachLabs is setting itself up for a big 2020. Although the database market will always be seriously competitive, with resilience as a core USP it’s hard to bet against Cockroach Labs. Cockroaches find a way, right? CockroachLabs investment status CockroachLabs total investment, following on from that impressive round of series C funding is now $108.5 million. Logz.io Logz.io is another platform in the observability space that you really need to watch out for in 2020. Built on the ELK stack (ElasticSearch, Logstash, and Kibana), what makes Logz.io really stand out is the use of machine learning to help identify issues across thousands and thousands of logs. Logz.io has been on ‘ones to watch’ lists for a number of years now. This was, we think, largely down to the rising wave of AI hype. And while we wouldn’t want to underplay its machine learning capabilities, it’s perhaps with the increasing awareness of the need for more observable software systems that we’ll see it really pack a punch across the tech industry. Logz.io’s investment status To date, Logz.io has raised $98.9 million. FaunaDB Fauna is the organization behind FaunaDB. It describes itself as “a global serverless database that gives you ubiquitous, low latency access to app data, without sacrificing data correctness and scale.” The database could be big in 2020. With serverless likely to go from strength to strength, and JAMstack increasing as a dominant approach for web developers, everything the Fauna team have been doing looks as though it will be a great fit for the shape of the engineering landscape in the future. Fauna’s investment status In total, Fauna has raised $32.6 million in funding from investors. Clubhouse One thing that gets overlooked when talking about DevOps and other issues in software development processes is simple project management. That’s why Clubhouse is such a welcome entry on this list. Of course, there are a massive range of project management tools available at the moment. But one of the reasons Clubhouse is such an interesting product is that it’s very deliberately built with engineers in mind. And more importantly, it appears it’s been built with an acute sense of the importance of enjoyment in a project management product. Clubhouse’s investment status Clubhouse has, to date, raised $16 million. As we see a continuing emphasis on developer experience, the tool is definitely one to watch in a tough marketplace. Conclusion: embrace the unpredictable The tech industry feels as unpredictable as the software systems we're building and managing. But while there will undoubtedly be some surprises in 2020, the need for greater security and resilience are themes that no one should overlook. Similarly, the need to gain more transparency and build for observability are critical. Whether you're an investor, business leader, or even an engineer, then, exploring the products that are shaping and defining the space is vital.
Read more
  • 0
  • 0
  • 44056
Modal Close icon
Modal Close icon