Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-ordering-buildings-flash-virtual-worlds
Packt
16 Aug 2010
2 min read
Save for later

Ordering the Buildings in Flash Virtual Worlds

Packt
16 Aug 2010
2 min read
(For more resources on Flash, see here.) Ordering the buildings The buildings are not well placed on the map. They overlap with each other in a very strange way. That is because we are now viewing the 3D isometric world in 2D screen with wrong ordering. When we view the 3D perspective in the following way, the closer objects should block the view of the objects behind. The buildings in the preceding image do not obey this real-world rule and cause some strange overlapping. We are going to solve this problem in the next section. Ordering the movie clips in Flash In Flash, every movie clip has its own depth. The depth is called z-order or the z-index of the movie clip. A movie clip with bigger z-order number is higher and covers the others with lower z-order when overlapping. By swapping their z-order, we can rearrange the movie clips on how they overlap and create the correct ordering of isometric buildings. Determining an object's location and view According to our tile-based isometric map, the object that locates in larger number of the x and y axis is in front of the object that locates in smaller number of the x and y axis. We can thus compare the isometric x and y coordinate to determine which object is in front. There is a special case when all shapes of the buildings occupy square shapes. In this situation, the order of the movie clip's z order can be easily determined by comparing their y position.
Read more
  • 0
  • 0
  • 5437

article-image-introduction-raspberry-pi-zero-w-wirelessv
Packt
01 Mar 2018
13 min read
Save for later

Introduction to Raspberry Pi Zero W Wirelessv

Packt
01 Mar 2018
13 min read
In this article by Vasilis Tzivaras, the author of the book Raspberry Pi Zero W Wireless Projects, we will be covering the following topics: An overview of the Raspberry Pi family An overview of the Raspberry Pi family Common issues Raspberry Pi Zero W is the new product of the Raspberry Pi Zero family. In early 2017, Raspberry Pi community has announced a new board with wireless extension. It offers wireless functionality and now everyone can develop his own projects without cables and other components. Comparing the new board with Raspberry Pi 3 Model B we can easily see that it is quite smaller with many possibilities over the Internet of Things. But what is a Raspberry Pi Zero W and why do you need it? Let' s go though the rest of the family and introduce the new board. In the following article we will cover the following topics: Raspberry Pi family As said earlier Raspberry Pi Zero W is the new member of Raspberry Pi family boards. All these years Raspberry Pi are evolving and become more user friendly with endless possibilities. Let's have a short look at the rest of the family so we can understand the difference of the Pi Zero board. Right now, the heavy board is named Raspberry Pi 3 Model B. It is the best solution for projects such as face recognition, video tracking, gaming or anything else that is demanding: RASPBERRY PI 3 MODEL B Insert Image B07972_01_15.png It is the 3rd generation of Raspberry Pi boards after Raspberry Pi 2 and has the following specs: 1.2GHz 64-bit quad-core ARMv8 CPU 802.11n Wireless LAN Bluetooth 4.1 Bluetooth Low Energy (BLE) Like the Pi 2, it also has 1GB RAM 4 USB ports 40 GPIO pins Full HDMI port Ethernet port Combined 3.5 mm audio jack and composite video Camera interface (CSI) Display interface (DSI) Micro SD card slot (now push-pull rather than push-push) </li><li> VideoCore IV 3D graphics core</li></ol> The next board is Raspberry Pi Zero, in which the Zero W was based. A small low cost and power board able to do many things:Raspberry Pi Zero Insert Image B07972_01_18.png The specs of this board can be found as follows: <ol><li>1GHz, Single-core CPU </li><li> 512MB RAM </li><li> Mini-HDMI port </li><li>Micro-USB OTG port </li><li>Micro-USB power </li><li>HAT-compatible 40-pin header </li><li>Composite video and reset headers </li><li> CSI camera connector (v1.3 only) At this point we should not forget to mention that apart from the boards mentioned earlier there are several other modules and components such as the Sense Hat or Raspberry Pi Touch Display available which will work great for advance projects. The 7″ Touchscreen Monitor for Raspberry Pi gives users the ability to create all-in-one, integrated projects such as tablets, infotainment systems and embedded projects: RASPBERRY PI Touch Display Insert Image B07972_01_16.png Where Sense HAT is an add-on board for Raspberry Pi, made especially for the Astro Pi mission. The Sense HAT has an 8×8 RGB LED matrix, a five-button joystick and includes the following sensors: <ol><li> Gyroscope </li><li> Accelerometer </li><li> Magnetometer </li><li>Temperature</li><li> Barometric pressure</li><li> Humidity Sense HAT Insert Image B07972_01_17.png Stay tuned with more new boards and modules at the official website: https://www.raspberrypi.org/ Raspberry Pi Zero W Raspberry Pi Zero W is a small device that has the possibilities to be connected either on an external monitor or TV and of course it is connected to the internet. The operating system varies as there are many distros in the official page and almost everyone is baled on Linux systems. Raspberry Pi Zero W Insert Image B07972_01_01.png With Raspberry Pi Zero W you have the ability to do almost everything, from automation to gaming! It is a small computer that allows you easily program with the help of the GPIO pins and some other components such as a camera. Its possibilities are endless! Specifications If you have bought Raspberry PI 3 Model B you would be familiar with Cypress CYW43438 wireless chip. It provides 802.11 n wireless LAN and Bluetooth 4.0 connectivity. The new Raspberry Pi Zero W is equipped with that wireless chip as well. Following you can find the specifications of the new board Dimensions: 65mm × 30mm × 5mm SoC: Broadcom BCM 2835 chip ARM11 at 1 GHz, single core CPU 512ΜΒ RAM Storage: MicroSD card Video and Audio: 1080P HD video and stereo audio via mini-HDMI connector </li><li> Power: 5V, supplied via micro USB connector </li><li> Wireless: 2.4GHz 802.11 n wireless LAN Bluetooth: Bluetooth classic 4.1 and Bluetooth Low Energy (BLE) Output: Micro USB GPIO: 40-pin GPIO, unpopulated Raspberry Pi Zero W Insert Image B07972_01_03.png Notice that all the components are on the top side of the board so you can easily choose your case without any problems and keep it safe. As far as the antenna concern, it is formed by etching away copper on each layer of the PCB. It may not be visible as it is in other similar boards but it is working great and offers quite a lot functionalities: Raspberry Pi Zero W Capacitors Insert Image B07972_01_04.png Also, the product is limited to only one piece per buyer and costs 10$. You can buy a full kit with microsd card, a case and some more extra components for about 45$ or choose the camera full kit which contains a small camera component for 55$. Camera support Image processing projects such as video tracking or face recognition require a camera. Following you can see the official camera support of Raspberry Pi Zero W. The camera can easily be mounted at the side of the board using a cable like the Raspberry Pi 3 Model B board: The official Camera support of Raspberry Pi Zero W Insert Image B07972_01_05.png Depending on your distribution you many need to enable the camera though command line. More information about the usage of this module will be mentioned at the project. Accessories Well building projects with the new board there are some other gadgets that you might find useful working with. Following there is list of some crucial components. Notice that if you buy Raspberry Pi Zero W kit, it includes some of them. So, be careful and don't double buy them: OTG cable powerHUB GPIO header microSD card and card adapter HDMI to miniHDMI cable HDMI to VGA cable Distributions The official site https://www.raspberrypi.org/downloads/ contains several distributions for downloading. The two basic operating systems that we will analyze after are RASPBIAN and NOOBS. Following you can see how the desktop environment looks like. Both RASPBIAN and NOOBS allows you to choose from two versions. There is the full version of the operating system and the lite one. Obviously the lite version does not contain everything that you might use so if you tend to use your Raspberry with a desktop environment choose and download the full version. On the other side if you tend to just ssh and do some basic stuff pick the lite one. It' s really up to you and of course you can easily download again anything you like and re-write your microSD card: Insert Image B07972_01_14.png NOOBS distribution Download NOOBS: https://www.raspberrypi.org/downloads/noobs/. NOOBS distribution is for the new users with not so much knowledge in linux systems and Raspberry PI boards. As the official page says it is really "New Out Of the Box Software". There is also pre-installed NOOBS SD cards that you can purchase from many retailers, such as Pimoroni, Adafruit, and The Pi Hut, and of course you can download NOOBS and write your own microSD card. If you are having trouble with the specific distribution take a look at the following links: Full guide at https://www.raspberrypi.org/learning/software-guide/. View the video at https://www.raspberrypi.org/help/videos/#noobs-setup. NOOBS operating system contains Raspbian and it provides various of other operating systems available to download. RASPBIAN distribution Download RASPBIAN: https://www.raspberrypi.org/downloads/raspbian/. Raspbian is the official supported operating system. It can be installed though NOOBS or be downloading the image file at the following link and going through the guide of the official website. Image file: https://www.raspberrypi.org/documentation/installation/installing-images/README.md. It has pre-installed plenty of software such as Python, Scratch, Sonic Pi, Java, Mathematica, and more! Furthermore, more distributions like Ubuntu MATE, Windows 10 IOT Core or Weather Station are meant to be installed for more specific projects like Internet of Things (IoT) or weather stations. To conclude with, the right distribution to install actually depends on your project and your expertise in Linux systems administration. Raspberry Pi Zero W needs an microSD card for hosting any operating system. You are able to write Raspbian, Noobs, Ubuntu MATE, or any other operating system you like. So, all that you need to do is simple write your operating system to that microSD card. First of all you have to download the image file from https://www.raspberrypi.org/downloads/ which, usually comes as a .zip file. Once downloaded, unzip the zip file, the full image is about 4.5 Gigabytes. Depending on your operating system you have to use different programs 7-Zip for Windows The Unarchiver for Mac Unzip for Linux Now we are ready to write the image in the MicroSD card. You can easily write the .img file in the microSD card by following one of the next guides according to your system. For Linux users dd tool is recommended. Before connecting your microSD card with your adaptor in your computer run the following command: df -h Now connect your card and run the same command again. You must see some new records. For example if the new device is called /dev/sdd1 keep in your mind that the card is at /dev/sdd (without the 1). The next step is to use the dd command and copy the image to the microSD card. We can do this by the following command: dd if=<path to your image> of=</dev/***> Where if is the input file (image file or the distribution) and of is the output file (microSD card). Again be careful here and use only /dev/sdd or whatever is yours without any numbers. If you are having trouble with that please use the full manual at the following link https://www.raspberrypi.org/documentation/installation/installing-images/linux.md. A good tool that could help you out for that job is GParted. If it is not installed on your system you can easily install it with the following command: sudo apt-get install gparted Then run sudo gparted to start the tool. Its handles partitions very easily and you can format, delete or find information about all your mounted partitions. More information about dd can be found here: https://www.raspberrypi.org/documentation/installation/installing-images/linux.md For Mac OS users dd tool is always recommended: https://www.raspberrypi.org/documentation/installation/installing-images/mac.md For Windows users Win32DiskImager utility is recommended: https://www.raspberrypi.org/documentation/installation/installing-images/windows.md There are several other ways to write an image file in a microSD card. So, if you are against any kind of problems when following the guides above feel free to use any other guide available on the Internet. Now, assuming that everything is ok and the image is ready. You can now gently plugin the microcard to your Raspberry PI Zero W board. Remember that you can always confirm that your download was successful with the sha1 code. In Linux systems you can use sha1sum followed by the file name (the image) and print the sha1 code that should and must be the same as it is at the end of the official page where you downloaded the image. Common issues Sometimes, working with Raspberry Pi boards can lead to issues. We all have faced some of them and hope to never face them again. The Pi Zero is so minimal and it can be tough to tell if it is working or not. Since, there is no LED on the board, sometimes a quick check if it is working properly or something went wrong is handy. Debugging steps With the following steps you will probably find its status: 1. Take your board, with nothing in any slot or socket. Remove even the microSD card! 2. Take a normal micro-USB to USB-ADATA SYNC cable and connect the one side to your computer and the other side to the Pi's USB, (not the PWR_IN). 3. If the Zero is alive: • On Windows the PC will go ding for the presence of new hardware and you should see BCM2708 Boot in Device Manager. • On Linux, with a ID 0a5c:2763 Broadcom Corp message from dmesg. Try to run dmesg in a Terminal before your plugin the USB and after that. You will find a new record there. Output example: [226314.048026] usb 4-2: new full-speed USB device number 82 using uhci_hcd [226314.213273] usb 4-2: New USB device found, idVendor=0a5c, idProduct=2763 [226314.213280] usb 4-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0 [226314.213284] usb 4-2: Product: BCM2708 Boot [226314.213] usb 4-2: Manufacturer: Broadcom If you see any of the preceding, so far so good, you know the Zero's not dead. microSD card issue Remember that if you boot your Raspberry and there is nothing working, you may have burned your microSD card wrong. This means that your card many not contain any boot partition as it should and it is not able to boot the first files. That problem occurs when the distribution is burned to /dev/sdd1 and not to /dev/sdd as we should. This is a quite common mistake and there will be no errors in your monitor. It will just not work! Case protection Raspberry Pi boards are electronics and we never place electronics in metallic surfaces or near magnetic objects. It will affect the booting operation of the Raspberry and it will probably not work. So a tip of advice, spend some extra money for the Raspberry PI Case and protect your board from anything like that. There are many problems and issues when hanging your raspberry pi using tacks. It may be silly, but there are many that do that. Summary Raspberry Pi Zero W is a new promising board allowing everyone to connect their devices to the Internet and use their skills to develop projects including software and hardware. This board is the new toy of any engineer interested in Internet of Things, security, automation and more! We have gone through an introduction in the new Raspberry Pi Zero board and the rest of its family and a brief analysis on some extra components that you should buy as well.
Read more
  • 0
  • 0
  • 5436

article-image-setting-openvpn-x509-certificates
Packt
23 Oct 2009
6 min read
Save for later

Setting Up OpenVPN with X509 Certificates

Packt
23 Oct 2009
6 min read
Creating Certificates One method could be setting up tunnels using pre-shared keys with static encryption, however, X509 certificates provide a much better level of security than pre-shared keys do. There is, however, slightly more work to be done to set up and connect two systems with certificate-based authentication. The following five steps have to be accomplished: Create a CA certificate for your CA with which we will sign and revoke client certificates.      Create a key and a certificate request for the clients.      Sign the request using the CA certificate and thereby making it valid.      Provide keys and certificates to the VPN partners.      Change the OpenVPN configuration so that OpenVPN will use the certificates and keys, and restart OpenVPN. There are a number of ways to accomplish these steps. easy-rsa is a command-line tool that comes with OpenVPN, and exists both on Linux and Windows. On Windows systems you could create certificates by clicking on the batch files in the Windows Explorer, but starting the batch files at the command-line prompt should be the better solution. On Linux you type the full path of the scripts, which share the same name as on Windows, simply without the extension .bat. Certificate Generation on Windows XP with easy-rsa Open the Windows Explorer and change to the directory C:Program Files OpenVPNeasy-rsa. The Windows version of easy-rsa consists of thirteen files. On Linux systems you will have to check your package management tools to find the right path to the easy-rsa scripts. On Debian Linux you will find them in /usr/share/doc/openvpn/examples/easy-rsa/. You find there are eight batch files, four configuration files, and a README (which is actually not really helpful). However, we must now create a directory called keys, copy the files serial.start and index.txt.start into it, and rename them to serial and index.txt respectively. The keys and certificates created by easy-rsa will be stored in this directory. These files are used as a database for certificate generation. Now we let easy-rsa prepare the standard configuration for our certificates. Double-click on the file C:Program FilesOpenVPNeasy-rsainit-config.bat or start this batch file at a command-line prompt. It simply copies the template files vars.bat.sample to vars.bat and openssl.cnf.sample to openvpn.ssl. While the file openssl is a standard OpenSSL configuration, the file vars.bat contains variables used by OpenVPN's scripts to create our certificates, and needs some editing in the next step. Setting Variables—Editing vars.bat Right-click on the vars.bat file's icon and select from the menu. In this file, several parameters are set that are used by the certificate generation scripts later. The following table gives a quick overview of the entries in the file: Entry in vars.bat Function set HOME=%ProgramFiles%OpenVPN easy-rsa The path to the directory where easy-rsa resides. set KEY_CONFIG=openssl.cnf The name of the OpenSSL configuration file. set KEY_DIR=keys The path to the directory where the newly generated keys are stored-relative to $HOME as set above. set KEY_SIZE=1024 The length of the SSL key. This parameter should be increased to 2048. set KEY_COUNTRY=US set KEY_PROVINCE=CA set KEY_CITY=SanFrancisco set KEY_ORG=FortFunston set KEY_EMAIL=mail@host.domain These five values are used as suggestions whenever you start a script and generate certificates with the easy-rsa software. Only the entry KEY_SIZE must be changed (unless you don't care much about security), but setting the last five entries to your needs might be very helpful later. Every time we generate a certificate, easy-rsa will ask (among others) for these five parameters, and give a suggestion that could be accepted simply by pressing Enter. The better the default values set here in vars.bat fit our needs, the less typing work we will have later. I leave it up to you to change these settings here. The next step is easy. Run vars.bat to set the variables. Even though you could simply double-click on its explorer icon, I recommend that you run it in a shell window. Select the entry Run from Windows' main menu, type cmd.exe, and change to the easy-rsa directory by typing cd "C:Program FilesOpenVPNeasy-rsa" and pressing Enter. By doing so, we will proceed in exactly the same way as we would do on a Linux system (except for the .bat extensions). Creating the Diffie-Hellman Key Now it is time to create the keys that will be used for encryption, authentication, and key exchange. For the latter, a Diffie-Hellman key is used by OpenVPN. The Diffie-Hellman key agreement protocol enables two communication partners to exchange a secret key safely. No prior secrets or safe lines are needed; a special mathematical algorithm guarantees that only the two partners know the used shared key. If you would like to know exactly what this algebra is about, have a look at this website: http://www.rsasecurity.com/rsalabs/node.asp?id=2248. easy-rsa provides a script (batch) file that generates the key for you: C:Program FilesOpenVPNeasy-rsabuild-dh.bat. Start it by typing build-dh.bat. A Diffie-Hellman key is being generated. The batch file tells you, This is going to take a long time, which is only true if your system is really old or if you are not patient enough. However, on modern systems some minutes may be a time span horribly long! Building the Certificate Authority OK, now it's time to generate our first CA. Enter build-ca.bat. This script generates a self-signed certificate for a CA. Such a certificate can be used to create and sign client certificates and thereby authenticate other machines. Depending on the data you entered in your vars.bat file, build-ca.bat will suggest different default parameters during the process of generating this certificate. Five of the last seven lines are taken from the variables set in vars.bat. If you edited these parameters, a simple return will do here and the certificate for the CA is generated in the keys directory.
Read more
  • 0
  • 0
  • 5430

article-image-when-do-we-use-r-over-python
Packt
10 Jul 2017
16 min read
Save for later

When do we use R over Python?

Packt
10 Jul 2017
16 min read
In the article Prabhanjan Tattar, author of book Practical Data Science Cookbook - Second Edition, explainsPython is an interpreted language (sometimes referred to as a scripting language), much like R. It requires no special IDE or software compilation tools and is therefore as fast as R to develop with and prototype. Like R, it also makes use of C shared objects to improve computational performance. Additionally, Python is a default system tool on Linux, Unix, and Mac OS X machines and is available on Windows. Python comes with batteries included, which means that the standard library is widely inclusive of many modules, from multiprocessing to compression toolsets. Python is a flexible computing powerhouse that can tackle any problem domain. If you find yourself in need of libraries that are outside of the standard library, Python also comes with a package manager (like R) that allows the download and installation of other code bases. (For more resources related to this topic, see here.) Python’s computational flexibility means that some analytical tasks take more lines of code than their counterpart in R. However, Python does have the tools that allow it to perform the same statistical computing. This leads to an obvious question: When do we use R over Python and vice versa? This article attempts to answer this question by taking an application-oriented approach to statistical analyses. From books to movies to people to follow on Twitter, recommender systems carve the deluge of information on the Internet into a more personalized flow, thus improving the performance of e-commerce, web, and social applications. It is no great surprise, given the success of Amazon-monetizing recommendations and the Netflix Prize, that any discussion of personalization or data-theoretic prediction would involve a recommender. What is surprising is how simple recommenders are to implement yet how susceptible they are to vagaries of sparse data and overfitting. Consider a non-algorithmic approach to eliciting recommendations; one of the easiest ways to garner a recommendation is to look at the preferences of someone we trust. We are implicitly comparing our preferences to theirs, and the more similarities you share, the more likely you are to discover novel, shared preferences. However, everyone is unique, and our preferences exist across a variety of categories and domains. What if you could leverage the preferences of a great number of people and not just those you trust? In the aggregate, you would be able to see patterns, not just of people like you, but also anti-recommendations— things to stay away from, cautioned by the people not like you. You would, hopefully, also see subtle delineations across the shared preference space of groups of people who share parts of your own unique experience. Understanding the data Understanding your data is critical to all data-related work. In this recipe, we acquire and take a first look at the data that we will be using to build our recommendation engine. Getting ready To prepare for this recipe, and the rest of the article, download the MovieLens data from the GroupLens website of the University of Minnesota. You can find the data at http://grouplens.org/datasets/movielens/. In this recipe, we will use the smaller MoveLens 100k dataset (4.7 MB in size) in order to load the entire model into the memory with ease. How to do it… Perform the following steps to better understand the data that we will be working with throughout: Download the data from http://grouplens.org/datasets/movielens/.The 100K dataset is the one that you want (ml-100k.zip): Unzip the downloaded data into the directory of your choice. The two files that we are mainly concerned with are u.data, which contains the user movie ratings, and u.item, which contains movie information and details. To get a sense of each file, use the head command at the command prompt for Mac and Linux or the more command for Windows: head -n 5 u.item Note that if you are working on a computer running the Microsoft Windows operating system and not using a virtual machine (not recommended), you do not have access to the head command; instead, use the following command: moreu.item 2 n The preceding command gives you the following output: 1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0 2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0 5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0 The following command will produce the given output: head -n 5 u.data For Windows, you can use the following command: moreu.item 2 n 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 How it works… The two main files that we will be using are as follows: u.data: This contains the user moving ratings u.item: This contains the movie information and other details Both are character-delimited files; u.data, which is the main file, is tab delimited, and u.item is pipe delimited. For u.data, the first column is the user ID, the second column is the movie ID, the third is the star rating, and the last is the timestamp. The u.item file contains much more information, including the ID, title, release date, and even a URL to IMDB. Interestingly, this file also has a Boolean array indicating the genre(s) of each movie, including (in order) action, adventure, animation, children, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western. There’s more… Free, web-scale datasets that are appropriate for building recommendation engines are few and far between. As a result, the movie lens dataset is a very popular choice for such a task but there are others as well. The well-known Netflix Prize dataset has been pulled down by Netflix. However, there is a dump of all user-contributed content from the Stack Exchange network (including Stack Overflow) available via the Internet Archive (https://archive.org/details/stackexchange). Additionally, there is a book-crossing dataset that contains over a million ratings of about a quarter million different books (http://www2.informatik.uni-freiburg.de/~cziegler/BX/). Ingesting the movie review data Recommendation engines require large amounts of training data in order to do a good job, which is why they’re often relegated to big data projects. However, to build a recommendation engine, we must first get the required data into memory and, due to the size of the data, must do so in a memory-safe and efficient way. Luckily, Python has all of the tools to get the job done, and this recipe shows you how. Getting ready You will need to have the appropriate movie lens dataset downloaded, as specified in the preceding recipe. If you skipped the setup in you will need to go back and ensure that you have NumPy correctly installed. How to do it… The following steps guide you through the creation of the functions that we will need in order to load the datasets into the memory: Open your favorite Python editor or IDE. There is a lot of code, so it should be far simpler to enter directly into a text file than Read-Eval-Print Loop (REPL). We create a function to import the movie reviews: In [1]: import csv ...: import datetime In [2]: defload_reviews(path, **kwargs): ...: “““ ...: Loads MovieLens reviews ...: “““ ...: options = { ...: ‘fieldnames’: (‘userid’, ‘movieid’, ‘rating’, ‘timestamp’), ...: ‘delimiter’: ‘t’, ...: } ...: options.update(kwargs) ...: ...: parse_date = lambda r,k: datetime.fromtimestamp(float(r[k])) ...: parse_int = lambda r,k: int(r[k]) ...: ...: with open(path, ‘rb’) as reviews: ...: reader = csv.DictReader(reviews, **options) ...: for row in reader: ...: row[‘movieid’] = parse_int(row, ‘movieid’) ...: row[‘userid’] = parse_int(row, ‘userid’) ...: row[‘rating’] = parse_int(row, ‘rating’) ...: row[‘timestamp’] = parse_date(row, ‘timestamp’) ...: yield row We create a helper function to help import the data: In [3]: import os ...: defrelative_path(path): ...: “““ ...: Returns a path relative from this code file ...: “““ ...: dirname = os.path.dirname(os.path.realpath(‘__file__’)) ...: path = os.path.join(dirname, path) ...: return os.path.normpath(path)   We create another function to load the movie information: In [4]: defload_movies(path, **kwargs): ...: ...: options = { ...: ‘fieldnames’: (‘movieid’, ‘title’, ‘release’, ‘video’, ‘url’), ...: ‘delimiter’: ‘|’, ...: ‘restkey’: ‘genre’, ...: } ...: options.update(kwargs) ...: ...: parse_int = lambda r,k: int(r[k]) ...: parse_date = lambda r,k: datetime.strptime(r[k], ‘%d-%b-%Y’) if r[k] else None ...: ...: with open(path, ‘rb’) as movies: ...: reader = csv.DictReader(movies, **options) ...: for row in reader: ...: row[‘movieid’] = parse_int(row, ‘movieid’) ...: row[‘release’] = parse_date(row, ‘release’) ...: row[‘video’] = parse_date(row, ‘video’) ...: yield row Finally, we start creating a MovieLens class that will be augmented later : In [5]: from collections import defaultdict In [6]: class MovieLens(object): ...: “““ ...: Data structure to build our recommender model on. ...: “““ ...: ...: def __init__(self, udata, uitem): ...: “““ ...: Instantiate with a path to u.data and u.item ...: “““ ...: self.udata = udata ...: self.uitem = uitem ...: self.movies = {} ...: self.reviews = defaultdict(dict) ...: self.load_dataset() ...: ...: defload_dataset(self): ...: “““ ...: Loads the two datasets into memory, indexed on the ID. ...: “““ ...: for movie in load_movies(self.uitem): ...: self.movies[movie[‘movieid’]] = movie ...: ...: for review in load_reviews(self.udata): ...: self.reviews[review[‘userid’]][review[‘movieid’]] = review Ensure that the functions have been imported into your REPL or the IPython workspace, and type the following, making sure that the path to the data files is appropriate for your system: In [7]: data = relative_path(‘../data/ml-100k/u.data’) ...: item = relative_path(‘../data/ml-100k/u.item’) ...: model = MovieLens(data, item) How it works… The methodology that we use for the two data-loading functions (load_reviews and load_movies) is simple, but it takes care of the details of parsing the data from the disk. We created a function that takes a path to our dataset and then any optional keywords. We know that we have specific ways in which we need to interact with the csv module, so we create default options, passing in the field names of the rows along with the delimiter, which is t. The options.update(kwargs) line means that we’ll accept whatever users pass to this function. We then created internal parsing functions using a lambda function in Python. These simple parsers take a row and a key as input and return the converted input. This is an example of using lambda as internal, reusable code blocks and is a common technique in Python. Finally, we open our file and create a csv.DictReader function with our options. Iterating through the rows in the reader, we parse the fields that we want to be int and datetime, respectively, and then yield the row. Note that as we are unsure about the actual size of the input file, we are doing this in a memory-safe manner using Python generators. Using yield instead of return ensures that Python creates a generator under the hood and does not load the entire dataset into the memory. We’ll use each of these methodologies to load the datasets at various times through our computation that uses this dataset. We’ll need to know where these files are at all times, which can be a pain, especially in larger code bases; in the There’s more… section, we’ll discuss a Python pro-tip to alleviate this concern. Finally, we created a data structure, which is the MovieLens class, with which we can hold our reviews’ data. This structure takes the udata and uitem paths, and then, it loads the movies and reviews into two Python dictionaries that are indexed by movieid and userid, respectively. To instantiate this object, you will execute something as follows: In [7]: data = relative_path(‘../data/ml-100k/u.data’) ...: item = relative_path(‘../data/ml-100k/u.item’) ...: model = MovieLens(data, item) Note that the preceding commands assume that you have your data in a folder called data. We can now load the whole dataset into the memory, indexed on the various IDs specified in the dataset. Did you notice the use of the relative_path function? When dealing with fixtures such as these to build models, the data is often included with the code. When you specify a path in Python, such as data/ml-100k/u.data, it looks it up relative to the current working directory where you ran the script. To help ease this trouble, you can specify the paths that are relative to the code itself: importos defrelative_path(path): “““ Returns a path relative from this code file “““ dirname = os.path.dirname(os.path.realpath(‘__file__’)) path = os.path.join(dirname, path) returnos.path.normpath(path) Keep in mind that this holds the entire data structure in memory; in the case of the 100k dataset, this will require 54.1 MB, which isn’t too bad for modern machines. However, we should also keep in mind that we’ll generally build recommenders using far more than just 100,000 reviews. This is why we have configured the data structure the way we have—very similar to a database. To grow the system, you will replace the reviews and movies properties with database access functions or properties, which will yield data types expected by our methods. Finding the highest-scoring movies If you’re looking for a good movie, you’ll often want to see the most popular or best rated movies overall. Initially, we’ll take a naïve approach to compute a movie’s aggregate rating by averaging the user reviews for each movie. This technique will also demonstrate how to access the data in our MovieLens class. Getting ready These recipes are sequential in nature. Thus, you should have completed the previous recipes in the article before starting with this one. How to do it… Follow these steps to output numeric scores for all movies in the dataset and compute a top-10 list: Augment the MovieLens class with a new method to get all reviews for a particular movie: In [8]: class MovieLens(object): ...: ...: ...: defreviews_for_movie(self, movieid): ...: “““ ...: Yields the reviews for a given movie ...: “““ ...: for review in self.reviews.values(): ...: if movieid in review: ...: yield review[movieid] ...: Then, add an additional method to compute the top 10 movies reviewed by users: In [9]: import heapq ...: from operator import itemgetter ...: class MovieLens(object): ...: ...: defaverage_reviews(self): ...: “““ ...: Averages the star rating for all movies. Yields a tuple of movieid, ...: the average rating, and the number of reviews. ...: “““ ...: for movieid in self.movies: ...: reviews = list(r[‘rating’] for r in self.reviews_for_movie(movieid)) ...: average = sum(reviews) / float(len(reviews)) ...: yield (movieid, average, len(reviews)) ...: ...: deftop_rated(self, n=10): ...: “““ ...: Yields the n top rated movies ...: “““ ...: return heapq.nlargest(n, self.bayesian_average(), key=itemgetter(1)) ...: Note that the … notation just below class MovieLens(object): signifies that we will be appending the average_reviews method to the existing MovieLens class. Now, let’s print the top-rated results: In [10]: for mid, avg, num in model.top_rated(10): ...: title = model.movies[mid][‘title’] ...: print “[%0.3f average rating (%i reviews)] %s” % (avg, num,title) Executing the preceding commands in your REPL should produce the following output: Out [10]: [5.000 average rating (1 reviews)] Entertaining Angels: The Dorothy Day Story (1996) [5.000 average rating (2 reviews)] Santa with Muscles (1996) [5.000 average rating (1 reviews)] Great Day in Harlem, A (1994) [5.000 average rating (1 reviews)] They Made Me a Criminal (1939) [5.000 average rating (1 reviews)] Aiqingwansui (1994) [5.000 average rating (1 reviews)] Someone Else’s America (1995) [5.000 average rating (2 reviews)] Saint of Fort Washington, The (1993) [5.000 average rating (3 reviews)] Prefontaine (1997) [5.000 average rating (3 reviews)] Star Kid (1997) [5.000 average rating (1 reviews)] Marlene Dietrich: Shadow and Light (1996) How it works… The new reviews_for_movie() method that is added to the MovieLens class iterates through our review dictionary values (which are indexed by the userid parameter), checks whether the movieid value has been reviewed by the user, and then presents that review dictionary. We will need such functionality for the next method. With the average_review() method, we have created another generator function that goes through all of our movies and all of their reviews and presents the movie ID, the average rating, and the number of reviews. The top_rated function uses the heapq module to quickly sort the reviews based on the average. The heapq data structure, also known as the priority queue algorithm, is the Python implementation of an abstract data structure with interesting and useful properties. Heaps are binary trees that are built so that every parent node has a value that is either less than or equal to any of its children nodes. Thus, the smallest element is the root of the tree, which can be accessed in constant time, which is a very desirable property. With heapq, Python developers have an efficient means to insert new values in an ordered data structure and also return sorted values. There’s more… Here, we run into our first problem—some of the top-rated movies only have one review (and conversely, so do the worst-rated movies). How do you compare Casablanca, which has a 4.457 average rating (243 reviews), with Santa with Muscles, which has a 5.000 average rating (2 reviews)? We are sure that those two reviewers really liked Santa with Muscles, but the high rating for Casablanca is probably more meaningful because more people liked it. Most recommenders with star ratings will simply output the average rating along with the number of reviewers, allowing the user to determine their quality; however, as data scientists, we can do better in the next recipe. See also The heapq documentation available at https://docs.python.org/2/library/heapq.html We have thus pointed out that companies such as Amazon track purchases and page views to make recommendations, Goodreads and Yelp use 5 star ratings and text reviews, and sites such as Reddit or Stack Overflow use simple up/down voting. You can see that preference can be expressed in the data in different ways, from Boolean flags to voting to ratings. However, these preferences are expressed by attempting to find groups of similarities in preference expressions in which you are leveraging the core assumption of collaborative filtering. More formally, we understand that two people, Bob and Alice, share a preference for a specific item or widget. If Alice too has a preference for a different item, say, sprocket, then Bob has a better than random chance of also sharing a preference for a sprocket. We believe that Bob and Alice’s taste similarities can be expressed in an aggregate via a large number of preferences, and by leveraging the collaborative nature of groups, we can filter the world of products. Summary In the recipes we learned various ways for understanding data and finding highest scoring reviews using IPython.  Resources for Article: Further resources on this subject: The Data Science Venn Diagram [article] Python Data Science Up and Running [article] Data Science with R [article]
Read more
  • 0
  • 0
  • 5425

article-image-building-product-recommendation-system
Packt
29 Mar 2016
25 min read
Save for later

Building a Product Recommendation System

Packt
29 Mar 2016
25 min read
 In this article by Raghav Bali and Dipanjan Sarkar, author of the book R Machine Learning By Example, we will discuss about, collaborative filtering is a simple yet very effective approach for predicting and recommending items to users. If we look closely, the algorithms work on input data, which is nothing but a matrix representation of the user ratings for different products. Bringing in a mathematical perspective into the picture, matrix factorization is a technique to manipulate matrices and identify latent or hidden features from the data represented in the matrix. Building upon the same concept, let us use matrix factorization as the basis for predicting ratings for items which the user has not yet rated. (For more resources related to this topic, see here.)  Matrix factorization Matrix factorization refers to identification of two or more matrices such that when these matrices are multiplied we get the original matrix. Matrix factorization, as mentioned earlier, can be used to discover latent features between two different kinds of entities. We will understand and use the concepts of matrix factorization as we go along preparing our recommender engine for our e-commerce platform. As our aim for the current project is to personalize the shopping experience and recommend product ratings for an e-commerce platform, our input data contains user ratings for various products on the website. We process the input data and transform it into matrix representation for analyzing it using matrix factorization. The input data looks like this: User ratings matrix As you can see, the input data is a matrix with each row representing a particular user's rating for different items represented in the columns. For the current case, the columns representing items are different mobile phones such as iPhone 4, iPhone 5s, Nexus 5, and so on. Each row contains ratings for each of these mobile phones as given by eight different users. The ratings range from 1 to 5 with 1 being the lowest and 5 being the highest. A rating of 0 represents unrated items or missing rating. The task of our recommender engine will be to predict the correct rating for the missing ones in the input matrix. We could then use the predicted ratings to recommend items most desired by the user. The premise here is that two users would rate a product similarly if they like similar features of the product or item. Since our current data is related to user ratings for different mobile phones, people might rate the phones based on their hardware configuration, price, OS, and so on. Hence, matrix factorization tries to identify these latent features to predict ratings for a certain user and a certain product. While trying to identify these latent features, we proceed with the basic assumption that the number of such features is less than the total number of items in consideration. This assumption makes sense because if this was the case, then each user would have a specific feature associated with him/her (and similarly for the product). This would in turn make recommendations futile as none of the users would be interested in items rated by the other users (which is not the case usually). Now let us get into the mathematical details of matrix factorization and our recommender engine. Since we are dealing with user ratings for different products, let us assume U to be a matrix representing user preferences and similarly, a matrix P representing the products for which we have the ratings. Then the ratings matrix R will be defined as R = U x PT (we take transpose of P as PT for matrix multiplication) where, |R| = |U| x |P|. Assuming the process helps us identify K latent features, our aim is to find two matrices X and Y such that their product (matrix multiplication) approximates R. X = |U| x K matrix Y = |P| x K matrix Here, X is a user related matrix which represents the associations between the users and the latent features. Y, on the other hand, is the product related matrix which represents the associations between the products and the latent features. The task of predicting the rating of a product pj by a user ui is done by calculating the dot product of the vectors corresponding to pj (vector Y, that is the user) and ui (vector X, that is the product). Now, to find the matrices X and Y, we utilize a technique called gradient descent. Gradient descent, in simple terms, tries to find the local minimum of a function; it is an optimization technique. We use gradient descent in the current context to iteratively minimize the difference between the predicted ratings and the actual ratings. To begin with, we randomly initialize the matrices X and Y and then calculate how different their product is from the actual ratings matrix R. The difference between the predicted and the actual values is what is termed as the error. For our problem, we will consider the squared error, which is calculated as: Here, rij is the actual rating by user i for product j and  is the predicted value of the same. To minimize the error, we need to find the correct direction or gradient to change our values to. To obtain the gradient for each of the variables x and y, we differentiate them separately as:   Hence, the equations to find xik and ykj can be given as:   Here α is the constant to denote the rate of descent or the rate of approaching the minima (also known as the learning rate). The value of α defines the size of steps we take in either direction to reach the minima. Large values may lead to oscillations as we may overshoot the minima every time. Usual practice is to select very small values for α, of the order 10-4.  and  are the updated values of xik and ykj after each iteration of gradient descent. To avoid overfitting, along with controlling extreme or large values in the matrices X and Y, we introduce the concept of regularization. Formally, regularization refers to the process of introducing additional information in order to prevent overfitting. Regularization penalizes models with extreme values. To prevent overfitting in our case, we introduce the regularization constant called β. With introduction of β, the equations are updated as follows: Also, As we already have the ratings matrix R and we use it determine how far are our predicted values from the actual, matrix factorization turns into a supervised learning problem. We use some of the rows as our training samples. Let S be our training set with elements being tuples of the form (ui, pj, rij). Thus, our task is to minimise the error (eij) for every tuple (ui, pj, rij) ϵ training set S. The overall error (say E) can be calculated as: E = ∑(ui, pj, rij) ϵS eij   = ∑(ui, pj, rij) ϵS (rij - ∑(k=1 to K) xikykj)2 Implementation Now that we have looked into the mathematics of matrix factorization, let us convert the algorithm into code and prepare a recommender engine for the mobile phone ratings input data set discussed earlier. As shown in the Matrix factorization section, the input dataset is a matrix with each row representing a user's rating for the products mentioned as columns. The ratings range from 1 to 5 with 0 representing the missing values. To transform our algorithm into working code, we need to compute and complete the following tasks: Load the input data and transform it into ratings matrix representation Prepare a matrix factorization based recommendation model Predict and recommend products to the users Interpret and evaluate the model Loading and transforming input data into matrix representation is simple. As seen earlier, R provides us with easy to use utility functions for the same. # load raw ratings from csv raw_ratings <- read.csv(<file_name>)   # convert columnar data to sparse ratings matrix ratings_matrix <- data.matrix(raw_ratings) Now that we have our data loaded into an R matrix, we proceed and prepare the user-latent features matrix X and item-latent features matrix Y. We initialize both from uniform distributions using the runif function. # number of rows in ratings rows <- nrow(ratings_matrix)   # number of columns in ratings matrix columns <- ncol(ratings_matrix)   # latent features K <- 2   # User-Feature Matrix X <- matrix(runif(rows*K), nrow=rows, byrow=TRUE)   # Item-Feature Matrix Y <- matrix(runif(columns*K), nrow=columns, byrow=TRUE) The major component is the matrix factorization function itself. Let us split the task into two, calculation of the gradient and subsequently the overall error. The calculation of the gradient involves the ratings matrix R and the two factor matrices X and Y, along with the constants α and β. Since we are dealing with matrix manipulations (specifically, multiplication), we transpose Y before we begin with any further calculations. The following lines of code convert the algorithm discussed previously into R syntax. All variables follow naming convention similar to the algorithm for ease of understanding. for (i in seq(nrow(ratings_matrix))){         for (j in seq(length(ratings_matrix[i, ]))){           if (ratings_matrix[i, j] > 0){             # error           eij = ratings_matrix[i, j] - as.numeric(X[i, ] %*% Y[, j])          # gradient calculation             for (k in seq(K)){             X[i, k] = X[i, k] + alpha * (2 * eij * Y[k, j]/             - beta * X[i, k])               Y[k, j] = Y[k, j] + alpha * (2 * eij * X[i, k]/             - beta * Y[k, j])           }         }       }     } The next part of the algorithm is to calculate the overall error; we again use similar variable names for consistency: # Overall Squared Error Calculation   e = 0   for (i in seq(nrow(ratings_matrix))){      for (j in seq(length(ratings_matrix[i, ]))){        if (ratings_matrix[i, j] > 0){          e = e + (ratings_matrix[i, j] - /            as.numeric(X[i, ] %*% Y[, j]))^2          for (k in seq(K)){           e = e + (beta/2) * (X[i, k]^2 + Y[k, j]^2)         }       }     } } As a final piece, we iterate over these calculations multiple times to mitigate the risks of cold start and sparsity. We term the variable controlling multiple starts as epoch. We also terminate the calculations once the overall error drops below a certain threshold. Moreover, as we had initialized X and Y from uniform distributions, the predicted values would be real numbers. We round the final output before returning the predicted matrix. Note that this is a very simplistic implementation and a lot of complexity has been kept out for ease of understanding. Hence, this may result in the predicted matrix to contain values greater than 5. For the current scenario, it is safe to assume the values above the max scale of 5 as equivalent to 5 (and similarly for values lesser than 0). We encourage the reader to fine tune the code to handle such cases. Setting α to 0.0002, β to 0.02, K (that is, latent features) to 2, and epoch to 1000, let us see a sample run of our code with overall error threshold set to 0.001: # load raw ratings from csv raw_ratings <- read.csv("product_ratings.csv")   # convert columnar data to sparse ratings matrix ratings_matrix <- data.matrix(raw_ratings)     # number of rows in ratings rows <- nrow(ratings_matrix)   # number of columns in ratings matrix columns <- ncol(ratings_matrix)   # latent features K <- 2   # User-Feature Matrix X <- matrix(runif(rows*K), nrow=rows, byrow=TRUE)   # Item-Feature Matrix Y <- matrix(runif(columns*K), nrow=columns, byrow=TRUE)   # iterations epoch <- 10000   # rate of descent alpha <- 0.0002   # regularization constant beta <- 0.02     pred.matrix <- mf_based_ucf(ratings_matrix, X, Y, K, epoch = epoch)   # setting column names colnames(pred.matrix)<- c("iPhone.4","iPhone.5s","Nexus.5","Moto.X","Moto.G","Nexus.6",/ "One.Plus.One") The preceding lines of code utilize the functions explained earlier to prepare the recommendation model. The predicted ratings or the output matrix looks like the following: Predicted ratings matrix Result interpretation Let us do a quick visual inspection to see how good or bad our predictions have been. Consider users 1 and 3 as our training samples. From the input dataset, we can clearly see that user 1 has given high ratings to iPhones while user 3 has done the same for Android based phones. The following side by side comparison shows that our algorithm has predicted values close enough to the actual values: Ratings by user 1 Let us see the ratings of user 3 in the following screenshot: Ratings by user 3 Now that we have our ratings matrix with updated values, we are ready to recommend products to users. It is common sense to show only the products which the user hasn't rated yet. The right set of recommendations will also enable the seller to pitch the products which have high probability of being purchased by the user. The usual practice is to return a list of top N items from the unrated list of products for each user. The user in consideration is usually termed as the active-user. Let us consider user 6 as our active-user. This user has only rated Nexus 6, One Plus One, Nexus 5, and iPhone4 in that order of rating, that is Nexus 6 was highly rated and iPhone4 was rated the least. Getting a list of Top 2 recommended phones for such a customer using our algorithm would result in Moto X and Moto G (very rightly indeed, do you see why?). Thus, we built a recommender engine smart enough to recommend the right mobile phones to an Android Fanboy and saved the world from yet another catastrophe! Data to the rescue! This simple implementation of recommender engine using matrix factorization gave us a flavor of how such a system actually works. Next, let us get into some real world action using recommender engines. Production ready recommender engines In this article so far, we have learnt about recommender engines in detail and even developed one from scratch (using matrix factorization). Through all this, it is clearly evident how widespread is the application of such systems. E-commerce websites (or for that fact, any popular technology platform) out there today have tonnes of content to offer. Not only that, but the number of users is also huge. In such a scenario, where thousands of users are browsing/buying stuff simultaneously across the globe, providing recommendations to them is a task in itself. To complicate things even further, a good user experience (response times, for example) can create a big difference between two competitors. These are live examples of production systems handling millions of customers day in and day out. Fun Fact Amazon.com is one of the biggest names in the e-commerce space with 244 million active customers. Imagine the amount of data being processed to provide recommendations to such a huge customer base browsing through millions of products! Source: http://www.amazon.com/b?ie=UTF8&node=8445211011 In order to provide a seamless capability for use in such platforms, we need highly optimized libraries and hardware. For a recommender engine to handle thousands of users simultaneously every second, R has a robust and reliable framework called the recommenderlab. Recommenderlab is a widely used R extension designed to provide a robust foundation for recommender engines. The focus of this library is to provide efficient handling of data, availability of standard algorithms and evaluation capabilities. In this section, we will be using recommenderlab to handle a considerably large data set for recommending items to users. We will also use the evaluation functions from recommenderlab to see how good or bad our recommendation system is. These capabilities will help us build a production ready recommender system similar (or at least closer) to what many online applications such as Amazon or Netflix use. The dataset used in this section contains ratings for 100 items as rated by 5000 users. The data has been anonymised and the product names have been replaced by product IDs. The rating scale used is 0 to 5 with 1 being the worst, 5 being the best, and 0 representing unrated items or missing ratings. To build a recommender engine using recommenderlab for a production ready system, the following steps are to be performed: Extract, transform, and analyze the data. Prepare a recommendation model and generate recommendations. Evaluate the recommendation model. We will look at all these steps in the following subsections. Extract, transform, and analyze As in case of any data intensive (particularly machine learning) application, the first and foremost step is to get the data, understand/explore it, and then transform it into the format required by the algorithm deemed fit for the current application. For our recommender engine using recommenderlab package, we will first load the data from a csv file described in the previous section and then explore it using various R functions. # Load recommenderlab library library("recommenderlab")   # Read dataset from csv file raw_data <- read.csv("product_ratings_data.csv")   # Create rating matrix from data ratings_matrix<- as(raw_data, "realRatingMatrix")   #view transformed data image(ratings_matrix[1:6,1:10]) The preceding section of code loads the recommenderlab package and then uses the standard utility function to read the product_ratings_data.csv file. For exploratory as well as further steps, we need the data to be transformed into user-item ratings matrix format (as described in the Core concepts and definitions section). The as(<data>,<type>) utility converts csv into the required ratings matrix format. The csv file contains data in the format shown in the following screenshot. Each row contains a user's rating for a specific product. The column headers are self explanatory. Product ratings data The realRatingMatrix conversion transforms the data into a matrix as shown in the following image. The users are depicted as rows while the columns represent the products. Ratings are represented using a gradient scale where white represents missing/unrated rating while black denotes a rating of 5/best. Ratings matrix representation of our data Now that we have the data in our environment, let us explore some of its characteristics and see if we can decipher some key patterns. First of all, we extract a representative sample from our main data set (refer to the screenshot Product ratings data) and analyse it for: Average rating score for our user population Spread/distribution of item ratings across the user population Number of items rated per user The following lines of code help us explore our data set sample and analyse the points mentioned previously: # Extract a sample from ratings matrix sample_ratings <-sample(ratings_matrix,1000)   # Get the mean product ratings as given by first user rowMeans(sample_ratings[1,])     # Get distribution of item ratings hist(getRatings(sample_ratings), breaks=100,/      xlab = "Product Ratings",main = " Histogram of Product Ratings")   # Get distribution of normalized item ratings hist(getRatings(normalize(sample_ratings)),breaks=100,/             xlab = "Normalized Product Ratings",main = /                 " Histogram of Normalized Product Ratings")   # Number of items rated per user hist(rowCounts(sample_ratings),breaks=50,/      xlab = "Number of Products",main =/      " Histogram of Product Count Distribution") We extract a sample of 1,000 users from our dataset for exploration purposes. The mean of product ratings as given by the first row in our user-rating sample is 2.055. This tells us that this user either hasn't seen/rated many products or he usually rates the products pretty low. To get a better idea of how the users rate products, we generate a histogram of item rating distribution. This distribution peaks around the middle, that is, 3. The histogram is shown next: Histogram for ratings distribution The histogram shows that the ratings are normally distributed around the mean with low counts for products with very high or very low ratings. Finally, we check the spread of the number of products rated by the users. We prepare a histogram which shows this spread: Histogram of number of rated products The preceding histogram shows that there are many users who have rated 70 or more products, as well as there are many users who have rated all the 100 products. The exploration step helps us get an idea of how our data is. We also get an idea about the way the users generally rate the products and how many products are being rated. Model preparation and prediction We have the data in our R environment which has been transformed into the ratings matrix format. In this section, we are interested in preparing a recommender engine based upon user-based collaborative filtering. We will be using similar terminology as described in the previous sections. Recommenderlab provides straight forward utilities to learn and prepare a model for building recommender engines. We prepare our model based upon a sample of just 1,000 users. This way, we can use this model to predict the missing ratings for rest of the users in our ratings matrix. The following lines of code utilize the first thousand rows for learning the model: # Create 'User Based collaborative filtering' model ubcf_recommender <- Recommender(ratings_matrix[1:1000],"UBCF") "UBCF" in the preceding code signifies user-based collaborative filtering. Recommenderlab also provides other algorithms, such as IBCF or Item-Based Collaborative Filtering, PCA or Principal Component Analysis, and others as well. After preparing the model, we use it to predict the ratings for our 1,010th and 1,011th users in the system. Recommenderlab also requires us to mention the number of items to be recommended to the users (in the order of preference of course). For the current case, we mention 5 as the number of items to be recommended. # Predict list of product which can be recommended to given users recommendations <- predict(ubcf_recommender,/                   ratings_matrix[1010:1011], n=5)   # show recommendation in form of the list as(recommendations, "list") The preceding lines of code generate two lists, one for each of the users. Each element in these lists is a product for recommendation. The model predicted that for user 1,010, product prod_93 should be recommended as the top most product followed by prod_79, and so on. # output generated by the model [[1]] [1] "prod_93" "prod_79" "prod_80" "prod_83" "prod_89"   [[2]] [1] "prod_80" "prod_85" "prod_87" "prod_75" "prod_79" Recommenderlab is a robust platform which is optimized to handle large datasets. With a few lines of code, we were able to load the data, learn a model, and even recommend products to the users in virtually no time. Compare this with the basic recommender engine we developed using matrix factorization which involved a lot many lines of code (when compared to recommenderlab) apart from the obvious difference in performance. Model evaluation We have successfully prepared a model and used it for predicting and recommending products to the users in our system. But what do we know about the accuracy of our model? To evaluate the prepared model, recommenderlab has handy and easy to use utilities. Since we need to evaluate our model, we need to split it into training and test data sets. Also, recommenderlab requires us to mention the number of items to be used for testing (it uses the rest for computing the error). For the current case, we will use 500 users to prepare an evaluation model. The model will be based upon 90-10 training-testing dataset split with 15 items used for test sets. # Evaluation scheme eval_scheme <- evaluationScheme(ratings_matrix[1:500],/                       method="split",train=0.9,given=15)   # View the evaluation scheme eval_scheme   # Training model training_recommender <- Recommender(getData(eval_scheme,/                        "train"), "UBCF")   # Preditions on the test dataset test_rating <- predict(training_recommender,/                getData(eval_scheme, "known"), type="ratings")   #Error error <- calcPredictionAccuracy(test_rating,/                    getData(eval_scheme, "unknown"))   error We use the evaluation scheme to train our model based upon UBCF algorithm. The prepared model from the training dataset is used to predict ratings for the given items. We finally use the method calcPredictionAccuracy to calculate the error in predicting the ratings between known and unknown components of the test set. For our case, we get an output as follows: The generated output mentions the values for RMSE or root mean squared error, MSE or mean squared error, and MAE or mean absolute error. For RMSE in particular, the values deviate from the correct values by 1.162 (note that the values might deviate slightly across runs due to various factors such as sampling, iterations, and so on). This evaluation will make more sense when the outcomes are compared from different CF algorithms. For evaluating UBCF, we use IBCF as comparator. The following few lines of code help us prepare an IBCF based model and test the ratings, which can then be compared using the calcPredictionAccuracy utility: # Training model using IBCF training_recommender_2 <- Recommender(getData(eval_scheme,/                                      "train"), "IBCF")   # Preditions on the test dataset test_rating_2 <- predict(training_recommender_2,/                   getData(eval_scheme, "known"),/                 type="ratings")   error_compare <- rbind(calcPredictionAccuracy(test_rating,/                 getData(eval_scheme, "unknown")),/                        calcPredictionAccuracy(test_rating_2,/                 getData(eval_scheme, "unknown")))   rownames(error_compare) <- c("User Based CF","Item Based CF") The comparative output shows that UBCF outperforms IBCF with lower values of RMSE, MSE, and MAE. Similarly, we can use the other algorithms available in recommenderlab to test/evaluate our models. We encourage the user to try out a few more and see which algorithm has the least error in predicted ratings. Summary In this arcticle, we continued our pursuit of using machine learning in the field of e-commerce to enhance sales and overall user experience. In this article, we accounted for the human factor and looked into the recommendation engines based upon user behavior. We started off by understanding what are recommendation systems and their classifications into user-based, content-based, and hybrid recommender systems. We touched upon the problems associated with recommender engines in general. Then we dived deep into the specifics of Collaborative Filters and discussed the math around prediction and similarity measures. After getting our basics straight, we moved onto building a recommender engine of our own from scratch. We utilized matrix factorization to build a recommender engine step by step using a small dummy dataset. We then moved onto building a production ready recommender engine using R's popular library called recommenderlab. We used user-based CF as our core algorithm to build a recommendation model upon a bigger dataset containing ratings for 100 products by 5,000 users. We closed our discussion by evaluating our recommendation model using recommenderlab's utility methods. Resources for Article: Further resources on this subject: Machine Learning with R [article] Introduction to Machine Learning with R [article] Training and Visualizing a neural network with R [article]
Read more
  • 0
  • 0
  • 5424

article-image-wpf-45-application-and-windows
Packt
24 Sep 2012
14 min read
Save for later

WPF 4.5 Application and Windows

Packt
24 Sep 2012
14 min read
Creating a window Windows are the typical top level controls in WPF. By default, a MainWindow class is created by the application wizard and automatically shown upon running the application. In this recipe, we'll take a look at creating and showing other windows that may be required during the lifetime of an application. Getting ready Make sure Visual Studio is up and running. How to do it... We'll create a new class derived from Window and show it when a button is clicked: Create a new WPF application named CH05.NewWindows. Right-click on the project node in Solution explorer, and select Add | Window…: In the resulting dialog, type OtherWindow in the Name textbox and click on Add. A file named OtherWindow.xaml should open in the editor. Add a TextBlock to the existing Grid, as follows: <TextBlock Text="This is the other window" FontSize="20" VerticalAlignment="Center" HorizontalAlignment="Center" /> Open MainWindow.xaml. Add a Button to the Grid with a Click event handler: <Button Content="Open Other Window" FontSize="30" Click="OnOpenOtherWindow" /> In the Click event handler, add the following code: void OnOpenOtherWindow(object sender, RoutedEventArgs e) { var other = new OtherWindow(); other.Show(); } Run the application, and click the button. The other window should appear and live happily alongside the main window: How it works... A Window is technically a ContentControl, so can contain anything. It's made visible using the Show method. This keeps the window open as long as it's not explicitly closed using the classic close button, or by calling the Close method. The Show method opens the window as modeless—meaning the user can return to the previous window without restriction. We can click the button more than once, and consequently more Window instances would show up. There's more... The first window shown can be configured using the Application.StartupUri property, typically set in App.xaml. It can be changed to any other window. For example, to show the OtherWindow from the previous section as the first window, open App.xaml and change the StartupUri property to OtherWindow.xaml: StartupUri="OtherWindow.xaml" Selecting the startup window dynamically Sometimes the first window is not known in advance, perhaps depending on some state or setting. In this case, the StartupUri property is not helpful. We can safely delete it, and provide the initial window (or even windows) by overriding the Application.OnStartup method as follows (you'll need to add a reference to the System.Configuration assembly for the following to compile): protected override void OnStartup(StartupEventArgs e) {    Window mainWindow = null;    // check some state or setting as appropriate          if(ConfigurationManager.AppSettings["AdvancedMode"] == "1")       mainWindow = new OtherWindow();    else       mainWindow = new MainWindow();    mainWindow.Show(); } This allows complete flexibility in determining what window or windows should appear at application startup. Accessing command line arguments The WPF application created by the New Project wizard does not expose the ubiquitous Main method. WPF provides this for us – it instantiates the Application object and eventually loads the main window pointed to by the StartupUri property. The Main method, however, is not just a starting point for managed code, but also provides an array of strings as the command line arguments passed to the executable (if any). As Main is now beyond our control, how do we get the command line arguments? Fortunately, the same OnStartup method provides a StartupEventArgs object, in which the Args property is mirrored from Main. The downloadable source for this chapter contains the project CH05.CommandLineArgs, which shows an example of its usage. Here's the OnStartup override: protected override void OnStartup(StartupEventArgs e) { string text = "Hello, default!"; if(e.Args.Length > 0) text = e.Args[0]; var win = new MainWindow(text); win.Show(); } The MainWindow instance constructor has been modified to accept a string that is later used by the window. If a command line argument is supplied, it is used. Creating a dialog box A dialog box is a Window that is typically used to get some data from the user, before some operation can proceed. This is sometimes referred to as a modal window (as opposed to modeless, or non-modal). In this recipe, we'll take a look at how to create and manage such a dialog box. Getting ready Make sure Visual Studio is up and running. How to do it... We'll create a dialog box that's invoked from the main window to request some information from the user: Create a new WPF application named CH05.Dialogs. Add a new Window named DetailsDialog.xaml (a DetailsDialog class is created). Visual Studio opens DetailsDialog.xaml. Set some Window properties: FontSize to 16, ResizeMode to NoResize, SizeToContent to Height, and make sure the Width is set to 300: ResizeMode="NoResize" SizeToContent="Height" Width="300" FontSize="16" Add four rows and two columns to the existing Grid, and add some controls for a simple data entry dialog as follows: <Grid.RowDefinitions> <RowDefinition Height="Auto" /> <RowDefinition Height="Auto"/> <RowDefinition Height="Auto"/> <RowDefinition Height="Auto"/> </Grid.RowDefinitions> <Grid.ColumnDefinitions> <ColumnDefinition Width="Auto" /> <ColumnDefinition /> </Grid.ColumnDefinitions> <TextBlock Text="Please enter details:" Grid.ColumnSpan="2" Margin="4,4,4,20" HorizontalAlignment="Center"/> <TextBlock Text="Name:" Grid.Row="1" Margin="4"/> <TextBox Grid.Column="1" Grid.Row="1" Margin="4" x_Name="_name"/> <TextBlock Text="City:" Grid.Row="2" Margin="4"/> <TextBox Grid.Column="1" Grid.Row="2" Margin="4" x_Name="_city"/> <StackPanel Grid.Row="3" Orientation="Horizontal" Margin="4,20,4,4" Grid.ColumnSpan="2" HorizontalAlignment="Center"> <Button Content="OK" Margin="4"  /> <Button Content="Cancel" Margin="4" /> </StackPanel> This is how it should look in the designer: The dialog should expose two properties for the name and city the user has typed in. Open DetailsDialog.xaml.cs. Add two simple properties: public string FullName { get; private set; } public string City { get; private set; } We need to show the dialog from somewhere in the main window. Open MainWindow.xaml, and add the following markup to the existing Grid: <Grid.RowDefinitions>     <RowDefinition Height="Auto" />     <RowDefinition /> </Grid.RowDefinitions> <Button Content="Enter Data" Click="OnEnterData"         Margin="4" FontSize="16"/> <TextBlock FontSize="24" x_Name="_text" Grid.Row="1"     VerticalAlignment="Center" HorizontalAlignment="Center"/> In the OnEnterData handler, add the following: private void OnEnterData(object sender, RoutedEventArgs e) { var dlg = new DetailsDialog(); if(dlg.ShowDialog() == true) { _text.Text = string.Format( "Hi, {0}! I see you live in {1}.", dlg.FullName, dlg.City); } } Run the application. Click the button and watch the dialog appear. The buttons don't work yet, so your only choice is to close the dialog using the regular close button. Clearly, the return value from ShowDialog is not true in this case. When the OK button is clicked, the properties should be set accordingly. Add a Click event handler to the OK button, with the following code: private void OnOK(object sender, RoutedEventArgs e) { FullName = _name.Text; City = _city.Text; DialogResult = true; Close(); } The Close method dismisses the dialog, returning control to the caller. The DialogResult property indicates the returned value from the call to ShowDialog when the dialog is closed. Add a Click event handler for the Cancel button with the following code: private void OnCancel(object sender, RoutedEventArgs e) { DialogResult = false; Close(); } Run the application and click the button. Enter some data and click on OK: You will see the following window: How it works... A dialog box in WPF is nothing more than a regular window shown using ShowDialog instead of Show. This forces the user to dismiss the window before she can return to the invoking window. ShowDialog returns a Nullable (can be written as bool? in C#), meaning it can have three values: true, false, and null. The meaning of the return value is mostly up to the application, but typically true indicates the user dismissed the dialog with the intention of making something happen (usually, by clicking some OK or other confirmation button), and false means the user changed her mind, and would like to abort. The null value can be used as a third indicator to some other application-defined condition. The DialogResult property indicates the value returned from ShowDialog because there is no other way to convey the return value from the dialog invocation directly. That's why the OK button handler sets it to true and the Cancel button handler sets it to false (this also happens when the regular close button is clicked, or Alt + F4 is pressed). Most dialog boxes are not resizable. This is indicated with the ResizeMode property of the Window set to NoResize. However, because of WPF's flexible layout, it certainly is relatively easy to keep a dialog resizable (and still manageable) where it makes sense (such as when entering a potentially large amount of text in a TextBox – it would make sense if the TextBox could grow if the dialog is enlarged). There's more... Most dialogs can be dismissed by pressing Enter (indicating the data should be used) or pressing Esc (indicating no action should take place). This is possible to do by setting the OK button's IsDefault property to true and the Cancel button's IsCancel property to true. The default button is typically drawn with a heavier border to indicate it's the default button, although this eventually depends on the button's control template. If these settings are specified, the handler for the Cancel button is not needed. Clicking Cancel or pressing Esc automatically closes the dialog (and sets DiaglogResult to false). The OK button handler is still needed as usual, but it may be invoked by pressing Enter, no matter what control has the keyboard focus within the Window. The CH05.DefaultButtons project from the downloadable source for this chapter demonstrates this in action. Modeless dialogs A dialog can be show as modeless, meaning it does not force the user to dismiss it before returning to other windows in the application. This is done with the usual Show method call – just like any Window. The term dialog in this case usually denotes some information expected from the user that affects other windows, sometimes with the help of another button labelled "Apply". The problem here is mostly logical—how to convey the information change. The best way would be using data binding, rather than manually modifying various objects. We'll take an extensive look at data binding in the next chapter. Using the common dialog boxes Windows has its own built-in dialog boxes for common operations, such as opening files, saving a file, and printing. Using these dialogs is very intuitive from the user's perspective, because she has probably used those dialogs before in other applications. WPF wraps some of these (native) dialogs. In this recipe, we'll see how to use some of the common dialogs. Getting ready Make sure Visual Studio is up and running. How to do it... We'll create a simple image viewer that uses the Open common dialog box to allow the user to select an image file to view: Create a new WPF Application named CH05.CommonDialogs. Open MainWindow.xaml. Add the following markup to the existing Grid: <Grid.RowDefinitions>     <RowDefinition Height="Auto" />     <RowDefinition /> </Grid.RowDefinitions> <Button Content="Open Image" FontSize="20" Click="OnOpenImage"         HorizontalAlignment="Center" Margin="4" /> <Image Grid.Row="1" x_Name="_img" Stretch="Uniform" /> Add a Click event handler for the button. In the handler, we'll first create an OpenFileDialog instance and initialize it (add a using to the Microsoft.Win32 namespace): void OnOpenImage(object sender, RoutedEventArgs e) { var dlg = new OpenFileDialog { Filter = "Image files|*.png;*.jpg;*.gif;*.bmp", Title = "Select image to open", InitialDirectory = Environment.GetFolderPath( Environment.SpecialFolder.MyPictures) }; Now we need to show the dialog and use the selected file (if any): if(dlg.ShowDialog() == true) { try { var bmp = new BitmapImage(new Uri(dlg.FileName)); _img.Source = bmp; } catch(Exception ex) { MessageBox.Show(ex.Message, "Open Image"); } } Run the application. Click the button and navigate to an image file and select it. You should see something like the following: How it works... The OpenFileDialog class wraps the Win32 open/save file dialog, providing easy enough access to its capabilities. It's just a matter of instantiating the object, setting some properties, such as the file types (Filter property) and then calling ShowDialog. This call, in turn, returns true if the user selected a file and false otherwise (null is never returned, although the return type is still defined as Nullable for consistency). The look of the Open file dialog box may be different in various Windows versions. This is mostly unimportant unless some automated UI testing is done. In this case, the way the dialog looks or operates may have to be taken into consideration when creating the tests. The filename itself is returned in the FileName property (full path). Multiple selections are possible by setting the MultiSelect property to true (in this case the FileNames property returns the selected files). There's more... WPF similarly wraps the Save As common dialog with the SaveFileDialog class (in the Microsoft.Win32 namespace as well). Its use is very similar to OpenFileDialog (in fact, both inherit from the abstract FileDialog class). What about folder selection (instead of files)? The WPF OpenFileDialog does not support that. One solution is to use Windows Forms' FolderBrowseDialog class. Another good solution is to use the Windows API Code Pack described shortly. Another common dialog box WPF wraps is PrintDialog (in System.Windows.Controls). This shows the familiar print dialog, with options to select a printer, orientation, and so on. The most straightforward way to print would be calling PrintVisual (after calling ShowDialog), providing anything that derives from the Visual abstract class (which include all elements). General printing is a complex topic and is beyond the scope of this book. What about colors and fonts? Windows also provides common dialogs for selecting colors and fonts. However, these are not wrapped by WPF. There are several alternatives: Use the equivalent Windows Forms classes (FontDialog and ColorDialog, both from System.Windows.Forms) Wrap the native dialogs yourself Look for alternatives on the Web The first option is possible, but has two drawbacks: first, it requires adding reference to the System.Windows.Forms assembly; this adds a dependency at compile time, and increases memory consumption at run time, for very little gain. The second drawback has to do with the natural mismatch between Windows Forms and WPF. For example, ColorDialog returns a color as a System.Drawing.Color, but WPF uses System.Windows.Media.Color. This requires mapping a GDI+ color (WinForms) to WPF's color, which is cumbersome at best. The second option of doing your own wrapping is a non-trivial undertaking and requires good interop knowledge. The other downside is that the default color and font common dialogs are pretty old (especially the color dialog), so there's much room for improvement. The third option is probably the best one. There are more than a few good candidates for color and font pickers. For a color dialog, for example, you can use the ColorPicker or ColorCanvas provided with the Extended WPF toolkit library on CodePlex (http://wpftoolkit.codeplex.com/). Here's how these may look (ColorCanvas on the left-hand side, and one of the possible views of ColorPicker on the right-hand side): The Windows API Code Pack The Windows API Code Pack is a Microsoft project on CodePlex (http://archive.msdn.microsoft.com/WindowsAPICodePack) that provides many .NET wrappers to native Windows features, in various areas, such as shell, networking, Windows 7 features (this is less important now as WPF 4 added first class support for Windows 7), power management, and DirectX. One of the Shell features in the library is a wrapper for the Open dialog box that allows selecting a folder instead of a file. This has no dependency on the WinForms assembly.
Read more
  • 0
  • 0
  • 5420
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-elasticsearch-administration
Packt
03 Mar 2015
28 min read
Save for later

Elasticsearch Administration

Packt
03 Mar 2015
28 min read
In this article by Rafał Kuć and Marek Rogoziński, author of the book Mastering Elasticsearch, Second Edition we will talk more about the Elasticsearch configuration and new features introduced in Elasticsearch 1.0 and higher. By the end of this article, you will have learned: (For more resources related to this topic, see here.) Configuring the discovery and recovery modules Using the Cat API that allows a human-readable insight into the cluster status The backup and restore functionality Federated search Discovery and recovery modules When starting your Elasticsearch node, one of the first things that Elasticsearch does is look for a master node that has the same cluster name and is visible in the network. If a master node is found, the starting node gets joined into an already formed cluster. If no master is found, then the node itself is selected as a master (of course, if the configuration allows such behavior). The process of forming a cluster and finding nodes is called discovery. The module responsible for discovery has two main purposes—electing a master and discovering new nodes within a cluster. After the cluster is formed, a process called recovery is started. During the recovery process, Elasticsearch reads the metadata and the indices from the gateway, and prepares the shards that are stored there to be used. After the recovery of the primary shards is done, Elasticsearch should be ready for work and should continue with the recovery of all the replicas (if they are present). In this section, we will take a deeper look at these two modules and discuss the possibilities of configuration Elasticsearch gives us and what the consequences of changing them are. Note that the information provided in the Discovery and recovery modules section is an extension of what we already wrote in Elasticsearch Server Second Edition, published by Packt Publishing. Discovery configuration As we have already mentioned multiple times, Elasticsearch was designed to work in a distributed environment. This is the main difference when comparing Elasticsearch to other open source search and analytics solutions available. With such assumptions, Elasticsearch is very easy to set up in a distributed environment, and we are not forced to set up additional software to make it work like this. By default, Elasticsearch assumes that the cluster is automatically formed by the nodes that declare the same cluster.name setting and can communicate with each other using multicast requests. This allows us to have several independent clusters in the same network. There are a few implementations of the discovery module that we can use, so let's see what the options are. Zen discovery Zen discovery is the default mechanism that's responsible for discovery in Elasticsearch and is available by default. The default Zen discovery configuration uses multicast to find other nodes. This is a very convenient solution: just start a new Elasticsearch node and everything works—this node will be joined to the cluster if it has the same cluster name and is visible by other nodes in that cluster. This discovery method is perfectly suited for development time, because you don't need to care about the configuration; however, it is not advised that you use it in production environments. Relying only on the cluster name is handy but can also lead to potential problems and mistakes, such as the accidental joining of nodes. Sometimes, multicast is not available for various reasons or you don't want to use it for these mentioned reasons. In the case of bigger clusters, the multicast discovery may generate too much unnecessary traffic, and this is another valid reason why it shouldn't be used for production. For these cases, Zen discovery allows us to use the unicast mode. When using the unicast Zen discovery, a node that is not a part of the cluster will send a ping request to all the addresses specified in the configuration. By doing this, it informs all the specified nodes that it is ready to be a part of the cluster and can be either joined to an existing cluster or can form a new one. Of course, after the node joins the cluster, it gets the cluster topology information, but the initial connection is only done to the specified list of hosts. Remember that even when using unicast Zen discovery, the Elasticsearch node still needs to have the same cluster name as the other nodes. If you want to know more about the differences between multicast and unicast ping methods, refer to these URLs: http://en.wikipedia.org/wiki/Multicast and http://en.wikipedia.org/wiki/Unicast. If you still want to learn about the configuration properties of multicast Zen discovery, let's look at them. Multicast Zen discovery configuration The multicast part of the Zen discovery module exposes the following settings: discovery.zen.ping.multicast.address (the default: all available interfaces): This is the interface used for the communication given as the address or interface name. discovery.zen.ping.multicast.port (the default: 54328): This port is used for communication. discovery.zen.ping.multicast.group (the default: 224.2.2.4): This is the multicast address to send messages to. discovery.zen.ping.multicast.buffer_size (the default: 2048): This is the size of the buffer used for multicast messages. discovery.zen.ping.multicast.ttl (the default: 3): This is the time for which a multicast message lives. Every time a packet crosses the router, the TTL is decreased. This allows for the limiting area where the transmission can be received. Note that routers can have the threshold values assigned compared to TTL, which causes that TTL value to not match exactly the number of routers that a packet can jump over. discovery.zen.ping.multicast.enabled (the default: true): Setting this property to false turns off the multicast. You should disable multicast if you are planning to use the unicast discovery method. The unicast Zen discovery configuration The unicast part of Zen discovery provides the following configuration options: discovery.zen.ping.unicats.hosts: This is the initial list of nodes in the cluster. The list can be defined as a list or as an array of hosts. Every host can be given a name (or an IP address) or have a port or port range added. For example, the value of this property can look like this: ["master1", "master2:8181", "master3[80000-81000]"]. So, basically, the hosts' list for the unicast discovery doesn't need to be a complete list of Elasticsearch nodes in your cluster, because once the node is connected to one of the mentioned nodes, it will be informed about all the others that form the cluster. discovery.zen.ping.unicats.concurrent_connects (the default: 10): This is the maximum number of concurrent connections unicast discoveries will use. If you have a lot of nodes that the initial connection should be made to, it is advised that you increase the default value. Master node One of the main purposes of discovery apart from connecting to other nodes is to choose a master node—a node that will take care of and manage all the other nodes. This process is called master election and is a part of the discovery module. No matter how many master eligible nodes there are, each cluster will only have a single master node active at a given time. If there is more than one master eligible node present in the cluster, they can be elected as the master when the original master fails and is removed from the cluster. Configuring master and data nodes By default, Elasticsearch allows every node to be a master node and a data node. However, in certain situations, you may want to have worker nodes, which will only hold the data or process the queries and the master nodes that will only be used as cluster-managed nodes. One of these situations is to handle a massive amount of data, where data nodes should be as performant as possible, and there shouldn't be any delay in master nodes' responses. Configuring data-only nodes To set the node to only hold data, we need to instruct Elasticsearch that we don't want such a node to be a master node. In order to do this, we add the following properties to the elasticsearch.yml configuration file: node.master: falsenode.data: true Configuring master-only nodes To set the node not to hold data and only to be a master node, we need to instruct Elasticsearch that we don't want such a node to hold data. In order to do that, we add the following properties to the elasticsearch.yml configuration file: node.master: truenode.data: false Configuring the query processing-only nodes For large enough deployments, it is also wise to have nodes that are only responsible for aggregating query results from other nodes. Such nodes should be configured as nonmaster and nondata, so they should have the following properties in the elasticsearch.yml configuration file: node.master: falsenode.data: false Please note that the node.master and the node.data properties are set to true by default, but we tend to include them for configuration clarity. The master election configuration We already wrote about the master election configuration in Elasticsearch Server Second Edition, but this topic is very important, so we decided to refresh our knowledge about it. Imagine that you have a cluster that is built of 10 nodes. Everything is working fine until, one day, your network fails and three of your nodes are disconnected from the cluster, but they still see each other. Because of the Zen discovery and the master election process, the nodes that got disconnected elect a new master and you end up with two clusters with the same name with two master nodes. Such a situation is called a split-brain and you must avoid it as much as possible. When a split-brain happens, you end up with two (or more) clusters that won't join each other until the network (or any other) problems are fixed. If you index your data during this time, you may end up with data loss and unrecoverable situations when the nodes get joined together after the network split. In order to prevent split-brain situations or at least minimize the possibility of their occurrences, Elasticsearch provides a discovery.zen.minimum_master_nodes property. This property defines a minimum amount of master eligible nodes that should be connected to each other in order to form a cluster. So now, let's get back to our cluster; if we set the discovery.zen.minimum_master_nodes property to 50 percent of the total nodes available plus one (which is six, in our case), we would end up with a single cluster. Why is that? Before the network failure, we would have 10 nodes, which is more than six nodes, and these nodes would form a cluster. After the disconnections of the three nodes, we would still have the first cluster up and running. However, because only three nodes disconnected and three is less than six, these three nodes wouldn't be allowed to elect a new master and they would wait for reconnection with the original cluster. Zen discovery fault detection and configuration Elasticsearch runs two detection processes while it is working. The first process is to send ping requests from the current master node to all the other nodes in the cluster to check whether they are operational. The second process is a reverse of that—each of the nodes sends ping requests to the master in order to verify that it is still up and running and performing its duties. However, if we have a slow network or our nodes are in different hosting locations, the default configuration may not be sufficient. Because of this, the Elasticsearch discovery module exposes three properties that we can change: discovery.zen.fd.ping_interval: This defaults to 1s and specifies the interval of how often the node will send ping requests to the target node. discovery.zen.fd.ping_timeout: This defaults to 30s and specifies how long the node will wait for the sent ping request to be responded to. If your nodes are 100 percent utilized or your network is slow, you may consider increasing that property value. discovery.zen.fd.ping_retries: This defaults to 3 and specifies the number of ping request retries before the target node will be considered not operational. You can increase this value if your network has a high number of lost packets (or you can fix your network). There is one more thing that we would like to mention. The master node is the only node that can change the state of the cluster. To achieve a proper cluster state updates sequence, Elasticsearch master nodes process single cluster state update requests one at a time, make the changes locally, and send the request to all the other nodes so that they can synchronize their state. The master nodes wait for the given time for the nodes to respond, and if the time passes or all the nodes are returned, with the current acknowledgment information, it proceeds with the next cluster state update request processing. To change the time, the master node waits for all the other nodes to respond, and you should modify the default 30 seconds time by setting the discovery.zen.publish_timeout property. Increasing the value may be needed for huge clusters working in an overloaded network. The Amazon EC2 discovery Amazon, in addition to selling goods, has a few popular services such as selling storage or computing power in a pay-as-you-go model. So-called Amazon Elastic Compute Cloud (EC2) provides server instances and, of course, they can be used to install and run Elasticsearch clusters (among many other things, as these are normal Linux machines). This is convenient—you pay for instances that are needed in order to handle the current traffic or to speed up calculations, and you shut down unnecessary instances when the traffic is lower. Elasticsearch works well on EC2, but due to the nature of the environment, some features may work slightly differently. One of these features that works differently is discovery, because Amazon EC2 doesn't support multicast discovery. Of course, we can switch to unicast discovery, but sometimes, we want to be able to automatically discover nodes and, with unicast, we need to at least provide the initial list of hosts. However, there is an alternative—we can use the Amazon EC2 plugin, a plugin that combines the multicast and unicast discovery methods using the Amazon EC2 API. Make sure that during the set up of EC2 instances, you set up communication between them (on port 9200 and 9300 by default). This is crucial in order to have Elasticsearch nodes communicate with each other and, thus, cluster functioning is required. Of course, this communication depends on network.bind_host and network.publish_host (or network.host) settings. The EC2 plugin installation The installation of a plugin is as simple as with most of the plugins. In order to install it, we should run the following command: bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.0 The EC2 plugin's generic configuration This plugin provides several configuration settings that we need to provide in order for the EC2 discovery to work: cluster.aws.access_key: Amazon access key—one of the credential values you can find in the Amazon configuration panel cluster.aws.secret_key: Amazon secret key—similar to the previously mentioned access_key setting, it can be found in the EC2 configuration panel The last thing is to inform Elasticsearch that we want to use a new discovery type by setting the discovery.type property to ec2 value and turn off multicast. Optional EC2 discovery configuration options The previously mentioned settings are sufficient to run the EC2 discovery, but in order to control the EC2 discovery plugin behavior, Elasticsearch exposes additional settings: cloud.aws.region: This region will be used to connect with Amazon EC2 web services. You can choose a region that's adequate for the region where your instance resides, for example, eu-west-1 for Ireland. The possible values can be eu-west, sa-east, us-east, us-west-1, us-west-2, ap-southeast-1, and ap-southeast-1. cloud.aws.ec2.endpoint: If you are using EC2 API services, instead of defining a region, you can provide an address of the AWS endpoint, for example, ec2.eu-west-1.amazonaws.com. cloud.aws.protocol: This is the protocol that should be used by the plugin to connect to Amazon Web Services endpoints. By default, Elasticsearch will use the HTTPS protocol (which means setting the value of the property to https). We can also change this behavior and set the property to http for the plugin to use HTTP without encryption. We are also allowed to overwrite the cloud.aws.protocol settings for each service by using the cloud.aws.ec2.protocol and cloud.aws.s3.protocol properties (the possible values are the same—https and http). cloud.aws.proxy_host: Elasticsearch allows us to define a proxy that will be used to connect to AWS endpoints. The cloud.aws.proxy_host property should be set to the address to the proxy that should be used. cloud.aws.proxy_port: The second property related to the AWS endpoints proxy allows us to specify the port on which the proxy is listening. The cloud.aws.proxy_port property should be set to the port on which the proxy listens. discovery.ec2.ping_timeout (the default: 3s): This is the time to wait for the response for the ping message sent to the other node. After this time, the nonresponsive node will be considered dead and removed from the cluster. Increasing this value makes sense when dealing with network issues or we have a lot of EC2 nodes. The EC2 nodes scanning configuration The last group of settings we want to mention allows us to configure a very important thing when building cluster working inside the EC2 environment—the ability to filter available Elasticsearch nodes in our Amazon Elastic Cloud Computing network. The Elasticsearch EC2 plugin exposes the following properties that can help us configure its behavior: discovery.ec2.host_type: This allows us to choose the host type that will be used to communicate with other nodes in the cluster. The values we can use are private_ip (the default one; the private IP address will be used for communication), public_ip (the public IP address will be used for communication), private_dns (the private hostname will be used for communication), and public_dns (the public hostname will be used for communication). discovery.ec2.groups: This is a comma-separated list of security groups. Only nodes that fall within these groups can be discovered and included in the cluster. discovery.ec2.availability_zones: This is array or command-separated list of availability zones. Only nodes with the specified availability zones will be discovered and included in the cluster. discovery.ec2.any_group (this defaults to true): Setting this property to false will force the EC2 discovery plugin to discover only those nodes that reside in an Amazon instance that falls into all of the defined security groups. The default value requires only a single group to be matched. discovery.ec2.tag: This is a prefix for a group of EC2-related settings. When you launch your Amazon EC2 instances, you can define tags, which can describe the purpose of the instance, such as the customer name or environment type. Then, you use these defined settings to limit discovery nodes. Let's say you define a tag named environment with a value of qa. In the configuration, you can now specify the following: discovery.ec2.tag.environment: qa and only nodes running on instances with this tag will be considered for discovery. cloud.node.auto_attributes: When this is set to true, Elasticsearch will add EC2-related node attributes (such as the availability zone or group) to the node properties and will allow us to use them, adjusting the Elasticsearch shard allocation and configuring the shard placement. Other discovery implementations The Zen discovery and EC2 discovery are not the only discovery types that are available. There are two more discovery types that are developed and maintained by the Elasticsearch team, and these are: Azure discovery: https://github.com/elasticsearch/elasticsearch-cloud-azure Google Compute Engine discovery: https://github.com/elasticsearch/elasticsearch-cloud-gce In addition to these, there are a few discovery implementations provided by the community, such as the ZooKeeper discovery for older versions of Elasticsearch (https://github.com/sonian/elasticsearch-zookeeper). The gateway and recovery configuration The gateway module allows us to store all the data that is needed for Elasticsearch to work properly. This means that not only is the data in Apache Lucene indices stored, but also all the metadata (for example, index allocation settings), along with the mappings configuration for each index. Whenever the cluster state is changed, for example, when the allocation properties are changed, the cluster state will be persisted by using the gateway module. When the cluster is started up, its state will be loaded using the gateway module and applied. One should remember that when configuring different nodes and different gateway types, indices will use the gateway type configuration present on the given node. If an index state should not be stored using the gateway module, one should explicitly set the index gateway type to none. The gateway recovery process Let's say explicitly that the recovery process is used by Elasticsearch to load the data stored with the use of the gateway module in order for Elasticsearch to work. Whenever a full cluster restart occurs, the gateway process kicks in to load all the relevant information we've mentioned—the metadata, the mappings, and of course, all the indices. When the recovery process starts, the primary shards are initialized first, and then, depending on the replica state, they are initialized using the gateway data, or the data is copied from the primary shards if the replicas are out of sync. Elasticsearch allows us to configure when the cluster data should be recovered using the gateway module. We can tell Elasticsearch to wait for a certain number of master eligible or data nodes to be present in the cluster before starting the recovery process. However, one should remember that when the cluster is not recovered, all the operations performed on it will not be allowed. This is done in order to avoid modification conflicts. Configuration properties Before we continue with the configuration, we would like to say one more thing. As you know, Elasticsearch nodes can play different roles—they can have a role of data nodes—the ones that hold data—they can have a master role, or they can be only used for request handing, which means not holding data and not being master eligible. Remembering all this, let's now look at the gateway configuration properties that we are allowed to modify: gateway.recover_after_nodes: This is an integer number that specifies how many nodes should be present in the cluster for the recovery to happen. For example, when set to 5, at least 5 nodes (doesn't matter whether they are data or master eligible nodes) must be present for the recovery process to start. gateway.recover_after_data_nodes: This is an integer number that allows us to set how many data nodes should be present in the cluster for the recovery process to start. gateway.recover_after_master_nodes: This is another gateway configuration option that allows us to set how many master eligible nodes should be present in the cluster for the recovery to start. gateway.recover_after_time: This allows us to set how much time to wait before the recovery process starts after the conditions defined by the preceding properties are met. If we set this property to 5m, we tell Elasticsearch to start the recovery process 5 minutes after all the defined conditions are met. The default value for this property is 5m, starting from Elasticsearch 1.3.0. Let's imagine that we have six nodes in our cluster, out of which four are data eligible. We also have an index that is built of three shards, which are spread across the cluster. The last two nodes are master eligible and they don't hold the data. What we would like to configure is the recovery process to be delayed for 3 minutes after the four data nodes are present. Our gateway configuration could look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3m Expectations on nodes In addition to the already mentioned properties, we can also specify properties that will force the recovery process of Elasticsearch. These properties are: gateway.expected_nodes: This is the number of nodes expected to be present in the cluster for the recovery to start immediately. If you don't need the recovery to be delayed, it is advised that you set this property to the number of nodes (or at least most of them) with which the cluster will be formed from, because that will guarantee that the latest cluster state will be recovered. gateway.expected_data_nodes: This is the number of expected data eligible nodes to be present in the cluster for the recovery process to start immediately. gateway.expected_master_nodes: This is the number of expected master eligible nodes to be present in the cluster for the recovery process to start immediately. Now, let's get back to our previous example. We know that when all six nodes are connected and are in the cluster, we want the recovery to start. So, in addition to the preceeding configuration, we would add the following property: gateway.expected_nodes: 6 So the whole configuration would look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3mgateway.expected_nodes: 6 The preceding configuration says that the recovery process will be delayed for 3 minutes once four data nodes join the cluster and will begin immediately after six nodes are in the cluster (doesn't matter whether they are data nodes or master eligible nodes). The local gateway With the release of Elasticsearch 0.20 (and some of the releases from 0.19 versions), all the gateway types, apart from the default local gateway type, were deprecated. It is advised that you do not use them, because they will be removed in future versions of Elasticsearch. This is still not the case, but if you want to avoid full data reindexation, you should only use the local gateway type, and this is why we won't discuss all the other types. The local gateway type uses a local storage available on a node to store the metadata, mappings, and indices. In order to use this gateway type and the local storage available on the node, there needs to be enough disk space to hold the data with no memory caching. The persistence to the local gateway is different from the other gateways that are currently present (but deprecated). The writes to this gateway are done in a synchronous manner in order to ensure that no data will be lost during the write process. In order to set the type of gateway that should be used, one should use the gateway.type property, which is set to local by default. There is one additional thing regarding the local gateway of Elasticsearch that we didn't talk about—dangling indices. When a node joins a cluster, all the shards and indices that are present on the node, but are not present in the cluster, will be included in the cluster state. Such indices are called dangling indices, and we are allowed to choose how Elasticsearch should treat them. Elasticsearch exposes the gateway.local.auto_import_dangling property, which can take the value of yes (the default value that results in importing all dangling indices into the cluster), close (results in importing the dangling indices into the cluster state but keeps them closed by default), and no (results in removing the dangling indices). When setting the gateway.local.auto_import_dangling property to no, we can also set the gateway.local.dangling_timeout property (defaults to 2h) to specify how long Elasticsearch will wait while deleting the dangling indices. The dangling indices feature can be nice when we restart old Elasticsearch nodes, and we don't want old indices to be included in the cluster. Low-level recovery configuration We discussed that we can use the gateway to configure the behavior of the Elasticsearch recovery process, but in addition to that, Elasticsearch allows us to configure the recovery process itself. However, we decided that it would be good to mention the properties we can use in the section dedicated to gateway and recovery. Cluster- level recovery configuration The recovery configuration is specified mostly on the cluster level and allows us to set general rules for the recovery module to work with. These settings are: indices.recovery.concurrent_streams: This defaults to 3 and specifies the number of concurrent streams that are allowed to be opened in order to recover a shard from its source. The higher the value of this property, the more pressure will be put on the networking layer; however, the recovery may be faster, depending on your network usage and throughput. indices.recovery.max_bytes_per_sec: By default, this is set to 20MB and specifies the maximum number of data that can be transferred during shard recovery per second. In order to disable data transfer limiting, one should set this property to 0. Similar to the number of concurrent streams, this property allows us to control the network usage of the recovery process. Setting this property to higher values may result in higher network utilization and a faster recovery process. indices.recovery.compress: This is set to true by default and allows us to define whether ElasticSearch should compress the data that is transferred during the recovery process. Setting this to false may lower the pressure on the CPU, but it will also result in more data being transferred over the network. indices.recovery.file_chunk_size: This is the chunk size used to copy the shard data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. indices.recovery.translog_ops: This defaults to 1000 and specifies how many transaction log lines should be transferred between shards in a single request during the recovery process. indices.recovery.translog_size: This is the chunk size used to copy the shard transaction log data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. In the versions prior to Elasticsearch 0.90.0, there was the indices.recovery.max_size_per_sec property that could be used, but it was deprecated, and it is suggested that you use the indices.recovery.max_bytes_per_sec property instead. However, if you are using an Elasticsearch version older than 0.90.0, it may be worth remembering this. All the previously mentioned settings can be updated using the Cluster Update API, or they can be set in the elasticsearch.yml file. Index-level recovery settings In addition to the values mentioned previously, there is a single property that can be set on a per-index basis. The property can be set both in the elasticsearch.yml file and using the indices Update Settings API, and it is called index.recovery.initial_shards. In general, Elasticsearch will only recover a particular shard when there is a quorum of shards present and if that quorum can be allocated. A quorum is 50 percent of the shards for the given index plus one. By using the index.recovery.initial_shards property, we can change what Elasticsearch will take as a quorum. This property can be set to the one of the following values: quorum: 50 percent, plus one shard needs to be present and be allocable. This is the default value. quorum-1: 50 percent of the shards for a given index need to be present and be allocable. full: All of the shards for the given index need to be present and be allocable. full-1: 100 percent minus one shards for the given index need to be present and be allocable. integer value: Any integer such as 1, 2, or 5 specifies the number of shards that are needed to be present and that can be allocated. For example, setting this value to 2 will mean that at least two shards need to be present and Elasticsearch needs at least 2 shards to be allocable. It is good to know about this property, but in most cases, the default value will be sufficient for most deployments. Summary In this article, we focused more on the Elasticsearch configuration and new features that were introduced in Elasticsearch 1.0. We configured discovery and recovery, and we used the human-friendly Cat API. In addition to that, we used the backup and restore functionality, which allowed easy backup and recovery of our indices. Finally, we looked at what federated search is and how to search and index data to multiple clusters, while still using all the functionalities of Elasticsearch and being connected to a single node. If you want to dig deeper, buy the book Mastering Elasticsearch, Second Edition and read in a simple step-by-step fashion using Elasticsearch to enhance your knowlege further. Resources for Article: Further resources on this subject: Downloading and Setting Up ElasticSearch [Article] Indexing the Data [Article] Driving Visual Analyses with Automobile Data (Python) [Article]
Read more
  • 0
  • 0
  • 5417

article-image-tips-tricks-mysql-python
Packt
23 Dec 2010
5 min read
Save for later

Tips & Tricks on MySQL for Python

Packt
23 Dec 2010
5 min read
  MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server         Read more about this book       (For more resources on this subject, see here.) Objective: Install a C compiler on Windows installation. Tip: Windows binaries do not currently exist for the 1.2.3 version of MySQL for Python. To get them, you would need to install a C compiler on your Windows installation and compile the binary from source. Objective: Use tar.gz to use egg file. Tip: If you cannot use egg files or if you use an earlier version of Python, you should use the tar.gz file, a tar and gzip archive. The tar.gz archive follows the Linux egg files in the file listing. The current version of MySQL for Python is 1.2.3c1, so the file we want is as following: MySQL-python-1.2.3c1.tar.gz This method is by far more complicated than the others. If at all possible, use your operating system's installation method or an egg file. Objective: Limitation of using MySQL for Python on Python version. Tip: This version of MySQL for Python is compatible up to Python 2.6. It is worth noting that MySQL for Python has not yet been released for Python 3.0 or later versions. In your deployment of the library, therefore, ensure that you are running Python 2.6 or earlier. As noted, Python 2.5 and 2.6 have version-specifi c releases. Prior to Python 2.4, you will need to use either a tar.gz version of the latest release or use an older version of MySQL for Python. The latter option is not recommended. Objective: It is important to phrase the query in such a way as to narrow the returned values as much as possible. Tip: Here, instead of returning whole records, we tell MySQL to return only the namecolumn. This natural reduction in the data reduces processing time for both MySQL and Python. This saving is then passed on to your server in the form of more sessions able to be run at one time. Objective: This hard-wiring of the search query allows us to test the connection before coding the rest of the function. Tip: There may be a tendency here to insert user-determined variables immediately. With experience, it is possible to do this. However, if there are any doubts about the availability of the database, your best fallback position is to keep it simple and hardwired. This reduces the number of variables in making a connection and helps one to blackbox the situation, making troubleshooting much easier. Objective: Readability counts. Tip: The virtue of readability in programming is often couched in terms of being kind to the next developer who works on your code. There is more at stake, however. With readability comes not only maintainability but control. If it takes you too much effort to understand the code you have written, you will have a harder time controlling the program's flow and this will result in unintended behavior. The natural consequence of unintended program behavior is the compromising of process stability and system security. Objective: Quote marks not necessary when assigning MySQL statements. Tip: It is not necessary to use triple quote marks when assigning the MySQL sentence to statement or when passing it to execute(). However, if you used only a single pair of either double or single quotes, it would be necessary to escape every similar quote mark. As a stylistic rule, it is typically best to switch to verbatim mode with the triple quote marks in order to ensure the readability of your code. Objective: xrange() is much more memory efficient than range(). Tip: The differences between xrange() and range() are often overlooked or even ignored. Both count through the same values, but they do it differently. Where range() calculates a list the first time it is called and then stores it in memory, xrange() creates an immutable sequence that returns the next in the series each time it is called. As a consequence, xrange() is much more memory efficient than range(), especially when dealing with large groups of integers. As a consequence of its memory efficiency, however, it does not support functionality such as slicing, which range() does, because the series is not yet fully determined. Objective: autocommit feature is useful in MySQL for Python . Tip: Unless you are running several database threads at a time or have to deal with similar complexity, MySQL for Python does not require you to use either commit() or close(). Generally speaking, MySQL for Python installs with an autocommit feature switched on. It thus takes care of committing the changes for you when the cursor object is destroyed. Similarly, when the program terminates, Python tends to close the cursor and database connection as it destroys both objects.
Read more
  • 0
  • 0
  • 5412

article-image-integrating-zimbra-collaboration-suite-microsoft-outlook
Packt
21 Oct 2009
11 min read
Save for later

Integrating Zimbra Collaboration Suite with Microsoft Outlook

Packt
21 Oct 2009
11 min read
Introduction Let's face it, in today's business environment, there is only one email client that truly matters. I am not saying it is the best client, or that it offers more features, and I certainly am not saying it is the most secure. What I am saying is that you would be hard pressed to walk into an organization, of let's say more than 10 desktops, and not see users checking their email with Microsoft Outlook. Zimbra Collaboration Suite offers uncanny support for Outlook including: Native Sync with MAPI Support for both Online and Offline Modes Cached mode operation Support for multiple calendars Here, we will discuss these features and focus on configuring Outlook to work with the Zimbra server. As you will see with a Zimbra back end and an Outlook client, it is transparent to your users whether you are using Microsoft Exchange or Zimbra Collaboration Suite as your back end product. This transparency to users makes the migration from Exchange to Zimbra that much easier in the eyes of your users, especially when it comes to user training. The ability to seamlessly integrate Zimbra and Outlook is one of Zimbra's strongest assets and one of its strongest arguments for making the transition from Exchange to Zimbra. However, if you want to use the full power of Zimbra (not only the fancy look but great features such as the searches), you should use the Web Client. We will take a detailed look at Zimbra integration with Outlook, including: The Zimbra Connector for Outlook (ZCO) A look at Zimbra integration Sharing Outlook folders Outlook uses the Messaging Application Programming Interface (MAPI) to allow programs to communicate with messaging systems and stores. MAPI is proprietary to Microsoft and is key to Zimbra being able to synchronize and work with Outlook. Zimbra uses a connector to facilitate this communication—called the Zimbra Connector—for Outlook. The PST Import Wizard One of the beauties of Outlook's integration in Zimbra is that you won't start from scratch: Zimbra gives you tools that are able to import your data (emails, calendars, contacts, etc) either from a concurrent solution's server (Exchange or Domino) or directly from a PST file (the file used by Outlook to store all its data).   We'll have a look at the PST Import Wizard. To download it: Log in to the Administration Console at https://zimbra.emailcs.com:7071/. Click on the Downloads tab on the left of the navigation pane. 3. In the Content Pane, click on the PST Import Wizard to download the executable file. 4. Save the file to the local computer, or a network accessible shared folder. 5. Double click the .exe file to launch the wizard (there's no installation process). 6. Click on Next on the presentation page. 7. In the Hostname field enter: zimbra.emailcs.com. 8. In the Port field you can leave the default (80) and let the Use Secure Connection box unchecked. 9. In the Username field, enter your Zimbra's user (worker@emailcs.com) and in the Password field your Zimbra's password. Click on the Next button. You will now have to select the PST file that you want to import into the Zimbra server. The Zimbra Import Wizard helps you as it opens the default Outlook PST's directory when you click on the Browse button. Once you have selected the PST you want to import and clicked on the Open button, you can now click on the Next button of the initial window. Now, you can choose how your data will be imported: Import Junk-Mail Folder: If you leave this box checked, all your spam will be imported to the server and marked as spam on the server. Import Deleted Items Folder: With this, the deleted mails will be imported on the server. Ignore previously imported items: This option is used when you're importing your data in several attempts. If you leave it checked, the already imported mails won't be imported again. Import items received after: Checking this box and choosing a date in the calendar next to it, allows us to do a partial import based on date. Import messages with some headers but no message body: You should leave this one checked as it imports some badly formatted mails. Convert meeting, organized by me, from my old address to my new address: You need to check this box (and enter your previous email address) if you're getting a new email address on the Zimbra server. If you don't, the meetings will be imported but you won't be the owner. Migrate private appointments: This one is your own choice as Zimbra does not handle (yet) the private items; all the imported appointments will be viewable by anyone you share your calendar with.   Once your choices are made, you can click on the Next button. 14. Click OK in the confirmation window. 15. Next window will show you the import evolution (number of items and percentage). You can stop the import at anytime and start again later. Once the import is finished, a window pops up with a summary of the import session. 16. You can click on the OK button of the summary window then Finish in the last window. The Zimbra Connector for Outlook The Zimbra Connector for Outlook (ZCO) is a downloadable .msi installable file that must be installed on the desktop in order for Outlook and Zimbra to communicate. To download the ZCO to the client: Log in to the Administration Console at https://zimbra.emailcs.com:7071. Click on the Downloads tab on the left of the navigation pane. 3. In the Content Pane, click on the Zimbra Connector for Outlook to download the .msi installable file. Save the file to the local computer, or a network accessible shared folder. Double click the .msi file to start the installation process. The installation wizard will begin and go ahead and accept the License Agreement and accept all of the defaults. Once complete click FINISH. The ZCO creates a brand new profile within Outlook, called Zimbra. If you had previous profile(s) created on your computer, be sure to choose the "Zimbra" one. The ZCO is now installed, and the first time we run Outlook on this client, the connector will prompt us for configuration information. Zimbra currently supports Outlook 2003 only. In the Server Name field, enter: zimbra.emailcs.com. For port leave the default of 80. Email address will be the email address for this user. In our case, we will use the Worker Bee with an email address of worker@emailcs.com. For the password, enter the same password Worker would use to log in to the AJAX web client. Once completed, click OK and Outlook will open Worker's email box. The ZCO will now sync the Global Access List and will get all the emails from the server locally. It means that if you imported lots of item previously, you might need some time to get them back into Outlook from the server. Luckily, then the sync process happens in the background. As you can see in the following screenshot, the folders we created in the Web Client are now configured in Outlook. The first time Outlook is opened, it automatically performs a send/receive with the Zimbra server. After this initial synchronization, there is nothing a user needs to do from then on to initiate synchronization with the server. There is also nothing the user needs to set-up to let Outlook know whether or not the user is Online—connected to the server, or Offline—disconnected from the server. Outlook automatically checks the status and acts accordingly. Therefore, a user need not be connected to the server to work with email that has already been received, check the address book, or work with the Calendar. All changes that the user makes in Offline mode, will synchronize with the server the next time the server is connected and Outlook is online. At this time, we should take a quick look around Outlook and see how integrated Zimbra really is with Outlook. A Look at Zimbra Integration The integration of Zimbra is more than just the ability to send and receive email. Outlook is now acting as the front end to create contacts, appointments, and tasks that will be stored on the server. Let's take a moment to look at each one individually. Contacts The easiest way to see the integration of contacts between Outlook and Zimbra is to compose a new mail message, and instead of typing in an email address, click on the TO: button and select Global Address List from the Address Book drop-down menu. As you can see, this feature looks exactly the same whether you are using Exchange or Zimbra as your back end collaboration server. Users are familiar with this look and feel, and the ability to select users that are within the organization's Global Address List. This list comes directly from the Zimbra server and is maintained there as well. The user also has the option to select the own personal contact list. This list could be created and maintained either via the web client, through Outlook directly, or both as they will synchronize together. Appointments In most work organizations, the ability to create appointments, invite people to attend, and check invitees schedule is a key function of Exchange and Outlook. Luckily for us, the same functionality could be used with Zimbra and Outlook. As seen in the following figure, the process for creating an appointment is exactly the same. In the Calendar application, click New --> Appointment. Click on the Invite Attendees button. Click the To button and select the Global Address List from the drop-down menu. Select the CEO from the address list and click OK. Click on the Scheduling tab.   The Calendar is synced with the Zimbra server and is able to check the availability of the users within the organization, a key feature of any collaboration suite. Once you have found a schedule when all the attendees are free, go back to the Appointment tab, type in a Subject for your appointment then click on the Send button. The last feature we will look at is Sharing Outlook folders. Sharing Outlook Folders Users have the option to share any Outlook folder with users in the Global Address List. Essentially, this is the same ability we covered in an earlier chapter with the Web Client for the Contacts or Calendar. However, here the process is different. Users could be delegated different levels of access to Outlook folders. These levels include: Read View items in folder only Edit Edit any contents in the folder Create Create/add items to the folder Delete Delete/modify items in the folder Act on workflow Respond to meeting and task requests Administer Folder Modify the permissions on the folder There are also predefined roles that users could assign to other users in the Global Address List including: Administrator Has all rights to the folder listed above Delegate Has access to all rights except for Administer folder Editor Access to Read, Edit, Create and Delete Author Access to Read and Create Reviewer Read only To assign roles and rights to the folder: 1. Right click the folder and click Properties. 2. Click on the Sharing tab. 3. Click Add and select CEO from the Global Address List. 4. With CEO highlighted, select Administrator from the Permission Level drop-down-box. 5. With Administrator selected, you should be able to see all of the Permissions selected. 6. Change the Permission Level to Reviewer and you will see that only Read items, is selected. 7. Go ahead and play with the various levels so you can get a feel for the different permissions associated with the various levels. 8. Once complete, click OK. An email will be sent to the CEO informing him that he now has Administrator access to the Inbox of the Worker Bee. In order for the CEO to work with the new Shared Folder (the Worker's Inbox in this case), the CEO would simply: 1. Click on File --> Open --> Other User's Mailbox in the Outlook Menu bar. 2. Select Worker from the Global Address List. 3. Once the folder is added to Outlook, click on the Send/Receive button to synchronize the folder. Summary The goal of this article was to take a brief look at using the Microsoft Outlook client as a front end to the Zimbra Collaboration Suite. In my experience, users do not like change and they tend to be comfortable with applications they are familiar with. One of the most common objections to changing email systems is that users rely so heavily on their email and contacts that they do not want to have to learn a whole new system to access them. Hopefully, if I have done my job, you could now see how users need not be afraid of moving to a Zimbra system, because in the end, their everyday life and functionality is not going to change much. They could still use the tool that they are most familiar with, but still have the added benefit of using the AJAX Web Client when they are on the road or away from their desks.
Read more
  • 0
  • 0
  • 5411

article-image-load-validate-and-submit-forms-using-ext-js-30-part-2
Packt
19 Nov 2009
4 min read
Save for later

Load, Validate, and Submit Forms using Ext JS 3.0: Part 2

Packt
19 Nov 2009
4 min read
Creating validation functions for URLs, email addresses, and other types of data Ext JS has an extensive library of validation functions. This is how it can be used to validate URLs, email addresses, and other types of data. The following screenshot shows email address validation in action: This screenshot displays URL validation in action: How to do it... Initialize the QuickTips singleton: Ext.QuickTips.init(); Create a form with fields that accept specific data formats: Ext.onReady(function() { var commentForm = new Ext.FormPanel({ frame: true, title: 'Send your comments', bodyStyle: 'padding:5px', width: 550, layout: 'form', defaults: { msgTarget: 'side' }, items: [ { xtype: 'textfield', fieldLabel: 'Name', name: 'name', anchor: '95%', allowBlank: false }, { xtype: 'textfield', fieldLabel: 'Email', name: 'email', anchor: '95%', vtype: 'email' }, { xtype: 'textfield', fieldLabel: 'Web page', name: 'webPage', vtype: 'url', anchor: '95%' }, { xtype: 'textarea', fieldLabel: 'Comments', name: 'comments', anchor: '95%', height: 150, allowBlank: false }], buttons: [{ text: 'Send' }, { text: 'Cancel' }] }); commentForm.render(document.body);}); How it works... The vtype configuration option specifies which validation function will be applied to the field. There's more... Validation types in Ext JS include alphanumeric, numeric, URL, and email formats. You can extend this feature with custom validation functions, and virtually, any format can be validated. For example, the following code shows how you can add a validation type for JPG and PNG files: Ext.apply(Ext.form.VTypes, { Picture: function(v) { return /^.*.(jpg|JPG|png|PNG)$/.test(v); }, PictureText: 'Must be a JPG or PNG file';}); If you need to replace the default error text provided by the validation type, you can do so by using the vtypeText configuration option: { xtype: 'textfield', fieldLabel: 'Web page', name: 'webPage', vtype: 'url', vtypeText: 'I am afraid that you did not enter a URL', anchor: '95%'} See also... The Specifying the required fields in a form recipe, covered earlier in this article, explains how to make some form fields required The Setting the minimum and maximum length allowed for a field's value recipe, covered earlier in this article, explains how to restrict the number of characters entered in a field The Changing the location where validation errors are displayed recipe, covered earlier in this article, shows how to relocate a field's error icon Refer to the previous recipe, Deferring field validation until form submission, to know how to validate all fields at once upon form submission, instead of using the default automatic field validation The next recipe, Confirming passwords and validating dates using relational field validation, explains how to perform validation when the value of one field depends on the value of another field The Rounding up your validation strategy with server-side validation of form fields recipe (covered later in this article) explains how to perform server-side validation Confirming passwords and validating dates using relational field validation Frequently, you face scenarios where the values of two fields need to match, or the value of one field depends on the value of another field. Let's examine how to build a registration form that requires the user to confirm his or her password when signing up. How to do it… Initialize the QuickTips singleton: Ext.QuickTips.init(); Create a custom vtype to handle the relational validation of the password: Ext.apply(Ext.form.VTypes, { password: function(val, field) { if (field.initialPassField) { var pwd = Ext.getCmp(field.initialPassField); return (val == pwd.getValue()); } return true; }, passwordText: 'What are you doing?<br/>The passwords entered do not match!'}); Create the signup form: var signupForm = { xtype: 'form', id: 'register-form', labelWidth: 125, bodyStyle: 'padding:15px;background:transparent', border: false, url: 'signup.php', items: [ { xtype: 'box', autoEl: { tag: 'div', html: '<div class="app-msg"><img src="img/businessman add.png" class="app-img" /> Register for The Magic Forum</div>' } }, { xtype: 'textfield', id: 'email', fieldLabel: 'Email', allowBlank: false, minLength: 3, maxLength: 64,anchor:'90%', vtype:'email' }, { xtype: 'textfield', id: 'pwd', fieldLabel: 'Password', inputType: 'password',allowBlank: false, minLength: 6, maxLength: 32,anchor:'90%', minLengthText: 'Password must be at least 6 characters long.' }, { xtype: 'textfield', id: 'pwd-confirm', fieldLabel: 'Confirm Password', inputType: 'password', allowBlank: false, minLength: 6, maxLength: 32,anchor:'90%', minLengthText: 'Password must be at least 6 characters long.', vtype: 'password', initialPassField: 'pwd' }],buttons: [{ text: 'Register', handler: function() { Ext.getCmp('register-form').getForm().submit(); }},{ text: 'Cancel', handler: function() { win.hide(); } }]} Create the window that will host the signup form: Ext.onReady(function() { win = new Ext.Window({ layout: 'form', width: 340, autoHeight: true, closeAction: 'hide', items: [signupForm] }); win.show();
Read more
  • 0
  • 0
  • 5410
article-image-using-show-explain-running-queries
Packt
13 Mar 2014
5 min read
Save for later

Using SHOW EXPLAIN with running queries

Packt
13 Mar 2014
5 min read
(For more resources related to this topic, see here.) Getting ready Import the ISFDB database which is available under Creative Commons licensing. How to do it... Open a terminal window and launch the mysql command-line client and connect to the isfdb database using the following statement. mysql isfdb Next, we open another terminal window and launch another instance of the mysql command-line client. Run the following command in the first window: ALTER TABLE title_relationships DROP KEY titles; Next, in the first window, start the following example query: SELECT titles.title_id AS ID, titles.title_title AS Title, authors.author_legalname AS Name, (SELECT COUNT(DISTINCT title_relationships.review_id) FROM title_relationships WHERE title_relationships.title_id = titles.title_id) AS reviews FROM titles,authors,canonical_author WHERE (SELECT COUNT(DISTINCT title_relationships.review_id) FROM title_relationships WHERE title_relationships.title_id = titles.title_id)>=10 AND canonical_author.author_id = authors.author_id AND canonical_author.title_id=titles.title_id AND titles.title_parent=0 ; Wait for at least a minute and then run the following query to look for the details of the query that we executed in step 4 and QUERY_ID for that query: SELECT INFO, TIME, ID, QUERY_ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE TIME > 60G Run SHOW EXPLAIN in the second window (replace id in the following command line with the numeric ID that we discovered in step 5): SHOW EXPLAIN FOR id Run the following command in the second window to kill the query running in the first window (replace query_id in the following command line with the numeric QUERY_ID number that we discovered in step 5): KILL QUERY ID query_id; In the first window, reverse the change we made in step 3 using the following command: ALTER TABLE title_relationships ADD KEY titles (title_id); How it works... The SHOW EXPLAIN statement allows us to obtain information about how MariaDB executes a long-running statement. This is very useful for identifying bottlenecks in our database. The query in this article will execute efficiently only if it touches the indexes in our data. So, for demonstration purposes, we will first sabotage the title_relationships table by removing the title's index. This causes our query to unnecessarily iterate through hundreds of thousands of rows and generally take far too long to complete. The output of steps 3 and 4 will look similar to the following screenshot: While our sabotaged query is running, and after waiting for at least a minute, we switch to another window and look for all queries that have been running for longer than 60 seconds. Our sabotaged query will likely be the only one in the output. From this output, we get ID and QUERY_ID. The output of the command will look like the following with the ID and QUERY_ID as the last two items: Next, we use the ID number to execute SHOW EXPLAIN for our query. Incidentally, our query looks up all titles in the database that have 10 or more reviews and displays the title, author, and the number of reviews that the title has. The EXPLAIN for our query will look similar to the following screenshot: An easy-to-read version of this EXPLAIN is available at https://mariadb.org/ea/8v65g. Looking at rows 4 and 5 of EXPLAIN, it's easy to see why our query runs for so long. These two rows are dependent subqueries of the primary query (the first row). In the first query, we see that 117044 rows will be searched, and then, for the two dependent subqueries, MariaDB searches through 83389 additional rows, twice. Ouch. If we were analyzing a slow query in the real world at this point, we would fix the query to not have such an inefficient subquery, or we would add a KEY to our table to make the subquery efficient. If we're part of a larger development team, we could send the output of SHOW EXPLAIN and the query to the appropriate people to easily and accurately show them what the problem is with the query. In our case, we know exactly what to do; we will add back the KEY that we removed earlier. For fun, after adding back the KEY, we could rerun the query and the SHOW EXPLAIN command to see the difference that having the KEY in place makes. We'll have to be quick though, as with the KEY there, the query will only take a few seconds to complete (depending on the speed of our computer). There's more... The output of SHOW EXPLAIN is always accompanied by a warning. The purpose of this warning is to show us the command that is being run. After running SHOW EXPLAIN on a process ID, we simply issue SHOW WARNINGSG and we will see what SQL statement the process ID is running: This is useful for very long-running commands that after their start, takes a long time to execute, and then returns back at a time where we might not remember the command we started. In the examples of this article, we're using "G" as the delimiter instead of the more common ";" so that the data fits the page better. We can use either one. See also The full documentation of the KILL QUERY ID command can be found at https://mariadb.com/kb/en/data-manipulation-kill-connectionquery/ The full documentation of the SHOW EXPLAIN command can be found at https://mariadb.com/kb/en/show-explain/ Summary In this article, we saw the functionality of the SHOW EXPLAIN feature after altering the database using various queries. Further information regarding the SHOW EXPLAIN command can be found in the official documents provided in the preceding section. Resources for Article: Further resources on this subject: Installing MariaDB on Windows and Mac OS X [Article] A Look Inside a MySQL Daemon Plugin [Article] Visual MySQL Database Design in MySQL Workbench [Article]
Read more
  • 0
  • 0
  • 5406

article-image-recovery-postgresql-9
Packt
25 Oct 2010
14 min read
Save for later

Recovery in PostgreSQL 9

Packt
25 Oct 2010
14 min read
Recovery of all databases Recovery of a complete database server, including all of its databases, is an important feature. This recipe covers how to do that in the simplest way possible. Getting ready Find a suitable server on which to perform the restore. Before you recover onto a live server, always take another backup. Whatever problem you thought you had could be just about to get worse. How to do it... LOGICAL (from custom dump -F c): Restore of all databases means simply restoring each individual database from each dump you took. Confirm you have the correct backup before you restore: pg_restore --schema-only -v dumpfile | head | grep Started Reload globals from script file as follows: psql -f myglobals.sql Reload all databases. Create the databases using parallel tasks to speed things along. This can be executed remotely without needing to transfer dumpfile between systems. Note that there is a separate dumpfile for each database. pg_restore -d postgres -j 4 dumpfile LOGICAL (from script dump created by pg_dump –F p): As above, though with this command to execute the script. This can be executed remotely without needing to transfer dumpfile between systems. Confirm you have the correct backup before you restore. If the following command returns nothing, then the file is not timestamped, and you'll have to identify it in a different way: head myscriptdump.sql | grep Started Reload globals from script file as follows: psql -f myglobals.sql Reload all scripts like the following: psql -f myscriptdump.sql LOGICAL (from script dump created by pg_dumpall): We need to follow the procedure, which is shown next. Confirm you have the correct backup before you restore. If the following command returns nothing, then the file is not timestamped, and you'll have to identify it in a different way: head myscriptdump.sql | grep Started Find a suitable server, or create a new virtual server. Reload script in full psql -f myscriptdump.sql PHYSICAL: Restore the backup file onto the target server. Extract the backup file into the new data directory. Confirm that you have the correct backup before you restore. $ cat backup_label START WAL LOCATION: 0/12000020 (file 000000010000000000000012) CHECKPOINT LOCATION: 0/12000058 START TIME: 2010-06-03 19:53:23 BST LABEL: standalone Check all file permissions and ownerships are correct and links are valid. That should already be the case if you are using the postgres userid everywhere, which is recommended. Start the server That procedure is so simple. That also helps us understand that we need both a base backup and the appropriate WAL files. If you used other techniques, then we need to step through the tasks to make sure we cover everything required as follows: Shutdown any server running in the data directory. Restore the backup so that any files in the data directory that have matching names are replaced with the version from the backup. (The manual says delete all files and then restore backup—that might be a lot slower than running an rsync between your backup and the destination without the –-update option). Remember that this step can be performed in parallel to speed things up, though it is up to you to script that. Check that all file permissions and ownerships are correct and links are valid. That should already be the case if you are using the postgres userid everywhere, which is recommended. Remove any files in pg_xlog/. Copy in any latest WAL files from a running server, if any. Add in a recovery.conf and set its file permissions correctly also. Start the server. The only part that requires some thought and checking is which parameters you select for the recovery.conf. There's only one that matters here, and that is the restore_command. restore_command tells us how to restore archived WAL files. It needs to be the command that will be executed to bring back WAL files from the archive. If you are forward-thinking, there'll be a README.backup file for you to read to find out how to set the restore_command. If not, then presumably you've got the location of the WAL files you've been saving written down somewhere. Say, for example, that your files are being saved to a directory named /backups/pg/servername/archive, owned by the postgres user. On a remote server named backup1, we would then write this all on one line of the recovery.conf as follows: restore_command = 'scp backup1:/backups/pg/servername/archive/%f %p' How it works... PostgreSQL is designed to require very minimal information to perform a recovery. We try hard to wrap all the details up for you. Logical recovery: Logical recovery executes SQL to re-create the database objects. If performance is an issue, look at the recipe on recovery performance. Physical recovery: Physical recovery re-applies data changes at the block level so tends to be much faster than logical recovery. Physical recovery requires both a base backup and a set of archived WAL files. There is a file named backup_label in the data directory of the base backup. This tells us to retrieve a .backup file from the archive that contains the start and stop WAL locations of the base backup. Recovery then starts to apply changes from the starting WAL location, and must proceed as far as the stop address for the backup to be valid. After recovery completes, the recovery.conf file is renamed to recovery.done to prevent the server from re-entering recovery. The server log records each WAL file restored from the archive, so you can check progress and rate of recovery. You can query the archive to find out the name of the latest archived WAL file to allow you to calculate how many files to go. The restore_command should return 0 if a file has been restored and non-zero for failure cases. Recovery will proceed until there is no next WAL file, so there will eventually be an error recorded in the logs. If you have lost some of the WAL files, or they are damaged, then recovery will stop at that point. No further changes after that will be applied, and you will likely lose those changes; that would be the time to call your support vendor. There's more... You can start and stop the server once recovery has started without any problem. It will not interfere with the recovery. You can connect to the database server while it is recovering and run queries, if that is useful. That is known as Hot Standby mode. Recovery to a point in time If your database suffers a problem at 15:22 p.m. and yet your backup was taken at 04:00 a.m. you're probably hoping there is a way to recover the changes made between those two times. What you need is known as "point-in-time recovery". Regrettably, if you've made a backup with pg_dump at 04:00 a.m. then you won't be able to recover to any other time than 04:00. As a result, the term point-in-time recovery (PITR) has become synonymous with the physical backup and restore technique in PostgreSQL. Getting ready If you have a backup made with pg_dump, then give up all hope of using that as a starting point for a point in time recovery. It's a frequently asked question, but the answer is still "no"; the reason it gets asked is exactly why I'm pleading with you to plan your backups ahead of time. First, you need to decide what the point of time is that to which you would like to recover. If the answer is "as late as possible", then you don't need to do a PITR at all, just recover until end of logs. How to do it... How do you decide to what point to recover? The point where we stop recovery is known as the "recovery target". The most straightforward way is to do this based upon a timestamp. In the recovery.conf, you can add (or uncomment) a line that says the following: recovery_target_time = '2010-06-01 16:59:14.27452+01' or similar. Note that you need to be careful to specify the time zone of the target, so that it matches the time zone of the server that wrote the log. That might differ from the time zone of the current server, so check. After that, you can check progress during a recovery by running queries in Hot Standby mode. How it works... Recovery works by applying individual WAL records. These correspond to individual block changes, so there are many WAL records to each transaction. The final part of any successful transaction is a commit WAL record, though there are abort records as well. Each transaction completion record has a timestamp on it that allows us to decide whether to stop at that point or not. You can also define a recovery target using a transaction id (xid), though finding out which xid to use is somewhat difficult, and you may need to refer to external records if they exist. The recovery target is specified in the recovery.conf and cannot change while the server is running. If you want to change the recovery target, you can shutdown the server, edit the recovery.conf, and then restart the server. Be careful though, if you change the recovery target and recovery is already passed the point, it can lead to errors. If you define a recovery_target_timestamp that has already passed, then recovery will stop almost immediately, though this will be later than the correct stopping point. If you define a recovery_target_xid that has already passed, then recovery will just continue to the end of logs. Restarting recovery from the beginning using a fresh restore of the base backup is always safe. Once a server completes recovery, it will assign a new "timeline". Once a server is fully available, we can write new changes to the database. Those changes might differ from changes we made in a previous "future history" of the database. So we differentiate between alternate futures using different timelines. If we need to go back and run recovery again, we can create a new server history using the original or subsequent timelines. The best way to think about this is that it is exactly like a Sci-Fi novel—you can't change the past but you can return to an earlier time and take a different action instead. But you'll need to be careful not to confuse yourself. There's more... pg_dump cannot be used as a base backup for a PITR. The reason is that a log replay contains the physical changes to data blocks, not logical changes based upon Primary Keys. If you reload a pg_dump the data will likely go back into different data blocks, so the changes wouldn't correctly reference the data. WAL doesn't contain enough information to reconstruct all SQL fully that produced those changes. Later feature additions to PostgreSQL may add the required information to WAL. See also Planned in 9.1 is the ability to pause/resume/stop recovery, and to set recovery targets while the server is up dynamically. This will allow you to use the Hot Standby facility to locate the correct stopping point more easily. You can trick Hot Standby into stopping recovery, which may help. Recovery of a dropped/damaged table You may drop or even damage a table in some way. Tables could be damaged for physical reasons, such as disk corruption, or they could also be damaged by running poorly specified UPDATEs/DELETEs, which update too many rows or overwrite critical data. It's a common request to recover from this situation from a backup. How to do it... The methods differ, depending upon the type of backup you have available. If you have multiple types of backup, you have a choice. LOGICAL (from custom dump -F c): If you've taken a logical backup using pg_dump into a custom file, then you can simply extract the table you want from the dumpfile like the following: pg_restore -t mydroppedtable dumpfile | psql or connect direct to the database using –d. The preceding command tries to re-create the table and then load data into it. Note that pg_restore -t option does not dump out any of the indexes on the table selected. That means we need a slightly more complex procedure than it would first appear, and the procedure needs to vary depending upon whether we are repairing a damaged table or putting back a dropped table. To repair a damaged table we want to replace the data in the table in a single transaction. There isn't a specific option to do this, so we need to do the following: Dump the table to a script file as follows: pg_restore -t mydroppedtable dumpfile > mydroppedtable.sql Edit a script named restore_mydroppedtable.sql with the following code: BEGIN; TRUNCATE mydroppedtable; i mydroppedtable.sql COMMIT; Then, run it using the following: psql -f restore_mydroppedtable.sql If you've dropped a table then you need to: Create a new database in which to work, name it restorework, as follows: CREATE DATABASE restorework; Restore the complete schema to the new database as follows: pg_restore --schema-only -d restorework dumpfile Now, dump just the definitions of the dropped table into a new file, which will contain CREATE TABLE, indexes, other constraints and grants. Note that this database has no data in it, so specifying –-schema-only is optional, as follows: pg_dump -t mydroppedtable --schema-only restorework > mydroppedtable.sql Now, recreate the table on the main database as follows: psql -f mydroppedtable.sql Now, reload just the data into database maindb as follows pg_restore -t mydroppedtable --data-only -d maindb dumpfile If you've got a very large table, then the fourth step can be a problem, because it builds the indexes as well. If you want you can manually edit the script into two pieces, one before the load ("pre-load") and one after the load ("post-load"). There are some ideas for that at the end of the recipe. LOGICAL (from script dump): The easy way to restore a single table from a script is as follows: Find a suitable server, or create a new virtual server. Reload the script in full, as follows: psql -f myscriptdump.sql From the recovered database server, dump the table, its data, and all the definitions of the dropped table into a new file as follows: pg_dump -t mydroppedtable -F c mydatabase > dumpfile Now, recreate the table into the original server and database, using parallel tasks to speed things along. This can be executed remotely without needing to transfer dumpfile between systems. pg_restore -d mydatabase -j 2 dumpfile The only way to extract a single table from a script dump without doing all of the preceding is to write a custom Perl script to read and extract just the parts of the file you want. That can be complicated, because you may need certain SET commands at the top of the file, the table, and data in the middle of the file, and the indexes and constraints on the table are near the end of the file. It's complex; the safe route is the one already mentioned. PHYSICAL: To recover a single table from a physical backup, we need to: Find a suitable server, or create a new virtual server. Recover the database server in full, as described in previous recipes on physical recovery, including all databases and all tables. You may wish to stop at a useful point in time. From the recovered database server, dump the table, its data, and all the definitions of the dropped table into a new file as follows: pg_dump -t mydroppedtable -F c mydatabase > dumpfile Now, recreate the table into the original server and database using parallel tasks to speed things along. This can be executed remotely without needing to transfer dumpfile between systems as follows: pg_restore -d mydatabase -j 2 dumpfile How it works... At present, there's no way to restore a single table from a physical restore in just a single step. See also Splitting a pg_dump into multiple sections, "pre" and "post" was proposed by me for an earlier release of PostgreSQL, though I haven't had time to complete that yet. It's possible to do that using an external utility also; the best script I've seen to split a dump file into two pieces is available at the following website: http://bucardo.org/wiki/split_postgres_dump
Read more
  • 0
  • 0
  • 5403

article-image-user-extensions-and-add-ons-selenium-10-testing-tools
Packt
26 Nov 2010
6 min read
Save for later

User Extensions and Add-ons in Selenium 1.0 Testing Tools

Packt
26 Nov 2010
6 min read
Important preliminary points If you are creating an extension that can be used by all, make sure that it is stored in a central place, like a version control system. This will prevent any potential issues in the future when others in your team start to use it. User extensions Imagine that you wanted to use a snippet of code that is used in a number of different tests. You could use: type | locator | javascript{ .... } However, if you had a bug in the JavaScript you would need to go through all the tests that reused this snippet of code. This, as we know from software development, is not good practice and is normally corrected with a refactoring of the code. In Selenium, we can create our own function that can then be used throughout the tests. User extensions are stored in a separate file that we will tell Selenium IDE or Selenium RC to use. Inside there the new function will be written in JavaScript. Because Selenium's core is developed in JavaScript, creating an extension follows the standard rules for prototypal languages. To create an extension, we create a function in the following design pattern. Selenium.prototype.doFunctionName = function(){ . . . } The "do" in front of the function name tells Selenium that this function can be called as a command for a step instead of an internal or private function. Now that we understand this, let's see this in action. Time for action – installing a user extension Now that you have a need for a user extension, let's have a look at installing an extension into Selenium IDE. This will make sure that we can use these functions in future Time for action sections throughout this article. Open your favorite text editor. Create an empty method with the following text: Selenium.prototype.doNothing = function(){ . . . } Start Selenium IDE. Click on the Options menu and then click on Options. Place the path of the user-extension.js file in the textbox labeled Selenium IDE extensions. Click on OK. Restart Selenium IDE. Start typing in the Command textbox and your new command will be available, as seen in the next screenshot: What just happened? We have just seen how to create our first basic extension command and how to get this going in Selenium IDE. You will notice that you had to restart Selenium IDE for the changes to take effect. Selenium has a process that finds all the command functions available to it when it starts up, and does a few things to it to make sure that Selenium can use them without any issues. Now that we understand how to create and install an extension command let's see what else we can do with it. In the next Time for action, we are going to have a look at creating a randomizer command that will store the result in a variable that we can use later in the test. Time for action – using Selenium variables in extensions Imagine that you are testing something that requires some form of random number entered into a textbox. You have a number of tests that require you to create a random number for the test so you can decide that you are going to create a user extension and then store the result in a variable. To do this we will need to pass in arguments to our function that we saw earlier. The value in the target box will be passed in as the first argument and the value textbox will be the second argument. We will use this in a number of different examples throughout this article. Let's now create this extension. Open your favorite text editor and open the user-extension.js file you created earlier. We are going to create a function called storeRandom. The function will look like the following: Selenium.prototype.doStoreRandom = function(variableName){ random = Math.floor(Math.random()*10000000); storedVars[variableName] = random; } Save the file. Restart Selenium IDE. Create a new step with storeRandom and the variable that will hold the value will be called random. Create a step to echo the value in the random variable. What just happened? In the previous example, we saw how we can create an extension function that allows us to use variables that can be used throughout the rest of the test. It uses the storedVars dictionary that we saw in the previous chapter. As everything that comes from the Selenium IDE is interpreted as a string, we just needed to put the variable as the key in storedVars. It is then translated and will look like storedVars['random'] so that we can use it later. As with normal Selenium commands, if you run the command a number of times, it will overwrite the value that is stored within that variable, as we can see in the previous screenshot. Now that we know how to create an extension command that computes something and then stores the results in a variable, let's have a look at using that information with a locator. Time for action – using locators in extensions Imagine that you need to calculate today's date and then type that into a textbox. To do that you can use the type | locator | javascript{...} format, but sometimes it's just neater to have a command that looks like typeTodaysDate | locator. We do this by creating an extension and then calling the relevant Selenium command in the same way that we are creating our functions. To tell it to type in a locator, use: this.doType(locator,text); The this in front of the command text is to make sure that it used the doType function inside of the Selenium object and not one that may be in scope from the user extensions. Let's see this in action: Use your favorite text editor to edit the user extensions that you were using in the previous examples. Create a new function called doTypeTodaysDate with the following snippet: Selenium.prototype.doTypeTodaysDate = function(locator){ var dates = new Date(); var day = dates.getDate(); if (day < 10){ day = '0' + day; } month = dates.getMonth() + 1; if (month < 10){ month = '0' + month; } var year = dates.getFullYear(); var prettyDay = day + '/' + month + '/' + year; this.doType(locator, prettyDay); } Save the file and restart Selenium IDE. Create a step in a test to type this in a textbox. Run your script. It should look similar to the next screenshot: What just happened? We have just seen that we can create extension commands that use locators. This means that we can create commands to simplify tests as in the previous example where we created our own Type command to always type today's date in the dd/mm/yyyy format. We also saw that we can call other commands from within our new command by calling its original function in Selenium. The original function has do in front of it. Now that we have seen how we can use basic locators and variables, let's have a look at how we can access the page using browserbot from within an extension method.
Read more
  • 0
  • 0
  • 5401
article-image-integrating-keras-tensorflow-r
Amey Varangaonkar
08 Nov 2017
6 min read
Save for later

How to Integrate Keras and TensorFlow with R

Amey Varangaonkar
08 Nov 2017
6 min read
[box type="info" align="" class="" width=""]The following is an excerpt from the book Neural Networks with R, Chapter 7, Use Cases of Neural Networks - Advanced Topics, written by Giuseppe Ciaburro and Balaji Venkateswaran. In this post, we see how to integrate popular deep learning libraries and frameworks like TensorFlow with R for effective neural network modeling.[/box] TensorFlow is an open source numerical computing library provided by Google for machine intelligence. It hides all of the programming required to build deep learning models and gives the developers a black box interface to program. The Keras API for TensorFlow provides a high-level interface for neural networks. Python is the de facto programming language for deep learning, but R is catching up. Deep learning libraries are now available with R and a developer can easily download TensorFlow or Keras similar to other R libraries and use them. In TensorFlow, nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. TensorFlow was originally developed by the Google Brain Team within Google's machine intelligence research for machine learning and deep neural networks research, but it is now available in the public domain. TensorFlow exploits GPU processing when configured appropriately. The generic use cases for TensorFlow are as follows: Image recognition Computer vision Voice/sound recognition Time series analysis Language detection Language translation Text-based processing Handwriting Recognition (HWR) Many others Integrating Tensorflow with R In this section, we will see how we can bring TensorFlow libraries into R. This will open up a huge number of possibilities with deep learning using TensorFlow with R. In order to use TensorFlow, we must first install Python. If you don't have a Python installation on your machine, it's time to get it. Python is a dynamic Object-Oriented Programming (OOP) language that can be used for many types of software development. It offers strong support for integration with other languages and programs, is provided with a large standard library, and can be learned within a few days. Many Python programmers can confirm a substantial increase in productivity and feel that it encourages the development of higher quality code and maintainability. Python runs on Windows, Linux/Unix, macOS X, OS/2, Amiga, Palm Handhelds, and Nokia phones. It also works on Java and .NET virtual machines. Python is licensed under the OSI-approved open source license; its use is free, including for commercial products. [box type="shadow" align="" class="" width=""]If you do not know which version to use, there is a document that could help you choose. In principle, if you have to start from scratch, we recommend choosing Python 3, and if you need to use third-party software packages that may not be compatible with Python 3, we recommend using Python 2.7. All information about the available versions and how to install Python is given at https://www.python.org/[/box] After properly installing the Python version of our machine, we have to worry about installing TensorFlow. We can retrieve all library information and available versions of the operating system from the following link: https://www.tensorflow.org/ . Also, in the install section, we can find a series of guides that explain how to install a version of TensorFlow that allows us to write applications in Python. Guides are available for the following operating systems: Installing TensorFlow on Ubuntu Installing TensorFlow on macOS X Installing TensorFlow on Windows Installing TensorFlow from sources For example, to install Tensorflow on Windows, we must choose one of the following types: TensorFlow with CPU support only TensorFlow with GPU support To install TensorFlow, start a terminal with privileges as administrator. Then issue the appropriate pip3 install command in that terminal. To install the CPU-only version, enter the following command: C:> pip3 install --upgrade tensorflow A series of code lines will be displayed on the video to keep us informed of the execution of the installation procedure, as shown in the following figure: Note: The output screenshot has been cropped for clarity purposes At this point, we can return to our favorite environment; I am referring to the R development environment. We will need to install the interface to TensorFlow. The R interface to TensorFlow lets you work productively using the high-level Keras and Estimator APIs, and when you need more control, it provides full access to the core TensorFlow API. To install the R interface to TensorFlow, follow the steps below First, install the tensorflow R package from CRAN as follows: install.packages("tensorflow")  2. Then, use the install_tensorflow() function to install TensorFlow (for a proper installation procedure, you must have administrator privileges): library(tensorflow) install_tensorflow()  3. We can confirm that the installation succeeded: sess = tf$Session() hello <- tf$constant('Hello, TensorFlow!') sess$run(hello) This will provide you with a default installation of TensorFlow suitable for use with the tensorflow R package. Read on if you want to learn about additional installation options.  4. If you want to install a version of TensorFlow that takes advantage of NVIDIA GPUs, you need to have the correct CUDA libraries installed. In the following code, we can check the success of the installation: library(tensorflow) > sess = tf$Session() > hello <- tf$constant('Hello, TensorFlow!') > sess$run(hello) b'Hello, TensorFlow!' Integrating Keras with R Keras is a set of open source neural network libraries coded in Python. It is capable of running on top of MxNet, TensorFlow, or Theano. The steps to install Keras in RStudio are very simple. The following code snippet gives the steps for installation and we can check whether Keras is working by checking the load of the MNIST dataset. By default, RStudio loads the CPU version of TensorFlow. Once Keras is loaded, we have a powerful set of deep learning libraries that can be utilized by R programmers to execute neural networks and deep learning. To install Keras for R, follow these steps: Run the following code install.packages("devtools") devtools::install_github("rstudio/keras")  2. At this point, we load the keras library: library(keras)  3. Finally, we check whether keras is installed correctly by loading the MNIST dataset: > data=dataset_mnist()   If you found this excerpt useful, make sure you check out the book Neural Networks with R, containing an interesting coverage of many such useful and insightful topics.
Read more
  • 0
  • 0
  • 5400

article-image-ab-testing-for-marketing-comparison-using-openai-chatgpt
Valentina Alto
08 Jun 2023
5 min read
Save for later

A/B testing for marketing comparison using OpenAI ChatGPT

Valentina Alto
08 Jun 2023
5 min read
This article is an excerpt from the book, Modern Generative AI with ChatGPT and OpenAI Models, by Valentina Alto. This book will help harness the power of AI with innovative, real-world applications, and unprecedented productivity boosts, powered by the latest advancements in AI technology like ChatGPT and OpenAI.A/B testing in marketing is a method of comparing two different versions of a marketing campaign, advertisement, or website to determine which one performs better. In A/B testing, two variations of the same campaign or element are created, with only one variable changed between the two versions. The goal is to see which version generates more clicks, conversions, or other desired outcomes.An example of A/B testing might be testing two versions of an email campaign, using different subject lines, or testing two versions of a website landing page, with different call-to-action buttons. By measuring the response rate of each version, marketers can determine which version performs better and make data-driven decisions about which version to use going forward. A/B testing allows marketers to optimize their campaigns and elements for maximum effectiveness, leading to better results and a higher return on investment.Since this method involves the process of generating many variations of the same content, the generative power of ChatGPT can definitely assist in that.Let’s consider the following example. I’m promoting a new product I developed: a new, light and thin climbing harness for speed climbers. I’ve already done some market research and I know my niche audience. I also know that one great channel of communication for that audience is publishing on an online climbing blog, of which most climbing gyms’ members are fellow readers.My goal is to create an outstanding blog post to share the launch of this new harness, and I want to test two different versions of it in two groups. The blog post I’m about to publish and that I want to be the object of my A/B testing is the following:  Figure 1– An example of a blog post to launch climbing gearHere, ChatGPT can help us on two levels:The first level is that of rewording the article, using different keywords or different attention-grabbing slogans. To do so, once this post is provided as context, we can ask ChatGPT to work on the article and slightly change some elements:Figure 2 – New version of the blog post generated by ChatGPT As per my request, ChatGPT was able to regenerate only those elements I asked for (title, subtitle, and closing sentence) so that I can monitor the effectiveness of those elements by monitoring the reaction of the two audience groups.The second level is working on the design of the web page, namely, changing the collocation of the image rather than the position of the buttons. For this purpose, I created a simple web page for the blog post published in the climbing blog (you can find the code in the book’s GitHub repository at https://github.com/PacktPublishing/The-Ultimate-Guide- to-ChatGPT-and-OpenAI/tree/main/Chapter%207%20-%20ChatGPT%20 for%20Marketers/Code):Figure 3 – Sample blog post published on the climbing blogWe can directly feed ChatGPT with the HTML code and ask it to change some layout elements, such as the position of the buttons or their wording. For example, rather than Buy Now, a reader might be more gripped by an I want one! button.So, let's feed ChatGPT with the HTML source code:Figure 4 – ChatGPT changing HTML code Let’s see what the output looks like:Figure 5 – New version of the websiteAs you can see, ChatGPT only intervened at the button level, slightly changing their layout, position, color, and wording.Indeed, inspecting the source code of the two versions of the web pages, we can see how it differs in the button sections:Figure 6 – Comparison between the source code of the two versions of the website ConclusionIn conclusion, ChatGPT is a valuable tool for A/B testing in marketing. Its ability to quickly generate different versions of the same content can reduce the time to market of new campaigns. By utilizing ChatGPT for A/B testing, you can optimize your marketing strategies and ultimately drive better results for your business.Author BioAfter completing her Bachelor's degree in Finance, Valentina Alto pursued a Master's degree in Data Science in 2021. She began her professional career at Microsoft as an Azure Solution Specialist, and since 2022, she has been primarily focused on working with Data & AI solutions in the Manufacturing and Pharmaceutical industries. Valentina collaborates closely with system integrators on customer projects, with a particular emphasis on deploying cloud architectures that incorporate modern data platforms, data mesh frameworks, and applications of Machine Learning and Artificial Intelligence. Alongside her academic journey, she has been actively writing technical articles on Statistics, Machine Learning, Deep Learning, and AI for various publications, driven by her passion for AI and Python programming.Link - Medium Modern Generative AI with ChatGPT and OpenAI Models
Read more
  • 0
  • 0
  • 5399
Modal Close icon
Modal Close icon