Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-troubleshooting-openvpn-2-configurations
Packt
21 Feb 2011
10 min read
Save for later

Troubleshooting OpenVPN 2: Configurations

Packt
21 Feb 2011
10 min read
OpenVPN 2 Cookbook 100 simple and incredibly effective recipes for harnessing the power of the OpenVPN 2 network Set of recipes covering the whole range of tasks for working with OpenVPN The quickest way to solve your OpenVPN problems! Set up, configure, troubleshoot and tune OpenVPN Uncover advanced features of OpenVPN and even some undocumented options Introduction The topic of this article is troubleshooting OpenVPN. This article will focus on troubleshooting OpenVPN misconfigurations. The recipes in this article will therefore deal first with breaking the things. We will then provide the tools on how to find and solve the configuration errors. Some of the configuration directives used in this article have not been demonstrated before, so even if you are not interested in breaking things this article will still be insightful. Cipher mismatches In this recipe, we will change the cryptographic ciphers that OpenVPN uses. Initially, we will change the cipher only on the client side, which will cause the initialization of the VPN connection to fail. The primary purpose of this recipe is to show the error messages that appear, not to explore the different types of ciphers that OpenVPN supports. Getting ready Install OpenVPN 2.0 or higher on two computers. Make sure the computers are connected over a network. Set up the client and server certificates. For this recipe, the server computer was running CentOS 5 Linux and OpenVPN 2.1.1. The client was running Fedora 13 Linux and OpenVPN 2.1.1. Keep the server configuration file basic-udp-server.conf (download code, ch:2) and the client configuration file basic-udp-client.conf at hand. How to do it... Start the server using the configuration file basic-udp-server.conf: [root@server]# openvpn --config basic-udp-server.conf Next, create the client configuration file by appending a line to the basic-udp-client.conf file: cipher CAST5-CBC Save it as example7-1-client.conf. Start the client, after which the following message will appear in the client log: [root@client]# openvpn --config example7-1-client.conf ... WARNING: 'cipher' is used inconsistently, local='cipher CAST5- CBC', remote='cipher BF-CBC' ... [openvpnserver] Peer Connection Initiated with server-ip:1194 ... TUN/TAP device tun0 opened ... /sbin/ip link set dev tun0 up mtu 1500 ... /sbin/ip addr add dev tun0 192.168.200.2/24 broadcast 192.168.200.255 ... Initialization Sequence Completed ... Authenticate/Decrypt packet error: cipher final failed And, similarly, on the server side: ... client-ip:52461 WARNING: 'cipher' is used inconsistently, local='cipher BF-CBC', remote='cipher CAST5-CBC' ... client-ip:52461 [openvpnclient1] Peer Connection Initiated with openvpnclient1:52461 ... openvpnclient1/client-ip:52461 Authenticate/Decrypt packet error: cipher final failed ... openvpnclient1/client-ip:52461 Authenticate/Decrypt packet error: cipher final failed The connection will not be successfully established, but it will also not be disconnected immediately. How it works... During the connection phase, the client and the server negotiate several parameters needed to secure the connection. One of the most important parameters in this phase is the encryption cipher, which is used to encrypt and decrypt all the messages. If the client and server are using different ciphers, then they are simply not capable of talking to each other. By adding the following configuration directive to the server configuration file, the client and the server can communicate again: cipher CAST5-CBC There's more... OpenVPN supports quite a few ciphers, although support for some of the ciphers is still experimental. To view the list of supported ciphers, type: $ openvpn --show-ciphers This will list all ciphers with both variables and fixed cipher length. The ciphers with variable cipher length are very well supported by OpenVPN, the others can sometimes lead to unpredictable results. TUN versus TAP mismatches A common mistake when setting up a VPN based on OpenVPN is the type of adapter that is used. If the server is configured to use a TUN-style network but a client is configured to use a TAP-style interface, then the VPN connection will fail. In this recipe, we will show what is typically seen when this common configuration error is made. Getting ready Install OpenVPN 2.0 or higher on two computers. Make sure the computers are connected over a network. Set up the client and server certificates (Download code-ch:2 here). For this recipe, the server computer was running CentOS 5 Linux and OpenVPN 2.1.1. The client was running Fedora 13 Linux and OpenVPN 2.1.1. Keep the server configuration file basic-udp-server.conf (Download code-ch:2 here) and the client configuration file basic-udp-client.confat hand. How to do it... Start the server using the configuration file basic-udp-server.conf: [root@server]# openvpn --config basic-udp-server.conf Next, create the client configuration: client proto udp remote openvpnserver.example.com port 1194 dev tap nobind ca /etc/openvpn/cookbook/ca.crt cert /etc/openvpn/cookbook/client1.crt key /etc/openvpn/cookbook/client1.key tls-auth /etc/openvpn/cookbook/ta.key 1 ns-cert-type server Save it as example7-2-client.conf. Start the client [root@client]# openvpn --config example7-2-client.conf The client log will show: ... WARNING: 'dev-type' is used inconsistently, local='dev-type tap', remote='dev-type tun' ... WARNING: 'link-mtu' is used inconsistently, local='link-mtu 1573', remote='link-mtu 1541' ... WARNING: 'tun-mtu' is used inconsistently, local='tun-mtu 1532', remote='tun-mtu 1500' ... [openvpnserver] Peer Connection Initiated with server-ip:1194 ... TUN/TAP device tap0 opened ... /sbin/ip link set dev tap0 up mtu 1500 ... /sbin/ip addr add dev tap0 192.168.200.2/24 broadcast 192.168.200.255 ... Initialization Sequence Completed At this point, you can try pinging the server, but it will respond with an error: [client]$ ping 192.168.200.1 PING 192.168.200.1 (192.168.200.1) 56(84) bytes of data. From 192.168.200.2 icmp_seq=2 Destination Host Unreachable From 192.168.200.2 icmp_seq=3 Destination Host Unreachable From 192.168.200.2 icmp_seq=4 Destination Host Unreachable How it works... A TUN-style interface offers a point-to-point connection over which only TCP/IP traffic can be tunneled. A TAP-style interface offers the equivalent of an Ethernet interface that includes extra headers. This allows a user to tunnel other types of traffic over the interface. When the client and the server are misconfigured, the expected packet size is different: ... WARNING: 'tun-mtu' is used inconsistently, local='tun-mtu 1532', remote='tun-mtu 1500' This shows that each packet that is sent through a TAP-style interface is 32 bytes larger than the packets sent through a TUN-style interface. By correcting the client configuration, this problem is resolved. Compression mismatches OpenVPN supports on-the-fly compression of the traffic that is sent over the VPN tunnel. This can improve the performance over a slow network line, but it does add a little overhead. When transferring uncompressible data (such as ZIP files), the performance actually decreases slightly. If the compression is enabled on the server but not on the client, then the VPN connection will fail. Getting ready Install OpenVPN 2.0 or higher on two computers. Make sure the computers are connected over a network. Set up the client and server certificates. For this recipe, the server computer was running CentOS 5 Linux and OpenVPN 2.1.1. The client was running Fedora 13 Linux and OpenVPN 2.1.1. Keep the server configuration file basic-udp-server.conf (Download code-ch:2 here) and the client configuration file basic-udp-client.confat hand.. How to do it... Append a line to the server configuration file basic-udp-server.conf: comp-lzo Save it as example7-3-server.conf. Start the server: [root@server]# openvpn --config example7-3-server.conf Next, start the client: [root@client]# openvpn --config basic-udp-client.conf The connection will initiate but when data is sent over the VPN connection, the following messages will appear: Initialization Sequence Completed ... write to TUN/TAP : Invalid argument (code=22) ... write to TUN/TAP : Invalid argument (code=22) How it works... During the connection phase, no compression is used to transfer information between the client and the server. One of the parameters that is negotiated is the use of compression for the actual VPN payload. If there is a configuration mismatch between the client and the server, then both the sides will get confused by the traffic that the other side is sending. With a network fully comprising OpenVPN 2.1 clients and an OpenVPN 2.1 server, this can be fixed for all the clients by just adding another line: push "comp-lzo" There's more... OpenVPN 2.0 did not have the ability to push compression directives to the clients. This means that an OpenVPN 2.0 server does not understand this directive, nor do OpenVPN 2.0 clients. So, if an OpenVPN 2.1 server pushes out this directive to an OpenVPN 2.0 client, the connection will fail. Key mismatches OpenVPN offers extra protection for its TLS control channel in the form of HMAC keys. These keys are exactly the same as the static "secret" keys used in point-to-point style networks. For multi-client style networks, this extra protection can be enabled using the tls-auth directive . If there is a mismatch between the client and the server related to this tls-auth key , then the VPN connection will fail to get initialized. Getting ready Install OpenVPN 2.0 or higher on two computers. Make sure the computers are connected over a network. Set up the client and server certificates using the first recipe. For this recipe, the server computer was running CentOS 5 Linux and OpenVPN 2.1.1. The client was running Fedora 13 Linux and OpenVPN 2.1.1. Keep the server configuration file basic-udp-server.conf (Download code-ch:2 here) and the client configuration file basic-udp-client.conf at hand. How to do it... Start the server using the configuration file basic-udp-server.conf: [root@server]# openvpn --config basic-udp-server.conf Next, create the client configuration: client proto udp remote openvpnserver port 1194 dev tun nobind ca /etc/openvpn/cookbook/ca.crt cert /etc/openvpn/cookbook/client1.crt key /etc/openvpn/cookbook/client1.key tls-auth /etc/openvpn/cookbook/ta.key ns-cert-type server Note the lack of the second parameter for tls-auth. Save it as example7-4-client.conf file. Start the client: [root@client]# openvpn --config example7-4-client.conf The client log will show no errors, but the connection will not be established either. In the server log we'll find: ... Initialization Sequence Completed ... Authenticate/Decrypt packet error: packet HMAC authentication failed ... TLS Error: incoming packet authentication failed from client- ip:54454 This shows that the client openvpnclient1 is connecting using the wrong tls-auth parameter and the connection is refused. How it works... At the very first phase of the connection initialization, the client and the server verify each other's HMAC keys. If an HMAC key is not configured correctly, then the initialization is aborted and the connection will fail to establish. As the OpenVPN server is not able to determine whether the client is simply misconfigured or whether a malicious client is trying to overload the server, the connection is simply dropped. This causes the client to keep listening for the traffic from the server, until it eventually times out. In this recipe, the misconfiguration consisted of the missing parameter 1 behind: tls-auth /etc/openvpn/cookbook/ta.key The second parameter to the tls-auth directive is the direction of the key. Normally, the following convention is used: 0: from server to client 1: from client to server This parameter causes OpenVPN to derive its HMAC keys from a different part of the ta.key file. If the client and server disagree on which parts the HMAC keys are derived from, the connection cannot be established. Similarly, when the client and server are deriving the HMAC keys from different ta.key files, the connection can also not be established.
Read more
  • 0
  • 0
  • 26252

article-image-classification-decision-trees-apache-spark-mllib
Wilson D'souza
02 Nov 2017
9 min read
Save for later

Building a classification system with Decision Trees in Apache Spark 2.0

Wilson D'souza
02 Nov 2017
9 min read
[box type="note" align="" class="" width=""]In this article by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, and Shuen Mei from their book Apache Spark 2.x Machine Learning Cookbook we shall explore how to build a classification system with decision trees using Spark MLlib library. The code and data files are available at the end of the article.[/box] A decision tree in Spark is a parallel algorithm designed to fit and grow a single tree into a dataset that can be categorical (classification) or continuous (regression). It is a greedy algorithm based on stumping (binary split, and so on) that partitions the solution space recursively while attempting to select the best split among all possible splits using Information Gain Maximization (entropy based). Apache Spark provides a good mix of decision tree based algorithms fully capable of taking advantage of parallelism in Spark. The implementation ranges from the straightforward Single Decision Tree (the CART type algorithm) to Ensemble Trees, such as Random Forest Trees and GBT (Gradient Boosted Tree). They all have both the variant flavors to facilitate classification (for example, categorical, such as height = short/tall) or regression (for example, continuous, such as height = 2.5 meters). Getting and preparing real-world medical data for exploring Decision Trees in Spark 2.0 To explore the real power of decision trees, we use a medical dataset that exhibits real life non-linearity with a complex error surface. The Wisconsin Breast Cancer dataset was obtained from the University of Wisconsin Hospital from Dr. William H Wolberg. The dataset was gained periodically as Dr. Wolberg reported his clinical cases. The dataset can be retrieved from multiple sources, and is available directly from the University of California Irvine's webserver http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wi sconsin/breast-cancer-wisconsin.data The data is also available from the University of Wisconsin's web Server: ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/cancer1/ datacum The dataset currently contains clinical cases from 1989 to 1991. It has 699 instances, with 458 classified as benign tumors and 241 as malignant cases. Each instance is described by nine attributes with an integer value in the range of 1 to 10 and a binary class label. Out of the 699 instances, there are 16 instances that are missing some attributes. We will remove these 16 instances from the memory and process the rest (in total, 683 instances) for the model calculations. The sample raw data looks like the following: 1000025,5,1,1,1,2,1,3,1,1,2 1002945,5,4,4,5,7,10,3,2,1,2 1015425,3,1,1,1,2,2,3,1,1,2 1016277,6,8,8,1,3,4,3,7,1,2 1017023,4,1,1,3,2,1,3,1,1,2 1017122,8,10,10,8,7,10,9,7,1,4 ... The attribute information is as follows: # Attribute Domain 1 Sample code number ID number 2 Clump Thickness 1 - 10 3 Uniformity of Cell Size 1 - 10 4 Uniformity of Cell Shape 1 - 10 5 Marginal Adhesion 1 - 10 6 Single Epithelial Cell Size 1 - 10 7 Bare Nuclei 1 - 10 8 Bland Chromatin 1 - 10 9 Normal Nucleoli 1 - 10 10 Mitoses 1 - 10 11 Class (2 for benign, 4 for Malignant) presented in the correct columns, it will look like the following: ID Number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nucleoli Bland Chromatin Normal Nucleoli Mitoses Class 1000025 5 1 1 1 2 1 3 1 1 2 1002945 5 4 4 5 7 10 3 2 1 2 1015425 3 1 1 1 2 2 3 1 1 2 1016277 6 8 8 1 3 4 3 7 1 2 1017023 4 1 1 3 2 1 3 1 1 2 1017122 8 10 10 8 7 10 9 7 1 4 1018099 1 1 1 1 2 10 3 1 1 2 1018561 2 1 2 1 2 1 3 1 1 2 1033078 2 1 1 1 2 1 1 1 5 2 1033078 4 2 1 1 2 1 2 1 1 2 1035283 1 1 1 1 1 1 3 1 1 2 1036172 2 1 1 1 2 1 2 1 1 2 1041801 5 3 3 3 2 3 4 4 1 4 1043999 1 1 1 1 2 3 3 1 1 2 1044572 8 7 5 10 7 9 5 5 4 4 ... ... ... ... ... ... ... ... ... ... ... We will now use the breast cancer data and use classifications to demonstrate the Decision Tree implementation in Spark. We will use the IG and Gini to show how to use the facilities already provided by Spark to avoid redundant coding. This exercise attempts to fit a single tree using a binary classification to train and predict the label (benign (0.0) and malignant (1.0)) for the dataset. Implementing Decision Trees in Apache Spark 2.0 Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included. Set up the package location where the program will reside: package spark.ml.cookbook.chapter10 Import the necessary packages for the Spark context to get access to the cluster andLog4j.Logger to reduce the amount of output produced by Spark: import org.apache.spark.mllib.evaluation.MulticlassMetrics import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.model.DecisionTreeModel import org.apache.spark.rdd.RDD import org.apache.spark.sql.SparkSession import org.apache.log4j.{Level, Logger} Create Spark's configuration and the Spark session so we can have access to the cluster:  Logger.getLogger("org").setLevel(Level.ERROR) val spark = SparkSession .builder .master("local[*]") .appName("MyDecisionTreeClassification") .config("spark.sql.warehouse.dir", ".") .getOrCreate() We read in the original raw data file:  val rawData = spark.sparkContext.textFile("../data/sparkml2/chapter10/breast- cancer-wisconsin.data") We pre-process the dataset:  val data = rawData.map(_.trim) .filter(text => !(text.isEmpty || text.startsWith("#") || text.indexOf("?") > -1)) .map { line => val values = line.split(',').map(_.toDouble) val slicedValues = values.slice(1, values.size) val featureVector = Vectors.dense(slicedValues.init) val label = values.last / 2 -1 LabeledPoint(label, featureVector) } First, we trim the line and remove any empty spaces. Once the line is ready for the next step, we remove the line if it's empty, or if it contains missing values ("?"). After this step, the 16 rows with missing data will be removed from the dataset in the memory. We then read the comma separated values into RDD. Since the first column in the dataset only contains the instance's ID number, it is better to remove this column from the real calculation. We slice it out with the following command, which will remove the first column from the RDD: val slicedValues = values.slice(1, values.size) We then put the rest of the numbers into a dense vector. Since the Wisconsin Breast Cancer dataset's classifier is either benign cases (last column value = 2) or malignant cases (last column value = 4), we convert the preceding value using the following command: val label = values.last / 2 -1 So the benign case 2 is converted to 0, and the malignant case value 4 is converted to 1, which will make the later calculations much easier. We then put the preceding row into a Labeled Points: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We verify the raw data count and process the data count:  println(rawData.count()) println(data.count()) And you will see the following on the console: 699 683 We split the whole dataset into training data (70%) and test data (30%) randomly. Please note that the random split will generate around 211 test datasets. It is approximately but NOT exactly 30% of the dataset:  val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) We define a metrics calculation function, which utilizes the Spark MulticlassMetrics: def getMetrics(model: DecisionTreeModel, data: RDD[LabeledPoint]): MulticlassMetrics = { val predictionsAndLabels = data.map(example => (model.predict(example.features), example.label) ) new MulticlassMetrics(predictionsAndLabels) } This function will read in the model and test dataset, and create a metric which contains the confusion matrix mentioned earlier. It will contain the model accuracy, which is one of the indicators for the classification model. We define an evaluate function, which can take some tunable parameters for the Decision Tree model, and do the training for the dataset:  def evaluate( trainingData: RDD[LabeledPoint], testData: RDD[LabeledPoint], numClasses: Int, categoricalFeaturesInfo: Map[Int,Int], impurity: String, maxDepth: Int, maxBins:Int ) :Unit = { val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) val metrics = getMetrics(model, testData) println("Using Impurity :"+ impurity) println("Confusion Matrix :") println(metrics.confusionMatrix) println("Decision Tree Accuracy: "+metrics.precision) println("Decision Tree Error: "+ (1-metrics.precision)) } The evaluate function will read in several parameters, including the impurity type (Gini or Entropy for the model) and generate the metrics for evaluations. We set the following parameters:  val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val maxDepth = 5 val maxBins = 32 Since we only have benign (0.0) and malignant (1.0), we put numClasses as 2. The other parameters are tunable, and some of them are algorithm stop criteria. We evaluate the Gini impurity first:  evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "gini", maxDepth, maxBins) From the console output: Using Impurity :gini Confusion Matrix : 115.0 5.0 0 88.0 Decision Tree Accuracy: 0.9620853080568721 Decision Tree Error: 0.03791469194312791 To interpret the above Confusion metrics, Accuracy is equal to (115+ 88)/ 211 all test cases, and error is equal to 1 - accuracy We evaluate the Entropy impurity:  evaluate(trainingData, testData, numClasses, categoricalFeaturesInfo, "entropy", maxDepth, maxBins) From the console output: Using Impurity:entropy Confusion Matrix: 116.0 4.0 9.0 82.0 Decision Tree Accuracy: 0.9383886255924171 Decision Tree Error: 0.06161137440758291 To interpret the preceding confusion metrics, accuracy is equal to (116+ 82)/ 211 for all test cases, and error is equal to 1 - accuracy We then close the program by stopping the session:  spark.stop() How it works... The dataset is a bit more complex than usual, but apart from some extra steps, parsing it remains the same as other recipes presented in previous chapters. The parsing takes the data in its raw form and turns it into an intermediate format which will end up as a LabelPoint data structure which is common in Spark ML schemes: Raw data: 1000025,5,1,1,1,2,1,3,1,1,2 Processed Data: 5,1,1,1,2,1,3,1,1,0 Labeled Points: (0.0, [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]) We use DecisionTree.trainClassifier() to train the classifier tree on the training set. We follow that by examining the various impurity and confusion matrix measurements to demonstrate how to measure the effectiveness of a tree model. The reader is encouraged to look at the output and consult additional machine learning books to understand the concept of the confusion matrix and impurity measurement to master Decision Trees and variations in Spark. There's more... To visualize it better, we included a sample decision tree workflow in Spark which will read the data into Spark first. In our case, we create the RDD from the file. We then split the dataset into training data and test data using a random sampling function. After the dataset is split, we use the training dataset to train the model, followed by test data to test the accuracy of the model. A good model should have a meaningful accuracy value (close to 1). The following figure depicts the workflow: A sample tree was generated based on the Wisconsin Breast Cancer dataset. The red spot represents malignant cases, and the blue ones the benign cases. We can examine the tree visually in the following figure: [box type="download" align="" class="" width=""]Download the code and data files here: classification system with Decision Trees in Apache Spark_excercise files[/box] If you liked this article, please be sure to check out Apache Spark 2.0 Machine Learning Cookbook which consists of this article and many more useful techniques on implementing machine learning solutions with the MLlib library in Apache Spark 2.0.
Read more
  • 0
  • 0
  • 26234

article-image-sleep-loss-cuts-developers-productivity-in-half-research-finds
Vincy Davis
03 May 2019
3 min read
Save for later

All coding and no sleep makes Jack/Jill a dull developer, research confirms

Vincy Davis
03 May 2019
3 min read
In recent years, the software engineering community has been interested in factors related to human habits that can play a role in increasing developers' productivity. The researchers- D. Fucci from HITeC and the University of Hamburg, G. Scanniello and S. Romano from DiMIE - University and N. Juristo from Technical University of Madrid have published a paper “Need for Sleep: the Impact of a Night of Sleep Deprivation on Novice Developers’ Performance” that investigates how sleep deprivation can impact developers' productivity. What was the experiment? The researchers performed a quasi experiment with 45 undergraduate students in Computer Science at the University of Basilicata in Italy. The participants were asked to work on a programming task which required them to use the popular agile practice of test-first development (TFD). The students were divided into two groups - The treatment group where 23 students were asked to skip their sleep the night before the experiment and the control group where the remaining students slept the night before the experiment. The conceptual model and the operationalization of the constructs investigated is as shown below. Image source: Research paper Outcome of the Experiment The result of the experiment indicated that sleep deprivation has a negative effect on the capacity of software developers to produce a software solution that meets given requirements. In particular, novice developers who forewent one night of sleep, wrote code which was approximately 50% more likely not to fulfill the functional requirements with respect to the code produced by developers under normal sleep condition. Another observation was that sleep deprivation decreased developers' productivity with the development task and hindered their ability to apply the test-first development (TFD) practice. The researchers also found that sleep-deprived novice developers had to make more fixes to syntactic mistakes in the source code. As an aftereffect of this result paper, experienced developers are recollecting their earlier sleep deprived programming days. Some are even regretting them. https://twitter.com/zhenghaooo/status/1121937715413434369 Recently the Chinese ‘996’ work routine has come into picture, wherein tech companies are expecting their employees to work from 9 am to 9 pm, 6 days a week, leading to 60+ hours of work per week. This kind of work culture will devoid these developers of any work-life balance. This will also encourage the habit of skipping sleep. Thus decreasing developers productivity. A user on Reddit declares sleep as the key to being a productive coder and not burning out. Another user added, “There's a culture in university computer science around programming for 30+ hours straight (hackathons). I've participated and pulled off some pretty cool things in 48 hours of feverish keyboard whacking and near-constant swearing, but I'd rather stab myself repeatedly with a thumbtack than repeat that experience.” It’s high time that companies focus more on the ‘quality’ of work than insisting developers to work for long hours, which will in turn reduce their productivity. It is clear from the result of this research paper that no sleep in a night, can certainly affect one’s quality of work. To know more about the experiment, head over to the research paper. Microsoft and GitHub employees come together to stand with the 996.ICU repository Jack Ma defends the extreme “996 work culture” in Chinese tech firms Dorsey meets Trump privately to discuss how to make public conversation “healthier and more civil” on Twitter
Read more
  • 0
  • 0
  • 26224

article-image-getting-started-python-packages
Packt
02 Nov 2016
37 min read
Save for later

Getting Started with Python Packages

Packt
02 Nov 2016
37 min read
In this article by Luca Massaron and Alberto Boschetti the authors of the book Python Data Science Essentials - Second Edition we will cover steps on installing Python, the different installation packages and have a glance at the essential packages will constitute a complete Data Science Toolbox. (For more resources related to this topic, see here.) Whether you are an eager learner of data science or a well-grounded data science practitioner, you can take advantage of this essential introduction to Python for data science. You can use it to the fullest if you already have at least some previous experience in basic coding, in writing general-purpose computer programs in Python, or in some other data-analysis-specific language such as MATLAB or R. Introducing data science and Python Data science is a relatively new knowledge domain, though its core components have been studied and researched for many years by the computer science community. Its components include linear algebra, statistical modelling, visualization, computational linguistics, graph analysis, machine learning, business intelligence, and data storage and retrieval. Data science is a new domain and you have to take into consideration that currently its frontiers are still somewhat blurred and dynamic. Since data science is made of various constituent sets of disciplines, please also keep in mind that there are different profiles of data scientists depending on their competencies and areas of expertise. In such a situation, what can be the best tool of the trade that you can learn and effectively use in your career as a data scientist? We believe that the best tool is Python, and we intend to provide you with all the essential information that you will need for a quick start. In addition, other tools such as R and MATLAB provide data scientists with specialized tools to solve specific problems in statistical analysis and matrix manipulation in data science. However, only Python really completes your data scientist skill set. This multipurpose language is suitable for both development and production alike; it can handle small- to large-scale data problems and it is easy to learn and grasp no matter what your background or experience is. Created in 1991 as a general-purpose, interpreted, and object-oriented language, Python has slowly and steadily conquered the scientific community and grown into a mature ecosystem of specialized packages for data processing and analysis. It allows you to have uncountable and fast experimentations, easy theory development, and prompt deployment of scientific applications. At present, the core Python characteristics that render it an indispensable data science tool are as follows: It offers a large, mature system of packages for data analysis and machine learning. It guarantees that you will get all that you may need in the course of a data analysis, and sometimes even more. Python can easily integrate different tools and offers a truly unifying ground for different languages, data strategies, and learning algorithms that can be fitted together easily and which can concretely help data scientists forge powerful solutions. There are packages that allow you to call code in other languages (in Java, C, FORTRAN, R, or Julia), outsourcing some of the computations to them and improving your script performance. It is very versatile. No matter what your programming background or style is (object-oriented, procedural, or even functional), you will enjoy programming with Python. It is cross-platform; your solutions will work perfectly and smoothly on Windows, Linux, and Mac OS systems. You won't have to worry all that much about portability. Although interpreted, it is undoubtedly fast compared to other mainstream data analysis languages such as R and MATLAB (though it is not comparable to C, Java, and the newly emerged Julia language). Moreover, there are also static compilers such as Cython or just-in-time compilers such as PyPy that can transform Python code into C for higher performance. It can work with large in-memory data because of its minimal memory footprint and excellent memory management. The memory garbage collector will often save the day when you load, transform, dice, slice, save, or discard data using various iterations and reiterations of data wrangling. It is very simple to learn and use. After you grasp the basics, there's no better way to learn more than by immediately starting with the coding. Moreover, the number of data scientists using Python is continuously growing: new packages and improvements have been released by the community every day, making the Python ecosystem an increasingly prolific and rich language for data science. Installing Python First, let's proceed to introduce all the settings you need in order to create a fully working data science environment to test the examples and experiment with the code that we are going to provide you with. Python is an open source, object-oriented, and cross-platform programming language. Compared to some of its direct competitors (for instance, C++ or Java), Python is very concise.  It allows you to build a working software prototype in a very short time. Yet it has become the most used language in the data scientist's toolbox not just because of that. It is also a general-purpose language, and it is very flexible due to a variety of available packages that solve a wide spectrum of problems and necessities. Python 2 or Python 3? There are two main branches of Python: 2.7.x and 3.x. At the time of writing this article, the Python foundation (www.python.org) is offering downloads for Python version 2.7.11 and 3.5.1. Although the third version is the newest, the older one is still the most used version in the scientific area, since a few packages (check on the website py3readiness.org for a compatibility overview) won't run otherwise yet. In addition, there is no immediate backward compatibility between Python 3 and 2. In fact, if you try to run some code developed for Python 2 with a Python 3 interpreter, it may not work. Major changes have been made to the newest version, and that has affected past compatibility. Some data scientists, having built most of their work on Python 2 and its packages, are reluctant to switch to the new version. We intend to address a larger audience of data scientists, data analysts and developers, who may not have such a strong legacy with Python 2. Thus, we agreed that it would be better to work with Python 3 rather than the older version. We suggest using a version such as Python 3.4 or above. After all, Python 3 is the present and the future of Python. It is the only version that will be further developed and improved by the Python foundation and it will be the default version of the future on many operating systems. Anyway, if you are currently working with version 2 and you prefer to keep on working with it, you can still the examples. In fact, for the most part, our code will simply work on Python 2 after having the code itself preceded by these imports: from __future__ import (absolute_import, division, print_function, unicode_literals) from builtins import * from future import standard_library standard_library.install_aliases() The from __future__ import commands should always occur at the beginning of your scripts or else you may experience Python reporting an error. As described in the Python-future website (python-future.org), these imports will help convert several Python 3-only constructs to a form compatible with both Python 3 and Python 2 (and in any case, most Python 3 code should just simply work on Python 2 even without the aforementioned imports). In order to run the upward commands successfully, if the future package is not already available on your system, you should install it (version >= 0.15.2) using the following command to be executed from a shell: $> pip install –U future If you're interested in understanding the differences between Python 2 and Python 3 further, we recommend reading the wiki page offered by the Python foundation itself: wiki.python.org/moin/Python2orPython3. Step-by-step installation Novice data scientists who have never used Python (who likely don't have the language readily installed on their machines) need to first download the installer from the main website of the project, www.python.org/downloads/, and then install it on their local machine. We will now coversteps which will provide you with full control over what can be installed on your machine. This is very useful when you have to set up single machines to deal with different tasks in data science. Anyway, please be warned that a step-by-step installation really takes time and effort. Instead, installing a ready-made scientific distribution will lessen the burden of installation procedures and it may be well suited for first starting and learning because it saves you time and sometimes even trouble, though it will put a large number of packages (and we won't use most of them) on your computer all at once. This being a multiplatform programming language, you'll find installers for machines that either run on Windows or Unix-like operating systems. Please remember that some of the latest versions of most Linux distributions (such as CentOS, Fedora, Red Hat Enterprise, and Ubuntu) have Python 2 packaged in the repository. In such a case and in the case that you already have a Python version on your computer (since our examples run on Python 3), you first have to check what version you are exactly running. To do such a check, just follow these instructions: Open a python shell, type python in the terminal, or click on any Python icon you find on your system. Then, after having Python started, to test the installation, run the following code in the Python interactive shell or REPL: >>> import sys >>> print (sys.version_info) If you can read that your Python version has the major=2 attribute, it means that you are running a Python 2 instance. Otherwise, if the attribute is valued 3, or if the print statements reports back to you something like v3.x.x (for instance v3.5.1), you are running the right version of Python and you are ready to move forward. To clarify the operations we have just mentioned, when a command is given in the terminal command line, we prefix the command with $>. Otherwise, if it's for the Python REPL, it's preceded by >>>. The installation of packages Python won't come bundled with all you need, unless you take a specific premade distribution. Therefore, to install the packages you need, you can use either pip or easy_install. Both these two tools run in the command line and make the process of installation, upgrade, and removal of Python packages a breeze. To check which tools have been installed on your local machine, run the following command: $> pip To install pip, follow the instructions given at pip.pypa.io/en/latest/installing.html. Alternatively, you can also run this command: $> easy_install If both of these commands end up with an error, you need to install any one of them. We recommend that you use pip because it is thought of as an improvement over easy_install. Moreover, easy_install is going to be dropped in future and pip has important advantages over it. It is preferable to install everything using pip because: It is the preferred package manager for Python 3. Starting with Python 2.7.9 and Python 3.4, it is included by default with the Python binary installers. It provides an uninstall functionality. It rolls back and leaves your system clear if, for whatever reason, the package installation fails. Using easy_install in spite of pip's advantages makes sense if you are working on Windows because pip won't always install pre-compiled binary packages.Sometimes it will try to build the package's extensions directly from C source, thus requiring a properly configured compiler (and that's not an easy task on Windows). This depends on whether the package is running on eggs (and pip cannot directly use their binaries, but it needs to build from their source code) or wheels (in this case, pip can install binaries if available, as explained here: pythonwheels.com/). Instead, easy_install will always install available binaries from eggs and wheels. Therefore, if you are experiencing unexpected difficulties installing a package, easy_install can save your day (at some price anyway, as we just mentioned in the list). The most recent versions of Python should already have pip installed by default. Therefore, you may have it already installed on your system. If not, the safest way is to download the get-pi.py script from bootstrap.pypa.io/get-pip.py and then run it using the following: $> python get-pip.py The script will also install the setup tool from pypi.python.org/pypi/setuptools, which also contains easy_install. You're now ready to install the packages you need in order to run the examples provided in this article. To install the < package-name > generic package, you just need to run this command: $> pip install < package-name > Alternatively, you can run the following command: $> easy_install < package-name > Note that in some systems, pip might be named as pip3 and easy_install as easy_install-3 to stress the fact that both operate on packages for Python 3. If you're unsure, check the version of Python pip is operating on with: $> pip –V For easy_install, the command is slightly different: $> easy_install --version After this, the <pk> package and all its dependencies will be downloaded and installed. If you're not certain whether a library has been installed or not, just try to import a module inside it. If the Python interpreter raises an ImportError error, it can be concluded that the package has not been installed. This is what happens when the NumPy library has been installed: >>> import numpy This is what happens if it's not installed: >>> import numpy Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named numpy In the latter case, you'll need to first install it through pip or easy_install. Take care that you don't confuse packages with modules. With pip, you install a package; in Python, you import a module. Sometimes, the package and the module have the same name, but in many cases, they don't match. For example, the sklearn module is included in the package named Scikit-learn. Finally, to search and browse the Python packages available for Python, look at pypi.python.org. Package upgrades More often than not, you will find yourself in a situation where you have to upgrade a package because either the new version is required by a dependency or it has additional features that you would like to use. First, check the version of the library you have installed by glancing at the __version__ attribute, as shown in the following example, numpy: >>> import numpy >>> numpy.__version__ # 2 underscores before and after '1.9.2' Now, if you want to update it to a newer release, say the 1.11.0 version, you can run the following command from the command line: $> pip install -U numpy==1.11.0 Alternatively, you can use the following command: $> easy_install --upgrade numpy==1.11.0 Finally, if you're interested in upgrading it to the latest available version, simply run this command: $> pip install -U numpy You can alternatively run the following command: $> easy_install --upgrade numpy Scientific distributions As you've read so far, creating a working environment is a time-consuming operation for a data scientist. You first need to install Python and then, one by one, you can install all the libraries that you will need (sometimes, the installation procedures may not go as smoothly as you'd hoped for earlier). If you want to save time and effort and want to ensure that you have a fully working Python environment that is ready to use, you can just download, install, and use the scientific Python distribution. Apart from Python, they also include a variety of preinstalled packages, and sometimes, they even have additional tools and an IDE. A few of them are very well known among data scientists, and in the following content, you will find some of the key features of each of these packages. We suggest that you promptly download and install a scientific distribution, such as Anaconda (which is the most complete one). Anaconda (continuum.io/downloads) is a Python distribution offered by Continuum Analytics that includes nearly 200 packages, which comprises NumPy, SciPy, pandas, Jupyter, Matplotlib, Scikit-learn, and NLTK. It's a cross-platform distribution (Windows, Linux, and Mac OS X) that can be installed on machines with other existing Python distributions and versions. Its base version is free; instead, add-ons that contain advanced features are charged separately. Anaconda introduces conda, a binary package manager, as a command-line tool to manage your package installations. As stated on the website, Anaconda's goal is to provide enterprise-ready Python distribution for large-scale processing, predictive analytics, and scientific computing. Leveraging conda to install packages If you've decided to install an Anaconda distribution, you can take advantage of the conda binary installer we mentioned previously. Anyway, conda is an open source package management system, and consequently it can be installed separately from an Anaconda distribution. You can test immediately whether conda is available on your system. Open a shell and digit: $> conda -V If conda is available, there will appear the version of your conda; otherwise an error will be reported. If conda is not available, you can quickly install it on your system by going to conda.pydata.org/miniconda.html and installing the Miniconda software suitable for your computer. Miniconda is a minimal installation that only includes conda and its dependencies. conda can help you manage two tasks: installing packages and creating virtual environments. In this paragraph, we will explore how conda can help you easily install most of the packages you may need in your data science projects. Before starting, please check to have the latest version of conda at hand: $> conda update conda Now you can install any package you need. To install the <package-name> generic package, you just need to run the following command: $> conda install <package-name> You can also install a particular version of the package just by pointing it out: $> conda install <package-name>=1.11.0 Similarly you can install multiple packages at once by listing all their names: $> conda install <package-name-1> <package-name-2> If you just need to update a package that you previously installed, you can keep on using conda: $> conda update <package-name> You can update all the available packages simply by using the --all argument: $> conda update --all Finally, conda can also uninstall packages for you: $> conda remove <package-name> If you would like to know more about conda, you can read its documentation at conda.pydata.org/docs/index.html. In summary, as a main advantage, it handles binaries even better than easy_install (by always providing a successful installation on Windows without any need to compile the packages from source) but without its problems and limitations. With the use of conda, packages are easy to install (and installation is always successful), update, and even uninstall. On the other hand, conda cannot install directly from a git server (so it cannot access the latest version of many packages under development) and it doesn't cover all the packages available on PyPI as pip itself. Enthought Canopy Enthought Canopy (enthought.com/products/canopy) is a Python distribution by Enthought Inc. It includes more than 200 preinstalled packages, such as NumPy, SciPy, Matplotlib, Jupyter, and pandas. This distribution is targeted at engineers, data scientists, quantitative and data analysts, and enterprises. Its base version is free (which is named Canopy Express), but if you need advanced features, you have to buy a front version. It's a multiplatform distribution and its command-line install tool is canopy_cli. PythonXY PythonXY (python-xy.github.io) is a free, open source Python distribution maintained by the community. It includes a number of packages, which include NumPy, SciPy, NetworkX, Jupyter, and Scikit-learn. It also includes Spyder, an interactive development environment inspired by the MATLAB IDE. The distribution is free. It works only on Microsoft Windows, and its command-line installation tool is pip. WinPython WinPython (winpython.sourceforge.net) is also a free, open-source Python distribution maintained by the community. It is designed for scientists, and includes many packages such as NumPy, SciPy, Matplotlib, and Jupyter. It also includes Spyder as an IDE. It is free and portable. You can put WinPython into any directory, or even into a USB flash drive, and at the same time maintain multiple copies and versions of it on your system. It works only on Microsoft Windows, and its command-line tool is the WinPython Package Manager (WPPM). Explaining virtual environments No matter you have chosen installing a stand-alone Python or instead you used a scientific distribution, you may have noticed that you are actually bound on your system to the Python's version you have installed. The only exception, for Windows users, is to use a WinPython distribution, since it is a portable installation and you can have as many different installations as you need. A simple solution to break free of such a limitation is to use virtualenv that is a tool to create isolated Python environments. That means, by using different Python environments, you can easily achieve these things: Testing any new package installation or doing experimentation on your Python environment without any fear of breaking anything in an irreparable way. In this case, you need a version of Python that acts as a sandbox. Having at hand multiple Python versions (both Python 2 and Python 3), geared with different versions of installed packages. This can help you in dealing with different versions of Python for different purposes (for instance, some of the packages we are going to present on Windows OS only work using Python 3.4, which is not the latest release). Taking a replicable snapshot of your Python environment easily and having your data science prototypes work smoothly on any other computer or in production. In this case, your main concern is the immutability and replicability of your working environment. You can find documentation about virtualenv at virtualenv.readthedocs.io/en/stable, though we are going to provide you with all the directions you need to start using it immediately. In order to take advantage of virtualenv, you have first to install it on your system: $> pip install virtualenv After the installation completes, you can start building your virtual environments. Before proceeding, you have to take a few decisions: If you have more versions of Python installed on your system, you have to decide which version to pick up. Otherwise, virtualenv will take the Python version virtualenv was installed by on your system. In order to set a different Python version you have to digit the argument –p followed by the version of Python you want or inserting the path of the Python executable to be used (for instance, –p python2.7 or just pointing to a Python executable such as -p c:Anaconda2python.exe). With virtualenv, when required to install a certain package, it will install it from scratch, even if it is already available at a system level (on the python directory you created the virtual environment from). This default behavior makes sense because it allows you to create a completely separated empty environment. In order to save disk space and limit the time of installation of all the packages, you may instead decide to take advantage of already available packages on your system by using the argument --system-site-packages. You may want to be able to later move around your virtual environment across Python installations, even among different machines. Therefore you may want to make the functioning of all of the environment's scripts relative to the path it is placed in by using the argument --relocatable. After deciding on the Python version, the linking to existing global packages, and the relocability of the virtual environment, in order to start, you just launch the command from a shell. Declare the name you would like to assign to your new environment: $> virtualenv clone virtualenv will just create a new directory using the name you provided, in the path from which you actually launched the command. To start using it, you just enter the directory and digit activate: $> cd clone $> activate At this point, you can start working on your separated Python environment, installing packages and working with code. If you need to install multiple packages at once, you may need some special function from pip—pip freeze—which will enlist all the packages (and their version) you have installed on your system. You can record the entire list in a text file by this command: $> pip freeze > requirements.txt After saving the list in a text file, just take it into your virtual environment and install all the packages in a breeze with a single command: $> pip install -r requirements.txt Each package will be installed according to the order in the list (packages are listed in a case-insensitive sorted order). If a package requires other packages that are later in the list, that's not a big deal because pip automatically manages such situations. So if your package requires Numpy and Numpy is not yet installed, pip will install it first. When you're finished installing packages and using your environment for scripting and experimenting, in order to return to your system defaults, just issue this command: $> deactivate If you want to remove the virtual environment completely, after deactivating and getting out of the environment's directory, you just have to get rid of the environment's directory itself by a recursive deletion. For instance, on Windows you just do this: $> rd /s /q clone On Linux and Mac, the command will be: $> rm –r –f clone If you are working extensively with virtual environments, you should consider using virtualenvwrapper, which is a set of wrappers for virtualenv in order to help you manage multiple virtual environments easily. It can be found at bitbucket.org/dhellmann/virtualenvwrapper. If you are operating on a Unix system (Linux or OS X), another solution we have to quote is pyenv (which can be found at https://github.com/yyuu/pyenv). It lets you set your main Python version, allow installation of multiple versions, and create virtual environments. Its peculiarity is that it does not depend on Python to be installed and works perfectly at the user level (no need for sudo commands). conda for managing environments If you have installed the Anaconda distribution, or you have tried conda using a Miniconda installation, you can also take advantage of the conda command to run virtual environments as an alternative to virtualenv. Let's see in practice how to use conda for that. We can check what environments we have available like this: >$ conda info -e This command will report to you what environments you can use on your system based on conda. Most likely, your only environment will be just "root", pointing to your Anaconda distribution's folder. As an example, we can create an environment based on Python version 3.4, having all the necessary Anaconda-packaged libraries installed. That makes sense, for instance, for using the package Theano together with Python 3 on Windows (because of an issue we will explain in a few paragraphs). In order to create such an environment, just do: $> conda create -n python34 python=3.4 anaconda The command asks for a particular python version (3.4) and requires the installation of all packages available on the anaconda distribution (the argument anaconda). It names the environment as python34 using the argument –n. The complete installation should take a while, given the large number of packages in the Anaconda installation. After having completed all of the installation, you can activate the environment: $> activate python34 If you need to install additional packages to your environment, when activated, you just do: $> conda install -n python34 <package-name1> <package-name2> That is, you make the list of the required packages follow the name of your environment. Naturally, you can also use pip install, as you would do in a virtualenv environment. You can also use a file instead of listing all the packages by name yourself. You can create a list in an environment using the list argument and piping the output to a file: $> conda list -e > requirements.txt Then, in your target environment, you can install the entire list using: $> conda install --file requirements.txt You can even create an environment, based on a requirements' list: $> conda create -n python34 python=3.4 --file requirements.txt Finally, after having used the environment, to close the session, you simply do this: $> deactivate Contrary to virtualenv, there is a specialized argument in order to completely remove an environment from your system: $> conda remove -n python34 --all A glance at the essential packages We mentioned that the two most relevant characteristics of Python are its ability to integrate with other languages and its mature package system, which is well embodied by PyPI (the Python Package Index: pypi.python.org/pypi), a common repository for the majority of Python open source packages that is constantly maintained and updated. The packages that we are now going to introduce are strongly analytical and they will constitute a complete Data Science Toolbox. All the packages are made up of extensively tested and highly optimized functions for both memory usage and performance, ready to achieve any scripting operation with successful execution. A walkthrough on how to install them is provided next. Partially inspired by similar tools present in R and MATLAB environments, we will together explore how a few selected Python commands can allow you to efficiently handle data and then explore, transform, experiment, and learn from the same without having to write too much code or reinvent the wheel. NumPy NumPy, which is Travis Oliphant's creation, is the true analytical workhorse of the Python language. It provides the user with multidimensional arrays, along with a large set of functions to operate a multiplicity of mathematical operations on these arrays. Arrays are blocks of data arranged along multiple dimensions, which implement mathematical vectors and matrices. Characterized by optimal memory allocation, arrays are useful not just for storing data, but also for fast matrix operations (vectorization), which are indispensable when you wish to solve ad hoc data science problems: Website: www.numpy.org Version at the time of print: 1.11.0 Suggested install command: pip install numpy As a convention largely adopted by the Python community, when importing NumPy, it is suggested that you alias it as np: import numpy as np SciPy An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier transformation, and much more: Website: www.scipy.org Version at time of print: 0.17.1 Suggested install command: pip install scipy pandas The pandas package deals with everything that NumPy and SciPy cannot do. Thanks to its specific data structures, namely DataFrames and Series, pandas allows you to handle complex tables of data of different types (which is something that NumPy's arrays cannot do) and time series. Thanks to Wes McKinney's creation, you will be able easily and smoothly to load data from a variety of sources. You can then slice, dice, handle missing elements, add, rename, aggregate, reshape, and finally visualize your data at will: Website: pandas.pydata.org Version at the time of print: 0.18.1 Suggested install command: pip install pandas Conventionally, pandas is imported as pd: import pandas as pd Scikit-learn Started as part of the SciKits (SciPy Toolkits), Scikit-learn is the core of data science operations on Python. It offers all that you may need in terms of data preprocessing, supervised and unsupervised learning, model selection, validation, and error metrics. Scikit-learn started in 2007 as a Google Summer of Code project by David Cournapeau. Since 2013, it has been taken over by the researchers at INRA (French Institute for Research in Computer Science and Automation): Website: scikit-learn.org/stable Version at the time of print: 0.17.1 Suggested install command: pip install scikit-learn Note that the imported module is named sklearn. Jupyter A scientific approach requires the fast experimentation of different hypotheses in a reproducible fashion. Initially named IPython and limited to working only with the Python language, Jupyter was created by Fernando Perez in order to address the need for an interactive Python command shell (which is based on shell, web browser, and the application interface), with graphical integration, customizable commands, rich history (in the JSON format), and computational parallelism for an enhanced performance. Jupyter is our favoured choice; it is used to clearly and effectively illustrate operations with scripts and data, and the consequent results: Website: jupyter.org Version at the time of print: 1.0.0 (ipykernel = 4.3.1) Suggested install command: pip install jupyter Matplotlib Originally developed by John Hunter, matplotlib is a library that contains all the building blocks that are required to create quality plots from arrays and to visualize them interactively. You can find all the MATLAB-like plotting frameworks inside the pylab module: Website: matplotlib.org Version at the time of print: 1.5.1 Suggested install command: pip install matplotlib You can simply import what you need for your visualization purposes with the following command: import matplotlib.pyplot as plt Statsmodels Previously part of SciKits, statsmodels was thought to be a complement to SciPy's statistical functions. It features generalized linear models, discrete choice models, time series analysis, and a series of descriptive statistics as well as parametric and nonparametric tests: Website: statsmodels.sourceforge.net Version at the time of print: 0.6.1 Suggested install command: pip install statsmodels Beautiful Soup Beautiful Soup, a creation of Leonard Richardson, is a great tool to scrap out data from HTML and XML files retrieved from the Internet. It works incredibly well, even in the case of tag soups (hence the name), which are collections of malformed, contradictory, and incorrect tags. After choosing your parser (the HTML parser included in Python's standard library works fine), thanks to Beautiful Soup, you can navigate through the objects in the page and extract text, tables, and any other information that you may find useful: Website: www.crummy.com/software/BeautifulSoup Version at the time of print: 4.4.1 Suggested install command: pip install beautifulsoup4 Note that the imported module is named bs4. NetworkX Developed by the Los Alamos National Laboratory, NetworkX is a package specialized in the creation, manipulation, analysis, and graphical representation of real-life network data (it can easily operate with graphs made up of a million nodes and edges). Besides specialized data structures for graphs and fine visualization methods (2D and 3D), it provides the user with many standard graph measures and algorithms, such as the shortest path, centrality, components, communities, clustering, and PageRank. Website: networkx.github.io Version at the time of print: 1.11 Suggested install command: pip install networkx Conventionally, NetworkX is imported as nx: import networkx as nx NLTK The Natural Language Toolkit (NLTK) provides access to corpora and lexical resources and to a complete suite of functions for statistical Natural Language Processing (NLP), ranging from tokenizers to part-of-speech taggers and from tree models to named-entity recognition. Initially, Steven Bird and Edward Loper created the package as an NLP teaching infrastructure for their course at the University of Pennsylvania. Now, it is a fantastic tool that you can use to prototype and build NLP systems: Website: www.nltk.org Version at the time of print: 3.2.1 Suggested install command: pip install nltk Gensim Gensim, programmed by Radim Rehurek, is an open source package that is suitable for the analysis of large textual collections with the help of parallel distributable online algorithms. Among advanced functionalities, it implements Latent Semantic Analysis (LSA), topic modelling by Latent Dirichlet Allocation (LDA), and Google's word2vec, a powerful algorithm that transforms text into vector features that can be used in supervised and unsupervised machine learning. Website: radimrehurek.com/gensim Version at the time of print: 0.12.4 Suggested install command: pip install gensim PyPy PyPy is not a package; it is an alternative implementation of Python 2.7.8 that supports most of the commonly used Python standard packages (unfortunately, NumPy is currently not fully supported). As an advantage, it offers enhanced speed and memory handling. Thus, it is very useful for heavy duty operations on large chunks of data and it should be part of your big data handling strategies: Website: pypy.org/ Version at time of print: 5.1 Download page: pypy.org/download.html XGBoost XGBoost is a scalable, portable, and distributed gradient boosting library (a tree ensemble machine learning algorithm). Initially created by Tianqi Chen from Washington University, it has been enriched by a Python wrapper by Bing Xu and an R interface by Tong He (you can read the story behind XGBoost directly from its principal creator at homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html). XGBoost is available for Python, R, Java, Scala, Julia, and C++, and it can work on a single machine (leveraging multithreading) in both Hadoop and Spark clusters: Website: xgboost.readthedocs.io/en/latest Version at the time of print: 0.4 Download page: github.com/dmlc/xgboost Detailed instructions for installing XGBoost on your system can be found at this page: github.com/dmlc/xgboost/blob/master/doc/build.md The installation of XGBoost on both Linux and MacOS is quite straightforward, whereas it is a little bit trickier for Windows users. On a Posix system you just have For this reason, we provide specific installation steps to get XGBoost working on Windows: First download and install Git for Windows (git-for-windows.github.io). Then you need a MINGW compiler present on your system. You can download it from www.mingw.org accordingly to the characteristics of your system. From the command line, execute: $> git clone --recursive https://github.com/dmlc/xgboost $> cd xgboost $> git submodule init $> git submodule update Then, always from command line, copy the configuration for 64-byte systems to be the default one: $> copy makemingw64.mk config.mk Alternatively, you just copy the plain 32-byte version: $> copy makemingw.mk config.mk After copying the configuration file, you can run the compiler, setting it to use four threads in order to speed up the compiling procedure: $> mingw32-make -j4 In MinGW, the make command comes with the name mingw32-make. If you are using a different compiler, the previous command may not work; then you can simply try: $> make -j4 Finally, if the compiler completes its work without errors, you can install the package in your Python by this: $> cd python-package $> python setup.py install After following all the preceding instructions, if you try to import XGBoost in Python and yet it doesn't load and results in an error, it may well be that Python cannot find the MinGW's g++ runtime libraries. You just need to find the location on your computer of MinGW's binaries (in our case, it was in C:mingw-w64mingw64bin; just modify the next code to put yours) and place the following code snippet before importing XGBoost: import os mingw_path = 'C:\mingw-w64\mingw64\bin' os.environ['PATH']=mingw_path + ';' + os.environ['PATH'] import xgboost as xgb Depending on the state of the XGBoost project, similarly to many other projects under continuous development, the preceding installation commands may or may not temporarily work at the time you will try them. Usually waiting for an update of the project or opening an issue with the authors of the package may solve the problem. Theano Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Basically, it provides you with all the building blocks you need to create deep neural networks. Created by academics (an entire development team; you can read their names on their most recent paper at arxiv.org/pdf/1605.02688.pdf), Theano has been used for large scale and intensive computations since 2007: Website: deeplearning.net/software/theano Release at the time of print: 0.8.2 In spite of many installation problems experienced by users in the past (expecially Windows users), the installation of Theano should be straightforward, the package being now available on PyPI: $> pip install Theano If you want the most updated version of the package, you can get it by Github cloning: $> git clone git://github.com/Theano/Theano.git Then you can proceed with direct Python installation: $> cd Theano $> python setup.py install To test your installation, you can run from shell/CMD and verify the reports: $> pip install nose $> pip install nose-parameterized $> nosetests theano If you are working on a Windows OS and the previous instructions don't work, you can try these steps using the conda command provided by the Anaconda distribution: Install TDM GCC x64 (this can be found at tdm-gcc.tdragon.net) Open an Anaconda prompt interface and execute: $> conda update conda $> conda update --all $> conda install mingw libpython $> pip install git+git://github.com/Theano/Theano.git Theano needs libpython, which isn't compatible yet with the version 3.5. So if your Windows installation is not working, this could be the likely cause. Anyway, Theano installs perfectly on Python version 3.4. Our suggestion in this case is to create a virtual Python environment based on version 3.4, install, and use Theano only on that specific version. Directions on how to create virtual environments are provided in the paragraph about virtualenv and conda create. In addition, Theano's website provides some information to Windows users; it could support you when everything else fails: deeplearning.net/software/theano/install_windows.html An important requirement for Theano to scale out on GPUs is to install Nvidia CUDA drivers and SDK for code generation and execution on GPU. If you do not know too much about the CUDA Toolkit, you can actually start from this web page in order to understand more about the technology being used: developer.nvidia.com/cuda-toolkit Therefore, if your computer has an NVidia GPU, you can find all the necessary instructions in order to install CUDA using this tutorial page from NVidia itself: docs.nvidia.com/cuda/cuda-quick-start-guide/index.html Keras Keras is a minimalist and highly modular neural networks library, written in Python and capable of running on top of either Theano or TensorFlow (the source software library for numerical computation released by Google). Keras was created by François Chollet, a machine learning researcher working at Google: Website: keras.io Version at the time of print: 1.0.3 Suggested installation from PyPI: $> pip install keras As an alternative, you can install the latest available version (which is advisable since the package is in continuous development) using the command: $> pip install git+git://github.com/fchollet/keras.git Summary In this article, we performed a lot of installations, from Python packages to examples.They were installed either directly or by using a scientific distribution. We also introduced Jupyter notebooks and demonstrated how you can have access to the data run in the tutorials. Resources for Article: Further resources on this subject: Python for Driving Hardware [Article] Mining Twitter with Python – Influence and Engagement [Article] Python Data Structures [Article]
Read more
  • 0
  • 0
  • 26199

article-image-classification-using-convolutional-neural-networks
Mohammad Pezeshki
07 Feb 2017
5 min read
Save for later

Classification using Convolutional Neural Networks

Mohammad Pezeshki
07 Feb 2017
5 min read
In this blog post, we begin with a simple classification task that the reader can readily relate to. The task is a binary classification of 25000 images of cats and dogs, divided into 20000 training, 2500 validation, and 2500 testing images. It seems reasonable to use the most promising model for object recognition, which is convolutional neural network (CNN). As a result, we use CNN as the baseline for the experiments, and along with this post, we will try to improve its performance using different techniques. So, in the next sections, we will first introduce CNN and its architecture and then we will explore three techniques to boost the performance and speed. These three techniques are using Parametric ReLU and a method of Batch Normalization. In this post, we will show the experimental results as we go through each technique. The complete code for CNN is available online in the author’s GitHub repository. Convolutional Neural Networks Convolutional neural networks can be seen as feedforward neural networks that multiple copies of the same neuron are applied to in different places. It means applying the same function to different patches of an image. Doing this means that we are explicitly imposing our knowledge about data (images) into the model structure. That's because we already know that natural image data is translation invariant, meaning that probability distribution of pixels are the same across all images. This structure, which is followed by a non-linearity and a pooling and subsampling layer, makes CNN’s powerful models, especially, when dealing with images. Here's a graphical illustration of CNN from Prof. Hugo Larochelle's course of Neural Networks, which is originally from Prof. YannLecun's paper on ConvNets. Implementation of a CNN in a GPU-based language of Theano is so straightforward as well. So, we can create a layer like this: And then we can stack them on top of each other like this: CNN Experiments Armed with CNN, we attacked the task using two baseline models. A relatively big, and a relatively small model. In the figures below, you can see the number for layer, filter size, pooling size, stride, and a number of fully connected layers. We trained both networks with a learning rate of 0.01, and a momentum of 0.9 on a GTX580 GPU. We also used early stopping. The small model can be trained in two hours and results in 81 percent accuracy on validation sets. The big model can be trained in 24 hours and results in 92 percent accuracy on validation sets. Parametric ReLU Parametric ReLU (aka Leaky ReLU) is an extension to Rectified Linear Unitthat allows the neuron to learn the slope of activation function in the negative region. Unlike the actual paper of Parametric ReLU by Microsoft Research, I used a different parameterizationthat forces the slope to be between 0 and 1. As shown in the figure below, when alpha is 0, the activation function is just linear. On the other hand, if alpha is 1, then the activation function is exactly the ReLU. Interestingly, although the number of trainable parameters is increased using Parametric ReLU, it improves the model both in terms of accuracy and in terms of convergence speed. Using Parametric ReLU makes the training time 3/4 and increases the accuracy around 1 percent. In Parametric ReLU,to make sure that alpha remains between 0 and 1, we will set alpha = Sigmoid(beta) and optimize beta instead. In our experiments, we will set the initial value of alpha to 0.5. After training, all alphas were between 0.5 and 0.8. That means that the model enjoys having a small gradient in the negative region. “Basically, even a small slope in negative region of activation function can help training a lot. Besides, it's important to let the model decide how much nonlinearity it needs.” Batch Normalization Batch Normalization simply means normalizing preactivations for each batch to have zero mean and unit variance. Based on a recent paper by Google, this normalization reduces a problem called Internal Covariance Shift and consequently makes the learning much faster. The equations are as follows: Personally, during this post, I found this as one of the most interesting and simplest techniques I've ever used. A very important point to keep in mind is to feed the whole validation set as a single batch at testing time to have a more accurate (less biased) estimation of mean and variance. “Batch Normalization, which means normalizing pre-activations for each batch to have zero mean and unit variance, can boost the results both in terms of accuracy and in terms of convergence speed.” Conclusion All in all, we will conclude this post with two finalized models. One of them can be trained in 10 epochs or, equivalently, 15 minutes, and can achieve 80 percent accuracy. The other model is a relatively large model. In this model, we did not use LDNN, but the two other techniques are used, and we achieved 94.5 percent accuracy. About the Author Mohammad Pezeshki is a PhD student in the MILA lab at University of Montreal. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014. He then obtained his Master’s in June 2016. His research interests lie in the fields of Artificial Intelligence, Machine Learning, Probabilistic Models and, specifically,Deep Learning.
Read more
  • 0
  • 0
  • 26196

article-image-fine-tune-your-web-application-profiling-and-automation
Packt
07 Jun 2016
17 min read
Save for later

Fine Tune Your Web Application by Profiling and Automation

Packt
07 Jun 2016
17 min read
In this article by James Singleton, author of the book,ASP.NET Core 1.0 High Performance,sheds some light on how to improve the performance of your web application by profiling and testing it. In this article, we will cover writing automated tests to monitor performance along with adding these to aContinuous Integration(CI) and deployment system by constantly checking for regressions. (For more resources related to this topic, see here.) Profiling and measurement It's impossible to overstate how important profiling, measuring, and analyzingreliable evidence is, especially when dealing with web application performance. Maybe you used Glimpseor MiniProfilerto provide insights into the running of your web application;or perhaps, you are familiar with the Visual Studio diagnostics tools and the Application InsightsSoftware Development Kit (SDK). There's another tool that's worth mentioning and that's the Prefix profiler, which you can get at prefix.io.Prefix is a free, web‑based,ASP.NET profiler thatsupports ASP.NET Core. However, it doesn't yet support .NET Core (although this is planned),so you'll need to run ASP.NETCore on .NET Framework 4.6, for now. There's a live demo on their website (at demo.prefix.io) if you want to quickly check it out. You may also want to look at the PerfView performance analysis tool from Microsoft, which is used in the development of .NET Core. You can download PerfView from https://www.microsoft.com/en-us/download/details.aspx?id=28567, as a ZIP file that you can just extract and run. It is useful to analyze the memory of .NET applications among other things. You can use PerfView for many debugging activities, for example, to snapshot the heap or force GC runs. We don't have space for a detailed walkthrough here, but the included instructions are good, and there blogs on MSDN with guides and many video tutorials on Channel 9 at channel9.msdn.com/Series/PerfView-Tutorial if you need more information.Sysinternals tools (technet.microsoft.com/sysinternals) can also be helpful, but as they are not focused on .NET, they are less useful in this context. While tools such as these are great, what would be even better is building performance monitoring into your development workflow. Automate everything that you can and make performance checks transparent, routine, and run by default. Manual processes are bad becausesteps can be skipped and errors can easily be made. You wouldn't dream of developing software by e-mailing files around or editing code directly on a production server, so why not automate your performance tests too? Change control processes exist to ensure consistency and reduce errors. This is why using a Source Control Management (SCM) system, such as git or Team Foundation Server (TFS) is essential. It's also extremely useful to have a build server and perform Continuous Integration(CI) or even fully automated deployments. If the code that is deployed in production differs from what you have on your local workstation, then you have very little chance of success. This is one of the reasons why SQL Stored Procedures (SPs/sprocs) are difficult to work with,at least without rigorous version control. It's far too easy to modify an old version of an SP on a development database, accidentally revert a bug fix, and end up with a regression.If you must use sprocs, then you will need a versioning system such, as ReadyRoll (which Redgate has now acquired). If you practice Continuous Delivery (CD),then you'll have a build server, such as JetBrains TeamCity, ThoughtWorksGoCD, orCruiseControl.NET,or a cloud service, such as AppVeyor. Perhaps, you even automating your deployments using a tool, such as Octopus Deploy, and have your own internal NuGet feeds using software such as TheMotleyFool's Klondike or a cloud service such as MyGet (which also supports npm, bower, and VSIX packages). Bypassing processes and doing things manually will cause problems, even if you follow a script. If it can be automated, then it probably should be, and this includes testing. Automated testing As previously mentioned, the key to improving almost everything is automation. Tests thatare only run manually on developer workstations add very little value. It should of course be possible to run the tests on desktops, but this shouldn't be the official result because there's no guarantee that they will pass on a server (where the correct functioning matters more). Although automation usually occurs on servers, it can be useful to automate tests running on developer workstations too. One way of doing this in Visual Studio is to use a plugin, such as NCrunch. This runs your tests as you work, which can be very useful if you practice Test-Driven Development (TDD) and write your tests before your implementations. You can read more about NCrunch and see the pricing at ncrunch.net, or there's a similar open source project at continuoustests.com. One way of enforcing testing is to use gated check-ins in TFS, but this can be a little draconian, and if you use an SCM-like git, then it's easier to work on branches and simply block merges until all of the tests pass. You want to encourage developers to check-in early and often because this makes merges easier.Therefore, it's a bad idea to have features in progress sitting on workstations for a long time (generally no longer than a day). Continuous integration CI systems automatically build and test all of your branches, and they feed this information back to your version control system. For example, using the GitHubAPI,you can block the merging of pull requests until the build server has reported success of the merge result. Both Bitbucket and GitLab offer free CI systems called pipelines, so you may not need any extra systems in addition to one for source control because everything is in one place. GitLab also offers an integrated Docker container registry, and there is an open source version that you can install locally. Docker is well supported by .NET Core, and the new version of Visual Studio.You cando something similar with Visual Studio Team Services for CI builds and unit testing. Visual Studioalso has git services built into it. This process works well for unit testing because unit tests must be quick so that you get feedback early.Shortening the iteration cycle is a good way of increasing productivity,and you'll want the lag to be as small as possible. However, running tests on each build isn't suitable for all types of testing because not all tests can be quick. In this case, you'll need an additional strategy so as not to slow down your feedback loop. There are many unit testing frameworks available for .NET, for example NUnit, xUnit, and MSTest (Microsoft's unit test framework), along with multiple graphical ways of running tests locally, such as the Visual Studio Test Explorer and the ReSharper plugin. People have their favorites, but it doesn't really matter what you choose because most CI systems will support all of them. Slow testing Some tests are slow,but even if each test is fast they can easily add up to a lengthy time if you have a lot of them. This is especially true if they can't be parallelized and need to be run in sequence.Therefore, you should always aim to have each test stand on its own, without any dependencies on others. It's good practice to divide your tests into rings of importance so that you can at least run a subset of the most crucial on every CI build. However, if you have a large test suite or some tests thatare unavoidably slow, then you may choose to only run these once a day (perhaps overnight) or every week (maybe over the weekend). Some testing is simply slow by nature, and performance testing can often fall into this category, for example, load testing or User Interface (UI) testing. These are usually classed as integration testing, rather than unit testing, because they require your code to be deployed to an environment for testing, and the tests can't simply exercise the binaries. To make use of such automated testing, you will need to have an automated deployment system in addition to your CI system. If you have enough confidence in your test system, then you caneven have live deployments happen automatically. This works well if you also use feature switching to control the rollout of new features. Realistic environments Using a test environment that is as close to production (or as live-like) as possible is a good step toward ensuring reliable results. You cantry and use a smaller set of servers, and then scale your results up to get an estimate of live performance, but this assumes that you have an intimate knowledge of how your application scales, and what hardware constraints will be the bottlenecks. A better option is to use your live environment or rather what will become your production stack. You first create a staging environment that is identical to live, then you deploy your code to it, and run your full test suite, including a comprehensive performance test, ensuring that it behaves correctly. Once you are happy, then you simply swap staging and production, perhaps using DNS or Azure staging slots. Your old live environment now either becomes your test environment or if you use immutable cloud instances, then you can simply terminate it and spin up a new staging system. This concept is known as blue‑green deployment. You don't necessarily have to move all users across at once in a big bang. You canmove a few over first to test whether everything is correct. Web UI testing tools One of the most popular web testing tools is Selenium, which allows you to easily write tests and automate web browsers using WebDriver. Selenium is useful for many other tasks apart from testing, and you can read more about it at docs.seleniumhq.org. WebDriver is a protocol for remote controlling web browsers, and you can read about it at w3c.github.io/webdriver/webdriver-spec.html. Selenium uses real browsers, the same versions your users will access your web application with. This makes it excellent to get representative results, but it can cause issues if itrunsfrom the command line in an unattended fashion. For example, you may find your test server's memory full of dead browser processes, which have timed out. You may find it easier to use a dedicated headless test browser, which while not exactly the same as what your users will see, is more suitable for automation. The best approach is of course to use a combination of both, perhaps running headless tests first and then running the same tests on real browsers with WebDriver. One of the most well-known headless test browsers is PhantomJS. This is based on the WebKit engine, so it should give similar results to Chrome and Safari. PhantomJS is useful for many things apart from testing, such as capturing screenshots, and many different testing frameworks can drive it. As the name suggests,JavaScript can control PhantomJS, and you can read more about it at phantomjs.org. WebKit is an open source engine for web browsers, which was originally part of the KDE Linux desktop environment. It is mainly used in Apple's Safari browser, but a fork called Blink is used in Google Chrome, Chromium, and Opera. You can read more at webkit.org. Other automatable testing browsers based on different engines are available, but they have some limitations. For example, SlimerJS (slimerjs.org) is based on the Gecko engine used by Firefox, but is not fully headless. You probably want to use a higher-level testing utility rather than scripting browser engines directly. One such utility that provides many useful abstractions is CasperJS(casperjs.org),which supports running onboth PhantomJS and SlimerJS. Another library is Capybara, which allows you to easily simulate user interactions in Ruby. It supports Selenium, WebKit, Rack, and PhantomJS (via Poltergeist), although it's more suitable for Rails apps.You can read more at jnicklas.github.io/capybara. There is also TrifleJS (triflejs.org), which uses the .NET WebBrowser class (the Internet Explorer Trident engine), but this is a work in progress. Additionally, there's Watir (watir.com), which is a set of Ruby libraries that target Internet Explorer and WebDriver. However, neither have been updated in a while, and IE has changed a lot recently. Microsoft Edge (codenamed Spartan)is the new version of IE, and the Trident engine has been forked to EdgeHTML.The JavaScript engine (Chakra) has been open sourced as ChakraCore (github.com/Microsoft/ChakraCore). It shouldn't matter too much what browser engine you use, and PhantomJS will work fine as a first pass for automated tests. You can always test with real browsers after using a headless one, perhaps with Selenium or with PhantomJS using WebDriver. When we refer to browser engines (WebKit/Blink, Gecko, and Trident/EdgeHTML), we generally mean only the rendering and layout engine, not the JavaScript engine (SFX/Nitro/FTL/B3, V8, SpiderMonkey, and Chakra/ChakraCore). You'll probably still want to use a utility such as CasperJS to make writing tests easier, and you'll likely need a test framework, such as Jasmine (jasmine.github.io) or QUnit (qunitjs.com), too. You can also use a test runner thatsupports both Jasmine and QUnit, such as Chutzpah (mmanela.github.io/chutzpah). You can integrate your automated tests with many different CI systems, for example, Jenkins or JetBrains TeamCity. If you prefer a cloud-hosted option, then there's Travis CI (travis-ci.org) andAppVeyor (appveyor.com), which is also suitableto build .NET apps. You may prefer to run your integration and UI tests from your deployment system, for example, to verify a successful deployment in Octopus Deploy. There are also dedicated,cloud-based,web-application UI testing services available, such as BrowserStack (browserstack.com). Automating UI performancetests Automated UI tests are clearly great to check functional regressions, but they are also useful to test performance. You have programmatic access to the same information provided by the network inspector in the browser developer tools. You can integrate the YSlow (yslow.org)performance analyzerwith PhantomJS, enabling your CI system to check for common web performance mistakes on every commit. YSlow came out of Yahoo!, and it provides rules used to identify bad practices, which can slow down web applications for users. It's a similar idea to Google's PageSpeed Insights service (which can be automated via its API). However, YSlow is pretty old, and things have moved on in web development recently, for example, HTTP/2. A modern alternative is "the coach" from sitespeed.io, and you can read more at github.com/sitespeedio/coach.You should check out their other open source tools too, such as the dashboard at dashboard.sitespeed.io, which uses Graphite and Grafana. You canalso export the network results (in industry standard HAR format) and analyze them however you like. For example, visualizing them graphically in waterfall format, as you might do manually with your browser developer tools. The HTTP Archive (HAR) format is a standard way of representing the content of monitored network data to export it to other software. You can copy or save as HAR in some browser developer tools by right-clicking on a network request. DevOps When using automation and techniques, such as feature switching, it is essential to have a good view of your environments so that you know the utilization of all the hardware. Good tooling is important to perform this monitoring, and you want to easily be able to see the vital statistics of every server. This will consist of at least the CPU, memory, and disk space consumption, but it may include more, and you will want alarms set up to alert you if any of these stray outside allowed bands. The practice of DevOps is the culmination of all of the automation that we covered previously with development, operations, and quality assurance testing teams all collaborating. The only missing pieces left now are provisioning and configuring infrastructure and then monitoring it while in use. Although DevOps is a culture, there is plenty of tooling that can help. DevOps tooling One of the primary themes of DevOps tooling is defining infrastructure as code. The idea is that you shouldn't manually perform a task, such as setting up a server, when you can create software to do it for you. You canthen reuse these provisioning scripts, which will not only save you time, but it will also ensure that all of the machines are consistent and free of mistakes or missed steps. Provisioning There are many systems available to commission and configure new machines. Some popular configuration management automation tools are Ansible (ansible.com), Chef (chef.io), and Puppet (puppet.com). Not all of these tools work great on Windows servers, partly because Linux is easier to automate. However, you can run ASP.NETCore on Linux and still develop on Windows using Visual Studio, while testing in a VM. Developing for a VM is a great idea because it solves the problems in setting up environments and issues where it "works on my machine" but not in production. Vagrant (vagrantup.com) is a great command line tool to manage developer VMs. It allows you to easily create, spin up, and share developer environments. The successor to Vagrant, Otto (ottoproject.io) takes this a step further and abstracts deployment too.Therefore,you can push to multiple cloud providers without worrying about the intricacies of CloudFormation, OpsWorks, or anything else. If you create your infrastructure as code, then your scripts can be versioned and tested, just like your application code. We'll stop before we get too far off-topic, but the point is that if you have reliable environments, which you can easily verify, instantiate, and perform testing on, then CI is a lot easier. Monitoring Monitoring is essential, especially for web applications, and there are many tools available to help with it. A popular open source infrastructure monitoring system is Nagios (nagios.org). Another more modern open source alerting and metrics tool is Prometheus(prometheus.io). If you use a cloud platform, then there will be monitoring built in, for example AWS CloudWatch or Azure Diagnostics.There are also cloud servicesto directly monitor your website, such as Pingdom (pingdom.com), UptimeRobot (uptimerobot.com),Datadog (datadoghq.com),and PagerDuty (pagerduty.com). You probably already have a system in place to measure availability, but you can also use the same systems to monitor performance. This is not only helpfulto ensure a responsive users experience, but it can also provide early warning signs that a failure is imminent. If you are proactive and take preventative action, then you can save yourself a lot of trouble reactively fighting fires. It helps consider application support requirements at design time. Development, testing, and operations aren't competing disciplines, and you will succeed more often if you work as one team rather than simply throwing an application over the fence and saying it "worked in test, ops problem now". Summary In this article, we saw how wecan integrate automated testing into a CI system in order to monitor for performance regressions. We also learned some strategies to roll out changes and ensure that tests accurately reflect real life. We also briefly covered some options for DevOps practices and cloud-hosting providers, which together make continuous performance testing much easier. Resources for Article: Further resources on this subject: Designing your very own ASP.NET MVC Application [article] Creating a NHibernate session to access database within ASP.NET [article] Working With ASP.NET DataList Control [article]
Read more
  • 0
  • 0
  • 26185
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-endpoint-protection-hardening-and-containment-strategies-for-ransomware-attack-protection-cisa-recommended-fireeye-report-highlights
Savia Lobo
12 Sep 2019
8 min read
Save for later

Endpoint protection, hardening, and containment strategies for ransomware attack protection: CISA recommended FireEye report Highlights

Savia Lobo
12 Sep 2019
8 min read
Last week, the Cybersecurity and Infrastructure Security Agency (CISA) shared some strategies with users and organizations to prevent, mitigate, and recover against ransomware. They said, “The Cybersecurity and Infrastructure Security Agency (CISA) has observed an increase in ransomware attacks across the Nation. Helping organizations protect themselves from ransomware is a chief priority for CISA.” They have also advised that those attacked by ransomware should report immediately to CISA, a local FBI Field Office, or a Secret Service Field Office. In the three resources shared, the first two include general awareness about what ransomware is and why it is a major threat, mitigations, and much more. The third resource is a FireEye report on ransomware protection and containment strategies. Also Read: Vulnerabilities in the Picture Transfer Protocol (PTP) allows researchers to inject ransomware in Canon’s DSLR camera CISA INSIGHTS and best practices to prevent ransomware The CISA, as a part of their first “CISA INSIGHTS” product, has put down three simple steps or recommendations organizations can take to manage their cybersecurity risk. CISA advises users to take necessary precautionary steps such as backing up the entire system offline, keeping the system updated and patched, update security solutions, and much more. If users have been affected by ransomware, they should contact the CISA or FBI immediately, work with an experienced advisor to help recover from the attack, isolate the infected systems and phase your return to operations, etc. Further, the CISA also tells users to practice good cyber hygiene, i.e. backup, update, whitelist apps, limit privilege, and using multi-factor authentication. Users should also develop containment strategies that will make it difficult for bad actors to extract information. Users should also review disaster recovery procedures and validate goals with executives, and much more. The CISA team has suggested certain best practices which the organizations should employ to stay safe from a ransomware attack. These include, users should restrict permissions to install and run software applications, and apply the principle of “least privilege” to all systems and services thus, limiting ransomware to spread further. The organization should also ensure using application whitelisting to allow only approved programs to run on a network. All firewalls should be configured to block access to known malicious IP addresses. Organizations should also enable strong spam filters to prevent phishing emails from reaching the end users and authenticate inbound emails to prevent email spoofing. A measure to scan all incoming and outgoing emails to detect threats and filter executable files from reaching end-users should be initiated. Read the entire CISA INSIGHTS to know more about the various ransomware outbreak strategies in detail. Also Read: ‘City Power Johannesburg’ hit by a ransomware attack that encrypted all its databases, applications and network FireEye report on Ransomware Protection and Containment strategies As a third resource, the CISA shared a FireEye report titled “Ransomware Protection and Containment Strategies: Practical Guidance for Endpoint Protection, Hardening, and Containment”. In this whitepaper, FireEye discusses different steps organizations can proactively take to harden their environment to prevent the downstream impact of a ransomware event. These recommendations can also help organizations with prioritizing the most important steps required to contain and minimize the impact of a ransomware event after it occurs. The FireEye report points out that any ransomware can be deployed across an environment in two ways. First, by Manual propagation by a threat actor after they have penetrated an environment and have administrator-level privileges broadly across the Environment to manually run encryptors on the targeted system through Windows batch files, Microsoft Group Policy Objects, and existing software deployment tools used by the victim’s organization. Second, by Automated propagation where the credential or Windows token is extracted directly from disk or memory to build trust relationships between systems through Windows Management Instrumentation, SMB, or PsExec. This binds systems and executes payloads. Hackers also automate brute-force attacks on unpatched exploitation methods, such as BlueKeep and EternalBlue. “While the scope of recommendations contained within this document is not all-encompassing, they represent the most practical controls for endpoint containment and protection from a ransomware outbreak,” FireEye researchers wrote. To combat these two deployment techniques, the FireEye researchers have suggested two enforcement measures which can limit the capability for a ransomware or malware variant to impact a large scope of systems within an environment. The FireEye report covers several technical recommendations to help organizations mitigate the risk of and contain ransomware events some of which include: RDP Hardening Remote Desktop Protocol (RDP) is a common method used by malicious actors to remotely connect to systems, laterally move from the perimeter onto a larger scope of systems for deploying malware. Organizations should also scan their public IP address ranges to identify systems with RDP (TCP/3389) and other protocols (SMB – TCP/445) open to the Internet in a proactive manner. RDP and SMB should not be directly exposed to ingress and egress access to/from the Internet. Other measures that organizations can take include: Enforcing Multi-Factor Authentication Organizations can either integrate a third-party multi-factor authentication technology or leverage a Remote Desktop Gateway and Azure Multi-Factor Authentication Server using RADIUS. Leveraging Network Level Authentication (NLA) Network Level Authentication (NLA) provides an extra layer of pre-authentication before a connection is established. It is also useful for protecting against brute force attacks, which mostly target open internet-facing RDP servers. Reducing the exposure of privileged and service accounts For ransomware deployment throughout an environment, both privileged and service accounts credentials are commonly utilized for lateral movement and mass propagation. Without a thorough investigation, it may be difficult to determine the specific credentials that are being utilized by a ransomware variant for connectivity within an environment. Privileged account and service account logon restrictions For accounts having privileged access throughout an environment, these should not be used on standard workstations and laptops, but rather from designated systems (e.g., Privileged Access Workstations (PAWS)) that reside in restricted and protected VLANs and Tiers. Explicit privileged accounts should be defined for each Tier, and only utilized within the designated Tier. The recommendations for restricting the scope of access for privileged accounts is based upon Microsoft’s guidance for securing privileged access. As a quick containment measure, consider blocking any accounts with privileged access from being able to login (remotely or locally) to standard workstations, laptops, and common access servers (e.g., virtualized desktop infrastructure). If a service account is only required to be leveraged on a single endpoint to run a specific service, the service account can be further restricted to only permit the account’s usage on a predefined listing of endpoints. Protected Users Security Group With the “Protected Users” security group for privileged accounts, an organization can minimize various risk factors and common exploitation methods for exposing privileged accounts on endpoints. Starting from Microsoft Windows 8.1 and Microsoft Windows Server 2012 R2 (and above), the “Protected Users” security group was introduced to manage credential exposure within an environment. Members of this group automatically have specific protections applied to their accounts, including: The Kerberos ticket granting ticket (TGT) expires after 4 hours, rather than the normal 10-hour default setting.  No NTLM hash for an account is stored in LSASS since only Kerberos authentication is used (NTLM authentication is disabled for an account).  Cached credentials are blocked. A Domain Controller must be available to authenticate the account. WDigest authentication is disabled for an account, regardless of an endpoint’s applied policy settings. DES and RC4 can’t be used for Kerberos pre-authentication (Server 2012 R2 or higher); rather Kerberos with AES encryption will be enforced. Accounts cannot be used for either constrained or unconstrained delegation (equivalent to enforcing the “Account is sensitive and cannot be delegated” setting in Active Directory Users and Computers). Cleartext password protections Organizations should also try minimizing the exposure of credentials and tokens in memory on endpoints. On older Windows Operating Systems, cleartext passwords are stored in memory (LSASS) to primarily support WDigest authentication. The WDigest should be explicitly disabled on all Windows endpoints where it is not disabled by default. WDigest authentication is disabled in Windows 8.1+ and in Windows Server 2012 R2+, by default. Starting from Windows 7 and Windows Server 2008 R2, after installing Microsoft Security Advisory KB2871997, WDigest authentication can be configured either by modifying the registry or by using the “Microsoft Security Guide” Group Policy template from the Microsoft Security Compliance Toolkit. To implement these and other ransomware protection and containment strategies, read the FireEye report. Other interesting news in Cybersecurity Wikipedia hit by massive DDoS (Distributed Denial of Service) attack; goes offline in many countries Exim patches a major security bug found in all versions that left millions of Exim servers vulnerable to security attacks CircleCI reports of a security breach and malicious database in a third-party vendor account
Read more
  • 0
  • 0
  • 26177

article-image-npm-inc-co-founder-and-chief-data-officer-quits-leaving-the-community-to-question-the-stability-of-the-javascript-registry
Fatema Patrawala
22 Jul 2019
6 min read
Save for later

Npm Inc. co-founder and Chief data officer quits, leaving the community to question the stability of the JavaScript Registry

Fatema Patrawala
22 Jul 2019
6 min read
On Thursday, The Register reported that Laurie Voss, the co-founder and chief data officer of JavaScript package registry, NPM Inc left the company. Voss’s last day in office was 1st July while he officially announced the news on Thursday. Voss joined NPM in January 2014 and decided to leave the company in early May this year. NPM has faced its share of unrest in the company in the past few months. In the month of March  5 NPM employees were fired from the company in an unprofessional and unethical way. Later 3 of those employees were revealed to have been involved in unionization and filed complaints against NPM Inc with the National Labor Relations Board (NLRB).  Earlier this month NPM Inc at the third trial settled the labor claims brought by these three former staffers through the NLRB. Voss’ s resignation will be third in line after Rebecca Turner, former core contributor who resigned in March and Kat Marchan, former CLI and community architect who resigned from NPM early this month. Voss writes on his blog, “I joined npm in January of 2014 as co-founder, when it was just some ideals and a handful of servers that were down as often as they were up. In the following five and a half years Registry traffic has grown over 26,000%, and worldwide users from about 1 million back then to more than 11 million today. One of our goals when founding npm Inc. was to make it possible for the Registry to run forever, and I believe we have achieved that goal. While I am parting ways with npm, I look forward to seeing my friends and colleagues continue to grow and change the JavaScript ecosystem for the better.” Voss also told The Register that he supported unions, “As far as the labor dispute goes, I will say that I have always supported unions, I think they're great, and at no point in my time at NPM did anybody come to me proposing a union,” he said. “If they had, I would have been in favor of it. The whole thing was a total surprise to me.” The Register team spoke to one of the former staffers of NPM and they said employees tend not to talk to management in the fear of retaliation and Voss seemed uncomfortable to defend the company’s recent actions and felt powerless to affect change. In his post Voss is optimistic about NPM’s business areas, he says, “Our paid products, npm Orgs and npm Enterprise, have tens of thousands of happy users and the revenue from those sustains our core operations.” However, Business Insider reports that a recent NPM Inc funding round of the company raised only enough to continue operating until early 2020. https://twitter.com/coderbyheart/status/1152453087745007616 A big question on everyone’s mind currently is the stability of the public Node JS Registry. Most users in the JavaScript community do not have a fallback in place. While the community see Voss’s resignation with appreciation for his accomplishments, some are disappointed that he could not raise his voice against these odds and had to quit. "Nobody outside of the company, and not everyone within it, fully understands how much Laurie was the brains and the conscience of NPM," Jonathan Cowperthwait, former VP of marketing at NPM Inc, told The Register. CJ Silverio, a principal engineer at Eaze who served as NPM Inc's CTO said that it’s good that Voss is out but she wasn't sure whether his absence would matter much to the day-to-day operations of NPM Inc. Silverio was fired from NPM Inc late last year shortly after CEO Bryan Bogensberger’s arrival. “Bogensberger marginalized him almost immediately to get him out of the way, so the company itself probably won’t notice the departure," she said. "What should affect fundraising is the massive brain drain the company has experienced, with the entire CLI team now gone, and the registry team steadily departing. At some point they’ll have lost enough institutional knowledge quickly enough that even good new hires will struggle to figure out how to cope." Silverio also mentions that she had heard rumors of eliminating the public registry while only continuing with their paid enterprise service, which will be like killing their own competitive advantage. She says if the public registry disappears there are alternative projects like the one spearheaded by Silverio and a fellow developer Chris Dickinson, Entropic. Entropic is available under an open source Apache 2.0 license, Silverio says "You can depend on packages from any other Entropic instance, and your home instance will mirror all your dependencies for you so you remain self-sufficient." She added that the software will mirror any packages installed by a legacy package manager, which is to say npm. As a result, the more developers use Entropic, the less they'll need NPM Inc's platform to provide a list of available packages. Voss feels the scale of npm is 3x bigger than any other registry and boasts of an extremely fast growth rate i.e approx 8% month on month. "Creating a company to manage an open source commons creates some tensions and challenges is not a perfect solution, but it is better than any other solution I can think of, and none of the alternatives proposed have struck me as better or even close to equally good." he said. With  NPM Inc. sustainability at stake, the JavaScript community on Hacker News discussed alternatives in case the public registry comes to an end. One of the comments read, “If it's true that they want to kill the public registry, that means I may need to seriously investigate Entropic as an alternative. I almost feel like migrating away from the normal registry is an ethical issue now. What percentage of popular packages are available in Entropic? If someone else's repo is not in there, can I add it for them?” Another user responds, “The github registry may be another reasonable alternative... not to mention linking git hashes directly, but that has other issues.” Other than Entropic another alternative discussed is nixfromnpm, it is a tool in which you can translate NPM packages to Nix expression. nixfromnpm is developed by Allen Nelson and two other contributors from Chicago. Surprise NPM layoffs raise questions about the company culture Is the Npm 6.9.1 bug a symptom of the organization’s cultural problems? Npm Inc, after a third try, settles former employee claims, who were fired for being pro-union, The Register reports
Read more
  • 0
  • 0
  • 26117

article-image-using-qiskit-with-ibm-qx-to-generate-quantum-circuits-tutorial
Natasha Mathur
20 Apr 2019
5 min read
Save for later

Using Qiskit with IBM QX to generate quantum circuits [Tutorial]

Natasha Mathur
20 Apr 2019
5 min read
This tutorial expands on the idea of quantum gates to introduce quantum circuits, the quantum analog of classical circuits. It goes over how classical gates can be reproduced by quantum circuits and proceeds to introduce a visual representation of quantum circuits that can be used to easily define a quantum circuit without reference to mathematics or use of a programming language. In this tutorial, we will look at how to use Qiskkit to generate quantum circuits. The Jupyter Notebook for this tutorial is available under chapter 05 at Github. This tutorial is an excerpt taken from the book Mastering Quantum Computing with IBM QX written by Dr. Christine Corbett Moran. The book explores principles of quantum computing and the areas in which they can be applied. You'll also learn about the IBM Ecosystem and Qiskit. Note: Every classic bit, either 0 or 1, can be written as either a "0" or a "1" qubit, which to show that it is a qubit, is written as a name surrounded by | and >. So, for example, the qubit named "0" is written as |"0"> and the qubit named "1" is written as |"1">. Throughout this post, the qubits will always be written as names surrounded by quotation marks and | and > to indicate that they are, indeed, qubits. Qiskit is the Quantum Information Science Kit. It is an SDK for working with the IBM QX quantum processors. It also has a variety of tools to simulate a quantum computer in Python. In this tutorial, we are going to learn to use it to generate quantum circuits. Single-qubit circuits in Qiskit First, let's import the tools to create classical and quantum registers as well as quantum circuits from qiskit: from qiskit import QuantumCircuit, ClassicalRegister, QuantumRegister Next let's make the X |"0"> circuit using qiskit: qr = QuantumRegister(1) circuit = QuantumCircuit(qr) circuit.x(qr[0]) Note that the argument to QuantumRegister is 1; this indicates that the quantum register is to contain one qubit. The XH|"0"> circuit using qiskit becomes the following: qr = QuantumRegister(1) circuit = QuantumCircuit(qr) circuit.h(qr[0]) circuit.x(qr[0]) Qiskit's QuantumCircuit class and universal gate methods We can see that the QuantumCircuit class allows us to execute a variety of gates on particular qubits in the quantum register used to initial it. The full gate set is available in the QuantumCircuit documentation, but I will give the equivalent of the gates we have learned so far: Gate Qiskit QuantumCircuit class method name I iden X x Y y Z z H h S s S† sdg T t T† tdg CNOT cx Multiqubit gates in Qiskit Now suppose we want to use qiskit to construct a circuit for CNOT using |"+"> as the control qubit and |"0"> as the target qubit. We will need to create a quantum register to hold two qubits with qr = QuantumRegister(2). We will also need to give each qubit in the register as an argument to the cx method of the QuantumCircuit class. The first qubit argument to cx is the control qubit; the second is the target qubit. The code is as follows: qr = QuantumRegister(2) circuit = QuantumCircuit(qr) circuit.h(qr[0]) circuit.cx(qr[0],qr[1]) Classical registers in Qiskit circuit We can add a classical register to our quantum circuit. We will need a classical register to hold the output of a measurement. Here is an example of adding a classical register to the circuit for CNOT using |"+"> as the control qubit and |"0"> as the target qubit: qr = QuantumRegister(2) cr = ClassicalRegister(2) circuit = QuantumCircuit(qr, cr) circuit.h(qr[0]) circuit.cx(qr[0],qr[1]) Here we can see that just like creating an instance of the QuantumRegister class requires us to specify the length of the quantum register in qubits, creating an instance of the ClassicalRegister class requires us to specify the size of the classical register in bits. Here we can see that initializing a member of the QuantumCircuit class with a classical register means that we need to give the ClassicalRegister instance as a second argument to the QuantumCircuit constructor. Measurement in a Qiskit circuit Now that we have a circuit with a two-qubit quantum register and a two-qubit classical register, we can perform a measurement of all the qubits in the circuit with the measure method of the QuantumCircuit class. This method takes as input the quantum register to measure as well as the classical register in which to place the result. Here is an example: qr = QuantumRegister(2) cr = ClassicalRegister(2) circuit = QuantumCircuit(qr, cr) circuit.h(qr[0]) circuit.cx(qr[0],qr[1]) circuit.measure(qr, cr) Note that we can also decide to measure just an individual qubit, by specifying which qubit to measure and which bit to put the output result in the following: qr = QuantumRegister(2) cr = ClassicalRegister(2) circuit = QuantumCircuit(qr, cr) circuit.h(qr[0]) circuit.cx(qr[0],qr[1]) circuit.measure(qr[0], cr[0]) That's it. We learned how to use Qiskit to write python code to represent different quantum circuits namely, single-qubit circuits, Qiskit's QuantumCircuit class, and universal gate methods, Multiqubit gates in Qiskit, Classical registers in Qiskit circuit, and Measurement in a Qiskit circuit. If you want to learn other core concepts and principles of Quantum computing with IBM QX, be sure to check out Mastering Quantum Computing with IBM QX. IBM Q System One, IBM’s standalone quantum computer unveiled at CES 2019 IBM launches Industry's first 'Cybersecurity Operations Center on Wheels' for on-demand cybersecurity support Say hello to IBM RXN, a free AI Tool in IBM Cloud for predicting chemical reactions
Read more
  • 0
  • 0
  • 26110

article-image-understanding-drivers
Packt
04 May 2016
7 min read
Save for later

Understanding Drivers

Packt
04 May 2016
7 min read
In this article by Jeff Stokes and Manuel Singer, authors of the book Mastering the Microsoft Deployment Toolkit 2013, we will discuss how to utilize Microsoft Deployment Toolkit (MDT) to make the complex world of device drivers into a much more manageable experience. We will focus on how drivers get installed via MDT, how to specifically control the drivers that get installed, and general best practices around proper driver management. We will cover the following topics in this article: Understanding offline servicing The MDT method of driver detection and injection (For more resources related to this topic, see here.) Understanding offline servicing Those of us who created images for the deployment of Windows XP were often met with an enormous challenge of dealing with drivers for many different models of hardware. We were already forced to create separate images for different hardware abstraction layer (HAL) families. Additionally, in order to deal with different hardware models within the same HAL family, the standard practice was to usually have a folder called C:Drivers, which contained a copy of every possible driver that could be required by this image for all of the different hardware models it would be installed to. There was an OemPnPDriversPath entry in the registry that individually listed each of the driver paths (subfolders under the C:Drivers directory) for the Windows Plug and Play process to locate and install the driver. As you can imagine, this was not a very efficient way to manage drivers. One reason is that every driver for every machine was staged in the image, causing the image size to grow. Another reason being that we were relying on Plug and Play to figure out the right driver to install, which gives us less control of the driver that actually gets installed, based on a driver ranking process. Fast forward to Windows Vista and current versions of Windows, and we can now utilize the magic of offline servicing to inject drivers into our Windows Imaging Format (WIM) as it is getting deployed. With this in mind, consider the concept of having your customized Windows image created through your reference image build process, but it contains no drivers. Now, when we deploy this image, we can utilize a process to detect all the hardware in the target machine, and then grab only the correct drivers that we need for this particular machine. Then, we can utilize Deployment Image Servicing and Management (DISM) to inject them into our WIM before the WIM actually gets installed, therefore, making the drivers available to be installed as Windows is installed on this machine. MDT is doing just that. The MDT method of driver detection and injection When we boot a target machine via our Lite Touch media, one of the initial task sequence steps will enumerate (via PnpEnum) all the PNP IDs for every device in the machine. Then, as part of the inject drivers task sequence step, we will search all of our Out-of-Box driver INF files to find the matching driver, then MDT will utilize DISM to inject these drivers offline into the WIM. Note that, by default, we will be searching our entire Out-of-Box repository and letting PnP figure things out. We can force MDT to only choose from drivers that we specify, therefore, gaining strict control over which drivers actually get installed. The preceding scenario indicates that this whole process hinges on the fact that we are searching through driver INF files to find matching PNP IDs in order to correctly detect and install the correct driver. This brings up a concern: what if the driver does not contain an INF file, but rather it simply has to be installed via an EXE program? In this scenario, we cannot utilize the driver injection process. Instead, we would treat that driver as an application in MDT, meaning we would add a new application, using the EXE program as the source files, specifying the command-line syntax to launch the driver install program and install silently, and then adding this application as a task sequence step. I will later demonstrate how to utilize conditional statements in your task sequence to only install that driver program on the model that it applies to; therefore, keeping our task sequence flexible to be able to install correctly on any hardware. Populating the Out-of-Box Drivers node of MDT The first step will be to visit the OEM Manufacturer’s website and download all the device drivers for each model machine that we will be deploying to. Note that many OEMs now offer a deployment-specific download or CAB file that has all the drivers for a particular model compressed into one single CAB file. This benefits you as you will not have to go through the hassle of downloading and extracting each individual driver for each device separately (NIC, video, audio, and so on). Once you download the necessary drivers, store them in a folder for each specific model, as you will need to extract the drivers within your folder before importing them into MDT. Next, we want to create a folder structure under the Out-of-Box Drivers node in MDT to organize our drivers. This will not only allow easy manageability of drivers, as new drivers are released by the OEM; but if we name the folders to match the model names exactly, we can later introduce logic to limit our PnP search to the exact folder that contains the correct drivers for our particular hardware model. As we will have different drivers for x86 and x64, as well as for different operating systems, a general best practice would be to create the first hierarchy of your folder structure. Perform the following steps to populate the node in MDT: In order to create the folder structure, simply click on Out-Of-Box-Drivers and choose New Folder, as shown in the following screenshot: Next, we will want to create a folder for each model that we will be deploying to: In order to ensure that you are using the correct model name, you can use the following WMI query to see what the hardware returns as the model name: Once you have your folder structure created, you are ready to inject the drivers. Right-click on the model folder and choose Import Drivers. Point the driver source directory to the folder, where you have downloaded and extracted the OEM drivers: There is a checkbox stating Import drivers even if they are duplicates of an existing driver. This is because MDT is utilizing the single instance storage technology to store the drivers in the actual deployment share. If you are importing multiple copies of a drivers to different folders, MDT only stores one copy of the file in the actual filesystem by default, and the folder structure you see within the MDT Workbench will be pointing duplicates to the same file in order to not waste space. As new drivers are released from the OEM, you can simply replace the drivers by going to the particular folder for this model, removing the old drivers, and importing the new drivers. Then, the next time you install your WIM in this model, you will be using the new drivers, and you won’t have to make any modifications or updates to your WIM. Summary In this article, we understood offline servicing, MDT method for driver detection and injection, and how to populate the Out-of-Box Drivers node of MDT. For more information related to MDT, refer to the following book by Packt Publishing: Mastering the Microsoft Deployment Toolkit 2013: https://www.packtpub.com/hardware-and-creative/mastering-microsoft-deployment-toolkit-2013 Resources for Article:   Further resources on this subject: The Configuration Manager Troubleshooting Toolkit [article] Social-Engineer Toolkit [article] Working with Entities in Google Web Toolkit 2 [article]
Read more
  • 0
  • 0
  • 26109
article-image-setting-intel-edison
Packt
21 Jun 2017
8 min read
Save for later

Setting up Intel Edison

Packt
21 Jun 2017
8 min read
In this article by Avirup Basu, the author of the book Intel Edison Projects, we will be covering the following topics: Setting up the Intel Edison Setting up the developer environment (For more resources related to this topic, see here.) In every Internet of Things(IoT) or robotics project, we have a controller that is the brain of the entire system. Similarly we have Intel Edison. The Intel Edison computing module comes in two different packages. One of which is a mini breakout board the other of which is an Arduino compatible board. One can use the board in its native state as well but in that case the person has to fabricate his/hers own expansion board. The Edison is basically a size of a SD card. Due to its tiny size, it's perfect for wearable devices. However it's capabilities makes it suitable for IoT application and above all, the powerful processing capability makes it suitable for robotics application. However we don't simply use the device in this state. We hook up the board with an expansion board. The expansion board provides the user with enough flexibility and compatibility for interfacing with other units. The Edison has an operating system that is running the entire system. It runs a Linux image. Thus, to setup your device, you initially need to configure your device both at the hardware and at software level. Initial hardware setup We'll concentrate on the Edison package that comes with an Arduino expansion board. Initially you will get two different pieces: The Intel® Edison board The Arduino expansion board The following given is the architecture of the device: Architecture of Intel Edison. Picture Credits: https://software.intel.com/en-us/ We need to hook these two pieces up in a single unit. Place the Edison board on top of the expansion board such that the GPIO interfaces meet at a single point. Gently push the Edison against the expansion board. You will get a click sound. Use the screws that comes with the package to tighten the set up. Once, this is done, we'll now setup the device both at hardware level and software level to be used further. Following are the steps we'll cover in details: Downloading necessary software packages Connecting your Intel® Edison to your PC Flashing your device with the Linux image Connecting to a Wi-Fi network SSH-ing your Intel® Edison device Downloading necessary software packages To move forward with the development on this platform, we need to download and install a couple of software which includes the drivers and the IDEs. Following is the list of the software along with the links that are required: Intel® Platform Flash Tool Lite (https://01.org/android-ia/downloads/intel-platform-flash-tool-lite) PuTTY (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html) Intel XDK for IoT (https://software.intel.com/en-us/intel-xdk) Arduino IDE (https://www.arduino.cc/en/Main/Software) FileZilla FTP client (https://filezilla-project.org/download.php) Notepad ++ or any other editor (https://notepad-plus-plus.org/download/v7.3.html) Drivers and miscellaneous downloads Latest Yocto* Poky image Windows standalone driver for Intel Edison FTDI drivers (http://www.ftdichip.com/Drivers/VCP.htm) The 1st and the 2nd packages can be downloaded from (https://software.intel.com/en-us/iot/hardware/edison/downloads) Plugging in your device After all the software and drivers installation, we'll now connect the device to a PC. You need two Micro-B USB Cables(s) to connect your device to the PC. You can also use a 9V power adapter and a single Micro-B USB Cable, but for now we will not use the power adapter: Different sections of Arduino expansion board of Intel Edison A small switch exists between the USB port and the OTG port. This switch must be towards the OTG port because we're going to power the device from the OTG port and not through the DC power port. Once it is connected to your PC, open your device manager and expands the ports section. If all installations of drivers were successful, then you must see two ports: Intel Edison virtual com port USB serial port Flashing your device Once your device is successfully detected an installed, you need to flash your device with the Linux image. For this we'll use the flash tool provided by Intel: Open the flash lite tool and connect your device to the PC: Intel phone flash lite tool Once the flash tool is opened, click on Browse... and browse to the .zip file of the Linux image you have downloaded. After you click on OK, the tool will automatically unzip the file. Next, click on Start to flash: Intel® Phone flash lite tool – stage 1 You will be asked to disconnect and reconnect your device. Do as the tool says and the board should start flashing. It may take some time before the flashing is completed. You are requested not to tamper with the device during the process. Once the flashing is completed, we'll now configure the device: Intel® Phone flash lite tool – complete Configuring the device After flashing is successfully we'll now configure the device. We're going to use the PuTTY console for the configuration. PuTTY is an SSH and telnet client, developed originally by Simon Tatham for the Windows platform. We're going to use the serial section here. Before opening PuTTY console: Open up the device manager and note the port number for USB serial port. This will be used in your PuTTY console: Ports for Intel® Edison in PuTTY Next select Serialon PuTTY console and enter the port number. Use a baud rate of 115200. Press Open to open the window for communicating with the device: PuTTY console – login screen Once you are in the console of PuTTY, then you can execute commands to configure your Edison. Following is the set of tasks we'll do in the console to configure the device: Provide your device a name Provide root password (SSH your device) Connect your device to Wi-Fi Initially when in the console, you will be asked to login. Type in root and press Enter. Once entered you will see root@edison which means that you are in the root directory: PuTTY console – login success Now, we are in the Linux Terminal of the device. Firstly, we'll enter the following command for setup: configure_edison –setup Press Enter after entering the command and the entire configuration will be somewhat straightforward: PuTTY console – set password Firstly, you will be asked to set a password. Type in a password and press Enter. You need to type in your password again for confirmation. Next, we'll set up a name for the device: PuTTY console – set name Give a name for your device. Please note that this is not the login name for your device. It's just an alias for your device. Also the name should be at-least 5 characters long. Once you entered the name, it will ask for confirmation press y to confirm. Then it will ask you to setup Wi-Fi. Again select y to continue. It's not mandatory to setup Wi-Fi, but it's recommended. We need the Wi-Fi for file transfer, downloading packages, and so on: PuTTY console – set Wi-Fi Once the scanning is completed, we'll get a list of available networks. Select the number corresponding to your network and press Enter. In this case it 5 which corresponds to avirup171which is my Wi-Fi. Enter the network credentials. After you do that, your device will get connected to the Wi-Fi. You should get an IP after your device is connected: PuTTY console – set Wi-Fi -2 After successful connection you should get this screen. Make sure your PC is connected to the same network. Open up the browser in your PC, and enter the IP address as mentioned in the console. You should get a screen similar to this: Wi-Fi setup – completed Now, we are done with the initial setup. However Wi-Fi setup normally doesn't happens in one go. Sometimes your device doesn't gets connected to the Wi-Fi and sometimes we cannot get this page as shown before. In those cases you need to start wpa_cli to manually configure the Wi-Fi. Refer to the following link for the details: http://www.intel.com/content/www/us/en/support/boards-and-kits/000006202.html Summary In this article, we have covered the areas of initial setup of Intel Edison and configuring it to the network. We have also covered how to transfer files to the Edison and vice versa. Resources for Article: Further resources on this subject: Getting Started with Intel Galileo [article] Creating Basic Artificial Intelligence [article] Using IntelliTrace to Diagnose Problems with a Hosted Service [article]
Read more
  • 0
  • 0
  • 26088

article-image-implementing-simple-time-series-data-analysis-r
Amarabha Banerjee
09 Feb 2018
4 min read
Save for later

Implementing a simple Time Series Data Analysis in R

Amarabha Banerjee
09 Feb 2018
4 min read
[box type="note" align="" class="" width=""]This article is extracted from the book Machine Learning with R written by Brett Lantz. This book will methodically take you through stages to apply machine learning for data analysis using R.[/box] In this article, we will explore the popular time series analysis method and its practical implementation using R. Introduction When we think about time, we think about years, days, months, hours, minutes, and seconds. Think of any datasets and you will find some attributes which will be in the form of time, especially data related to stock, sales, purchase, profit, and loss. All these have time associated with them. For example, the price of stock in the stock exchange at different points on a given day or month or year. Think of any industry domain, and sales are an important factor; you can see time series in sales, discounts, customers, and so on. Other domains include but are not limited to statistics, economics and budgets, processes and quality control, finance, weather forecasting, or any kind of forecasting, transport, logistics, astronomy, patient study, census analysis, and the list goes on. In simple words, it contains data or observations in time order, spaced at equal intervals. Time series analysis means finding the meaning in the time-related data to predict what will happen next or forecast trends on the basis of observed values. There are many methods to fit the time series, smooth the random variation, and get some insights from the dataset. When you look at time series data you can see the following: Trend: Long term increase or decrease in the observations or data. Pattern: Sudden spike in sales due to christmas or some other festivals, drug consumption increases due to some condition; this type of data has a fixed time duration and can be predicted for future time also. Cycle: Can be thought of as a pattern that is not fixed; it rises and falls without any pattern. Such time series involve a great fluctuation in data. How to do There are many datasets available with R that are of the time series types. Using the command class, one can know if the dataset is time series or not. We will look into the AirPassengers dataset that shows monthly air passengers in thousands from 1949 to 1960. We will also create new time series to represent the data. Perform the following commands in RStudio or R Console: > class(AirPassengers) Output: [1] "ts" > start(AirPassengers) Output: [1] 1949 1 > end(AirPassengers) Output: [1] 1960 12 > summary(AirPassengers) Output: Min. 1st Qu. Median Mean 3rd Qu. Max. 104.0 180.0 265.5 280.3 360.5 622.0 Analyzing Time Series Data [ 89 ] In the next recipe, we will create the time series and print it out. Let's think of the share price of some company in the range of 2,500 to 4,000 from 2011 to be recorded monthly. Perform the following coding in R: > my_vector = sample(2500:4000, 72, replace=T) > my_series = ts(my_vector, start=c(2011,1), end=c(2016,12), frequency = 12) > my_series Output: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2011 2888 3894 3675 3113 3421 3870 2644 2677 3392 2847 2543 3147 2012 2973 3538 3632 2695 3475 3971 2695 2963 3217 2836 3525 2895 2013 3984 3811 2902 3602 3812 3631 2625 3887 3601 2581 3645 3324 2014 3830 2821 3794 3942 3504 3526 3932 3246 3787 2894 2800 2732 2015 3326 3659 2993 2765 3881 3983 3813 3172 2667 3517 3445 2805 2016 3668 3948 2779 2881 3285 2733 3203 3329 3854 3285 3800 2563 How it works In the first recipe, we used the AirPassengers dataset, using the class function. We saw that it is ts (ts stands for time series). The start and end functions will give the starting year and ending year of the dataset with the values. The frequency function tells us the interval of observations; 1 means annually, 4 means quarterly, 12 means yearly, and so on. In the next recipe, we want to generate samples between 2,500 to 40,000 to represent the price of a share. Using a sample function, we can create a sample; it takes the range as the first argument, and the number of samples required as the second argument. The last argument decides whether duplication is to be allowed in the sample or not. We stored the sample in the my_vector. Now we create a time series using the ts function. The ts function takes the vector as an argument followed by the start and end to show the period for which the time series is being constructed. The frequency specifies the number of observations in the start and end to be recorded. 12. To summarize we talked about how R can be utilized to perform time series analysis in different ways. If you would like to learn more useful machine learning techniques in R, be sure to check out Machine Learning with R.      
Read more
  • 0
  • 0
  • 26086

article-image-how-to-work-with-langchain-python-modules
Avratanu Biswas
22 Jun 2023
13 min read
Save for later

How to work with LangChain Python modules

Avratanu Biswas
22 Jun 2023
13 min read
This article is the second part of a series of articles, please refer to Part 1 for learning how to Get to grips with LangChain framework and how to utilize it for building LLM-powered AppsIntroductionIn this section, we dive into the practical usage of LangChain modules. Building upon the previous overview of LangChain components, we will work within a Python environment to gain hands-on coding experience. However, it is important to note that this overview is not a substitute for the official documentation, and it is recommended to refer to the documentation for a more comprehensive understanding.Choosing the Right Python EnvironmentWhen working with Python, Jupyter Notebook and Google Colab are popular choices for quickly getting started in the Python environment. Additionally, Visual Studio Code (VSCode) Atom, PyCharm, or Sublime Text integrated with a conda environment are also excellent options. While many of these can be used, Google Colab is used here for its convenience in quick testing and code sharing. Find the code link here.PrerequisitesBefore we begin, make sure to install the necessary Python libraries. Use the pip command within a notebook cell to install them.Installing LangChain: In order to install the "LangChain" library, which is essential for this section, you can conveniently use the following command:!pip install langchainRegular Updates: Personally, I would recommend taking advantage of LangChain’s frequent releases by frequently upgrading the packages. Use the following command for this purpose:!pip install langchain  - -  upgradeIntegrating LangChain with LLMs: Previously, we discussed how the LangChain library facilitates interaction with Large Language Models (LLMs) provided by platforms such as OpenAI, Cohere, or HuggingFace. To integrate LangChain with these models, we need to follow these steps:Obtain API Keys: In this tutorial, we will use OpenAI. We need to sign up; to easily access the API keys for the various endpoints which Open AI provides. The key must be confidential. You can obtain the API via this link.Install Python Package: Install the required Python package associated with your chosen LLM provider. For OpenAI language models, execute the command:!pip install openaiConfiguring the API Key for OpenAI: To initialize the API key for the OpenAI library, we will use the getpass Python Library. Alternatively, you can set the API key as an environment variable.# Importing the library OPENAI_API_KEY = getpass.getpass() import getpass # In order to double check # print(OPENAI_API_KEY) # not recommendedRunning the above lines of code will create a secure text input widget where we can enter the API key, obtained for accessing OpenAI LLMs endpoints. After hitting enter, the inputted value will be stored as the assigned variable OPENAI_API_KEY, allowing it to be used for subsequent operations throughout our notebook.We will explore different LangChain modules in the section below:Prompt TemplateWe need to import the necessary module, PromptTemplate, from the langchain library. A multi-line string variable named template is created - representing the structure of the prompt and containing placeholders for the context, question, and answer which are the crucial aspects of any prompt template.Image by Author | Key components of a prompt template is shown in the figure. A PromptTemplate the object is instantiated using the template variable. The input_variables parameter is provided with a list containing the variable names used in the template, in this case, only the query.:from langchain import PromptTemplate template = """ You are a Scientific Chat Assistant. Your job is to answer scientific facts and evidence, in a bullet point wise. Context: Scientific evidence is necessary to validate claims, establish credibility, and make informed decisions based on objective and rigorous investigation. Question: {query} Answer: """ prompt = PromptTemplate(template=template, input_variables=["query"])The generated prompt structure can be further utilized to dynamically fill in the question placeholder and obtain responses within the specified template format. Let's print our entire prompt! print(prompt) lc_kwargs={'template': ' You are an Scientific Chat Assistant.\nYour job is to reply scientific facts and evidence in a bullet point wise.\n\nContext: Scientific evidence is necessary to validate claims, establish credibility, \nand make informed decisions based on objective and rigorous investigation.\n\nQuestion: {query}\n\nAnswer: \n', 'input_variables': ['query']} input_variables=['query'] output_parser=None partial_variables={} template=' You are an Scientific Chat Assistant.\nYour job is to reply scientific facts and evidence in a bullet point wise.\n\nContext: Scientific evidence is necessary to validate claims, establish credibility, \nand make informed decisions based on objective and rigorous investigation.\n\nQuestion: {query}\n\nAnswer: \n' template_format='f-string' validate_template=TrueChainsThe LangChain documentation covers various types of LLM chains, which can be effectively categorized into two main groups: Generic chains and Utility chains.Image 2: ChainsChains can be broadly classified into Generic Chains and Utility Chains. (a) Generic chains are designed to provide general-purpose language capabilities, such as generating text, answering questions, and engaging in natural language conversations by leveraging LLMs. On the other contrary, (b) Utility Chains: are specialized to perform specific tasks or provide targeted functionalities. These chains are fine-tuned and optimized for specific use cases. Note, although Index-related chains can be classified into a sub-group, here we keep such chains under the banner of utility chains. They are often considered to be very useful while working with Vector databases.Since this is the very first time we are running the LLM chain, we will walk through the code in detail.We need to import the OpenAI LLM module from langchain.llms and the LLMChain module from langchain Python package.Then, an instance of the OpenAI LLM is created, using the arguments such as temperature (affects the randomness of the generated responses), openai_api_key (the API key for OpenAI which we just assigned before), model (the specific OpenAI language model to be used - other models are available here), and streaming. Note the verbose argument is pretty useful to understand the abstraction that LangChain provides under the hood, while executing our query.Next, an instance of LLMChain is created, providing the prompt (the previously defined prompt template) and the LLM (the OpenAI LLM instance).The query or question is defined as the variable query.Finally, the llm_chain.run(query) line executes the LLMChain with the specified query, generating the response based on the defined prompt and the OpenAI LLM:# Importing the OpenAI LLM module from langchain.llms import OpenAI # Importing the LLMChain module from langchain import LLMChain # Creating an instance of the OpenAI LLM llm = OpenAI(temperature=0.9, openai_api_key=OPENAI_API_KEY, model="text-davinci-003", streaming=True) # Creating an instance of the LLMChain with the provided prompt and OpenAI LLM llm_chain = LLMChain(prompt=prompt,llm=llm, verbose=True) # Defining the query or question to be asked query = "What is photosynthesis?" # Running the LLMChain with the specified query print(llm_chain.run(query)) Let's have a look at the response that is generated after running the chain with and without verbose,a) with verbose = True;Prompt after formatting:You are an Scientific Chat Assistant. Your job is to reply scientific facts and evidence in a bullet point wise.Context: Scientific evidence is necessary to validate claims, establish credibility, and make informed decisions based on objective and rigorous investigation. Question: What is photosynthesis?Answer:> Finished chain.• Photosynthesis is the process used by plants, algae and certain bacteria to convert light energy from the sun into chemical energy in the form of sugars.• Photosynthesis occurs in two stages: the light reactions and the Calvin cycle. • During the light reactions, light energy is converted into ATP and NADPH molecules.• During the Calvin cycle, ATP and NADPH molecules are used to convert carbon dioxide into sugar molecules.  b ) with verbose = False;• Photosynthesis is a process used by plants and other organisms to convert light energy, normally from the sun, into chemical energy which can later be released to fuel the organisms' activities.• During photosynthesis, light energy is converted into chemical energy and stored in sugars.• Photosynthesis occurs in two stages: light reactions and the Calvin cycle. The light reactions trap light energy and convert it into chemical energy in the form of the energy-storage molecule ATP. The Calvin cycle uses ATP and other molecules to create glucose.Seems like our general-purpose LLMChain has done a pretty decent job and given a reasonable output by leveraging the LLM.Now let's move onto the utility chain and understand it, using a simple code snippet:from langchain import OpenAI from langchain import LLMMathChain llm = OpenAI(temperature=0.9,openai_api_key= OPENAI_API_KEY) # Using the LLMMath Chain / LLM defined in Prompt Template section llm_math = LLMMathChain.from_llm(llm = llm, verbose = True) question = "What is 4 times 5" llm_math.run(question) # You know what the response would be 🎈Here the utility chain serves a specific function, i.e. to solve a fundamental maths question using the LLMMathChain. It's crucial to look at the prompt used under the hood for such chains. However , in addition, a few more notable utility chains are there as well,BashChain: A utility chain designed to execute Bash commands and scripts.SQLDatabaseChain: This utility chain enables interaction with SQL databasesSummarizationChain: The SummarizationChain is designed specifically for text summarization tasks.Such utility chains, along with other available chains in the LangChain framework, provide specialized functionalities and ready-to-use tools that can be utilized to expedite and enhance various aspects of the language processing pipeline.MemoryUntil now, we have seen, each incoming query or input to the LLMs or to its subsequent chain is treated as an independent interaction, meaning it is "stateless" (in simpler terms, information IN, information OUT). This can be considered as one of the major drawbacks, as it hinders the ability to provide a seamless and natural conversational experience for users who are seeking reasonable responses further on. To overcome this limitation and enable better context retention, LangChain offers a broad spectrum of memory components that are extremely helpful.Image by Author | The various types of Memory modules that LangChain provides.By utilizing the memory components supported, it becomes possible to remember the context of the conversation, making it more coherent and intuitive. These memory components allow for the storage and retrieval of information, enabling the LLMs to have a sense of continuity. This means they can refer back to previous relevant contexts, which greatly enhances the conversational experience for users. A typical example of such memory-based interaction is the very popular chatbot - ChatGPT, which remembers the context of our conversations.Let's have a look at how we can leverage such a possibility using LangChain:from langchain.llms import OpenAI from langchain.chains import ConversationChain from langchain.memory import ConversationBufferMemory llm = OpenAI(temperature=0, openai_api_key= OPENAI_API_KEY) conversation = ConversationChain( llm=llm, verbose=True, memory = ConversationBufferMemory() ) In the above code, we have initialized an instance of the ConversationChain class, configuring it with the OpenAI language model, enabling verbose mode for detailed output, and utilizing a ConversationBufferMemory for memory management during conversations. Now, let's begin our conversation,conversation.predict(input="Hi there!I'm Avra") Prompt after formatting:The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.Current conversation:Human: Hi there! I'm AvraAI:> Finished chain.' Hi, Avra! It's nice to meet you. My name is AI. What can I do for you today?Let's add a few more contexts to the chain, so that later we can test the context memory of the chain.conversation.predict(input="I'm interested in soccer and building AI web apps.")Prompt after formatting:The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.Current conversation:Human: Hi there!I'm AvraAI:  Hi Avra! It's nice to meet you. My name is AI. What can I do for you today?Human: I'm interested in soccer and building AI web apps.AI:> Finished chain.' That's great! Soccer is a great sport and AI web apps are a great way to explore the possibilities of artificial intelligence. Do you have any specific questions about either of those topics?Now, we make a query, which requires the chain to trace back to its memory storage and provide a reasonable response based on it.conversation.predict(input="Who am I and what's my interest ?")Prompt after formatting:The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know. Current conversation:Human: Hi there!I'm AvraAI:  Hi Avra! It's nice to meet you. My name is AI. What can I do for you today?Human: I'm interested in soccer and building AI web apps.AI:  That's great! Soccer is a great sport and AI web apps are a great way to explore the possibilities of artificial intelligence. Do you have any specific questions about either of those topics?Human: Who am I and what's my interest ?AI:> Finished chain.' That's a difficult question to answer. I don't have enough information to answer that question. However, based on what you've told me, it seems like you are Avra and your interests are soccer and building AI web apps.The above response highlights the significance of the ConversationBufferMemory chain in retaining the context of the conversation. It would be worthwhile to try out the above example without a buffer memory to get a clear perspective of the importance of the memory module. Additionally, LangChain provides several memory modules that can enhance our understanding of memory management in different ways, to handle conversational contexts.Moving forward, we will delve into the next section, where we will focus on the final two components called the “Indexes” and the "Agent." During this section, we will not only gain a hands-on understanding of its usage but also build and deploy a web app using an online workspace called Databutton.ReferencesLangChain Official Docs - https://python.langchain.com/en/latest/index.htmlCode available for this section here (Google Collab) - https://colab.research.google.com/drive/1_SpAvehzfbYYdDRnhU6v9-KHwIHMC1yj?usp=sharingPart 1: Using LangChain for Large Language Model — powered Applications : https://www.packtpub.com/article-hub/using-langchain-for-large-language-model-powered-applicationsPart 3 : Building and deploying Web App using LangChain <Insert Link>How to build a Chatbot with ChatGPT API and a Conversational Memory in Python: https://medium.com/@avra42/how-to-build-a-chatbot-with-chatgpt-api-and-a-conversational-memory-in-python-8d856cda4542Databutton - https://www.databutton.io/Author BioAvratanu Biswas, Ph.D. Student ( Biophysics ), Educator, and Content Creator, ( Data Science, ML & AI ).Twitter    YouTube    Medium     GitHub
Read more
  • 0
  • 0
  • 26077
article-image-efficient-llm-querying-with-lmql
Alan Bernardo Palacio
12 Sep 2023
14 min read
Save for later

Efficient LLM Querying with LMQL

Alan Bernardo Palacio
12 Sep 2023
14 min read
IntroductionIn the world of natural language processing, Large Language Models (LLMs) have proven to be highly successful at a variety of language-based tasks, such as machine translation, text summarization, question answering, reasoning, and code generation. LLMs like ChatGPT, GPT-4, and others have demonstrated outstanding performance by predicting the next token in a sequence based on input prompts. Users interact with these models by providing language instructions or examples to perform various downstream tasks. However, to achieve optimal results or adapt LLMs for specific tasks, complex and task-specific programs must be implemented, often requiring ad-hoc interactions and deep knowledge of the model's internals.In this article, we discuss LMQL, a framework for Language Model Programming (LMP), that allows users to specify complex interactions, control flow, and constraints without needing deep knowledge of the LLM's internals using a declarative programming language similar to SQL. LMQL supports high-level, logical constraints and users can express a wide range of prompting techniques concisely, reducing the need for ad-hoc interactions and manual work to steer model generation, avoiding costly re-querying, and guiding the text generation process according to their specific criteria. Let’s start.Overview of Large Language ModelsLanguage models (LMs) operate on sequences of tokens, where tokens are discrete elements that represent words or sub-words in a text. The process involves using a tokenizer to map input words to tokens, and then a language model predicts the probabilities of possible next tokens based on the input sequence. Various decoding methods are used in the LMs to output the right sequence of tokens from the language model's predictions out of which we can name:Decoding Methods:Greedy decoding: Select the token with the highest probability at each step.Sampling: Randomly sampling tokens based on the predicted probabilities.Full decoding: Enumerating all possible sequences and selecting the one with the highest probability (computationally expensive).Beam search: Maintaining a set of candidate sequences and refining them by predicting the next token.Masked Decoding: In some cases, certain tokens can be ruled out based on a mask that indicates which tokens are viable. Decoding is then performed on the remaining set of tokens.Few-Shot Prompting: LMs can be trained on broad text-sequence prediction datasets and then provided with context in the form of examples for specific tasks. This approach allows LMs to perform downstream tasks without task-specific training.Multi-Part Prompting: LMs are used not only for simple prompt completion but also as reasoning engines integrated into larger programs. Various LM programming schemes explore compositional reasoning, such as iterated decompositions, meta prompting, tool use, and composition of multiple prompts.It is also important to name that for beam searching and sampling there is a parameter named temperature which we can use to control the diversity of the output.These techniques enable LMs to be versatile and perform a wide range of tasks without requiring task-specific training, making them powerful multi-task reasoners.Asking the Right QuestionsWhile LLMs can be prompted with examples or instructions, using them effectively and adapting to new models often demands a deep understanding of their internal workings, along with the use of vendor-specific libraries and implementations. Constrained decoding to limit text generation to legal words or phrases can be challenging. Many advanced prompting methods require complex interactions and control flows between the LLM and the user, leading to manual work and restricting the generality of implementations. Additionally, generating complete sequences from LLMs may require multiple calls and become computationally expensive, resulting in high usage costs per query in pay-to-use APIs. Generally, the challenges that can associated with creating proper promts for LLMs are:Interaction Challenge: One challenge in LM interaction is the need for multiple manual interactions during the decoding process. For example, in meta prompting, where the language model is asked to expand the prompt and then provide an answer, the current approach requires inputting the prompt partially, invoking the LM, extracting information, and manually completing the sequence. This manual process may involve human intervention or several API calls, making joint optimization of template parameters difficult and limiting automated optimization possibilities.Constraints & Token Representation: Another issue arises when considering completions generated by LMs. Sometimes, LMs may produce long, ongoing sequences of text that do not adhere to desired constraints or output formats. Users often have specific constraints for the generated text, which may be violated by the LM. Expressing these constraints in terms of human-understandable concepts and logic is challenging, and existing methods require considerable manual implementation effort and model-level understanding of decoding procedures, tokenization, and vocabulary.Efficiency and Cost Challenge: Efficiency and performance remain significant challenges in LM usage. While efforts have been made to improve the inference step in modern LMs, they still demand high-end GPUs for reasonable performance. This makes practical usage costly, particularly when relying on hosted models running in the cloud with paid APIs. The computational and financial expenses associated with frequent LM querying can become prohibitive.Addressing these challenges, Language Model Programming and constraints offer new optimization opportunities. By defining behavior and limiting the search space, the number of LM invocations can be reduced. In this context, the cost of validation, parsing, and mask generation becomes negligible compared to the significant cost of a single LM call.So the question arises, how can we overcome the challenges of implementing complex interactions and constraints with LLMs while reducing computational costs and retaining or improving accuracy on downstream tasks?Introducing LMQLTo address these challenges and enhance language model programming, a team of researchers has introduced LMQL (Language Model Query Language). LMQL is an open-source programming language and platform for LLM interaction that combines prompts, constraints, and scripting. It is designed to elevate the capabilities of LLMs like ChatGPT, GPT-4, and any future models, offering a declarative, SQL-like approach based on Python.LMQL enables Language Model Programming (LMP), a novel paradigm that extends traditional natural language prompting by allowing lightweight scripting and output constraining. This separation of front-end and back-end interaction allows users to specify complex interactions, control flow, and constraints without needing deep knowledge of the LLM's internals. This approach abstracts away tokenization, implementation, and architecture details, making it more portable and easier to use across different LLMs.With LMQL, users can express a wide range of prompting techniques concisely, reducing the need for ad-hoc interactions and manual work. The language supports high-level, logical constraints, enabling users to steer model generation and avoid costly re-querying and validation. By guiding the text generation process according to specific criteria, users can achieve the desired output with fewer iterations and improved efficiency.Moreover, LMQL leverages evaluation semantics to automatically generate token masks for LM decoding based on user-specified constraints. This optimization reduces inference cost by up to 80%, resulting in significant latency reduction and lower computational expenses, particularly beneficial for pay-to-use APIs.LMQL ddresses certain challenges in LM interaction and usage which are namely.Overcoming Manual Interaction: LMQL simplifies the prompt and eliminates the need for manual interaction during the decoding process. It achieves this by allowing the use of variables, represented within square brackets, which store the answers obtained from the language model. These variables can be referenced later in the query, avoiding the need for manual extraction and input. By employing LMQL syntax, the interaction process becomes more automated and efficient.Constraints on Variable Parts: To address issues related to long and irrelevant outputs, LMQL introduces constraints on the variable parts of LM interaction. These constraints allow users to specify word and phrase limitations for the generated text. LMQL ensures that the decoded tokens for variables meet these constraints during the decoding process. This provides more control over the generated output and ensures that it adheres to user-defined restrictions.Generalization of Multi-Part Prompting: Language Model Programming through LMQL generalizes various multi-part prompting approaches discussed earlier. It streamlines the process of trying different values for variables by automating the selection process. Users can set constraints on variables, which are then applied to multiple inputs without any human intervention. Once developed and tested, an LMQL query can be easily applied to different inputs in an unsupervised manner, eliminating the need for manual trial and error.Efficient Execution: LMQL offers efficiency benefits over manual interaction. The constraints and scripting capabilities in LMQL are applied eagerly during decoding, reducing the number of times the LM needs to be invoked. This optimized approach results in notable time and cost savings, especially when using hosted models in cloud environments.The LMQL syntax involves components such as the decoder, the actual query, the model to query, and the constraints. The decoder specifies the decoding procedure, which can include argmax, sample, or beam search. LMQL allows for constraints on the generated text using Python syntax, making it more user-friendly and easily understandable. Additionally, the distribution instruction allows users to augment the returned result with probability distributions, which is useful for tasks like sentiment analysis.Using LMQL with PythonLMQL can be utilized in various ways - as a standalone language, in the Playground, or even as a Python library being the latter what we will demonstrate now. Integrating LMQL into Python projects allows users to streamline their code and incorporate LMQL queries seamlessly. Let's explore how to use LMQL as a Python library and understand some examples.To begin, make sure you have LMQL and LangChain installed by running the following command:!pip install lmql==0.0.6.6 langchain==0.0.225You can then define and execute LMQL queries within Python using a simple approach. Decorate a Python function with the lmql.query decorator, providing the query code as a multi-line string. The decorated function will automatically be compiled into an LMQL query. The return value of the decorated function will be the result of the LMQL query.Here's an example code snippet demonstrating this:import lmql import aiohttp import os os.environ['OPENAI_API_KEY'] = '<your-openai-key>' @lmql.query async def hello():    '''lmql    argmax        "Hello[WHO]"    from        "openai/text-ada-001"    where        len(TOKENS(WHO)) < 10    ''' print(await hello())LMQL provides a fully asynchronous API that enables running multiple LMQL queries in parallel. By declaring functions as async with @lmql.query, you can use await to execute the queries concurrently.The code below demonstrates how to look up information from Wikipedia and incorporate it into an LMQL prompt dynamically:async def look_up(term):    # Looks up term on Wikipedia    url = f"<https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles={term}&origin=*>"    async with aiohttp.ClientSession() as session:        async with session.get(url) as response:            # Get the first sentence on the first page            page = (await response.json())["query"]["pages"]            return list(page.values())[0]["extract"].split(".")[0] @lmql.query async def greet(term):    '''    argmax        """Greet {term} ({await look_up(term)}):        Hello[WHO]        """    from        "openai/text-davinci-003"    where        STOPS_AT(WHO, "\\n")    ''' print((await greet("Earth"))[0].prompt)As an alternative to @lmql.query you can use lmql.query(...) as a function that compiles a provided string of LMQL code into a Python function.q = lmql.query('argmax "Hello[WHO]" from "openai/text-ada-001" where len(TOKENS(WHO)) < 10') await q()LMQL queries can also be easily integrated into langchain's Chain components. This allows for sequential prompting using multiple queries.pythonCopy code from langchain import LLMChain, PromptTemplate from langchain.chat_models import ChatOpenAI from langchain.prompts.chat import (ChatPromptTemplate, HumanMessagePromptTemplate) from langchain.llms import OpenAI # Setup the LM to be used by langchain llm = OpenAI(temperature=0.9) human_message_prompt = HumanMessagePromptTemplate(    prompt=PromptTemplate(        template="What is a good name for a company that makes {product}?",        input_variables=["product"],    ) ) chat_prompt_template = ChatPromptTemplate.from_messages([human_message_prompt]) chat = ChatOpenAI(temperature=0.9) chain = LLMChain(llm=chat, prompt=chat_prompt_template) # Run the chain chain.run("colorful socks")Lastly, by treating LMQL queries as Python functions, you can easily build pipelines by chaining functions together. Furthermore, the guaranteed output format of LMQL queries ensures ease of processing the returned values using data processing libraries like Pandas.Here's an example of processing the output of an LMQL query with Pandas:pythonCopy code import pandas as pd @lmql.query async def generate_dogs(n: int):    '''lmql    sample(n=n)        """Generate a dog with the following characteristics:        Name:[NAME]        Age: [AGE]        Breed:[BREED]        Quirky Move:[MOVE]        """    from        "openai/text-davinci-003"    where        STOPS_BEFORE(NAME, "\\n") and STOPS_BEFORE(BREED, "\\n") and        STOPS_BEFORE(MOVE, "\\n") and INT(AGE) and len(AGE) < 3    ''' result = await generate_dogs(8) df = pd.DataFrame([r.variables for r in result]) dfBy employing LMQL as a Python library, users can make their code more efficient and structured, allowing for easier integration with other Python libraries and tools.LMQL can be used in various ways - as a standalone language, in the Playground, or even as a Python library. When integrated into Python projects, LMQL queries can be executed seamlessly. Below, we provide a brief overview of using LMQL as a Python library.ConclusionLMQL introduces an efficient and powerful approach to interact with language models, revolutionizing language model programming. By combining prompts, constraints, and scripting, LMQL offers a user-friendly interface for working with large language models, significantly improving efficiency and accuracy across diverse tasks. Its capabilities allow developers to leverage the full potential of language models without the burden of complex implementations, making language model interaction more accessible and cost-effective.With LMQL, users can overcome challenges in LM interaction, including manual interactions, constraints on variable parts, and generalization of multi-part prompting. By automating the selection process and eager application of constraints during decoding, LMQL reduces the number of LM invocations, resulting in substantial time and cost savings. Moreover, LMQL's declarative, SQL-like approach simplifies the development process and abstracts away tokenization and implementation details, making it more portable and user-friendly.In conclusion, LMQL represents a promising advancement in the realm of large language models and language model programming. Its efficiency, flexibility, and ease of use open up new possibilities for creating complex interactions and steering model generation without deep knowledge of the model's internals. By embracing LMQL, developers can make the most of language models, unleashing their potential across a wide range of language-based tasks with heightened efficiency and reduced computational costs.Author BioAlan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, and Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder of startups, and later on earned a Master's degree from the faculty of Mathematics at the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.LinkedIn
Read more
  • 0
  • 0
  • 26048

article-image-creating-views-in-odoo-12-list-form-search-tutorial
Sugandha Lahoti
02 Feb 2019
10 min read
Save for later

Creating views in Odoo 12 - List, Form, Search [Tutorial]

Sugandha Lahoti
02 Feb 2019
10 min read
Odoo provides a rapid application development framework that's particularly suited to building business applications. This type of application is usually concerned with keeping business records, centered around create, read, update, and delete (CRUD) operations. Not only does Odoo makes it easy to build this type of application, but it also provides rich components to create compelling user interfaces, such as kanban, calendar, and graph views. In this tutorial, we will create list, form, and search views, the basic building blocks for the user interface. This article is taken from the book Odoo 12 Development Essentials by Daniel Reis. This book will tecah you to build a business application from scratch by using  Odoo 12. Technical requirements The minimal requirement is for you to have a modern web browser, such as Firefox, Chrome, or Edge. You may go a little further and use a packaged Odoo distribution to have it locally installed on your computer. For that, you only need an operating system such as Windows, macOS, Debian-based Linux (such as Ubuntu), or Red Hat-based Linux (such as Fedora). Windows, Debian, and Red Hat have installation packages available. Another option is to use Docker, available for all these systems and for macOS. In this article, we will mostly have point-and-click interaction with the user interface. You will find the code snippets used and a summary of the steps performed in the book's code repository, under the ch01 folder. It's important to note that Odoo databases are incompatible between Odoo major versions. If you run an Odoo 11 server against a database created for a previous major version of Odoo, it won't work. Non-trivial migration work is needed before a database can be used with a later version of the product. The same is true for add-on modules: as a general rule, an add-on module developed for an Odoo major version will not work on other versions. When downloading a community module from the web, make sure it targets the Odoo version you are using. On the other hand, major releases (10.0, 11.0) are expected to receive frequent updates, but these should be mostly bug fixes. They are assured to be API-stable, meaning that model data structures and view element identifiers will remain stable. This is important because it means there will be no risk of custom modules breaking due to incompatible changes in the upstream core modules. Creating a new Model Models are the basic components for applications, providing the data structures and storage to be used. We will create the Model for To-do Items. It will have three fields: Description  Is done? flag Work team partner list Model definitions are accessed in the Settings app, in the Technical | Database Structure | Models menu. To create a Model, follow these steps: Visit the Models menu, and click on the upper-left Create button. Fill in the new Model form with these values: Model Description: To-do Item Model: x_todo_item We should save it before we can properly add new fields to it. So, click on Save and then Edit it again. You can see that a few fields were automatically added. The ORM includes them in all Models, and they can be useful for audit purposes: The x_name (or Name) field is a title representing the record in lists or when it is referenced in other records. It makes sense to use it for the To-do Item title. You may edit it and change the Field Label to a more meaningful label description. Adding the Is Done? flag to the Model should be straightforward now. In the Fields list, click on Add a line, at the bottom of the list, to create a new field with these values: Field Name: x_is_done Field Label: Is Done? Field Type: boolean The new Fields form should look like this: Now, something a little more challenging is to add the Work Team selection. Not only it is a relation field, referring to a record in the res.partner Model, it also is a multiple-value selection field. In many frameworks, this is not a trivial task, but fortunately, that's not the case in Odoo, because it supports many-to-many relations. This is the case because one to-do can have many people, and each person can participate in many to-do items. In the Fields list, click again on Add a line to create the new field: Field Name: x_work_team_ids Field Label: Work Team Field Type: many2many Object Relation: res.partner Domain: [('x_is_work_team', '=', True)] The many-to-many field has a few specific definitions—Relation Table, Column 1, and Column 2 fields. These are automatically filled out for you and the defaults are good for most cases, so we don't need to worry about them now. The domain attribute is optional, but we used it so that only eligible work team members are selectable from the list. Otherwise, all partners would be available for selection. The Domain expression defines a filter for the records to be presented. It follows an Odoo-specific syntax—it is a list of triplets, where each triplet is a filter condition, indicating the Field Name to filter, the filter operator to use, and the value to filter against. Odoo has an interactive domain filter wizard that can be used as a helper to generate Domain expressions. You can use it at Settings | User Interface | User-defined Filters. Once a target Model is selected in the form, the Domain field will display an add filter button, which can be used to add filter conditions, and the text box below it will dynamically show the corresponding Domain expression code. Creating views We have created the To-do Items Model. Next, we will be creating the two essential views for it—a list (also called a tree) and a form. List views We will now create a list view: In Settings, navigate to Technical | User Interface | Views and create a new record with the following values: View Name: To-do List View View Type: Tree Model: x_todo_item This is how the View definition is expected to look like: In the Architecture tab, we should write XML with the view structure. Use the following XML code: <tree> <field name="x_name" /> <field name="x_is_done" /> </tree> The basic structure of a list view is quite simple—a <tree> element containing one or more <field> elements for each of the columns to display in the list view. Form views Next, we will create the form view: Create another View record, using the following values: View Name: To-do Form View View Type: Form Model: x_todo_item If we don't specify the View Type, it will be auto-detected from the view definition. In the Architecture tab, type the following XML code: <form> <group> <field name="x_name" /> <field name="x_is_done" /> <field name="x_work_team_ids" widget="many2many_tags" context="{'default_x_is_work_team': True}" /> </group> </form> The form view structure has a root <form> element, containing elements such as <field>, Here, we also chose a specific widget for the work team field, to be displayed as tag buttons instead of a list grid. We added the widget attribute to the Work Team field, to have the team members presented as button-like tags. By default, relational fields allow you to directly create a new record to be used in the relationship. This means that we are allowed to create new Partner directly from the Work Team field. But if we do so, they won't have the Is Work Team? flag enabled, which can cause inconsistencies. For better user experience, we can have this flag set by default for these cases. This is done with the context attribute, used to pass session information to the next View, such as default values to be used. This will be discussed in detail in later chapters, and for now, we just need to know that it is a dictionary of key-value pairs. Values prefixed with default_ provide the default value for the corresponding field. So in our case, the expression needed to set a default value for the partner's Is Work Team? flag is {'default_x_is_work_team': True}. That's it. If we now try the To-Do menu option, and create a new item or open an existing one from the list, we will see the form we just added. Search views We can also make predefined filter and grouping options available, in the search box in the upper-right corner of the list view. Odoo considers these view elements also, and so they are defined in Views records, just like lists and forms are. As you may already know by now, Views can be edited either in the Settings | Technical | User Interface menu, or from the contextual Developer Tools menu. Let's go for the latter now; navigate to the to-do list, click on the Developer Tools icon in the upper-right corner, and select Edit Search view from the available options: Since no search view is yet defined for the To-do Items Model, we will see an empty form, inviting us to create the first one. Fill in these values and save it: View Name: Some meaningful description, such as To-do Items Filter View Type: Search Model: x_todo_item Architecture: Add this XML code: <search> <filter name="item_not_done" string="Not Done" domain="[('x_is_done', '=', False)]" /> </search> If we now open the to-do list from the menu, so that it is reloaded, we will see that our predefined filter is now available from the Filters button below the search box. If we type Not Done inside the search box, it will also show a suggested selection. It would be nice to have this filter enabled by default and disable it when needed. Just like default field values, we can also use context to set default filters. When we click on the To-do menu option, it runs a Window Actions to open the To-do list view. This Window Actions can set a context value, signaling the Views to enable a search filter by default. Let's try this: Click on the To-do menu option to go to the To-do list. Click on the Developer Tools icon and select the Edit Action option. This will open the Window Actions used to open the current Views. In the lower-right corner, there is a Filter section, where we have the Domain and Context fields. The Domain allows setting a fixed filter on the records shown, which can't be removed by the user. We don't want to use that. Instead, we want to enable the item_not_done filter created before by default, which can be deselected whenever the user wishes to. To enable a filter by default, add a context key with its name prefixed with search_default_, in this case {'search_default_item_not_done': True}. If we click on the To-do menu option now, we should see the Not Done filter enabled by default on the search box. In this article, we created create list, form, and search views, the basic building blocks for the user interface for our model. To learn more about Odoo development in depth, read our book Odoo 12 Development Essentials. “Everybody can benefit from adopting Odoo, whether you’re a small start-up or a giant tech company - An interview by Yenthe van Ginneken. Implement an effective CRM system in Odoo 11 [Tutorial] Handle Odoo application data with ORM API [Tutorial]
Read more
  • 0
  • 0
  • 26034
Modal Close icon
Modal Close icon