Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-debug-application-using-qt-creator
Gebin George
27 Apr 2018
9 min read
Save for later

How to Debug an application using Qt Creator

Gebin George
27 Apr 2018
9 min read
Today, we will learn about debugging an application using Qt Creator. A debugger is a program that can be used to test and debug other programs, in case of a sudden crash during the program execution or an unexpected behavior in the logic of the program. Most of the time (if not always), debuggers are used in the development environment and in conjunction with an IDE. In our case, we will learn how to use a debugger with Qt Creator. It is important to note that debuggers are not part of the Qt Framework, and, just like compilers, they are usually provided by the operating system SDK. Qt Creator automatically detects and uses debuggers if they are present on a system. This can be checked by navigating into the Qt Creator Options page via the main menu Tools and then Options. Make sure to select Build & Run from the list on the left side and then switch to the Debuggers tab from the top. You should be able to see one or more autodetected debuggers on the list. [box type="info" align="" class="" width=""]Windows Users: You should see something similar to the screenshot after this information box. If not, this means you have not installed any debuggers. You can easily download and install it using the instructions provided here: https:/ / docs. microsoft. com/ en- us/ windows- hardware/ drivers/debugger/ Or, you can independently search for the following topic online: Debugging Tools for Windows (WinDbg, KD, CDB, NTSD). Nevertheless, after the debugger is installed (assumingly, CDB or Microsoft Console Debugger for Microsoft Visual C++ Compilers and GDB for GCC Compilers), you can restart Qt Creator and return to this page. You should be able to have one or more entries similar to the following. Since we have installed a 32-bit version of the Qt and OpenCV Frameworks, choose the entry with x86 in its name to view its path, type, and other properties. macOS and Linux Users: There shouldn't be any action needed on your part and, depending on the OS, you'll see a GDB, LLDB, or some other debugger in the entries.[/box] Here's the screenshot of the Build & Run tab on the Options page: Depending on the operating system and the installed debugger, the preceding screenshot might be slightly different. Nevertheless, you'll have a debugger that you need to make sure is correctly set as the debugger for the Qt Kit you are using. So, make a note of the debugger path and name and switch to the Kits tab, and, after selecting the Qt Kit you were using, make sure the debugger for it is correctly set, as you can see in the following screenshot: Don't worry about choosing the wrong debugger, or any other options, since you'll be warned with relevant icons beside the Qt Kit icon selected at the top. The icon seen in the following image on the left side is usually displayed when everything is okay with the Kit, the second one from the left is an indication that something is not right, and the one on the right means a critical error. Move your mouse over the icon when it appears to see more information about the required actions needed to fix the issue: [box type="info" align="" class="" width=""]Critical issues with Qt Kits can be caused by many different factors such as a missing compiler which will make the kit completely useless until the issue is resolved. An example of a warning message in a Qt Kit would be a missing debugger, which will not make the kit useless, but you won't be able to use the debugger with it, thus it means less functionality than a completely configured Qt Kit.[/box] After the debugger is correctly set, you can start debugging your applications in one of the following ways, which basically have the same result: ending up in the Debugger view of the Qt Creator: Starting an application in Debugging mode Attaching to a running application (or process) [box type="info" align="" class="" width=""]Note that a debugging process can be started in many ways, such as remotely, by attaching to a process running on a separate machine and so on. However, the preceding methods will suffice for most cases and especially for the ones relevant to the Qt+OpenCV application development and what we learned throughout this book.[/box] Getting started with the debugging mode To start an application in the debugging mode, after opening a Qt project, you can use one of the following methods: Pressing the F5 button Using the Start Debugging button, right below the usual Run button with a similar icon, but with a small bug on it Using the main menu entries in the following order: Debug/Start Debugging/Start Debugging. To attach the debugger to a running application, you can use the main menu entries in the following order: Debug/Start Debugging/Attach to Running Application. This will open up the List of Processes window, from which you can choose your application or any other process you want to debug using its process ID or executable name. You can also use the Filter field (as seen in the following image) to find your application, since, most probably, the list of processes will be quite a long one. After choosing the correct process, make sure to press the Attach to Process button. No matter which one of the preceding methods you use, you will end up in the Qt Creator Debug mode, which is quite similar to the Edit mode, but it also allows you to do the following, among many others: Add, Enable, Disable, and View Breakpoints in the code (a Breakpoint is simply a point or a line in the code that we want the debugger to pause in the process and allow us to do a more detailed analysis of the status of the program) Interrupt running programs and processes to view and examine the code View and examine the function call stack (the call stack is a stack containing the hierarchical list of functions that led to a breakpoint or interrupted state) View and examine the variables Disassemble the source codes (disassembling in this sense means extracting the exact instructions that correspond to the function calls and other C++ codes in our program) You'll notice a performance drop in the application when it is started in debugging mode, which is obviously because of the fact that codes are being monitored and traced by the debugger. Here's a screenshot of the Qt Creator Debug mode, in which all of the capabilities mentioned earlier are visible in a single window and in the Debug mode of the Qt Creator: The area specified with the number 1 in the preceding screenshot in the code editor that you have already used through the book and are quite familiar with. Each line of code has a line number; you can click on their left side to toggle a breakpoint anywhere you want in the code. You can also right-click on the line numbers to set, remove, disable, or enable a breakpoint by selecting Set Breakpoint at Line X, Remove Breakpoint X, Disable Breakpoint X, or Enable Breakpoint X, where X in all of the commands mentioned here needs to be replaced by the line number. Apart from the code editor, you can also use the area mentioned with number 4 in the preceding screenshot to add, delete, edit, and further modify breakpoints in the code. You can also right-click on the same toolbar below the code editor that contains the debugger controls to open up the following menu and add or remove more panes to display additional debug and analysis information. We will cover the default debugger view, but make sure to check out each one of the following options on your own to familiarize yourself with the debugger even more: The area specified with number 2 in the preceding code can be used to view the call stack. Whether you interrupt the program by pressing the Interrupt button or choosing Debug/Interrupt from the menu while the it is running, set a breakpoint and stop the program in a specific line of code, or a malfunctioning code causes the program to fall into a trap and pause the process (since a crash and exception will be caught by the debugger), you can always view the hierarchy of function calls that led to the interrupted state, or further analyze them by checking the area 2 in the preceding Qt Creator screenshot. Finally, you can use the third area in the previous screenshot to view the local and global variables of the program in the interrupted location in the code. You can see the contents of the variables, whether they are standard data types, such as integers and floats or structures and classes, and also you can further expand and analyze their content to test and analyze any possible issues in your code. Using a debugger efficiently can mean hours of difference in testing and solving the issues in your code. In terms of practical usage of the debuggers, there is really no other way but to use it as much as you can and develop habits of your own to use the debugger, but also make note of good practices and tricks you found along the way and the ones we just went through. If you are interested, you can also read online about other possible methods of debugging, such as remote debugging, debugging using crash dump files (on Windows), and more. We saw how to practically debug an application using QT debugging mode. [box type="note" align="" class="" width=""]You read an excerpt from the book, Computer Vision with OpenCV 3 and Qt 5 written by Amin Ahmadi Tazehkandi.  The book covers development of cross-platform applications using OpenCV 3 and Qt 5.[/box] 3 ways to deploy a QT and OpenCV application Debugging Your .NET Application    
Read more
  • 0
  • 0
  • 29850

article-image-predicting-bitcoin-price-from-historical-and-live-data
Sunith Shetty
06 Apr 2018
17 min read
Save for later

Predicting Bitcoin price from historical and live data

Sunith Shetty
06 Apr 2018
17 min read
Today, you will learn how to collect Bitcoin historical and live-price data. You will also learn to transform data into time series and train your model to make insightful predictions. Historical and live-price data collection We will be using the Bitcoin historical price data from Kaggle. For the real-time data, Cryptocompare API will be used. Historical data collection For training the ML algorithm, there is a Bitcoin  Historical  Price Data dataset available to the public on Kaggle (version 10). The dataset can be downloaded here. It has 1 minute OHLC data for BTC-USD pairs from several exchanges. At the beginning of the project, for most of them, data was available from January 1, 2012 to May 31, 2017; but for the Bitstamp exchange, it's available until October 20, 2017 (as well as for Coinbase, but that dataset became available later): Figure 1: The Bitcoin  historical dataset on Kaggle Note that you need to be a registered user and be logged in in order to download the file. The file that we are using is bitstampUSD_1-min_data_2012-01-01_to_2017-10-20.csv. Now, let us get the data we have. It has eight columns: Timestamp: The time elapsed in seconds since January 1, 1970. It is 1,325,317,920 for the first row and 1,325,317,920 for the second 1. (Sanity check! The difference is 60 seconds). Open: The price at the opening of the time interval. It is 4.39 dollars. Therefore it is the price of the first trade that happened after Timestamp (1,325,317,920 in the first row's case). Close: The price at the closing of the time interval. High: The highest price from all orders executed during the interval. Low: The same as High but it is the lowest price. Volume_(BTC): The sum of all Bitcoins that were transferred during the time interval. So, take all transactions that happened during the selected interval and sum up the BTC values of each of them. Volume_(Currency): The sum of all dollars transferred. Weighted_Price: This is derived from the volumes of BTC and USD. By dividing all dollars traded by all bitcoins, we can get the weighted average price of BTC during this minute. So Weighted_Price=Volume_(Currency)/Volume_(BTC). One of the most important parts of the data-science pipeline after data collection (which is in a sense outsourced; we use data collected by others) is data preprocessing—clearing a dataset and transforming it to suit our needs. Transformation of historical data into a time series Stemming from our goal—predict the direction of price change—we might ask ourselves, does having an actual price in dollars help to achieve this? Historically, the price of Bitcoin was usually rising, so if we try to fit a linear regression, it will show further exponential growth (whether in the long run this will be true is yet to be seen). Assumptions and design choices One of the assumptions of this project is as follows: whether we are thinking about Bitcoin trading in November 2016 with a price of about $700, or trading in November 2017 with a price in the $6500-7000 range, patterns in how people trade are similar. Now, we have several other assumptions, as described in the following points: Assumption one: From what has been said previously, we can ignore the actual price and rather look at its change. As a measure of this, we can take the delta between opening and closing prices. If it is positive, it means the price grew during that minute; the price went down if it is negative and stayed the same if delta = 0. In the following figure, we can see that Delta was -1.25 for the first minute observed, -12.83 for the second one, and -0.23 for the third one. Sometimes, the open price can differ significantly from the close price of the previous minute (although Delta is negative during all three of the observed minutes, for the third minute the shown price was actually higher than close for a second). But such things are not very common, and usually the open price doesn't change significantly compared to the close price of the previous minute. Assumption two: The next need to consider...  is predicting the price change in a black box environment. We do not use other sources of knowledge such as news, Twitter feeds, and others to predict how the market would react to them. This is a more advanced topic. The only data we use is price and volume. For simplicity of the prototype, we can focus on price only and construct time series data. Time series prediction is a prediction of a parameter based on the values of this parameter in the past. One of the most common examples is temperature prediction. Although there are many supercomputers using satellite and sensor data to predict the weather, a simple time series analysis can lead to some valuable results. We predict the price at T+60 seconds, for instance, based on the price at T, T-60s, T-120s and so on. Assumption three: Not all data in the dataset is valuable. The first 600,000 records are not informative, as price changes are rare and trading volumes are small. This can affect the model we are training and thus make end results worse. That is why the first 600,000 of rows are eliminated from the dataset. Assumption four: We need to Label  our data so that we can use a supervised ML algorithm. This is the easiest measure, without concerns about transaction fees. Data preprocessing Taking into account the goals of data preparation, Scala was chosen as an easy and interactive way to manipulate data: val priceDataFileName: String = "bitstampUSD_1- min_data_2012-01-01_to_2017-10-20.csv" val spark = SparkSession .builder() .master("local[*]") .config("spark.sql.warehouse.dir", "E:/Exp/") .appName("Bitcoin Preprocessing") .getOrCreate() val data = spark.read.format("com.databricks.spark.csv").option("header", "true").load(priceDataFileName) data.show(10) >>> println((data.count(),  data.columns.size)) >>> (3045857,  8) In the preceding code, we load data from the file downloaded from Kaggle and look at what is inside. There are 3045857 rows in the dataset and 8 columns, described before. Then we create the Delta column, containing the difference between closing and opening prices (that is, to consider only that data where meaningful trading has started to occur): val dataWithDelta  = data.withColumn("Delta",  data("Close") - data("Open")) The following code labels our data by assigning 1 to the rows the Delta value of which was positive; it assigns 0 otherwise: import org.apache.spark.sql.functions._  import spark.sqlContext.implicits._  val dataWithLabels  = dataWithDelta.withColumn("label",    when($"Close"  -  $"Open"  > 0, 1).otherwise(0))  rollingWindow(dataWithLabels,  22, outputDataFilePath,  outputLabelFilePath) This code transforms the original dataset into time series data. It takes the Delta values of WINDOW_SIZE rows (22 in this experiment) and makes a new row out of them. In this way, the first row has Delta values from t0 to t21, and the second one has values from t1 to t22. Then we create the corresponding array with labels (1 or 0). Finally, we save X and Y into files where 612000 rows were cut off from the original dataset; 22 means rolling window size and 2 classes represents that labels are binary 0 and 1: val dropFirstCount: Int = 612000 def rollingWindow(data: DataFrame, window: Int, xFilename: String, yFilename: String): Unit = { var i = 0 val xWriter = new BufferedWriter(new FileWriter(new File(xFilename))) val yWriter = new BufferedWriter(new FileWriter(new File(yFilename))) val zippedData = data.rdd.zipWithIndex().collect() System.gc() val dataStratified = zippedData.drop(dropFirstCount)//slice 612K while (i < (dataStratified.length - window)) { val x = dataStratified .slice(i, i + window) .map(r => r._1.getAs[Double]("Delta")).toList val y = dataStratified.apply(i + window)._1.getAs[Integer]("label") val stringToWrite = x.mkString(",") xWriter.write(stringToWrite + "n") yWriter.write(y + "n") i += 1 if (i % 10 == 0) { xWriter.flush() yWriter.flush() } } xWriter.close() yWriter.close() } In the preceding code segment: val outputDataFilePath:  String = "output/scala_test_x.csv" val outputLabelFilePath:  String = "output/scala_test_y.csv" Real-time data through the Cryptocompare API For real-time data, the Cryptocompare API is used, more specifically HistoMinute, which gives us access to OHLC data for the past seven days at most. The details of the API will be discussed in a section devoted to implementation, but the API response is very similar to our historical dataset, and this data is retrieved using a regular HTTP request. For example, a simple JSON response has the following structure: {  "Response":"Success", "Type":100, "Aggregated":false, "Data":  [{"time":1510774800,"close":7205,"high":7205,"low":7192.67,"open":7198,  "volumefrom":81.73,"volumeto":588726.94},  {"time":1510774860,"close":7209.05,"high":7219.91,"low":7205,"open":7205, "volumefrom":16.39,"volumeto":118136.61},    ... (other  price data)    ], "TimeTo":1510776180,    "TimeFrom":1510774800,    "FirstValueInArray":true,    "ConversionType":{"type":"force_direct","conversionSymbol":""} } Through Cryptocompare HistoMinute, we can get open, high, low, close, volumefrom, and volumeto from each minute of historical data. This data is stored for 7 days only; if you need more, use the hourly or daily path. It uses BTC conversion if data is not available because the coin is not being traded in the specified currency: Now, the following method fetches the correctly formed URL of the Cryptocompare API, which is a fully formed URL with all parameters, such as currency, limit, and aggregation specified. It finally returns the future that will have a response body parsed into the data model, with the price list to be processed at an upper level: import javax.inject.Inject import play.api.libs.json.{JsResult, Json} import scala.concurrent.Future import play.api.mvc._ import play.api.libs.ws._ import processing.model.CryptoCompareResponse class RestClient @Inject() (ws: WSClient) { def getPayload(url : String): Future[JsResult[CryptoCompareResponse]] = { val request: WSRequest = ws.url(url) val future = request.get() implicit val context = play.api.libs.concurrent.Execution.Implicits.defaultContext future.map { response => response.json.validate[CryptoCompareResponse] } } } In the preceding code segment, the CryptoCompareResponse class is the model of API, which takes the following parameters: Response Type Aggregated\ Data FirstValueInArray TimeTo TimeFrom Now, it has the following signature: case class CryptoCompareResponse(Response  : String, Type : Int,  Aggregated  : Boolean, Data  :  List[OHLC], FirstValueInArray  :  Boolean, TimeTo  : Long,  TimeFrom:  Long) object CryptoCompareResponse  {  implicit  val cryptoCompareResponseReads  = Json.reads[CryptoCompareResponse] } Again, the preceding two code segments the open-high-low-close (also known as OHLC), are a model class for mapping with CryptoAPI response data array internals. It takes these parameters: Time: Timestamp in seconds, 1508818680, for instance. Open: Open price at a given minute interval. High: Highest price. Low: Lowest price. Close: Price at the closing of the interval. Volumefrom: Trading volume in the from currency. It's BTC in our case. Volumeto: The trading volume in the to currency, USD in our case. Dividing Volumeto by Volumefrom gives us the weighted price of BTC. Now, it has the following signature: case class OHLC(time:  Long,  open:  Double,  high:  Double,  low:  Double,  close:  Double,  volumefrom:  Double,  volumeto:  Double)  object OHLC  {    implicit  val implicitOHLCReads  = Json.reads[OHLC]  } Model training for prediction Inside the project, in the package folder prediction.training, there is a Scala object called TrainGBT.scala. Before launching, you have to specify/change four things: In the code, you need to set up spark.sql.warehouse.dir in some actual place on your computer that has several gigabytes of free space: set("spark.sql.warehouse.dir",  "/home/user/spark") The RootDir is the main folder, where all files and train models will be stored:rootDir  = "/home/user/projects/btc-prediction/" Make sure that the x filename matches the one produced by the Scala script in the preceding step: x  = spark.read.format("com.databricks.spark.csv").schema(xSchema).load(rootDir  + "scala_test_x.csv") Make sure that the y filename matches the one produced by Scala script: y_tmp=spark.read.format("com.databricks.spark.csv").schema (ySchema).load(rootDir + "scala_test_y.csv") The code for training uses the Apache Spark ML library (and libraries required for it) to train the classifier, which means they have to be present in your class path to be able to run it. The easiest way to do that (since the whole project uses SBT) is to run it from the project root folder by typing sbtrun-main  prediction.training.TrainGBT, which will resolve all dependencies and launch training. Depending on the number of iterations and depth, it can take several hours to train the model. Now let us see how training is performed on the example of the gradient-boosted trees model. First, we need to create a SparkSession object: val xSchema = StructType(Array( StructField("t0", DoubleType, true), StructField("t1", DoubleType, true), StructField("t2", DoubleType, true), StructField("t3", DoubleType, true), StructField("t4", DoubleType, true), StructField("t5", DoubleType, true), StructField("t6", DoubleType, true), StructField("t7", DoubleType, true), StructField("t8", DoubleType, true), StructField("t9", DoubleType, true), StructField("t10", DoubleType, true), StructField("t11", DoubleType, true), StructField("t12", DoubleType, true), StructField("t13", DoubleType, true), StructField("t14", DoubleType, true), StructField("t15", DoubleType, true), StructField("t16", DoubleType, true), StructField("t17", DoubleType, true), StructField("t18", DoubleType, true), StructField("t19", DoubleType, true), StructField("t20", DoubleType, true), StructField("t21", DoubleType, true)) ) Then we read the files we defined for the schema. It was more convenient to generate two separate files in Scala for data and labels, so here we have to join them into a single DataFrame: Then we read the files we defined for the schema. It was more convenient to generate two separate files in Scala for data and labels, so here we have to join them into a single DataFrame: import spark.implicits._ val y = y_tmp.withColumn("y", 'y.cast(IntegerType)) import org.apache.spark.sql.functions._ val x_id = x.withColumn("id", monotonically_increasing_id()) val y_id = y.withColumn("id", monotonically_increasing_id()) val data = x_id.join(y_id, "id") The next step is required by Spark—we need to vectorize the features: val featureAssembler = new VectorAssembler() .setInputCols(Array("t0", "t1", "t2", "t3", "t4", "t5", "t6", "t7", "t8", "t9", "t10", "t11", "t12", "t13", "t14", "t15", "t16", "t17", "t18", "t19", "t20", "t21")) .setOutputCol("features") We split the data into train and test sets randomly in the proportion of 75% to 25%. We set the seed so that the splits would be equal among all times we run the training: val Array(trainingData,testData) = dataWithLabels.randomSplit(Array(0.75, 0.25), 123) We then define the model. It tells which columns are features and which are labels. It also sets parameters: val gbt = new GBTClassifier() .setLabelCol("label") .setFeaturesCol("features") .setMaxIter(10) .setSeed(123) Create a pipeline of steps—vector assembling of features and running GBT: val pipeline = new Pipeline() .setStages(Array(featureAssembler, gbt)) Defining evaluator function—how the model knows whether it is doing well or not. As we have only two classes that are imbalanced, accuracy is a bad measurement; area under the ROC curve is better: val rocEvaluator = new BinaryClassificationEvaluator() .setLabelCol("label") .setRawPredictionCol("rawPrediction") .setMetricName("areaUnderROC") K-fold cross-validation is used to avoid overfitting; it takes out one-fifth of the data at each iteration, trains the model on the rest, and then tests on this one-fifth: val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(rocEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(numFolds) .setSeed(123) val cvModel = cv.fit(trainingData) After we get the trained model (which can take an hour or more depending on the number of iterations and parameters we want to iterate on, specified in paramGrid), we then compute the predictions on the test data: val predictions = cvModel.transform(testData) In addition, evaluate quality of predictions: val roc = rocEvaluator.evaluate(predictions) The trained model is saved for later usage by the prediction service: val gbtModel = cvModel.bestModel.asInstanceOf[PipelineModel] gbtModel.save(rootDir + "__cv__gbt_22_binary_classes_" + System.nanoTime()/ 1000000 + ".model") In summary, the code for model training is given as follows: import org.apache.spark.{ SparkConf, SparkContext } import org.apache.spark.ml.{ Pipeline, PipelineModel } import org.apache.spark.ml.classification.{ GBTClassificationModel, GBTClassifier, RandomForestClassificationModel, RandomForestClassifier} import org.apache.spark.ml.evaluation.{BinaryClassificationEvaluator, MulticlassClassificationEvaluator} import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorAssembler, VectorIndexer} import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder} import org.apache.spark.sql.types.{DoubleType, IntegerType, StructField, StructType} import org.apache.spark.sql.SparkSession object TrainGradientBoostedTree { def main(args: Array[String]): Unit = { val maxBins = Seq(5, 7, 9) val numFolds = 10 val maxIter: Seq[Int] = Seq(10) val maxDepth: Seq[Int] = Seq(20) val rootDir = "output/" val spark = SparkSession .builder() .master("local[*]") .config("spark.sql.warehouse.dir", ""/home/user/spark/") .appName("Bitcoin Preprocessing") .getOrCreate() val xSchema = StructType(Array( StructField("t0", DoubleType, true), StructField("t1", DoubleType, true), StructField("t2", DoubleType, true), StructField("t3", DoubleType, true), StructField("t4", DoubleType, true), StructField("t5", DoubleType, true), StructField("t6", DoubleType, true), StructField("t7", DoubleType, true), StructField("t8", DoubleType, true), StructField("t9", DoubleType, true), StructField("t10", DoubleType, true), StructField("t11", DoubleType, true), StructField("t12", DoubleType, true), StructField("t13", DoubleType, true), StructField("t14", DoubleType, true), StructField("t15", DoubleType, true), StructField("t16", DoubleType, true), StructField("t17", DoubleType, true), StructField("t18", DoubleType, true), StructField("t19", DoubleType, true), StructField("t20", DoubleType, true), StructField("t21", DoubleType, true))) val ySchema = StructType(Array(StructField("y", DoubleType, true))) val x = spark.read.format("csv").schema(xSchema).load(rootDir + "scala_test_x.csv") val y_tmp = spark.read.format("csv").schema(ySchema).load(rootDir + "scala_test_y.csv") import spark.implicits._ val y = y_tmp.withColumn("y", 'y.cast(IntegerType)) import org.apache.spark.sql.functions._ //joining 2 separate datasets in single Spark dataframe val x_id = x.withColumn("id", monotonically_increasing_id()) val y_id = y.withColumn("id", monotonically_increasing_id()) val data = x_id.join(y_id, "id") val featureAssembler = new VectorAssembler() .setInputCols(Array("t0", "t1", "t2", "t3", "t4", "t5", "t6", "t7", "t8", "t9", "t10", "t11", "t12", "t13", "t14", "t15", "t16", "t17","t18", "t19", "t20", "t21")) .setOutputCol("features") val encodeLabel = udf[Double, String] { case "1" => 1.0 case "0" => 0.0 } val dataWithLabels = data.withColumn("label", encodeLabel(data("y"))) //123 is seed number to get same datasplit so we can tune params val Array(trainingData, testData) = dataWithLabels.randomSplit(Array(0.75, 0.25), 123) val gbt = new GBTClassifier() .setLabelCol("label") .setFeaturesCol("features") .setMaxIter(10) .setSeed(123) val pipeline = new Pipeline() .setStages(Array(featureAssembler, gbt)) // *********************************************************** println("Preparing K-fold Cross Validation and Grid Search") // *********************************************************** val paramGrid = new ParamGridBuilder() .addGrid(gbt.maxIter, maxIter) .addGrid(gbt.maxDepth, maxDepth) .addGrid(gbt.maxBins, maxBins) .build() val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator()) .setEstimatorParamMaps(paramGrid) .setNumFolds(numFolds) .setSeed(123) // ************************************************************ println("Training model with GradientBoostedTrees algorithm") // ************************************************************ // Train model. This also runs the indexers. val cvModel = cv.fit(trainingData) cvModel.save(rootDir + "cvGBT_22_binary_classes_" + System.nanoTime() / 1000000 + ".model") println("Evaluating model on train and test data and calculating RMSE") // ********************************************************************** // Make a sample prediction val predictions = cvModel.transform(testData) // Select (prediction, true label) and compute test error. val rocEvaluator = new BinaryClassificationEvaluator() .setLabelCol("label") .setRawPredictionCol("rawPrediction") .setMetricName("areaUnderROC") val roc = rocEvaluator.evaluate(predictions) val prEvaluator = new BinaryClassificationEvaluator() .setLabelCol("label") .setRawPredictionCol("rawPrediction") .setMetricName("areaUnderPR") val pr = prEvaluator.evaluate(predictions) val gbtModel = cvModel.bestModel.asInstanceOf[PipelineModel] gbtModel.save(rootDir + "__cv__gbt_22_binary_classes_" + System.nanoTime()/1000000 +".model") println("Area under ROC curve = " + roc) println("Area under PR curve= " + pr) println(predictions.select().show(1)) spark.stop() } } Now let us see how the training went: >>> Area under ROC curve = 0.6045355104779828 Area under PR curve= 0.3823834607704922 Therefore, we have not received very high accuracy, as the ROC is only 60.50% out of the best GBT model. Nevertheless, if we tune the hyperparameters, we will get better accuracy. We learned how a complete ML pipeline can be implemented, from collecting historical data, to transforming it into a format suitable for testing hypotheses. We also performed machine learning model training to carry out predictions. You read an excerpt from a book written by Md. Rezaul Karim, titled Scala Machine Learning Projects. In this book, you will learn to build powerful machine learning applications for performing advanced numerical computing and functional programming.  
Read more
  • 0
  • 0
  • 29814

article-image-2019-stack-overflow-survey-quick-overview
Sugandha Lahoti
10 Apr 2019
5 min read
Save for later

2019 Stack Overflow survey: A quick overview

Sugandha Lahoti
10 Apr 2019
5 min read
The results of the 2019 Stack Overflow survey have just been published: 90,000 developers took the 20-minute survey this year. The survey shed light on some very interesting insights – from the developers’ preferred language for programming, to the development platform they hate the most, to the blockers to developer productivity. As the survey is quite detailed and comprehensive, here’s a quick look at the most important takeaways. Key highlights from the Stack Overflow Survey Programming languages Python again emerged as the fastest-growing programming language, a close second behind Rust. Interestingly, Python and Typescript achieved the same votes with almost 73% respondents saying it was their most loved language. Python was the most voted language developers wanted to learn next and JavaScript remains the most used programming language. The most dreaded languages were VBA and Objective C. Source: Stack Overflow Frameworks and databases in the Stack Overflow survey Developers preferred using React.js and Vue.js web frameworks while dreaded Drupal and jQuery. Redis was voted as the most loved database and MongoDB as the most wanted database. MongoDB’s inclusion in the list is surprising considering its controversial Server Side Public License. Over the last few months, Red Hat dropped support for MongoDB over this license, so did GNU Health Federation. Both of these organizations choose PostgreSQL over MongoDB, which is one of the reasons probably why PostgreSQL was the second most loved and wanted database of Stack Overflow Survey 2019. Source: Stack Overflow It’s interesting to see WebAssembly making its way in the popular technology segment as well as one of the top paying technologies. Respondents who use Clojure, F#, Elixir, and Rust earned the highest salaries Stackoverflow also did a new segment this year called "Blockchain in the real world" which gives insight into the adoption of Blockchain. Most respondents (80%) on the survey said that their organizations are not using or implementing blockchain technology. Source: Stack Overflow Developer lifestyles and learning About 80% of our respondents say that they code as a hobby outside of work and over half of respondents had written their first line of code by the time they were sixteen, although this experience varies by country and by gender. For instance, women wrote their first code later than men and non-binary respondents wrote code earlier than men. About one-quarter of respondents are enrolled in a formal college or university program full-time or part-time. Of professional developers who studied at the university level, over 60% said they majored in computer science, computer engineering, or software engineering. DevOps specialists and site reliability engineers are among the highest paid, most experienced developers most satisfied with their jobs, and are looking for new jobs at the lowest levels. The survey also noted that developers who are system admins or DevOps specialists are 25-30 times more likely to be men than women. Chinese developers are the most optimistic about the future while developers in Western European countries like France and Germany are among the least optimistic. Developers also overwhelmingly believe that Elon Musk will be the most influential person in tech in 2019. With more than 30,000 people responding to a free text question asking them who they think will be the most influential person this year, an amazing 30% named Tesla CEO Musk. For perspective, Jeff Bezos was in second place, being named by ‘only’ 7.2% of respondents. Although, this year the US survey respondents proportion of women, went up from 9% to 11%, it’s still a slow growth and points to problems with inclusion in the tech industry in general and on Stack Overflow in particular. When thinking about blockers to productivity, different kinds of developers report different challenges. Men are more likely to say that being tasked with non-development work is a problem for them, while gender minority respondents are more likely to say that toxic work environments are a problem. Stack Overflow survey demographics and diversity challenges This report is based on a survey of 88,883 software developers from 179 countries around the world. It was conducted between January 23 to February 14 and the median time spent on the survey for qualified responses was 23.3 minutes. The majority of survey respondents this year were people who said they are professional developers or who code sometimes as part of their work, or are students preparing for such a career. Majority of them were from the US, India, China and Europe. Stack Overflow acknowledged that their results did not represent racial disparities evenly and people of color continue to be underrepresented among developers. This year nearly 71% of respondents continued to be of White or European descent, a slight improvement from last year (74%). The survey notes that, “In the United States this year, 22% of respondents are people of color; last year 19% of United States respondents were people of color.” This clearly signifies that a lot of work is still needed to be done particularly for people of color, women, and underrepresented groups. Although, last year in August, Stack Overflow revamped its Code of Conduct to include more virtues around kindness, collaboration, and mutual respect. It also updated  its developers salary calculator to include 8 new countries. Go through the full report to learn more about developer salaries, job priorities, career values, the best music to listen to while coding, and more. Developers believe Elon Musk will be the most influential person in tech in 2019, according to Stack Overflow survey results Creators of Python, Java, C#, and Perl discuss the evolution and future of programming language design at PuPPy Stack Overflow is looking for a new CEO as Joel Spolsky becomes Chairman
Read more
  • 0
  • 0
  • 29753

article-image-exploring-hdfs
Packt
10 Mar 2016
17 min read
Save for later

Exploring HDFS

Packt
10 Mar 2016
17 min read
In this article by Tanmay Deshpande, the author of the book Hadoop Real World Solutions Cookbook- Second Edition, we'll cover the following recipes: Loading data from a local machine to HDFS Exporting HDFS data to a local machine Changing the replication factor of an existing file in HDFS Setting the HDFS block size for all the files in a cluster Setting the HDFS block size for a specific file in a cluster Enabling transparent encryption for HDFS Importing data from another Hadoop cluster Recycling deleted data from trash to HDFS Saving compressed data in HDFS Hadoop has two important components: Storage: This includes HDFS Processing: This includes Map Reduce HDFS takes care of the storage part of Hadoop. So, let's explore the internals of HDFS through various recipes. (For more resources related to this topic, see here.) Loading data from a local machine to HDFS In this recipe, we are going to load data from a local machine's disk to HDFS. Getting ready To perform this recipe, you should have an already Hadoop running cluster. How to do it... Performing this recipe is as simple as copying data from one folder to another. There are a couple of ways to copy data from the local machine to HDFS. Using the copyFromLocal commandTo copy the file on HDFS, let's first create a directory on HDFS and then copy the file. Here are the commands to do this: hadoop fs -mkdir /mydir1 hadoop fs -copyFromLocal /usr/local/hadoop/LICENSE.txt /mydir1 Using the put commandWe will first create the directory, and then put the local file in HDFS: hadoop fs -mkdir /mydir2 hadoop fs -put /usr/local/hadoop/LICENSE.txt /mydir2 You can validate that the files have been copied to the correct folders by listing the files: hadoop fs -ls /mydir1 hadoop fs -ls /mydir2 How it works... When you use HDFS copyFromLocal or the put command, the following things will occur: First of all, the HDFS client (the command prompt, in this case) contacts NameNode because it needs to copy the file to HDFS. NameNode then asks the client to break the file into chunks of different cluster block sizes. In Hadoop 2.X, the default block size is 128MB. Based on the capacity and availability of space in DataNodes, NameNode will decide where these blocks should be copied. Then, the client starts copying data to specified DataNodes for a specific block. The blocks are copied sequentially one after another. When a single block is copied, the block is sent to DataNode into packets that are 4MB in size. With each packet, a checksum is sent; once the packet copying is done, it is verified with checksum to check whether it matches. The packets are then sent to the next DataNode where the block will be replicated. The HDFS client's responsibility is to copy the data to only the first node; the replication is taken care by respective DataNode. Thus, the data block is pipelined from one DataNode to the next. When the block copying and replication is taking place, metadata on the file is updated in NameNode by DataNode. Exporting data from HDFS to Local machine In this recipe, we are going to export/copy data from HDFS to the local machine. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... Performing this recipe is as simple as copying data from one folder to the other. There are a couple of ways in which you can export data from HDFS to the local machine. Using the copyToLocal command, you'll get this code: hadoop fs -copyToLocal /mydir1/LICENSE.txt /home/ubuntu Using the get command, you'll get this code: hadoop fs -get/mydir1/LICENSE.txt /home/ubuntu How it works... When you use HDFS copyToLocal or the get command, the following things occur: First of all, the client contacts NameNode because it needs a specific file in HDFS. NameNode then checks whether such a file exists in its FSImage. If the file is not present, the error code is returned to the client. If the file exists, NameNode checks the metadata for blocks and replica placements in DataNodes. NameNode then directly points DataNode from where the blocks would be given to client one by one. The data is directly copied from DataNode to the client machine. and it never goes through NameNode to avoid bottlenecks. Thus, the file is exported to the local machine from HDFS. Changing the replication factor of an existing file in HDFS In this recipe, we are going to take a look at how to change the replication factor of a file in HDFS. The default replication factor is 3. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... Sometimes. there might be a need to increase or decrease the replication factor of a specific file in HDFS. In this case, we'll use the setrep command. This is how you can use the command: hadoop fs -setrep [-R] [-w] <noOfReplicas><path> ... In this command, a path can either be a file or directory; if its a directory, then it recursively sets the replication factor for all replicas. The w option flags the command and should wait until the replication is complete The r option is accepted for backward compatibility First, let's check the replication factor of the file we copied to HDFS in the previous recipe: hadoop fs -ls /mydir1/LICENSE.txt -rw-r--r-- 3 ubuntu supergroup 15429 2015-10-29 03:04 /mydir1/LICENSE.txt Once you list the file, it will show you the read/write permissions on this file, and the very next parameter is the replication factor. We have the replication factor set to 3 for our cluster, hence, you the number is 3. Let's change it to 2 using this command: hadoop fs -setrep -w 2 /mydir1/LICENSE.txt It will wait till the replication is adjusted. Once done, you can verify this again by running the ls command: hadoop fs -ls /mydir1/LICENSE.txt -rw-r--r-- 2 ubuntu supergroup 15429 2015-10-29 03:04 /mydir1/LICENSE.txt How it works... Once the setrep command is executed, NameNode will be notified, and then NameNode decides whether the replicas need to be increased or decreased from certain DataNode. When you are using the –w command, sometimes, this process may take too long if the file size is too big. Setting the HDFS block size for all the files in a cluster In this recipe, we are going to take a look at how to set a block size at the cluster level. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... The HDFS block size is configurable for all files in the cluster or for a single file as well. To change the block size at the cluster level itself, we need to modify the hdfs-site.xml file. By default, the HDFS block size is 128MB. In case we want to modify this, we need to update this property, as shown in the following code. This property changes the default block size to 64MB: <property> <name>dfs.block.size</name> <value>67108864</value> <description>HDFS Block size</description> </property> If you have a multi-node Hadoop cluster, you should update this file in the nodes, that is, NameNode and DataNode. Make sure you save these changes and restart the HDFS daemons: /usr/local/hadoop/sbin/stop-dfs.sh /usr/local/hadoop/sbin/start-dfs.sh This will set the block size for files that will now get added to the HDFS cluster. Make sure that this does not change the block size of the files that are already present in HDFS. There is no way to change the block sizes of existing files. How it works... By default, the HDFS block size is 128MB for Hadoop 2.X. Sometimes, we may want to change this default block size for optimization purposes. When this configuration is successfully updated, all the new files will be saved into blocks of this size. Ensure that these changes do not affect the files that are already present in HDFS; their block size will be defined at the time being copied. Setting the HDFS block size for a specific file in a cluster In this recipe, we are going to take a look at how to set the block size for a specific file only. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... In the previous recipe, we learned how to change the block size at the cluster level. But this is not always required. HDFS provides us with the facility to set the block size for a single file as well. The following command copies a file called myfile to HDFS, setting the block size to 1MB: hadoop fs -Ddfs.block.size=1048576 -put /home/ubuntu/myfile / Once the file is copied, you can verify whether the block size is set to 1MB and has been broken into exact chunks: hdfs fsck -blocks /myfile Connecting to namenode via http://localhost:50070/fsck?ugi=ubuntu&blocks=1&path=%2Fmyfile FSCK started by ubuntu (auth:SIMPLE) from /127.0.0.1 for path /myfile at Thu Oct 29 14:58:00 UTC 2015 .Status: HEALTHY Total size: 17276808 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 17 (avg. block size 1016282 B) Minimally replicated blocks: 17 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 1 Average block replication: 1.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 3 Number of racks: 1 FSCK ended at Thu Oct 29 14:58:00 UTC 2015 in 2 milliseconds The filesystem under path '/myfile' is HEALTHY How it works... When we specify the block size at the time of copying a file, it overwrites the default block size and copies the file to HDFS by breaking the file into chunks of a given size. Generally, these modifications are made in order to perform other optimizations. Make sure you make these changes, and you are aware of their consequences. If the block size isn't adequate enough, it will increase the parallelization, but it will also increase the load on NameNode as it would have more entries in FSImage. On the other hand, if the block size is too big, then it will reduce the parallelization and degrade the processing performance. Enabling transparent encryption for HDFS When handling sensitive data, it is always important to consider the security measures. Hadoop allows us to encrypt sensitive data that's present in HDFS. In this recipe, we are going to see how to encrypt data in HDFS. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... For many applications that hold sensitive data, it is very important to adhere to standards such as PCI, HIPPA, FISMA, and so on. To enable this, HDFS provides a utility called encryption zone where we can create a directory in it so that data is encrypted on writes and decrypted on read. To use this encryption facility, we first need to enable Hadoop Key Management Server (KMS): /usr/local/hadoop/sbin/kms.sh start This would start KMS in the Tomcat web server. Next, we need to append the following properties in core-site.xml and hdfs-site.xml. In core-site.xml, add the following property: <property> <name>hadoop.security.key.provider.path</name> <value>kms://http@localhost:16000/kms</value> </property> In hds-site.xml, add the following property: <property> <name>dfs.encryption.key.provider.uri</name> <value>kms://http@localhost:16000/kms</value> </property> Restart the HDFS daemons: /usr/local/hadoop/sbin/stop-dfs.sh /usr/local/hadoop/sbin/start-dfs.sh Now, we are all set to use KMS. Next, we need to create a key that will be used for the encryption: hadoop key create mykey This will create a key, and then, save it on KMS. Next, we have to create an encryption zone, which is a directory in HDFS where all the encrypted data is saved: hadoop fs -mkdir /zone hdfs crypto -createZone -keyName mykey -path /zone We will change the ownership to the current user: hadoop fs -chown ubuntu:ubuntu /zone If we put any file into this directory, it will encrypt and would decrypt at the time of reading: hadoop fs -put myfile /zone hadoop fs -cat /zone/myfile How it works... There can be various types of encryptions one can do in order to comply with security standards, for example, application-level encryption, database level, file level, and disk-level encryption. The HDFS transparent encryption sits between the database and file-level encryptions. KMS acts like proxy between HDFS clients and HDFS's encryption provider via HTTP REST APIs. There are two types of keys used for encryption: Encryption Zone Key( EZK) and Data Encryption Key (DEK). EZK is used to encrypt DEK, which is also called Encrypted Data Encryption Key(EDEK). This is then saved on NameNode. When a file needs to be written to the HDFS encryption zone, the client gets EDEK from NameNode and EZK from KMS to form DEK, which is used to encrypt data and store it in HDFS (the encrypted zone). When an encrypted file needs to be read, the client needs DEK, which is formed by combining EZK and EDEK. These are obtained from KMS and NameNode, respectively. Thus, encryption and decryption is automatically handled by HDFS. and the end user does not need to worry about executing this on their own. You can read more on this topic at http://blog.cloudera.com/blog/2015/01/new-in-cdh-5-3-transparent-encryption-in-hdfs/. Importing data from another Hadoop cluster Sometimes, we may want to copy data from one HDFS to another either for development, testing, or production migration. In this recipe, we will learn how to copy data from one HDFS cluster to another. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... Hadoop provides a utility called DistCp, which helps us copy data from one cluster to another. Using this utility is as simple as copying from one folder to another: hadoop distcp hdfs://hadoopCluster1:9000/source hdfs://hadoopCluster2:9000/target This would use a Map Reduce job to copy data from one cluster to another. You can also specify multiple source files to be copied to the target. There are couple of other options that we can also use: -update: When we use DistCp with the update option, it will copy only those files from the source that are not part of the target or differ from the target. -overwrite: When we use DistCp with the overwrite option, it overwrites the target directory with the source. How it works... When DistCp is executed, it uses map reduce to copy the data and also assists in error handling and reporting. It expands the list of source files and directories and inputs them to map tasks. When copying from multiple sources, collisions are resolved in the destination based on the option (update/overwrite) that's provided. By default, it skips if the file is already present at the target. Once the copying is complete, the count of skipped files is presented. You can read more on DistCp at https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html. Recycling deleted data from trash to HDFS In this recipe, we are going to see how recover deleted data from the trash to HDFS. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... To recover accidently deleted data from HDFS, we first need to enable the trash folder, which is not enabled by default in HDFS. This can be achieved by adding the following property to core-site.xml: <property> <name>fs.trash.interval</name> <value>120</value> </property> Then, restart the HDFS daemons: /usr/local/hadoop/sbin/stop-dfs.sh /usr/local/hadoop/sbin/start-dfs.sh This will set the deleted file retention to 120 minutes. Now, let's try to delete a file from HDFS: hadoop fs -rmr /LICENSE.txt 15/10/30 10:26:26 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 120 minutes, Emptier interval = 0 minutes. Moved: 'hdfs://localhost:9000/LICENSE.txt' to trash at: hdfs://localhost:9000/user/ubuntu/.Trash/Current We have 120 minutes to recover this file before it is permanently deleted from HDFS. To restore the file to its original location, we can execute the following commands. First, let's confirm whether the file exists: hadoop fs -ls /user/ubuntu/.Trash/Current Found 1 items -rw-r--r-- 1 ubuntu supergroup 15429 2015-10-30 10:26 /user/ubuntu/.Trash/Current/LICENSE.txt Now, restore the deleted file or folder; it's better to use the distcp command instead of copying each file one by one: hadoop distcp hdfs //localhost:9000/user/ubuntu/.Trash/Current/LICENSE.txt hdfs://localhost:9000/ This will start a map reduce job to restore data from the trash to the original HDFS folder. Check the HDFS path; the deleted file should be back to its original form. How it works... Enabling trash enforces the file retention policy for a specified amount of time. So, when trash is enabled, HDFS does not execute any blocks deletions or movements immediately but only updates the metadata of the file and its location. This way, we can accidently stop deleting files from HDFS; make sure that trash is enabled before experimenting with this recipe. Saving compressed data on HDFS In this recipe, we are going to take a look at how to store and process compressed data in HDFS. Getting ready To perform this recipe, you should already have a running Hadoop. How to do it... It's always good to use compression while storing data in HDFS. HDFS supports various types of compression algorithms such as LZO, BIZ2, Snappy, GZIP, and so on. Every algorithm has its own pros and cons when you consider the time taken to compress and decompress and the space efficiency. These days people prefer Snappy compression as it aims to achieve a very high speed and reasonable amount compression. We can easily store and process any number of files in HDFS. To store compressed data, we don't need to specifically make any changes to the Hadoop cluster. You can simply copy the compressed data in the same way it's in HDFS. Here is an example of this: hadoop fs -mkdir /compressed hadoop fs –put file.bz2 /compressed Now, we'll run a sample program to take a look at how Hadoop automatically uncompresses the file and processes it: hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar wordcount /compressed /compressed_out Once the job is complete, you can verify the output. How it works... Hadoop explores native libraries to find the support needed for various codecs and their implementations. Native libraries are specific to the platform that you run Hadoop on. You don't need to make any configurations changes to enable compression algorithms. As mentioned earlier, Hadoop supports various compression algorithms that are already familiar to the computer world. Based on your needs and requirements (more space or more time), you can choose your compression algorithm. Take a look at http://comphadoop.weebly.com/ for more information on this. Summary We covered major factors with respect to HDFS in this article which comprises of recipes that help us to load, extract, import, export and saving data in HDFS. It also covers enabling transparent encryption for HDFS as well adjusting block size of HDFS cluster. Resources for Article: Further resources on this subject: Hadoop and MapReduce [article] Advanced Hadoop MapReduce Administration [article] Integration with Hadoop [article]
Read more
  • 0
  • 0
  • 29624

article-image-fat-2018-conference-session-2-summary-interpretability-explainability
Savia Lobo
22 Feb 2018
5 min read
Save for later

FAT* 2018 Conference Session 2 Summary: Interpretability and Explainability

Savia Lobo
22 Feb 2018
5 min read
This session of the FAT* 2018 is about interpretability and explainability in machine learning models. With the advances in Deep learning, machine learning models have become more accurate. However, with accuracy and advancements, it is a tough task to keep the models highly explainable. This means, these models may appear as black boxes to business users, who utilize them without knowing what lies within. Thus, it is equally important to make ML models interpretable and explainable, which can be beneficial and essential for understanding ML models and to have a ‘behind the scenes’ knowledge of what’s happening within them. This understanding can be highly essential for heavily regulated industries like Finance, Medicine, Defence and so on. The Conference on Fairness, Accountability, and Transparency (FAT), which would be held on the 23rd and 24th of February, 2018 is a multi-disciplinary conference that brings together researchers and practitioners interested in fairness, accountability, and transparency in socio-technical systems. The FAT 2018 conference will witness 17 research papers, 6 tutorials, and 2 keynote presentations from leading experts in the field. This article covers research papers pertaining to the 2nd session that is dedicated to Interpretability and Explainability of machine-learned decisions. If you’ve missed our summary of the 1st session on Online Discrimination and Privacy, visit the article link for a catch up. Paper 1: Meaningful Information and the Right to Explanation This paper addresses an active debate in policy, industry, academia, and the media about whether and to what extent Europe’s new General Data Protection Regulation (GDPR) grants individuals a “right to explanation” of automated decisions. The paper explores two major papers, European Union Regulations on Algorithmic Decision Making and a “Right to Explanation” by Goodman and Flaxman (2017) Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation by Wachter et al. (2017) This paper demonstrates that the specified framework is built on incorrect legal and technical assumptions. In addition to responding to the existing scholarly contributions, the article articulates a positive conception of the right to explanation, located in the text and purpose of the GDPR. The authors take a position that the right should be interpreted functionally, flexibly, and should, at a minimum, enable a data subject to exercise his or her rights under the GDPR and human rights law. Key takeaways: The first paper by Goodman and Flaxman states that GDPR creates a "right to explanation" but without any argument. The second paper is in response to the first paper, where Watcher et al. have published an extensive critique, arguing against the existence of such a right. The current paper, on the other hand, is partially concerned with responding to the arguments of Watcher et al. Paper 2: Interpretable Active Learning The paper tries to highlight how due to complex and opaque ML models, the process of active learning has also become opaque. Not much has been known about what specific trends and patterns, the active learning strategy may be exploring. The paper expands on explaining about LIME (Local Interpretable Model-agnostic Explanations framework) to provide explanations for active learning recommendations. The authors, Richard Phillips, Kyu Hyun Chang, and Sorelle A. Friedler, demonstrate uses of LIME in generating locally faithful explanations for an active learning strategy. Further, the paper shows how these explanations can be used to understand how different models and datasets explore a problem space over time. Key takeaways: The paper demonstrates how active learning choices can be made more interpretable to non-experts. It also discusses techniques that make active learning interpretable to expert labelers, so that queries and query batches can be explained and the uncertainty bias can be tracked via interpretable clusters. It showcases per-query explanations of uncertainty to develop a system that allows experts to choose whether to label a query. This will allow them to incorporate domain knowledge and their own interests into the labeling process. It introduces a quantified notion of uncertainty bias, the idea that an algorithm may be less certain about its decisions on some data clusters than others. Paper 3: Interventions over Predictions: Reframing the Ethical Debate for Actuarial Risk Assessment Actuarial risk assessments might be unduly perceived as a neutral way to counteract implicit bias and increase the fairness of decisions made within the criminal justice system, from pretrial release to sentencing, parole, and probation. However, recently, these assessments have come under increased scrutiny, as critics claim that the statistical techniques underlying them might reproduce existing patterns of discrimination and historical biases that are reflected in the data. The paper proposes that machine learning should not be used for prediction, but rather to surface covariates that are fed into a causal model for understanding the social, structural and psychological drivers of crime. The authors, Chelsea Barabas, Madars Virza, Karthik Dinakar, Joichi Ito (MIT), Jonathan Zittrain (Harvard),  propose an alternative application of machine learning and causal inference away from predicting risk scores to risk mitigation. Key takeaways: The paper gives a brief overview of how risk assessments have evolved from a tool used solely for prediction to one that is diagnostic at its core. The paper places a debate around risk assessment in a broader context. One can get a fuller understanding of the way these actuarial tools have evolved to achieve a varied set of social and institutional agendas. It argues for a shift away from predictive technologies, towards diagnostic methods that will help in understanding the criminogenic effects of the criminal justice system itself, as well as evaluate the effectiveness of interventions designed to interrupt cycles of crime. It proposes that risk assessments, when viewed as a diagnostic tool, can be used to understand the underlying social, economic and psychological drivers of crime. The authors also posit that causal inference offers the best framework for pursuing the goals to achieve a fair and ethical risk assessment tool.
Read more
  • 0
  • 0
  • 29623

article-image-gql-graph-query-language-joins-sql-as-a-global-standards-project-and-is-now-the-international-standard-declarative-query-language-for-graphs
Amrata Joshi
19 Sep 2019
6 min read
Save for later

GQL (Graph Query Language) joins SQL as a Global Standards Project and will be the international standard declarative query language for graphs

Amrata Joshi
19 Sep 2019
6 min read
On Tuesday, the team at Neo4j, the graph database management system announced that the international committees behind the development of the SQL standard have voted to initiate GQL (Graph Query Language) as the new database query language. GQL is now going to be the international standard declarative query language for property graphs and it is also a Global Standards Project. GQL is developed and maintained by the same international group that maintains the SQL standard. How did the proposal for GQL pass? Last year in May, the initiative for GQL was first time processed in the GQL Manifesto. This year in June, the national standards bodies across the world from the ISO/IEC’s Joint Technical Committee 1 (responsible for IT standards) started voting on the GQL project proposal.  The ballot closed earlier this week and the proposal was passed wherein ten countries including Germany, Korea, United States, UK, and China voted in favor. And seven countries agreed to put forward their experts to work on this project. Japan was the only country to vote against in the ballot because according to Japan, existing languages already do the job, and SQL/Property Graph Query extensions along with the rest of the SQL standard can do the same job. According to the Neo4j team, the GQL project will initiate development of next-generation technology standards for accessing data. Its charter mandates building on core foundations that are established by SQL and ongoing collaboration in order to ensure SQL and GQL interoperability and compatibility. GQL would reflect rapid growth in the graph database market by increasing adoption of the Cypher language.  Stefan Plantikow, GQL project lead and editor of the planned GQL specification, said, “I believe now is the perfect time for the industry to come together and define the next generation graph query language standard.”  Plantikow further added, “It’s great to receive formal recognition of the need for a standard language. Building upon a decade of experience with property graph querying, GQL will support native graph data types and structures, its own graph schema, a pattern-based approach to data querying, insertion and manipulation, and the ability to create new graphs, and graph views, as well as generate tabular and nested data. Our intent is to respect, evolve, and integrate key concepts from several existing languages including graph extensions to SQL.” Keith Hare, who has served as the chair of the international SQL standards committee for database languages since 2005, charted the progress toward GQL, said, “We have reached a balance of initiating GQL, the database query language of the future whilst preserving the value and ubiquity of SQL.” Hare further added, “Our committee has been heartened to see strong international community participation to usher in the GQL project.  Such support is the mark of an emerging de jure and de facto standard .” The need for a graph-specific query language Researchers and vendors needed a graph-specific query language because of the following limitations: SQL/PGQ language is restricted to read-only queries SQL/PGQ cannot project new graphs The SQL/PGQ language can only access those graphs that are based on taking a graph view over SQL tables. Researchers and vendors needed a language like Cypher that would cover insertion and maintenance of data and not just data querying. But SQL wasn’t the apt model for a graph-centric language that takes graphs as query inputs and outputs a graph as a result. But GQL, on the other hand, builds in openCypher, a project that brings Cypher to Apache Spark and gives users a composable graph query language. SQL and GQL can work together According to most of the companies and national standards bodies that are supporting the GQL initiative, GQL and SQL are not competitors. Instead, these languages can complement each other via interoperation and shared foundations. Alastair Green, Query Languages Standards & Research Lead at Neo4j writes, “A SQL/PGQ query is in fact a SQL sub-query wrapped around a chunk of proto-GQL.” SQL is a language that is built around tables whereas GQL is built around graphs. Users can use GQL to find and project a graph from a graph.  Green further writes, “I think that the SQL standards community has made the right decision here: allow SQL, a language built around tables, to quote GQL when the SQL user wants to find and project a table from a graph, but use GQL when the user wants to find and project a graph from a graph. Which means that we can produce and catalog graphs which are not just views over tables, but discrete complex data objects.” It is still not clear when will the first implementation version of GQL will be out. The official page reads,  “The work of the GQL project starts in earnest at the next meeting of the SQL/GQL standards committee, ISO/IEC JTC 1 SC 32/WG3, in Arusha, Tanzania, later this month. It is impossible at this stage to say when the first implementable version of GQL will become available, but it is highly likely that some reasonably complete draft will have been created by the second half of 2020.” Developer community welcomes the new addition Users are excited to see how GQL will incorporate Cypher, a user commented on HackerNews, “It's been years since I've worked with the product and while I don't miss Neo4j, I do miss the query language. It's a little unclear to me how GQL will incorporate Cypher but I hope the initiative is successful if for no other reason than a selfish one: I'd love Cypher to be around if I ever wind up using a GraphDB again.” Few others mistook GQL to be Facebook’s GraphQL and are sceptical about the name. A comment on HackerNews reads, “Also, the name is of course justified, but it will be a mess to search for due to (Facebook) GraphQL.” A user commented, “I read the entire article and came away mistakenly thinking this was the same thing as GraphQL.” Another user commented, “That's quiet an unfortunate name clash with the existing GraphQL language in a similar domain.” Other interesting news in Data Media manipulation by Deepfakes and cheap fakes refquire both AI and social fixes, finds a Data & Society report Percona announces Percona Distribution for PostgreSQL to support open source databases  Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out
Read more
  • 0
  • 0
  • 29596
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-transformers-2-0-nlp-library-with-deep-interoperability-between-tensorflow-2-0-and-pytorch
Fatema Patrawala
30 Sep 2019
3 min read
Save for later

Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages

Fatema Patrawala
30 Sep 2019
3 min read
Last week, Hugging Face, a startup specializing in natural language processing, released a landmark update to their popular Transformers library, offering unprecedented compatibility between two major deep learning frameworks, PyTorch and TensorFlow 2.0. Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. Transformers 2.0 embraces the ‘best of both worlds’, combining PyTorch’s ease of use with TensorFlow’s production-grade ecosystem. The new library makes it easier for scientists and practitioners to select different frameworks for the training, evaluation and production phases of developing the same language model. “This is a lot deeper than what people usually think when they talk about compatibility,” said Thomas Wolf, who leads Hugging Face’s data science team. “It’s not only about being able to use the library separately in PyTorch and TensorFlow. We’re talking about being able to seamlessly move from one framework to the other dynamically during the life of the model.” https://twitter.com/Thom_Wolf/status/1177193003678601216 “It’s the number one feature that companies asked for since the launch of the library last year,” said Clement Delangue, CEO of Hugging Face. Notable features in Transformers 2.0 8 architectures with over 30 pretrained models, in more than 100 languages Load a model and pre-process a dataset in less than 10 lines of code Train a state-of-the-art language model in a single line with the tf.keras fit function Share pretrained models, reducing compute costs and carbon footprint Deep interoperability between TensorFlow 2.0 and PyTorch models Move a single model between TF2.0/PyTorch frameworks at will Seamlessly pick the right framework for training, evaluation, production As powerful and concise as Keras About Hugging Face Transformers With half a million installs since January 2019, Transformers is the most popular open-source NLP library. More than 1,000 companies including Bing, Apple or Stitchfix are using it in production for text classification, question-answering, intent detection, text generation or conversational. Hugging Face, the creators of Transformers, have raised US$5M so far from investors in companies like Betaworks, Salesforce, Amazon and Apple. On Hacker News, users are appreciating the company and how Transformers has become the most important library in NLP. Other interesting news in data Baidu open sources ERNIE 2.0, a continual pre-training NLP model that outperforms BERT and XLNet on 16 NLP tasks Dr Joshua Eckroth on performing Sentiment Analysis on social media platforms using CoreNLP Facebook open-sources PyText, a PyTorch based NLP modeling framework
Read more
  • 0
  • 0
  • 29544

article-image-saving-backups-on-cloud-services-with-elasticsearch-plugins
Savia Lobo
29 Dec 2017
9 min read
Save for later

Saving backups on cloud services with ElasticSearch plugins

Savia Lobo
29 Dec 2017
9 min read
[box type="note" align="" class="" width=""]Our article is a book excerpt taken from Mastering Elasticsearch 5.x. written by Bharvi Dixit. This book guides you through the intermediate and advanced functionalities of Elasticsearch, such as querying, indexing, searching, and modifying data. In other words, you will gain all the knowledge necessary to master Elasticsearch and put into efficient use.[/box] This article will explain how Elasticsearch, with the help of additional plugins, allows us to push our data outside of the cluster to the cloud. There are three possibilities where our repository can be located, at least using officially supported plugins: The S3 repository: AWS The HDFS repository: Hadoop clusters The GCS repository: Google cloud services The Azure repository: Microsoft's cloud platform Let's go through these repositories to see how we can push our backup data on the cloud services. The S3 repository The S3 repository is a part of the Elasticsearch AWS plugin, so to use S3 as the repository for snapshotting, we need to install the plugin first on every node of the cluster and each node must be restarted after the plugin installation: sudo bin/elasticsearch-plugin install repository-s3 After installing the plugin on every Elasticsearch node in the cluster, we need to alter their configuration (the elasticsearch.yml file) so that the AWS access information is available. The example configuration can look like this: cloud: aws: access_key: YOUR_ACCESS_KEY secret_key: YOUT_SECRET_KEY To create the S3 repository that Elasticsearch will use for snapshotting, we need to run a command similar to the following one: curl -XPUT 'http://localhost:9200/_snapshot/my_s3_repository' -d '{ "type": "s3", "settings": { "bucket": "bucket_name" } }' The following settings are supported when defining an S3-based repository: bucket: This is the required parameter describing the Amazon S3 bucket to which the Elasticsearch data will be written and from which Elasticsearch will read the data. region: This is the name of the AWS region where the bucket resides. By default, the US Standard region is used. base_path: By default, Elasticsearch puts the data in the root directory. This parameter allows you to change it and alter the place where the data is placed in the repository. server_side_encryption: By default, encryption is turned off. You can set this parameter to true in order to use the AES256 algorithm to store data. chunk_size: By default, this is set to 1GB and specifies the size of the data chunk that will be sent. If the snapshot size is larger than the chunk_size, Elasticsearch will split the data into smaller chunks that are not larger than the size specified in the chunk_size. The chunk size can be specified in size notations such as 1GB, 100mb, and 1024kB. buffer_size: The size of this buffer is set to 100mb by default. When the chunk size is greater than the value of the buffer_size, Elasticsearch will split it into buffer_size fragments and use the AWS multipart API to send it. The buffer size cannot be set lower than 5 MB because it disallows the use of the multipart API. endpoint: This defaults to AWS's default S3 endpoint. Setting a region overrides the endpoint setting. protocol: Specifies whether to use http or https. It default to cloud.aws.protocol or cloud.aws.s3.protocol. compress: Defaults to false and when set to true. This option allows snapshot metadata files to be stored in a compressed format. Please note that index files are already compressed by default. read_only: Makes a repository to be read only. It defaults to false. max_retries: This specifies the number of retries Elasticsearch will take before giving up on storing or retrieving the snapshot. By default, it is set to 3. In addition to the preceding properties, we are allowed to set two additional properties that can overwrite the credentials stored in elasticsearch.yml, which will be used to connect to S3. This is especially handy when you want to use several S3 repositories, each with its own security settings: access_key: This overwrites cloud.aws.access_key from elasticsearch.yml secret_key: This overwrites cloud.aws.secret_key from elasticsearch.yml    Note:    AWS instances resolve S3 endpoints to a public IP. If the Elasticsearch instances reside in a private subnet in an AWS VPC then all traffic to S3 will go through that VPC's NAT instance. If your VPC's NAT instance is a smaller instance size (for example, a t1.micro) or is handling a high volume of network traffic, your bandwidth to S3 may be limited by that NAT instance's networking bandwidth limitations. So, if you running your Elasticsearch cluster inside a VPC then make sure that you are using instances with a high networking bandwidth and there is no network congestion. Note:  Instances residing in a public subnet in an AWS VPC will   connect to S3 via the VPC's Internet gateway and not be bandwidth limited by the VPC's NAT instance. The HDFS repository If you use Hadoop and its HDFS (http://wiki.apache.org/hadoop/HDFS) filesystem, a good alternative to back up the Elasticsearch data is to store it in your Hadoop cluster. As with the case of S3, there is a dedicated plugin for this. To install it, we can use the following  command: sudo bin/elasticsearch-plugin install repository-hdfs   Note :    The HDFS snapshot/restore plugin is built against the latest Apache Hadoop 2.x (currently 2.7.1). If your Hadoop distribution is not protocol compatible with Apache Hadoop, you can replace the Hadoop libraries inside the plugin folder with your own (you might have to adjust the security permissions required). Note:  Even if Hadoop is already installed on the Elasticsearch nodes, for security        reasons, the required libraries need to be placed under the plugin folder. Note that in most cases, if the distribution is compatible, one simply needs to configure the repository with the appropriate Hadoop configuration files. After installing the plugin on each node in the cluster and restarting every node, we can use the following command to create a repository in our Hadoop cluster: curl -XPUT 'http://localhost:9200/_snapshot/es_hdfs_repository' -d '{ "type": "hdfs" "settings": { "uri": "hdfs://namenode:8020/", "path": "elasticsearch_snapshots/es_hdfs_repository" } }' The available settings that we can use are as follows: uri: This is a required parameter that tells Elasticsearch where HDFS resides. It should have a format like hdfs://HOST:PORT/. path: This is the information about the path where snapshot files should be stored. It is a required parameter. load_default: This specifies whether the default parameters from the Hadoop configuration should be loaded and set to false if the reading of the settings should be disabled. This setting is enabled by default. chunk_size: This specifies the size of the chunk that Elasticsearch will use to split the snapshot data. If you want the snapshotting to be faster, you can use smaller chunks and more streams to push the data to HDFS. By default, it is disabled. conf.<key>: This is an optional parameter and tells where a key is in any Hadoop argument. The value provided using this property will be merged with the configuration. As an alternative, you can define your HDFS repository and its settings inside the elasticsearch.yml file of each node as follows: repositories: hdfs: uri: "hdfs://<host>:<port>/" path: "some/path" load_defaults: "true" conf.<key> : "<value>" compress: "false" chunk_size: "10mb" The Azure repository Just like Amazon S3, we are able to use a dedicated plugin to push our indices and metadata to Microsoft cloud services. To do this, we need to install a plugin on every node of the cluster, which we can do by running the following command: sudo bin/elasticsearch-plugin install repository-azure The configuration is also similar to the Amazon S3 plugin configuration. Our elasticsearch.yml file should contain the following section: cloud: azure: storage: my_account: account: your_azure_storage_account key: your_azure_storage_key Do not forget to restart all the nodes after installing the plugin. After Elasticsearch is configured, we need to create the actual repository, which we do by running the following command: curl -XPUT 'http://localhost:9200/_snapshot/azure_repository' -d '{ "type": "azure" }' The following settings are supported by the Elasticsearch Azure plugin: account: Microsoft Azure account settings to be used. container: As with the bucket in Amazon S3, every piece of information must reside in the container. This setting defines the name of the container in the Microsoft Azure space. The default value is elasticsearch-snapshots. base_path: This allows us to change the place where Elasticsearch will put the data. By default, the value for this setting is empty which causes Elasticsearch to put the data in the root directory. compress: This defaults to false and when enabled it allows us to compress the metadata files during the snapshot creation. chunk_size: This is the maximum chunk size used by Elasticsearch (set to 64m by default, and this is also the maximum value allowed). You can change it to change the size when the data should be split into smaller chunks. You can change the chunk size using size value notations such as, 1g, 100m, or 5k.  An example of creating a repository using the settings follows: curl -XPUT "http://localhost:9205/_snapshot/azure_repository" -d' { "type": "azure", "settings": { "container": "es-backup-container", "base_path": "backups", "chunk_size": "100m", "compress": true } }' The Google cloud storage repository Similar to Amazon S3 and Microsoft Azure, we can use a GCS repository plugin for snapshotting and restoring of our indices. The settings for this plugin are almost similar to other cloud plugins. To know how to work with the Google cloud repository plugin please refer to the following URL: https://www.elastic.co/guide/en/elasticsearch/plugins/5.0/repository-gcs.htm Thus, in the article we learn how to carry out backup of your data from Elasticsearch clusters to the cloud, i.e. within different cloud repositories by making use of the additional plugin options with Elasticsearch. If you found our excerpt useful, you may explore other interesting features and advanced concepts of Elasticsearch 5.x like aggregation, index control, sharding, replication, and clustering in the book Mastering Elasticsearch 5.x.
Read more
  • 0
  • 1
  • 29343

article-image-machine-learning-algorithms-naive-bayes-with-spark-mllib
Wilson D'souza
07 Nov 2017
7 min read
Save for later

Machine Learning Algorithms: Implementing Naive Bayes with Spark MLlib

Wilson D'souza
07 Nov 2017
7 min read
[box type="note" align="" class="" width=""]In this article by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, and Shuen Mei from their book Apache Spark 2.x Machine Learning Cookbook, we look at how to implement Naïve Bayes classification algorithm with Spark 2.0 MLlib. The associated code and exercise are available at the end of the article.[/box] How to implement Naive Bayes with Spark MLlib Naïve Bayes is one of the most widely used classification algorithms which can be trained and optimized quite efficiently. Spark’s machine learning library, MLlib, primarily focuses on simplifying machine learning and has great support for multinomial naïve Bayes and Bernoulli naïve Bayes. Here we use the famous Iris dataset and use Apache Spark API NaiveBayes() to classify/predict which of the three classes of flower a given set of observations belongs to. This is an example of a multi-class classifier and requires multi-class metrics for measurements of fit. Let’s have a look at the steps to achieve this: For the Naive Bayes exercise, we use a famous dataset called iris.data, which can be obtained from UCI. The dataset was originally introduced in the 1930s by R. Fisher. The set is a multivariate dataset with flower attribute measurements classified into three groups. In short, by measuring four columns, we attempt to classify a species into one of the three classes of Iris flower (that is, Iris Setosa, Iris Versicolour, Iris Virginica).We can download the data from here: https://archive.ics.uci.edu/ml/datasets/Iris/  The column definition is as follows: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm  Class: -- Iris Setosa => Replace it with 0 -- Iris Versicolour => Replace it with 1 -- Iris Virginica => Replace it with 2 The steps/actions we need to perform on the data are as follows: Download and then replace column five (that is, the label or classification classes) with a numerical value, thus producing the iris.data.prepared data file. The Naïve Bayes call requires numerical labels and not text, which is very common with most tools. Remove the extra lines at the end of the file. Remove duplicates within the program by using the distinct() call. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included. Set up the package location where the program will reside: package spark.ml.cookbook.chapter6 Import the necessary packages for SparkSession to gain access to the cluster and Log4j.Logger to reduce the amount of output produced by Spark:  import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel} import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics, MultilabelMetrics, binary} import org.apache.spark.sql.{SQLContext, SparkSession} import org.apache.log4j.Logger import org.apache.log4j.Level Initialize a SparkSession specifying configurations with the builder pattern, thus making an entry point available for the Spark cluster: val spark = SparkSession .builder .master("local[4]") .appName("myNaiveBayes08") .config("spark.sql.warehouse.dir", ".") .getOrCreate() val data = sc.textFile("../data/sparkml2/chapter6/iris.data.prepared.txt") Parse the data using map() and then build a LabeledPoint data structure. In this case, the last column is the Label and the first four columns are the features. Again, we replace the text in the last column (that is, the class of Iris) with numeric values (that is, 0, 1, 2) accordingly: val NaiveBayesDataSet = data.map { line => val columns = line.split(',') LabeledPoint(columns(4).toDouble , Vectors.dense(columns(0).toDouble,columns(1).toDouble,columns(2).to Double,columns(3).toDouble )) } Then make sure that the file does not contain any redundant rows. In this case, it has three redundant rows. We will use the distinct dataset going forward: println(" Total number of data vectors =", NaiveBayesDataSet.count()) val distinctNaiveBayesData = NaiveBayesDataSet.distinct() println("Distinct number of data vectors = ", distinctNaiveBayesData.count()) Output: (Total number of data vectors =,150) (Distinct number of data vectors = ,147) We inspect the data by examining the output: distinctNaiveBayesData.collect().take(10).foreach(println(_)) Output: (2.0,[6.3,2.9,5.6,1.8]) (2.0,[7.6,3.0,6.6,2.1]) (1.0,[4.9,2.4,3.3,1.0]) (0.0,[5.1,3.7,1.5,0.4]) (0.0,[5.5,3.5,1.3,0.2]) (0.0,[4.8,3.1,1.6,0.2]) (0.0,[5.0,3.6,1.4,0.2]) (2.0,[7.2,3.6,6.1,2.5]) .............. ................ ............. Split the data into training and test sets using a 30% and 70% ratio. The 13L in this case is simply a seeding number (L stands for long data type) to make sure the result does not change from run to run when using a randomSplit() method: val allDistinctData = distinctNaiveBayesData.randomSplit(Array(.30,.70),13L) val trainingDataSet = allDistinctData(0) val testingDataSet = allDistinctData(1) Print the count for each set: println("number of training data =",trainingDataSet.count()) println("number of test data =",testingDataSet.count()) Output: (number of training data =,44) (number of test data =,103) Build the model using train() and the training dataset: val myNaiveBayesModel = NaiveBayes.train(trainingDataSet Use the training dataset plus the map() and predict() methods to classify the flowers based on their features: val predictedClassification = testingDataSet.map( x => (myNaiveBayesModel.predict(x.features), x.label)) Examine the predictions via the output: predictedClassification.collect().foreach(println(_)) (2.0,2.0) (1.0,1.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (2.0,2.0) ....... ....... ....... Use MulticlassMetrics() to create metrics for the multi-class classifier. As a reminder, this is different from the previous recipe, in which we used BinaryClassificationMetrics(): val metrics = new MulticlassMetrics(predictedClassification) Use the commonly used confusion matrix to evaluate the model: val confusionMatrix = metrics.confusionMatrix println("Confusion Matrix= n",confusionMatrix) Output: (Confusion Matrix= ,35.0 0.0 0.0 0.0 34.0 0.0 0.0 14.0 20.0 ) We examine other properties to evaluate the model: val myModelStat=Seq(metrics.precision,metrics.fMeasure,metrics.recall) myModelStat.foreach(println(_)) Output: 0.8640776699029126 0.8640776699029126 0.8640776699029126 How it works... We used the IRIS dataset for this recipe, but we prepared the data ahead of time and then selected the distinct number of rows by using the NaiveBayesDataSet.distinct() API. We then proceeded to train the model using the NaiveBayes.train() API. In the last step, we predicted using .predict() and then evaluated the model performance via MulticlassMetrics() by outputting the confusion matrix, precision, and F-Measure metrics. The idea here was to classify the observations based on a selected feature set (that is, feature engineering) into classes that correspond to the left-hand label. The difference here was that we are applying joint probability given conditional probability to the classification. This concept is known as Bayes' theorem, which was originally proposed by Thomas Bayes in the 18th century. There is a strong assumption of independence that must hold true for the underlying features to make Bayes' classifier work properly. At a high level, the way we achieved this method of classification was to simply apply Bayes' rule to our dataset. As a refresher from basic statistics, Bayes' rule can be written as follows: The formula states that the probability of A given B is true is equal to probability of B given A is true times probability of A being true divided by probability of B being true. It is a complicated sentence, but if we step back and think about it, it will make sense. The Bayes' classifier is a simple yet powerful one that allows the user to take the entire probability feature space into consideration. To appreciate its simplicity, one must remember that probability and frequency are two sides of the same coin. The Bayes' classifier belongs to the incremental learner class in which it updates itself upon encountering a new sample. This allows the model to update itself on-the-fly as the new observation arrives rather than only operating in batch mode. We evaluated a model with different metrics. Since this is a multi-class classifier, we have to use MulticlassMetrics() to examine model accuracy. [box type="download" align="" class="" width=""]Download exercise and code files here. Exercise Files_Implementing Naive Bayes algorithm with Spark MLlib[/box] For more information on Multiclass Metrics, please see the following link: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib .evaluation.MulticlassMetrics Documentation for constructor can be found here: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.classification.NaiveBayes If you enjoyed this article, you should have a look at Apache Spark 2.0 Machine Learning Cookbook which contains this excerpt.
Read more
  • 0
  • 0
  • 29257

article-image-halloween-costume-data-science-nerds
Packt Editorial Staff
31 Oct 2017
14 min read
Save for later

(13*3)+ Halloween costume ideas for Data science nerds

Packt Editorial Staff
31 Oct 2017
14 min read
Are you a data scientist, a machine learning engineer, an AI researcher or simply a data enthusiast? Channel the inner data science nerd within you with these geeky ideas for your Halloween costumes! The Data Science Spectrum Don't know what to go as to this evening's party because you've been busy cleaning that terrifying data? Don’t worry, here are some easy-to-put-together Halloween costume ideas just for you. [dropcap]1[/dropcap] Big Data Go as Baymax, the healthcare robot, (who can also turn into battle mode when required). Grab all white clothes that you have. Stuff your tummy with some pillows and wear a white mask with cutouts for eyes. You are all ready to save the world. In fact, convince a friend or your brother to go as Hiro! [dropcap]2[/dropcap] A.I. agent Enter as Agent Smith, the AI antagonist, this Halloween. Lure everyone with your bold black suit paired with a white shirt and a black tie. A pair of polarized sunglasses would replicate you as the AI agent. Capture the crowd by being the most intelligent and cold-hearted personality of all. [dropcap]3[/dropcap] Data Miner Put on your dungaree with a tee. Fix a flashlight atop your cap. Grab a pickaxe from the gardening toolkit, if you have one. Stripe some mud onto your face. Enter the party wheeling with loads of data boxes that you have freshly mined. You’ll definitely grab some traffic for data. Unstructured data anyone? [dropcap]4[/dropcap] Data Lake Go as a Data lake this Halloween. Simply grab any blue item from your closet. Draw some fishes, crabs, and weeds. (Use a child’s marker for that). After all, it represents the data you have. And you’re all set. [dropcap]5[/dropcap] Dark Data Unleash the darkness within your soul! Just kidding. You don’t actually have to turn to the evil side. Just coming up with your favorite black-costume character would do. Looking for inspiration? Maybe, a witch, The dark knight, or The Darth Vader. [dropcap]6[/dropcap] Cloud A fluffy, white cloud is what you need to be this Halloween. Raid your nearby drug store for loads of cotton balls. Better still, tear up that old pillow you have been meaning to throw away for a while. Use the fiber inside to glue onto an unused tee. You will be the cutest cloud ever seen. Don’t forget to carry an umbrella in case you turn grey! [dropcap]7[/dropcap] Predictive Analytics Make your own paper wizard hat with silver stars and moons pasted on it. If you can arrange for an advocate gown, it would be great. Else you could use a long black bed sheet as a cape. And most importantly, a crystal ball to show off some prediction stunts at the Halloween. [dropcap]8[/dropcap] Gradient boosting Enter Halloween as the energy booster. Wear what you want. Grab loads of empty energy drink tetra packs and stick it all over you. Place one on your head too. Wear a nameplate that says “ G-booster Energy drink”. Fuel up some weak models this Halloween. [dropcap]9[/dropcap] Cryptocurrency Wear head to toe black. In fact, paint your face black as well, like the Grim reaper. Then grab a cardboard piece. Cut out a circle, paint it orange, and then draw a gold B symbol, just like you see in a bitcoin. This Halloween costume will definitely grab you the much-needed attention just as this popular cryptocurrency. [dropcap]10[/dropcap] IoT Are you a fan of IoT and the massive popularity it has gained? Then you should definitely dress up as your web-slinging, friendly neighborhood Spiderman. Just grab a spiderman costume from any costume store and attach some handmade web slings. Remember to connect with people by displaying your IoT knowledge. [dropcap]11[/dropcap] Self-driving car Choose a mono-color outfit of your choice (P.S. The color you would choose for your car). Cut out four wheels and paste two on your lower calves and two on your arms. Cut out headlights too. Put on a wiper goggle. And yes you do not need a steering wheel or the brakes, clutch and the accelerator. Enter the Halloween at your own pace, go self-driving this Halloween. Bonus point: You can call yourself Bumblebee or Optimus Prime. Machine Learning and Deep learning Frameworks If machine learning or deep learning is your forte, here are some fresh Halloween costume ideas based on some of the popular frameworks in that space. [dropcap]12[/dropcap] Torch Flame up the party with a costume inspired by the fantastic four superhero, Johnny Storm a.k.a The Human Torch. Wear a yellow tee and orange slacks. Draw some orange flames on your tee. And finally, wear a flame-inspired headband. Someone is a hot machine learning library! [dropcap]13[/dropcap] TensorFlow No efforts for this one. Just arrange for a pumpkin costume, paste a paper cut-out of the TensorFlow logo and wear it as a crown. Go as the most powerful and widely popular deep learning library. You will be the star of the Halloween as you are a Google Kid. [dropcap]14[/dropcap] Caffe Go as your favorite Starbucks coffee this Halloween. Wear any of your brown dress/ tee. Draw or stick a Starbucks logo. And then add frothing to the top by bunching up a cream-colored sheet. Mamma Mia! [dropcap]15[/dropcap] Pandas Go as a Panda this Halloween! Better still go as a group of Pandas. The best option is to buy a panda costume. But if you don’t want that, wear a white tee, black slacks, black goggles and some cardboard cutouts for ears. This will make you not only the cutest animal in the party but also a top data manipulation library. Good luck finding your python in the party by the way. [dropcap]16[/dropcap] Jupyter Notebook Go as a top trending open-source web application by dressing up as the largest planet in our solar system. People would surely be intimidated by your mass and also by your computing power. [dropcap]17[/dropcap] H2O Go to Halloween as a world famous open source deep learning platform. No, no, you don’t have to go as the platform itself. Instead go as the chemical alter-ego, water. Wear all blue and then grab some leftover asymmetric, blue cloth pieces to stick at your sides. Thirsty anyone? Data Viz & Analytics Tools If you’re all about analytics and visualization, grab the attention of every data geek in your party by dressing up as your favorite data insight tools. [dropcap]18[/dropcap] Excel Grab an old white tee and paint some green horizontal stripes. You’re all ready to go as the most widely used spreadsheet. The simplest of costumes, yet the most useful - a timeless classic that never goes out of fashion. [dropcap]19[/dropcap] MatLab If you have seriously run out of all costume ideas, going out as MatLab is your only solution. Just grab a blue tablecloth. Stick or sew it with some orange curtain and throw it over your head. You’re all ready to go as the multi-paradigm numerical computing environment. [dropcap]20[/dropcap] Weka Wear a brown overall, a brown wig, and paint your face brown. Make an orange beak out of a chart paper, and wear a pair orange stockings/ socks with your trousers tucked in. You are all set to enter as a data mining bird with ML algorithms and Java under your wings. [dropcap]21[/dropcap] Shiny Go all Shimmery!! Get some glitter powder and put it all over you. (You’ll have a tough time removing it though). Else choose a glittery outfit, with glittery shoes, and touch-up with some glitter on your face. Let the party see the bling of R that you bring. You will be the attractive storyteller out there. [dropcap]22[/dropcap] Bokeh A colorful polka-dotted outfit and some dim lights to do the magic. You are all ready to grab the show with such a dazzle. Make sure you enter the party gates with Python. An eye-catching beauty with the beast pair. [dropcap]23[/dropcap] Tableau Enter the Halloween as one of your favorite characters from history. But there is a term and condition for this: You cannot talk or move. Enjoy your Halloween by being still. Weird, but you’ll definitely grab everyone’s eye. [dropcap]24[/dropcap] Microsoft Power BI Power up your Halloween party by entering as a data insights superhero. Wear a yellow turtleneck, a stylish black leather jacket, black pants, some mid-thigh high boots and a slick attitude. You’re ready to save your party! Data Science oriented Programming languages These hand-picked Halloween costume ideas are for you if you consider yourself a top coder. By a top coder we mean you’re all about learning new programming languages in your spare and, well, your not so spare time.   [dropcap]25[/dropcap] Python Easy peasy as the language looks, the reptile is not that easy to handle. A pair of python-printed shirt and trousers would do the job. You could be getting more people giving you candies some out of fear, other out of the ease. Definitely, go as a top trending and a go-to language which everyone loves! And yes, don’t forget the fangs. [dropcap]26[/dropcap] R Grab an eye patch and your favorite leather pants. Wear a loose white shirt with some rugged waistcoat and a sword. Here you are all decked up as a pirate for your next loot. You’ll surely thank me for giving you a brilliant Halloween idea. But yes! Don’t forget to make that Arrrr (R) noise! [dropcap]27[/dropcap] Java Go as a freshly roasted coffee bean! People in your Halloween party would be allured by your aroma. They would definitely compliment your unique idea and also the fact that you’re the most popular programming language. [dropcap]28[/dropcap] SAS March in your Halloween party up as a Special Airforce Service (SAS) agent. You would be disciplined, accurate, precise and smart. Just like the advanced software suite that goes by the same name. You would need a full black military costume, with a gas mask, some fake ammunition from a nearby toy store, and some attitude of course! [dropcap]29[/dropcap] SQL If you pride yourself on being very organized or are a stickler for the rules, you should go as SQL this Halloween. Prep-up yourself with an overall blue outfit. Spike up your hair and spray some temporary green hair color. Cut out bold letters S, Q, and L from a plain white paper and stick them on your chest. You are now ready to enter the Halloween party as the most popular database of all times. Sink in all the data that you collect this Halloween. [dropcap]30[/dropcap] Scala If Scala is your favorite programming language, add a spring to your Halloween by going as, well, a spring! Wear the brightest red that you have. Using a marker, draw some swirls around your body (You can ask your mom to help). Just remember to elucidate a 3D picture. And you’re all set. [dropcap]31[/dropcap] Julia If you want to make a red carpet entrance to your Halloween party, go as the Academy award-winning actress, Julia Roberts. You can even take up inspiration from her character in the 90s hit film Pretty Woman. For extra oomph, wear a pink, red, and purple necklace to highlight the Julia programming language [dropcap]32[/dropcap] Ruby Act pricey this Halloween. Be the elegant, dynamic yet simple programming language. Go blood red, wear on your brightest red lipstick, red pumps, dazzle up with all the red accessories that you have. You’ll definitely gather some secret admirers around the hall. [dropcap]33[/dropcap] Go Go as the mascot of Go, the top trending programming language. All you need is a blue mouse costume. Fear not if you don’t have one. Just wear a powder blue jumpsuit, grab a baby pink nose, and clip on a fake single, large front tooth. Ready for the party! [dropcap]34[/dropcap] Octave Go as a numerically competent programming language. And if that doesn’t sound very trendy, go as piano keys depicting an octave. You simply need to wear all white and divide your space into 8 sections. Then draw 5 horizontal black stripes. You won’t be able to do that vertically, well, because they are a big number. Here you go, you’re all set to fill the party with your melody. Fancy an AI system inspired Halloween costume? This is for you if you love the way AI works and the enigma that it has thrown around the world. This is for you if you are spellbound with AI magic. You should go dressed as one of these at your Halloween party this season. Just pick up the AI you want to look like and follow as advised. [dropcap]35[/dropcap] IBM Watson Wear a dark blue hat, a matching long overcoat, a vest and a pale blue shirt with a dark tie tucked into the vest. Complement it with a mustache and a brooding look. You are now ready to be IBM Watson at your Halloween party. [dropcap]36[/dropcap] Apple Siri If you want to be all cool and sophisticated like the Apple’s Siri, wear an alluring black turtleneck dress. Don’t forget to carry your latest iPhone and air pods. Be sure you don’t have a sore throat, in case someone needs your assistance. [dropcap]37[/dropcap] Microsoft Cortana If Microsoft Cortana is your choice of voice assistant, dress up as Cortana, the fictional synthetic intelligence character in the Halo video game series. Wear a blue bodysuit. Get a bob if you’re daring. (A wig would also do). Paint some dark blue robot like designs over your body and well, your face. And you’re all set. [dropcap]38[/dropcap] Salesforce Einstein Dress up as the world’s most famous physicist and also an AI-powered CRM. How? Just grab a white shirt, a blue pullover and a blue tie (Salesforce colors). Finish your look with a brown tweed coat, brown pants and shoes, a rugged white wig and mustache, and a deep thought on your face. [dropcap]39[/dropcap] Facebook Jarvis Get inspired by the Iron man’s Jarvis, the coolest A.I. in the Marvel universe. Just grab a plexiglass, draw some holograms and technological symbols over it with a neon marker. (Try to keep the color palette in shades of blues and reds). And fix this plexiglass in a curved fashion in front of your face by a headband. Do practice saying “Hello Mr. Stark.”  [dropcap]40[/dropcap] Amazon Echo This is also an easy one. Grab a long, black chart paper. Roll it around in a tube form around your body. Draw the Amazon symbol at the bottom with some glittery, silver sketch pen, color your hair blue, and there you go. If you have a girlfriend, convince her to go as Amazon Alexa. [dropcap]41[/dropcap] SAP Leonardo Put on a hat, wear a long cloak, some fake overgrown mustache, and beard. Accessorize with a color palette and a paintbrush. You will be the Leonardo da Vinci of the Halloween party. Wait a minute, don’t forget to cut out SAP initials and stick them on your cap. After all, you are entering as SAP’s very own digital revolution system. [dropcap]42[/dropcap] Intel Neon Deck the Halloween hall with a Harley Quinn costume. For some extra dramatization, roll up some neon blue lights around your head. Create an Intel logo out of some blue neon lights and wear it as your neckpiece. [dropcap]43[/dropcap] Microsoft Brainwave This one will require a DIY task. Arrange for a red and green t-shirt, cut them into a vertical half. Stitch it in such a way that the green is on the left and the red on the right. Similarly, do that with your blue and yellow pants; with yellow on the left and blue on the right. You will look like the most powerful Microsoft’s logo. Wear a skullcap with wires protruding out and a Hololens like eyewear to go with. And so, you are all ready to enter the Halloween party as Microsoft’s deep learning acceleration platform for real-time AI. [dropcap]44[/dropcap] Sophia, the humanoid Enter with all the confidence and a top-to-toe professional attire. Be ready to answer any question thrown at you with grace and without a stroke of skepticism. And to top it off, sport a clean shaved head. And there, you are all ready to blow off everyone’s mind with a mix of beauty with super intelligent brains.   Happy Halloween folks!
Read more
  • 0
  • 0
  • 29251
article-image-how-to-recognize-patterns-with-neural-networks-in-java
Kunal Chaudhari
04 Jan 2018
8 min read
Save for later

How to recognize Patterns with Neural Networks in Java

Kunal Chaudhari
04 Jan 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Fabio M. Soares and Alan M. F. Souza, titled Neural Network Programming with Java Second Edition. This book covers the current state-of-art in the field of neural network that helps you understand and design basic to advanced neural networks with Java.[/box] Our article explores the power of neural networks in pattern recognition by showcasing how to recognize digits from 0 to 9 in an image. For pattern recognition, the neural network architectures that can be applied are MLPs (supervised) and the Kohonen Network (unsupervised). In the first case, the problem should be set up as a classification problem, that is, the data should be transformed into the X-Y dataset, where for every data record in X there should be a corresponding class in Y. The output of the neural network for classification problems should have all of the possible classes, and this may require preprocessing of the output records. For the other case, unsupervised learning, there is no need to apply labels to the output, but the input data should be properly structured. To remind you, the schema of both neural networks are shown in the next figure: Data pre-processing We have to deal with all possible types of data, i.e., numerical (continuous and discrete) and categorical (ordinal or unscaled).  However, here we have the possibility of performing pattern recognition on multimedia content, such as images and videos. So, can multimedia could be handled? The answer to this question lies in the way these contents are stored in files. Images, for example, are written with a representation of small colored points called pixels. Each color can be coded in an RGB notation where the intensity of red, green, and blue define every color the human eye is able to see. Therefore an image of dimension 100x100 would have 10,000 pixels, each one having three values for red, green and blue, yielding a total of 30,000 points. That is the challenge for image processing in neural networks. Some methods, may reduce this huge number of dimensions. Afterwards an image can be treated as big matrix of numerical continuous values. For simplicity, we are applying only gray-scale images with small dimensions in this article. Text recognition (optical character recognition) Many documents are now being scanned and stored as images, making it necessary to convert these documents back into text, for a computer to apply edition and text processing. However, this feature involves a number of challenges: Variety of text font Text size Image noise Manuscripts In spite of that, humans can easily interpret and read even texts produced in a bad quality image. This can be explained by the fact that humans are already familiar with text characters and the words in their language. Somehow the algorithm must become acquainted with these elements (characters, digits, signalization, and so on), in order to successfully recognize texts in images. Digit recognition Although there are a variety of tools available on the market for OCR, it still remains a big challenge for an algorithm to properly recognize texts in images. So, we will be restricting our application to in a smaller domain, so that we'll face simpler problems. Therefore, in this article, we are going to implement a neural network to recognize digits from 0 to 9 represented on images. Also, the images will have standardized and small dimensions, for the sake of simplicity. Digit representation We applied the standard dimension of 10x10 (100 pixels) in gray scaled images, resulting in 100 values of gray scale for each image: In the preceding image we have a sketch representing the digit 3 at the left and a corresponding matrix with gray values for the same digit, in gray scale. We apply this pre-processing in order to represent all ten digits in this application. Implementation in Java To recognize optical characters, data to train and to test neural network was produced by us. In this example, digits from 0 (super black) to 255 (super white) were considered. According to pixel disposal, two versions of each digit data were created: one to train and another to test. Classification techniques will be used here. Generating data Numbers from zero to nine were drawn in the Microsoft Paint ®. The images have been converted into matrices, from which some examples are shown in the following image. All pixel values between zero and nine are grayscale: For each digit we generated five variations, where one is the perfect digit, and the others contain noise, either by the drawing, or by the image quality. Each matrix row was merged into vectors (Dtrain and Dtest) to form a pattern that will be used to train and test the neural network. Therefore, the input layer of the neural network will be composed of 101 neurons. The output dataset was represented by ten patterns. Each one has a more expressive value (one) and the rest of the values are zero. Therefore, the output layer of the neural network will have ten neurons. Neural architecture So, in this application our neural network will have 100 inputs (for images that have a 10x10 pixel size) and ten outputs, the number of hidden neurons remaining unrestricted. We created a class called DigitExample to handle this application. The neural network architecture was chosen with these parameters: Neural network type: MLP Training algorithm: Backpropagation Number of hidden layers: 1 Number of neurons in the hidden layer: 18 Number of epochs: 1000 Minimum overall error: 0.001 Experiments Now, as has been done in other cases previously presented, let's find the best neural network topology training several nets. The strategy to do that is summarized in the following table:   Experiment Learning rate Activation Functions #1 0.3 Hidden Layer: SIGLOG Output Layer: LINEAR #2 0.5 Hidden Layer: SIGLOG Output Layer: LINEAR #3 0.8 Hidden Layer: SIGLOG Output Layer: LINEAR #4 0.3 Hidden Layer: HYPERTAN Output Layer: LINEAR #5 0.5 Hidden Layer: SIGLOG Output Layer: LINEAR #6 0.8 Hidden Layer: SIGLOG Output Layer: LINEAR #7 0.3 Hidden Layer: HYPERTAN Output Layer: SIGLOG #8 0.5 Hidden Layer: HYPERTAN Output Layer: SIGLOG #9 0.8 Hidden Layer: HYPERTAN Output Layer: SIGLOG The following DigitExample class code defines how to create a neural network to read from digit data: // enter neural net parameter via keyboard (omitted) // load dataset from external file (omitted) // data normalization (omitted) // create ANN and define parameters to TRAIN: Backpropagation backprop = new Backpropagation(nn, neuralDataSetToTrain, LearningAlgorithm.LearningMode.BATCH); backprop.setLearningRate( typedLearningRate ); backprop.setMaxEpochs( typedEpochs ); backprop.setGeneralErrorMeasurement(Backpropagation.ErrorMeasurement.SimpleError); backprop.setOverallErrorMeasurement(Backpropagation.ErrorMeasurement.MSE); backprop.setMinOverallError(0.001); backprop.setMomentumRate(0.7); backprop.setTestingDataSet(neuralDataSetToTest); backprop.printTraining = true; backprop.showPlotError = true; // train ANN: try {    backprop.forward();    //neuralDataSetToTrain.printNeuralOutput();    backprop.train();    System.out.println("End of training");    if (backprop.getMinOverallError() >= backprop.getOverallGeneralError()) {        System.out.println("Training successful!"); } else {        System.out.println("Training was unsuccessful"); }    System.out.println("Overall Error:" + String.valueOf(backprop.getOverallGeneralError()));    System.out.println("Min Overall Error:" + String.valueOf(backprop.getMinOverallError()));    System.out.println("Epochs of training:" + String.valueOf(backprop.getEpoch())); } catch (NeuralException ne) {    ne.printStackTrace(); } // test ANN (omitted) Results After running each experiment using the DigitExample class, excluding training and testing overall errors and the quantity of right number classifications using the test data (table above), it is possible observe that experiments #2 and #4 have the lowest MSE values. The differences between these two experiments are learning rate and activation function used in the output layer. Experiment Training overall error Testing overall error # Right number classifications #1 9.99918E-4 0.01221 2 by 10 #2 9.99384E-4 0.00140 5 by 10 #3 9.85974E-4 0.00621 4 by 10 #4 9.83387E-4 0.02491 3 by 10 #5 9.99349E-4 0.00382 3 by 10 #6 273.70 319.74 2 by 10 #7 1.32070 6.35136 5 by 10 #8 1.24012 4.87290 7 by 10 #9 1.51045 4.35602 3 by 10 The figure above shows the MSE evolution (train and test) by each epoch graphically by experiment #2. It is interesting to notice the curve stabilizes near the 30th epoch: The same graphic analysis was performed for experiment #8. It is possible to check the MSE curve stabilizes near the 200th epoch. As already explained, only MSE values might not be considered to attest neural net quality. Accordingly, the test dataset has verified the neural network generalization capacity. The next table shows the comparison between real output with noise and the neural net estimated output of experiment #2 and #8. It is possible to conclude that the neural network weights by experiment #8 can recognize seven digits patterns better than #2's: Output comparison Real output (test dataset) Digit 0.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     1.0 0.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     1.0     0.0 0.0 0.0        0.0     0.0     0.0        0.0     0.0     1.0     0.0     0.0 0.0 0.0        0.0     0.0     0.0     0.0     1.0     0.0     0.0     0.0 0.0 0.0        0.0     0.0     0.0     1.0     0.0     0.0     0.0     0.0 0.0    0.0        0.0     0.0     1.0     0.0     0.0     0.0     0.0     0.0 0.0 0.0        0.0     1.0     0.0     0.0     0.0     0.0     0.0     0.0 0.0 0.0        1.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 0.0 1.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 1.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 0 1 2 3 4 5 6 7 8 9 Estimated output (test dataset) – Experiment #2 Digit 0.20   0.26  0.09  -0.09  0.39   0.24  0.35   0.30  0.24   1.02 0.42  -0.23  0.39   0.06  0.11    0.16 0.43   0.25  0.17  -0.26 0.51   0.84  -0.17  0.02  0.16    0.27 -0.15  0.14  -0.34 -0.12 -0.20  -0.05  -0.58  0.20  -0.16     0.27 0.83 -0.56  0.42   0.35 0.24   0.05  0.72  -0.05  -0.25    -0.38 -0.33  0.66  0.05  -0.63 0.08   0.41  -0.21  0.41  0.59     -0.12 -0.54  0.27  0.38  0.00 -0.76  -0.35  -0.09  1.25  -0.78     0.55 -0.22  0.61  0.51  0.27 -0.15   0.11  0.54  -0.53  0.55     0.17 0.09  -0.72  0.03  0.12 0.03   0.41  0.49  -0.44  -0.01    0.05 -0.05 -0.03  -0.32 -0.30 0.63  -0.47  -0.15  0.17  0.38    -0.24 0.58   0.07  -0.16 0.54 0 (OK) 1 (ERR) 2 (ERR) 3 (OK) 4 (ERR) 5 (OK) 6 (OK) 7 (ERR) 8 (ERR) 9 (OK) Estimated output (test dataset) – Experiment #8 Digit 0.10 0.10    0.12 0.10 0.12    0.13 0.13 0.26    0.17 0.39 0.13 0.10    0.11 0.10 0.11    0.10 0.29    0.23 0.32 0.10 0.26 0.38    0.10 0.10 0.12    0.10 0.10 0.17    0.10 0.10 0.10 0.10    0.10 0.10 0.10    0.17 0.39 0.10    0.38 0.10 0.15 0.10    0.24 0.10 0.10    0.10 0.10 0.39    0.37 0.10 0.20 0.12    0.10 0.10 0.37    0.10 0.10 0.10    0.17 0.12 0.10 0.10    0.10 0.39 0.10    0.16 0.11 0.30    0.14 0.10 0.10 0.11    0.39 0.10 0.10    0.15 0.10 0.10    0.17 0.10 0.10 0.25    0.34 0.10 0.10    0.10 0.10 0.10    0.10 0.10 0.39 0.10    0.10 0.10 0.28    0.10 0.27 0.11    0.10 0.21 0 (OK) 1 (OK) 2 (OK) 3 (ERR) 4 (OK) 5 (ERR) 6 (OK) 7 (OK) 8 (ERR) 9 (OK) The experiments showed in this article have taken in consideration 10x10 pixel information images. We recommend that you try to use 20x20 pixel datasets to build a neural net able to classify digit images of this size. You should also change the training parameters of the neural net to achieve better classifications. To summarize, we applied neural network techniques to perform pattern recognition on a series of numbers from 0 to 9 in an image. The application here can be extended to any type of characters instead of digits, under the condition that the neural network should all be presented with the predefined characters. If you enjoyed this excerpt, check out the book Neural Network Programming with Java Second Edition to know more about leveraging the multi-platform feature of Java to build and run your personal neural networks everywhere.    
Read more
  • 0
  • 0
  • 28986

article-image-customizing-deep-learning-models-keras
Amey Varangaonkar
22 Dec 2017
8 min read
Save for later

2 ways to customize your deep learning models with Keras

Amey Varangaonkar
22 Dec 2017
8 min read
[box type="note" align="" class="" width=""]The following extract is taken from the book Deep Learning with Keras, co-authored by Antonio Gulli and Sujit Pal. [/box] Keras has a lot of built-in functionality for you to build all your deep learning models without much need for customization. In this article, the authors explain how your Keras models can be customized for better and more efficient deep learning. As you will recall, Keras is a high level API that delegates to either a TensorFlow or Theano backend for the computational heavy lifting. Any code you build for your customization will call out to one of these backends. In order to keep your code portable across the two backends, your custom code should use the Keras backend API (https://keras.io/backend/), which provides a set of functions that act like a facade over your chosen backend. Depending on the backend selected, the call to the backend facade will translate to the appropriate TensorFlow or Theano call. The full list of functions available and their detailed descriptions can be found on the Keras backend page. In addition to portability, using the backend API also results in more maintainable code, since Keras code is generally more high-level and compact compared to equivalent TensorFlow or Theano code. In the unlikely case that you do need to switch to using the backend directly, your Keras components can be used directly inside TensorFlow (not Theano though) code as described in this Keras blog (https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html) Customizing Keras typically means writing your own custom layer or custom distance function. In this section, we will demonstrate how to build some simple Keras layers. You will see more examples of using the backend functions to build other custom Keras components, such as objectives (loss functions), in subsequent sections. Keras example — using the lambda layer Keras provides a lambda layer; it can wrap a function of your choosing. For example, if you wanted to build a layer that squares its input tensor element-wise, you can say simply: model.add(lambda(lambda x: x ** 2)) You can also wrap functions within a lambda layer. For example, if you want to build a custom layer that computes the element-wise euclidean distance between two input tensors, you would define the function to compute the value itself, as well as one that returns the output shape from this function, like so: def euclidean_distance(vecs): x, y = vecs return K.sqrt(K.sum(K.square(x - y), axis=1, keepdims=True)) def euclidean_distance_output_shape(shapes): shape1, shape2 = shapes return (shape1[0], 1) You can then call these functions using the lambda layer shown as follows: lhs_input = Input(shape=(VECTOR_SIZE,)) lhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(lhs_input) rhs_input = Input(shape=(VECTOR_SIZE,)) rhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(rhs_input) sim = lambda(euclidean_distance, output_shape=euclidean_distance_output_shape)([lhs, rhs]) Keras example - building a custom normalization layer While the lambda layer can be very useful, sometimes you need more control. As an example, we will look at the code for a normalization layer that implements a technique called local response normalization. This technique normalizes the input over local input regions, but has since fallen out of favor because it turned out not to be as effective as other regularization methods such as dropout and batch normalization, as well as better initialization methods. Building custom layers typically involves working with the backend functions, so it involves thinking about the code in terms of tensors. As you will recall, working with tensors is a two step process. First, you define the tensors and arrange them in a computation graph, and then you run the graph with actual data. So working at this level is harder than working in the rest of Keras. The Keras documentation has some guidelines for building custom layers (https://keras.io/layers/writing-your-own-keras-layers/), which you should definitely read. One of the ways to make it easier to develop code in the backend API is to have a small test harness that you can run to verify that your code is doing what you want it to do. Here is a small harness I adapted from the Keras source to run your layer against some input and return a result: from keras.models import Sequential from keras.layers.core import Dropout, Reshape def test_layer(layer, x): layer_config = layer.get_config() layer_config["input_shape"] = x.shape layer = layer.__class__.from_config(layer_config) model = Sequential() model.add(layer) model.compile("rmsprop", "mse") x_ = np.expand_dims(x, axis=0) return model.predict(x_)[0] And here are some tests with layer objects provided by Keras to make sure that the harness runs okay: from keras.layers.core import Dropout, Reshape from keras.layers.convolutional import ZeroPadding2D import numpy as np x = np.random.randn(10, 10) layer = Dropout(0.5) y = test_layer(layer, x) assert(x.shape == y.shape) x = np.random.randn(10, 10, 3) layer = ZeroPadding2D(padding=(1,1)) y = test_layer(layer, x) assert(x.shape[0] + 2 == y.shape[0]) assert(x.shape[1] + 2 == y.shape[1]) x = np.random.randn(10, 10) layer = Reshape((5, 20)) y = test_layer(layer, x) assert(y.shape == (5, 20)) Before we begin building our local response normalization layer, we need to take a moment to understand what it really does. This technique was originally used with Caffe, and the Caffe documentation (http://caffe.berkeleyvision.org/tutorial/layers/lrn.html), describes it as a kind of lateral inhibition that works by normalizing over local input regions. In ACROSS_CHANNEL mode, the local regions extend across nearby channels but have no spatial extent. In WITHIN_CHANNEL mode, the local regions extend spatially, but are in separate channels. We will implement the WITHIN_CHANNEL model as follows. The formula for local response normalization in the WITHIN_CHANNEL model is given by: The code for the custom layer follows the standard structure. The __init__ method is used to set the application specific parameters, that is, the hyperparameters associated with the layer. Since our layer only does a forward computation and doesn't have any learnable weights, all we do in the build method is to set the input shape and delegate to the superclass's build method, which takes care of any necessary book-keeping. In layers where learnable weights are involved, this method is where you would set the initial values. The call method does the actual computation. Notice that we need to account for dimension ordering. Another thing to note is that the batch size is usually unknown at design times, so you need to write your operations so that the batch size is not explicitly invoked. The computation itself is fairly straightforward and follows the formula closely. The sum in the denominator can also be thought of as average pooling over the row and column dimension with a padding size of (n, n) and a stride of (1, 1). Because the pooled data is averaged already, we no longer need to divide the sum by n. The last part of the class is the get_output_shape_for method. Since the layer normalizes each element of the input tensor, the output size is identical to the input size: from keras import backend as K from keras.engine.topology import Layer, InputSpec class LocalResponseNormalization(Layer): def __init__(self, n=5, alpha=0.0005, beta=0.75, k=2, **kwargs): self.n = n self.alpha = alpha self.beta = beta self.k = k super(LocalResponseNormalization, self).__init__(**kwargs) def build(self, input_shape): self.shape = input_shape super(LocalResponseNormalization, self).build(input_shape) def call(self, x, mask=None): if K.image_dim_ordering == "th": _, f, r, c = self.shape Else: _, r, c, f = self.shape squared = K.square(x) pooled = K.pool2d(squared, (n, n), strides=(1, 1), padding="same", pool_mode="avg") if K.image_dim_ordering == "th": summed = K.sum(pooled, axis=1, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=1) Else: summed = K.sum(pooled, axis=3, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=3) denom = K.pow(self.k + averaged, self.beta) return x / denom def get_output_shape_for(self, input_shape): return input_shape You can test this layer during development using the test harness we described here. It is easier to run this instead of trying to build a whole network to put this into, or worse, waiting till you have fully specified the layer before running it: x = np.random.randn(225, 225, 3) layer = LocalResponseNormalization() y = test_layer(layer, x) assert(x.shape == y.shape) Now that you have a good idea of how to build a custom Keras layer, you might find it instructive to look at Keunwoo Choi's melspectogram (https://keunwoochoi.wordpress.com/2016/11/18/for-beginners-writing-a-custom-keras-layer/) Though building custom Keras layers seems to be fairly commonplace for experienced Keras developers, but they may not be widely useful in a general context. Custom layers are usually built to serve a specific narrow purpose, depending on the use-case in question, and Keras gives you enough flexibility to do so with ease. If you found our post useful, make sure to check out our best selling title Deep Learning with Keras, for other intriguing deep learning concepts and their implementation using Keras.    
Read more
  • 0
  • 1
  • 28927

article-image-introduction-data-analysis-and-libraries
Packt
07 Oct 2015
13 min read
Save for later

Introduction to Data Analysis and Libraries

Packt
07 Oct 2015
13 min read
In this article by Martin Czygan and Phuong Vothihong, the authors of the book Getting Started with Python Data Analysis, Data is raw information that can exist in any form, which is either usable or not. We can easily get data everywhere in our life; for example, the price of gold today is $ 1.158 per ounce. It does not have any meaning, except describing the gold price. This also shows that data is useful based on context. (For more resources related to this topic, see here.) With relational data connection, information appears and allows us to expand our knowledge beyond the range of our senses. When we possess gold price data gathered overtime, one information we might have is that the price has continuously risen from $1.152 to $1.158 for three days. It is used by someone who tracks gold prices. Knowledge helps people to create value in their lives and work. It is based on information that is organized, synthesized, or summarized to enhance comprehension, awareness, or understanding. It represents a state or potential for action and decisions. When the gold price continuously increases for three days, it will lightly decrease on the next day; this is useful knowledge. The following figure illustrates the steps from data to knowledge; we call this process the data analysis process and we will introduce it in the next section: In this article, we will cover the following topics: Data analysis and process Overview of libraries in data analysis using different programming languages Common Python data analysis libraries Data analysis and process Data is getting bigger and more diversified every day. Therefore, analyzing and processing data to advance human knowledge or to create value are big challenges. To tackle these challenges, you will need domain knowledge and a variety of skills, drawing from areas such as computer science, artificial intelligence (AI) and machine learning (ML), statistics and mathematics, and knowledge domain, as shown in the following figure: Let's us go through the Data analysis and it's domain knowledge: Computer science: We need this knowledge to provide abstractions for efficient data processing. A basic Python programming experience is required. We will introduce Python libraries used in data analysis. Artificial intelligence and machine learning: If computer science knowledge helps us to program data analysis tools, artificial intelligence and machine learning help us to model the data and learn from it in order to build smart products. Statistics and mathematics: We cannot extract useful information from raw data if we do not use statistical techniques or mathematical functions. Knowledge domain: Besides technology and general techniques, it is important to have an insight into the specific domain. What do the data fields mean? What data do we need to collect? Based on the expertise, we explore and analyze raw data by applying the above techniques, step by step. Data analysis is a process composed of the following steps: Data requirements: We have to define what kind of data will be collected based on the requirements or problem analysis. For example, if we want to detect a user's behavior while reading news on the internet, we should be aware of visited article links, date and time, article categories, and the user's time spent on different pages. Data collection: Data can be collected from a variety of sources: mobile, personal computer, camera, or recording device. It may also be obtained through different ways: communication, event, and interaction between person and person, person and device, or device and device. Data appears whenever and wherever in the world. The problem is, how can we find and gather it to solve our problem? This is the mission of this step. Data processing: Data that is initially obtained must be processed or organized for analysis. This process is performance-sensitive: How fast can we create, insert, update, or query data? For building a real product that has to process big data, we should consider this step carefully. What kind of database should we use to store data? What kind of data structure, such as analysis, statistics, or visualization, is suitable for our purposes? Data cleaning: After being processed and organized, the data may still contain duplicates or errors. Therefore, we need a cleaning step to reduce those situations and increase the quality of the results in the following steps. Common tasks include record matching, deduplication, or column segmentation. Depending on the type of data, we can apply several types of data cleaning. For example, a user's history of a visited news website might contain a lot of duplicate rows, because the user might have refreshed certain pages many times. For our specific issue, these rows might not carry any meaning when we explore the user's behavior. So, we should remove them before saving it to our database. Another situation we may encounter is click fraud on news—someone just wants to improve their website ranking or sabotage the website. In this case, the data will not help us to explore a user's behavior. We can use thresholds to check whether a visit page event comes from a real person or from malicious software. Exploratory data analysis: Now, we can start to analyze data through a variety of techniques referred to as exploratory data analysis. We may detect additional problems in data cleaning or discover requests for further data. Therefore, these steps may be iterative and repeated throughout the whole data analysis process. Data visualization techniques are also used to examine the data in graphs or charts. Visualization often facilitates the understanding of data sets, especially, if they are large or high-dimensional. Modelling and algorithms: A lot of mathematical formulas and algorithms may be applied to detect or predict useful knowledge from the raw data. For example, we can use similarity measures to cluster users who have exhibited similar news reading behavior and recommend articles of interest to them next time. Alternatively, we can detect users' gender based on their news reading behavior by applying classification models such as Support Vector Machine (SVM) or linear regression. Depending on the problem, we may use different algorithms to get an acceptable result. It can take a lot of time to evaluate the accuracy of the algorithms and to choose the best one to implement for a certain product. Data product: The goal of this step is to build data products that receive data input and generate output according to the problem requirements. We will apply computer science knowledge to implement our selected algorithms as well as manage the data storage. Overview of libraries in data analysis There are numerous data analysis libraries that help us to process and analyze data. They use different programming languages and have different advantages as well as disadvantages of solving various data analysis problems. Now, we introduce some common libraries that may be useful for you. They should give you an overview of libraries in the field. However, the rest of this focuses on Python-based libraries. Some of the libraries that use the Java language for data analysis are as follows: Weka: This is the library that I got familiar with, the first time I learned about data analysis. It has a graphical user interface that allows you to run experiments on a small dataset. This is great if you want to get a feel for what is possible in the data processing space. However, if you build a complex product, I think it is not the best choice because of its performance: sketchy API design, non-optimal algorithms, and little documentation (http://www.cs.waikato.ac.nz/ml/weka/). Mallet: This is another Java library that is used for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications on text. There is an add-on package to Mallet, called GRMM, that contains support for inference in general, graphical models, and training of Conditional random fields (CRF) with arbitrary graphical structure. In my experience, the library performance as well as the algorithms are better than Weka. However, its only focus is on text processing problems. The reference page is at http://mallet.cs.umass.edu/. Mahout: This is Apache's machine learning framework built on top of Hadoop; its goal is to build a scalable machine learning library. It looks promising, but comes with all the baggage and overhead of Hadoop. The Homepage is at http://mahout.apache.org/. Spark: This is a relatively new Apache project; supposedly up to a hundred times faster than Hadoop. It is also a scalable library that consists of common machine learning algorithms and utilities. Development can be done in Python as well as in any JVM language. The reference page is at https://spark.apache.org/docs/1.5.0/mllib-guide.html. Here are a few libraries that are implemented in C++: Vowpal Wabbit: This library is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. It has been used to learn a tera-feature (1012) dataset on 1000 nodes in one hour. More information can be found in the publication at http://arxiv.org/abs/1110.4198. MultiBoost: This package is a multiclass, multilabel, and multitask classification boosting software implemented in C++. If you use this software, you should refer to the paper published in 2012, in the Journal of Machine Learning Research, MultiBoost: A Multi-purpose Boosting Package, D.Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl. MLpack: This is also a C++ machine learning library, developed by the Fundamental Algorithmic and Statistical Tools Laboratory (FASTLab) at Georgia Tech. It focusses on scalability, speed, and ease-of-use and was presented at the BigLearning workshop of NIPS 2011. Its homepage is at http://www.mlpack.org/about.html. Caffe: The last C++ library we want to mention is Caffe. This is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. You can find more information about it at http://caffe.berkeleyvision.org/. Other libraries for data processing and analysis are as follows: Statsmodels: This is a great Python library for statistical modelling and is mainly used for predictive and exploratory analysis. Modular toolkit for data processing (MDP):This is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures (http://mdp-toolkit.sourceforge.net/index.html). Orange: This is an open source data visualization and analysis for novices and experts. It is packed with features for data analysis and has add-ons for bioinformatics and text mining. It contains an implementation of self-organizing maps, which sets it apart from the other projects as well (http://orange.biolab.si/). Mirador: This is a tool for the visual exploration of complex datasets supporting Mac and Windows. It enables users to discover correlation patterns and derive new hypotheses from data (http://orange.biolab.si/). RapidMiner: This is another GUI-based tool for data mining, machine learning, and predictive analysis (https://rapidminer.com/). Theano: This bridges the gap between Python and lower-level languages. Theano gives very significant performance gains, particularly for large matrix operations and is, therefore, a good choice for deep learning models. However, it is not easy to debug because of the additional compilation layer. Natural language processing toolkit (NLTK): This is written in Python with very unique and salient features. Here, I could not list all libraries for data analysis. However, I think the above libraries are enough to take a lot of your time to learn and build data analysis applications. Python libraries in data analysis Python is a multi-platform, general purpose programming language that can run on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and the .NET virtual machines as well. It has a powerful standard library. In addition, it has many libraries for data analysis: Pylearn2, Hebel, Pybrain, Pattern, MontePython, and MILK. We will cover some common Python data analysis libraries such as Numpy, Pandas, Matplotlib, PyMongo, and scikit-learn. Now, to help you getting started, I will briefly present an overview of each library for those who are less familiar with the scientific Python stack. NumPy One of the fundamental packages used for scientific computing with Python is Numpy. Among other things, it contains the following: A powerful N-dimensional array object Sophisticated (broadcasting) functions for performing array computations Tools for integrating C/C++ and Fortran code Useful linear algebra operations, Fourier transforms, and random number capabilities. Besides this, it can also be used as an efficient multidimensional container of generic data. Arbitrary data types can be defined and integrated with a wide variety of databases. Pandas Pandas is a Python package that supports rich data structures and functions for analyzing data and is developed by the PyData Development Team. It is focused on the improvement of Python's data libraries. Pandas consists of the following things: A set of labelled array data structures; the primary of which are Series, DataFrame, and Panel Index objects enabling both simple axis indexing and multilevel/hierarchical axis indexing An integrated group by engine for aggregating and transforming datasets Date range generation and custom date offsets Input/output tools that loads and saves data from flat files or PyTables/HDF5 format Optimal memory versions of the standard data structures Moving window statistics and static and moving window linear/panel regression Because of these features, Pandas is an ideal tool for systems that need complex data structures or high-performance time series functions such as financial data analysis applications. Matplotlib Matplotlib is the single most used Python package for 2D-graphic. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats: line plots, contour plots, scatter plots, or Basemap plot. It comes with a set of default settings, but allows customizing all kinds of properties. However, we can easily create our chart with the defaults of almost every property in Matplotlib. PyMongo MongoDB is a type of NoSQL database. It is highly scalable, robust, and perfect to work with JavaScript-based web applications because we can store data as JSON documents and use flexible schemas. PyMongo is a Python distribution containing tools for working with MongoDB. Many tools have also been written for working with PyMongo to add more features such as MongoKit, Humongolus, MongoAlchemy, and Ming. scikit-learn scikit-learn is an open source machine learning library using the Python programming language. It supports various machine learning models, such as classification, regression, and clustering algorithms, interoperated with the Python numerical and scientific libraries NumPy and SciPy. The latest scikit-learn version is 0.16.1, published in April 2015. Summary In this article, there were three main points that we presented. Firstly, we figured out the relationship between raw data, information and knowledge. Because of its contribution in our life, we continued to discuss an overview of data analysis and processing steps in the second part. Finally, we introduced a few common supported libraries that are useful for practical data analysis applications. Among those we will focus on Python libraries in data analysis. Resources for Article: Further resources on this subject: Exploiting Services with Python [Article] Basics of Jupyter Notebook and Python [Article] How to do Machine Learning with Python [Article]
Read more
  • 0
  • 0
  • 28651
article-image-build-cartpole-game-using-openai-gym
Savia Lobo
10 Mar 2018
11 min read
Save for later

How to build a cartpole game using OpenAI Gym

Savia Lobo
10 Mar 2018
11 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn advanced features of TensorFlow1.x, such as distributed TensorFlow with TF Clusters, deploy production models with TensorFlow Serving, and more. [/box] Today, we will help you understand OpenAI Gym and how to apply the basics of OpenAI Gym onto a cartpole game. OpenAI Gym 101 OpenAI Gym is a Python-based toolkit for the research and development of reinforcement learning algorithms. OpenAI Gym provides more than 700 opensource contributed environments at the time of writing. With OpenAI, you can also create your own environment. The biggest advantage is that OpenAI provides a unified interface for working with these environments, and takes care of running the simulation while you focus on the reinforcement learning algorithms. Note : The research paper describing OpenAI Gym is available here: http://arxiv.org/abs/1606.01540 You can install OpenAI Gym using the following command: pip3  install  gym Note: If the above command does not work, then you can find further help with installation at the following link: https://github.com/openai/ gym#installation  Let us print the number of available environments in OpenAI Gym: all_env  =  list(gym.envs.registry.all()) print('Total  Environments  in  Gym  version  {}  :  {}' .format(gym.     version     ,len(all_env))) Total  Environments  in  Gym  version  0.9.4  :  777  Let us print the list of all environments: for  e  in  list(all_env): print(e) The partial list from the output is as follows: EnvSpec(Carnival-ramNoFrameskip-v0) EnvSpec(EnduroDeterministic-v0) EnvSpec(FrostbiteNoFrameskip-v4) EnvSpec(Taxi-v2) EnvSpec(Pooyan-ram-v0) EnvSpec(Solaris-ram-v4) EnvSpec(Breakout-ramDeterministic-v0) EnvSpec(Kangaroo-ram-v4) EnvSpec(StarGunner-ram-v4) EnvSpec(Enduro-ramNoFrameskip-v4) EnvSpec(DemonAttack-ramDeterministic-v0) EnvSpec(TimePilot-ramNoFrameskip-v0) EnvSpec(Amidar-v4) Each environment, represented by the env object, has a standardized interface, for example: An env object can be created with the env.make(<game-id-string>) function by passing the id string. Each env object contains the following main functions: The step() function takes an action object as an argument and returns four objects: observation: An object implemented by the environment, representing the observation of the environment. reward: A signed float value indicating the gain (or loss) from the previous action. done: A Boolean value representing if the scenario is finished. info: A Python dictionary object representing the diagnostic information. The render() function creates a visual representation of the environment. The reset() function resets the environment to the original state. Each env object comes with well-defined actions and observations, represented by action_space and observation_space. One of the most popular games in the gym to learn reinforcement learning is CartPole. In this game, a pole attached to a cart has to be balanced so that it doesn't fall. The game ends if either the pole tilts by more than 15 degrees or the cart moves by more than 2.4 units from the center. The home page of OpenAI.com emphasizes the game in these words: The small size and simplicity of this environment make it possible to run very quick experiments, which is essential when learning the basics. The game has only four observations and two actions. The actions are to move a cart by applying a force of +1 or -1. The observations are the position of the cart, the velocity of the cart, the angle of the pole, and the rotation rate of the pole. However, knowledge of the semantics of observation is not necessary to learn to maximize the rewards of the game. Now let us load a popular game environment, CartPole-v0, and play it with stochastic control:  Create the env object with the standard make function: env  =  gym.make('CartPole-v0')  The number of episodes is the number of game plays. We shall set it to one, for now, indicating that we just want to play the game once. Since every episode is stochastic, in actual production runs you will run over several episodes and calculate the average values of the rewards. Additionally, we can initialize an array to store the visualization of the environment at every timestep: n_episodes  =  1 env_vis  =  []  Run two nested loops—an external loop for the number of episodes and an internal loop for the number of timesteps you would like to simulate for. You can either keep running the internal loop until the scenario is done or set the number of steps to a higher value. At the beginning of every episode, reset the environment using env.reset(). At the beginning of every timestep, capture the visualization using env.render(). for  i_episode  in  range(n_episodes): observation  =  env.reset() for  t  in  range(100): env_vis.append(env.render(mode  =  'rgb_array')) print(observation) action  =  env.action_space.sample() observation,  reward,  done,  info  =  env.step(action) if  done: print("Episode  finished  at  t{}".format(t+1)) break  Render the environment using the helper function: env_render(env_vis)  The code for the helper function is as follows: def  env_render(env_vis): plt.figure() plot  =  plt.imshow(env_vis[0]) plt.axis('off') def  animate(i): plot.set_data(env_vis[i]) anim  =  anm.FuncAnimation(plt.gcf(), animate, frames=len(env_vis), interval=20, repeat=True, repeat_delay=20) display(display_animation(anim,  default_mode='loop')) We get the following output when we run this example: [-0.00666995  -0.03699492  -0.00972623    0.00287713] [-0.00740985    0.15826516  -0.00966868  -0.29285861] [-0.00424454  -0.03671761  -0.01552586  -0.00324067] [-0.0049789    -0.2316135    -0.01559067    0.28450351] [-0.00961117  -0.42650966  -0.0099006    0.57222875] [-0.01814136  -0.23125029    0.00154398    0.27644332] [-0.02276636  -0.0361504    0.00707284  -0.01575223] [-0.02348937    0.1588694       0.0067578    -0.30619523] [-0.02031198  -0.03634819    0.00063389  -0.01138875] [-0.02103895    0.15876466    0.00040612  -0.3038716  ] [-0.01786366    0.35388083  -0.00567131  -0.59642642] [-0.01078604    0.54908168  -0.01759984  -0.89089036] [    1.95594914e-04   7.44437934e-01    -3.54176495e-02    -1.18905344e+00] [ 0.01508435 0.54979251 -0.05919872 -0.90767902] [ 0.0260802 0.35551978 -0.0773523 -0.63417465] [ 0.0331906 0.55163065 -0.09003579 -0.95018025] [ 0.04422321 0.74784161 -0.1090394 -1.26973934] [ 0.05918004 0.55426764 -0.13443418 -1.01309691] [ 0.0702654 0.36117014 -0.15469612 -0.76546874] [ 0.0774888 0.16847818 -0.1700055 -0.52518186] [ 0.08085836 0.3655333 -0.18050913 -0.86624457] [ 0.08816903 0.56259197 -0.19783403 -1.20981195] Episode  finished  at  t22 It took 22 time-steps for the pole to become unbalanced. At every run, we get a different time-step value because we picked the action scholastically by using env.action_space.sample(). Since the game results in a loss so quickly, randomly picking an action and applying it is probably not the best strategy. There are many algorithms for finding solutions to keeping the pole straight for a longer number of time-steps that you can use, such as Hill Climbing, Random Search, and Policy Gradient. Note: Some of the algorithms for solving the Cartpole game are available at the following links: https://openai.com/requests-for-research/#cartpole http://kvfrans.com/simple-algoritms-for-solving-cartpole/ https://github.com/kvfrans/openai-cartpole Applying simple policies to a cartpole game So far, we have randomly picked an action and applied it. Now let us apply some logic to picking the action instead of random chance. The third observation refers to the angle. If the angle is greater than zero, that means the pole is tilting right, thus we move the cart to the right (1). Otherwise, we move the cart to the left (0). Let us look at an example: We define two policy functions as follows: def  policy_logic(env,obs): return  1  if  obs[2]  >  0  else  0 def  policy_random(env,obs): return  env.action_space.sample() Next, we define an experiment function that will run for a specific number of episodes; each episode runs until the game is lost, namely when done is True. We use rewards_max to indicate when to break out of the loop as we do not wish to run the experiment forever: def  experiment(policy,  n_episodes,  rewards_max): rewards=np.empty(shape=(n_episodes)) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): obs  =  env.reset() done  =  False episode_reward  =  0 while  not  done: action  =  policy(env,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break rewards[i]=episode_reward print('Policy:{},  Min  reward:{},  Max  reward:{}' .format(policy.     name     , min(rewards), max(rewards))) We run the experiment 100 times, or until the rewards are less than or equal to rewards_max, that is set to 10,000: n_episodes  =  100 rewards_max  =  10000 experiment(policy_random,  n_episodes,  rewards_max) experiment(policy_logic,  n_episodes,  rewards_max) We can see that the logically selected actions do better than the randomly selected ones, but not that much better: Policy:policy_random,  Min  reward:9.0,  Max  reward:63.0,  Average  reward:20.26 Policy:policy_logic,  Min  reward:24.0,  Max  reward:66.0,  Average  reward:42.81 Now let us modify the process of selecting the action further—to be based on parameters. The parameters will be multiplied by the observations and the action will be chosen based on whether the multiplication result is zero or one. Let us modify the random search method in which we initialize the parameters randomly. The code looks as follows: def  policy_logic(theta,obs): #  just  ignore  theta return  1  if  obs[2]  >  0  else  0 def  policy_random(theta,obs): return  0  if  np.matmul(theta,obs)  <  0  else  1 def  episode(env,  policy,  rewards_max): obs  =  env.reset() done  =  False episode_reward  =  0 if  policy.   name          in  ['policy_random']: theta  =  np.random.rand(4)  *  2  -  1 else: theta  =  None while  not  done: action  =  policy(theta,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break return  episode_reward def  experiment(policy,  n_episodes,  rewards_max): rewards=np.empty(shape=(n_episodes)) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): rewards[i]=episode(env,policy,rewards_max) #print("Episode  finished  at  t{}".format(reward)) print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes  =  100 rewards_max  =  10000 experiment(policy_random,  n_episodes,  rewards_max) experiment(policy_logic,  n_episodes,  rewards_max) We can see that random search does improve the results: Policy:policy_random,  Min  reward:8.0,  Max  reward:200.0,  Average reward:40.04 Policy:policy_logic,  Min  reward:25.0,  Max  reward:62.0,  Average  reward:43.03 With the random search, we have improved our results to get the max rewards of 200. On average, the rewards for random search are lower because random search tries various bad parameters that bring the overall results down. However, we can select the best parameters from all the runs and then, in production, use the best parameters. Let us modify the code to train the parameters first: def  policy_logic(theta,obs): #  just  ignore  theta return  1  if  obs[2]  >  0  else  0 def  policy_random(theta,obs): return  0  if  np.matmul(theta,obs)  <  0  else  1 def  episode(env,policy,  rewards_max,theta): obs  =  env.reset() done  =  False episode_reward  =  0 while  not  done: action  =  policy(theta,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break return  episode_reward def  train(policy,  n_episodes,  rewards_max): env  =  gym.make('CartPole-v0') theta_best  =  np.empty(shape=[4]) reward_best  =  0 for  i  in  range(n_episodes): if  policy.   name          in  ['policy_random']: theta  =  np.random.rand(4)  *  2  -  1 else: theta  =  None reward_episode=episode(env,policy,rewards_max,  theta) if  reward_episode  >  reward_best: reward_best  =  reward_episode theta_best  =  theta.copy() return  reward_best,theta_best def  experiment(policy,  n_episodes,  rewards_max,  theta=None): rewards=np.empty(shape=[n_episodes]) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): rewards[i]=episode(env,policy,rewards_max,theta) #print("Episode  finished  at  t{}".format(reward)) print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes  =  100 rewards_max  =  10000 reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max) We train for 100 episodes and then use the best parameters to run the experiment for the random search policy: n_episodes  =  100 rewards_max  =  10000 reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max) We find the that the training parameters gives us the best results of 200: trained  theta:  [-0.14779543               0.93269603    0.70896423   0.84632461],  rewards: 200.0 Policy:policy_random,  Min  reward:200.0,  Max  reward:200.0,  Average reward:200.0 Policy:policy_logic,  Min  reward:24.0,  Max  reward:63.0,  Average  reward:41.94 We may optimize the training code to continue training until we reach a maximum reward. To summarize, we learnt the basics of OpenAI Gym and also applied it onto a cartpole game for relevant output.   If you found this post useful, do check out this book Mastering TensorFlow 1.x  to build, scale, and deploy deep neural network models using star libraries in Python.
Read more
  • 0
  • 1
  • 28647

article-image-sql-server-user-management
Vijin Boricha
10 Apr 2018
8 min read
Save for later

Get SQL Server user management right

Vijin Boricha
10 Apr 2018
8 min read
The question who are you sounds pretty simple, right? Well, possibly not where philosophy is concerned, and neither is it where databases are concerned either. But user management is essential for anyone managing databases. In this tutorial, learn how SQL server user management works - and how to configure it in the right way. SQL Server user management: the authentication process During the setup procedure, you have to select a password which actually uses the SQL Server authentication process. This database engine comes from Windows and it is tightly connected with Active Directory and internal Windows authentication. In this phase of development, SQL Server on Linux only supports SQL authentication. SQL Server has a very secure entry point. This means no access without the correct credentials. Every information system has some way of checking a user's identity, but SQL Server has three different ways of verifying identity, and the ability to select the most appropriate method, based on individual or business needs. When using SQL Server authentication, logins are created on SQL Server. Both the user name and the password are created by using SQL Server and stored in SQL Server. Users connecting through SQL Server authentication must provide their credentials every time that they connect (user name and password are transmitted through the network). Note: When using SQL Server authentication, it is highly recommended to set strong passwords for all SQL Server accounts.  As you'll have noticed, so far you have not had any problems accessing SQL Server resources. The reason for this is very simple. You are working under the sa login. This login has unlimited SQL Server access. In some real-life scenarios, sa is not something to play with. It is good practice to create a login under a different name with the same level of access. Now let's see how to create a new SQL Server login. But, first, we'll check the list of current SQL Server logins. To do this, access the sys.sql_logins system catalog view and three attributes: name, is_policy_checked, and is_expiration_checked. The attribute name is clear; the second one will show the login enforcement password policy; and the third one is for enforcing account expiration. Both attributes have a Boolean type of value: TRUE or FALSE (1 or 0). Type the following command to list all SQL logins: 1> SELECT name, is_policy_checked, is_expiration_checked 2> FROM sys.sql_logins 3> WHERE name = 'sa' 4> GO name is_policy_checked is_expiration_checked -------------- ----------------- --------------------- sa 1 0 (1 rows affected) 2. If you want to see what your password for the sa login looks like, just type this version of the same statement. This is the result of the hash function: 1> SELECT password_hash 2> FROM sys.sql_logins 3> WHERE name = 'sa' 4> GO password_hash ------------------------------------------------------------- 0x0200110F90F4F4057F1DF84B2CCB42861AE469B2D43E27B3541628 B72F72588D36B8E0DDF879B5C0A87FD2CA6ABCB7284CDD0871 B07C58D0884DFAB11831AB896B9EEE8E7896 (1 rows affected) 3. Now let's create the login dba, which will require a strong password and will not expire: 1> USE master 2> GO Changed database context to 'master'. 1> CREATE LOGIN dba 2> WITH PASSWORD ='S0m3c00lPa$$', 3> CHECK_EXPIRATION = OFF, 4> CHECK_POLICY = ON 5> GO 4. Re-check the dba on the login list: 1> SELECT name, is_policy_checked, is_expiration_checked 2> FROM sys.sql_logins 3> WHERE name = 'dba' 4> GO name is_policy_checked is_expiration_checked ----------------- ----------------- --------------------- dba 1 0 (1 rows affected) Notice that dba logins do not have any kind of privilege. Let's check that part. First close your current sqlcmd session by typing exit. Now, connect again but, instead of using sa, you will connect with the dba login. After the connection has been successfully created, try to change the content of the active database to AdventureWorks. This process, based on the login name, should looks like this: # dba@tumbleweed:~> sqlcmd -S suse -U dba Password: 1> USE AdventureWorks 2> GO Msg 916, Level 14, State 1, Server tumbleweed, Line 1 The server principal "dba" is not able to access the database "AdventureWorks" under the current security context As you can see, the authentication process will not grant you anything. Simply, you can enter the building but you can't open any door. You will need to pass the process of authorization first. Authorization process After authenticating a user, SQL Server will then determine whether the user has permission to view and/or update data, view metadata, or perform administrative tasks (server-side level, database-side level, or both). If the user, or a group to which the user is amember, has some type of permission within the instance and/or specific databases, SQL Server will let the user connect. In a nutshell, authorization is the process of checking user access rights to specific securables. In this phase, SQL Server will check the login policy to determine whether there are any access rights to the server and/or database level. Login can have successful authentication, but no access to the securables. This means that authentication is just one step before login can proceed with any action on SQL Server. SQL Server will check the authorization process on every T-SQL statement. In other words, if a user has SELECT permissions on some database, SQL Server will not check once and then forget until the next authentication/authorization process. Every statement will be verified by the policy to determine whether there are any changes. Permissions are the set of rules that govern the level of access that principals have to securables. Permissions in an SQL Server system can be granted, revoked, or denied. Each of the SQL Server securables has associated permissions that can be granted to each Principal. The only way a principal can access a resource in an SQL Server system is if it is granted permission to do so. At this point, it is important to note that authentication and authorization are two different processes, but they work in conjunction with one another. Furthermore, the terms login and user are to be used very carefully, as they are not the same: Login is the authentication part User is the authorization part Prior to accessing any database on SQL Server, the login needs to be mapped as a user. Each login can have one or many user instances in different databases. For example, one login can have read permission in AdventureWorks and write permission in WideWorldImporters. This type of granular security is a great SQL Server security feature. A login name can be the same or different from a user name in different databases. In the following lines, we will create a database user dba based on login dba. The process will be based on the AdventureWorks database. After that we will try to enter the database and execute a SELECT statement on the Person.Person table: dba@tumbleweed:~> sqlcmd -S suse -U sa Password: 1> USE AdventureWorks 2> GO Changed database context to 'AdventureWorks'. 1> CREATE USER dba 2> FOR LOGIN dba 3> GO 1> exit dba@tumbleweed:~> sqlcmd -S suse -U dba Password: 1> USE AdventureWorks 2> GO Changed database context to 'AdventureWorks'. 1> SELECT * 2> FROM Person.Person 3> GO Msg 229, Level 14, State 5, Server tumbleweed, Line 1 The SELECT permission was denied on the object 'Person', database 'AdventureWorks', schema 'Person' We are making progress. Now we can enter the database, but we still can't execute SELECT or any other SQL statement. The reason is very simple. Our dba user still is not authorized to access any types of resources. Schema separation In Microsoft SQL Server, a schema is a collection of database objects that are owned by a single principal and form a single namespace. All objects within a schema must be uniquely named and a schema itself must be uniquely named in the database catalog. SQL Server (since version 2005) breaks the link between users and schemas. In other words, users do not own objects; schemas own objects, and principals own schemas. Users can now have a default schema assigned using the DEFAULT_SCHEMA option from the CREATE USER and ALTER USER commands. If a default schema is not supplied for a user, then the dbo will be used as the default schema. If a user from a different default schema needs to access objects in another schema, then the user will need to type a full name. For example, Denis needs to query the Contact tables in the Person schema, but he is in Sales. To resolve this, he would type: SELECT * FROM Person.Contact Keep in mind that the default schema is dbo. When database objects are created and not explicitly put in schemas, SQL Server will assign them to the dbo default database schema. Therefore, there is no need to type dbo because it is the default schema. You read a book excerpt from SQL Server on Linux written by Jasmin Azemović.  From this book, you will be able to recognize and utilize the full potential of setting up an efficient SQL Server database solution in the Linux environment. Check out other posts on SQL Server: How SQL Server handles data under the hood How to implement In-Memory OLTP on SQL Server in Linux How to integrate SharePoint with SQL Server Reporting Services
Read more
  • 0
  • 0
  • 28566
Modal Close icon
Modal Close icon