How-To Tutorials

article-image-facial-recognition-technology-is-faulty-racist-biased-abusive-to-civil-rights-act-now-to-restrict-misuse-say-experts-to-house-oversight-and-reform-committee

27 May 2019

6 min read

‘Facial Recognition technology is faulty, racist, biased, abusive to civil rights; act now to restrict misuse’ say experts to House Oversight and Reform Committee

27 May 2019

0
0
19827

article-image-develop-stock-price-predictive-model-using-reinforcement-learning-tensorflow

Aaron Lazar

20 Feb 2018

12 min read

How to develop a stock price predictive model using Reinforcement Learning and TensorFlow

Aaron Lazar

20 Feb 2018

12 min read

[box type="note" align="" class="" width=""]This article is an extract from the book Predictive Analytics with TensorFlow, authored by Md. Rezaul Karim. This book helps you build, tune, and deploy predictive models with TensorFlow.[/box] In this article we’ll show you how to create a predictive model to predict stock prices, using TensorFlow and Reinforcement Learning. An emerging area for applying Reinforcement Learning is the stock market trading, where a trader acts like a reinforcement agent since buying and selling (that is, action) particular stock changes the state of the trader by generating profit or loss, that is, reward. The following figure shows some of the most active stocks on July 15, 2017 (for an example): Now, we want to develop an intelligent agent that will predict stock prices such that a trader will buy at a low price and sell at a high price. However, this type of prediction is not so easy and is dependent on several parameters such as the current number of stocks, recent historical prices, and most importantly, on the available budget to be invested for buying and selling. The states in this situation are a vector containing information about the current budget, current number of stocks, and a recent history of stock prices (the last 200 stock prices). So each state is a 202-dimensional vector. For simplicity, there are only three actions to be performed by a stock market agent: buy, sell, and hold. So, we have the state and action, what else do you need? Policy, right? Yes, we should have a good policy, so based on that an action will be performed in a state. A simple policy can consist of the following rules: Buying (that is, action) a stock at the current stock price (that is, state) decreases the budget while incrementing the current stock count Selling a stock trades it in for money at the current share price Holding does neither, and performing this action simply waits for a particular time period and yields no reward To find the stock prices, we can use the yahoo_finance library in Python. A general warning you might experience is "HTTPError: HTTP Error 400: Bad Request". But keep trying. Now, let's try to get familiar with this module: >>> from yahoo_finance import Share >>> msoft = Share('MSFT') >>> print(msoft.get_open()) 72.24= >>> print(msoft.get_price()) 72.78 >>> print(msoft.get_trade_datetime()) 2017-07-14 20:00:00 UTC+0000 >>> So as of July 14, 2017, the stock price of Microsoft Inc. went higher, from 72.24 to 72.78, which means about a 7.5% increase. However, this small and just one-day data doesn't give us any significant information. But, at least we got to know the present state for this particular stock or instrument. To install yahoo_finance, issue the following command: $ sudo pip3 install yahoo_finance Now it would be worth looking at the historical data. The following function helps us get the historical data for Microsoft Inc: def get_prices(share_symbol, start_date, end_date, cache_filename): try: stock_prices = np.load(cache_filename) except IOError: share = Share(share_symbol) stock_hist = share.get_historical(start_date, end_date) stock_prices = [stock_price['Open'] for stock_price in stock_ hist] np.save(cache_filename, stock_prices) return stock_prices The get_prices() method takes several parameters such as the share symbol of an instrument in the stock market, the opening date, and the end date. You will also like to specify and cache the historical data to avoid repeated downloading. Once you have downloaded the data, it's time to plot the data to get some insights. The following function helps us to plot the price: def plot_prices(prices): plt.title('Opening stock prices') plt.xlabel('day') plt.ylabel('price ($)') plt.plot(prices) plt.savefig('prices.png') Now we can call these two functions by specifying a real argument as follows: if __name__ == '__main__': prices = get_prices('MSFT', '2000-07-01', '2017-07-01', 'historical_stock_prices.npy') plot_prices(prices) Here I have chosen a wide range for the historical data of 17 years to get a better insight. Now, let's take a look at the output of this data: The goal is to learn a policy that gains the maximum net worth from trading in the stock market. So what will a trading agent be achieving in the end? Figure 8 gives you some clue: Well, figure 8 shows that if the agent buys a certain instrument with price $20 and sells at a peak price say at $180, it will be able to make $160 reward, that is, profit. So, implementing such an intelligent agent using RL algorithms is a cool idea? From the previous example, we have seen that for a successful RL agent, we need two operations well defined, which are as follows: How to select an action How to improve the utility Q-function To be more specific, given a state, the decision policy will calculate the next action to take. On the other hand, improve Q-function from a new experience of taking an action. Also, most reinforcement learning algorithms boil down to just three main steps: infer, perform, and learn. During the first step, the algorithm selects the best action (a) given a state (s) using the knowledge it has so far. Next, it performs the action to find out the reward (r) as well as the next state (s'). Then, it improves its understanding of the world using the newly acquired knowledge (s, r, a, s') as shown in the following figure: Now, let's start implementing the decision policy based on which action will be taken for buying, selling, or holding a stock item. Again, we will do it an incremental way. At first, we will create a random decision policy and evaluate the agent's performance. But before that, let's create an abstract class so that we can implement it accordingly: class DecisionPolicy: def select_action(self, current_state, step): pass def update_q(self, state, action, reward, next_state): pass The next task that can be performed is to inherit from this superclass to implement a random decision policy: class RandomDecisionPolicy(DecisionPolicy): def __init__(self, actions): self.actions = actions def select_action(self, current_state, step): action = self.actions[random.randint(0, len(self.actions) - 1)] return action The previous class did nothing except defi ning a function named select_action (), which will randomly pick an action without even looking at the state. Now, if you would like to use this policy, you can run it on the real-world stock price data. This function takes care of exploration and exploitation at each interval of time, as shown in the following figure that form states S1, S2, and S3. The policy suggests an action to be taken, which we may either choose to exploit or otherwise randomly explore another action. As we get rewards for performing an action, we can update the policy function over time: Fantastic, so we have the policy and now it's time to utilize this policy to make decisions and return the performance. Now, imagine a real scenario—suppose you're trading on Forex or ForTrade platform, then you can recall that you also need to compute the portfolio and the current profit or loss, that is, reward. Typically, these can be calculated as follows: portfolio = budget + number of stocks * share value reward = new_portfolio - current_portfolio At first, we can initialize values that depend on computing the net worth of a portfolio, where the state is a hist+2 dimensional vector. In our case, it would be 202 dimensional. Then we define the range of tuning the range up to: Length of the prices selected by the user query – (history + 1), since we start from 0, we subtract 1 instead. Then, we should calculate the updated value of the portfolio and from the portfolio, we can calculate the value of the reward, that is, profit. Also, we have already defined our random policy, so we can then select an action from the current policy. Then, we repeatedly update the portfolio values based on the action in each iteration and the new portfolio value after taking the action can be calculated. Then, we need to compute the reward from taking an action at a state. Nevertheless, we also need to update the policy after experiencing a new action. Finally, we compute the final portfolio worth: def run_simulation(policy, initial_budget, initial_num_stocks, prices, hist, debug=False): budget = initial_budget num_stocks = initial_num_stocks share_value = 0 transitions = list() for i in range(len(prices) - hist - 1): if i % 100 == 0: print('progress {:.2f}%'.format(float(100*i) / (len(prices) - hist - 1))) current_state = np.asmatrix(np.hstack((prices[i:i+hist], budget, num_stocks))) current_portfolio = budget + num_stocks * share_value action = policy.select_action(current_state, i) share_value = float(prices[i + hist + 1]) if action == 'Buy' and budget >= share_value: budget -= share_value num_stocks += 1 elif action == 'Sell' and num_stocks > 0: budget += share_value num_stocks -= 1 else: action = 'Hold' new_portfolio = budget + num_stocks * share_value reward = new_portfolio - current_portfolio next_state = np.asmatrix(np.hstack((prices[i+1:i+hist+1], budget, num_stocks))) transitions.append((current_state, action, reward, next_ state)) policy.update_q(current_state, action, reward, next_state) portfolio = budget + num_stocks * share_value if debug: print('${}t{} shares'.format(budget, num_stocks)) return portfolio The previous simulation predicts a somewhat good result; however, it produces random results too often. Thus, to obtain a more robust measurement of success, let's run the simulation a couple of times and average the results. Doing so may take a while to complete, say 100 times, but the results will be more reliable: def run_simulations(policy, budget, num_stocks, prices, hist): num_tries = 100 final_portfolios = list() for i in range(num_tries): final_portfolio = run_simulation(policy, budget, num_stocks, prices, hist) final_portfolios.append(final_portfolio) avg, std = np.mean(final_portfolios), np.std(final_portfolios) return avg, std The previous function computes the average portfolio and the standard deviation by iterating the previous simulation function 100 times. Now, it's time to evaluate the previous agent. As already stated, there will be three possible actions to be taken by the stock trading agent such as buy, sell, and hold. We have a state vector of 202 dimension and budget only $1000. Then, the evaluation goes as follows: actions = ['Buy', 'Sell', 'Hold'] hist = 200 policy = RandomDecisionPolicy(actions) budget = 1000.0 num_stocks = 0 avg,std=run_simulations(policy,budget,num_stocks,prices, hist) print(avg, std) >>> 1512.87102405 682.427384814 The first one is the mean and the second one is the standard deviation of the final portfolio. So, our stock prediction agent predicts that as a trader you/we could make a profit about $513. Not bad. However, the problem is that since we have utilized a random decision policy, the result is not so reliable. To be more specific, the second execution will definitely produce a different result: >>> 1518.12039077 603.15350649 Therefore, we should develop a more robust decision policy. Here comes the use of neural network-based QLearning for decision policy. Next, we will see a new hyperparameter epsilon to keep the solution from getting stuck when applying the same action over and over. The lesser its value, the more often it will randomly explore new actions: Next, I am going to write a class containing their functions: Constructor: This helps to set the hyperparameters from the Q-function. It also helps to set the number of hidden nodes in the neural networks. Once we have these two, it helps to define the input and output tensors. It then defines the structure of the neural network. Further, it defines the operations to compute the utility. Then, it uses an optimizer to update model parameters to minimize the loss and sets up the session and initializes variables. select_action: This function exploits the best option with probability 1-epsilon. update_q: This updates the Q-function by updating its model parameters. Refer to the following code: class QLearningDecisionPolicy(DecisionPolicy): def __init__(self, actions, input_dim): self.epsilon = 0.9 self.gamma = 0.001 self.actions = actions output_dim = len(actions) h1_dim = 200 self.x = tf.placeholder(tf.float32, [None, input_dim]) self.y = tf.placeholder(tf.float32, [output_dim]) W1 = tf.Variable(tf.random_normal([input_dim, h1_dim])) b1 = tf.Variable(tf.constant(0.1, shape=[h1_dim])) h1 = tf.nn.relu(tf.matmul(self.x, W1) + b1) W2 = tf.Variable(tf.random_normal([h1_dim, output_dim])) b2 = tf.Variable(tf.constant(0.1, shape=[output_dim])) self.q = tf.nn.relu(tf.matmul(h1, W2) + b2) loss = tf.square(self.y - self.q) self.train_op = tf.train.GradientDescentOptimizer(0.01). minimize(loss) self.sess = tf.Session() self.sess.run(tf.initialize_all_variables()) def select_action(self, current_state, step): threshold = min(self.epsilon, step / 1000.) if random.random() < threshold: # Exploit best option with probability epsilon action_q_vals = self.sess.run(self.q, feed_dict={self.x: current_state}) action_idx = np.argmax(action_q_vals) action = self.actions[action_idx] else: # Random option with probability 1 - epsilon action = self.actions[random.randint(0, len(self.actions) - 1)] return action def update_q(self, state, action, reward, next_state): action_q_vals = self.sess.run(self.q, feed_dict={self.x: state}) next_action_q_vals = self.sess.run(self.q, feed_dict={self.x: next_state}) next_action_idx = np.argmax(next_action_q_vals) action_q_vals[0, next_action_idx] = reward + self.gamma * next_action_q_vals[0, next_action_idx] action_q_vals = np.squeeze(np.asarray(action_q_vals)) self.sess.run(self.train_op, feed_dict={self.x: state, self.y: action_q_vals}) There you go! We have a stock price predictive model running and we’ve built it using Reinforcement Learning and TensorFlow. If you found this tutorial interesting and would like to learn more, head over to grab this book, Predictive Analytics with TensorFlow, by Md. Rezaul Karim.

0
1
19813

article-image-golang-decorators-logging-time-profiling

Nicholas Maccharoli

30 Mar 2016

6 min read

Golang Decorators: Logging & Time Profiling

Nicholas Maccharoli

30 Mar 2016

6 min read

Golang's imperative world Golang is not, by any means, a functional language; its design remains true to its jingle, which says that it is "C for the 21st Century". One task I tried to do early on in learning the language was search for the map, filter, and reduce functions in the standard library but to no avail. Next, I tried rolling my own versions, but I felt as though I hit a bit of a road block when I discovered that there is no support for generics in the language at the time of writing this. There is, however, support for Higher Order Functions or, more simply put, functions that take other functions as arguments and return functions. If you have spent some time in Python, you may have come to love a design pattern called "Decorator". In fact, decorators make life in Python so great that support for applying them is built right into the language with a nifty @ operator! Python frameworks such as Flask extensively use decorators. If you have little or no experience in Python, fear not for the concept is a design pattern independent of any language. Decorators An alternative name for the decorator pattern is "wrapper", which pretty much sums it all up in one word! A decorator's job is only to wrap a function so that additional code can be executed when the original function is called. This is accomplished by writing a function that takes a function as its argument and returns a function of the same type (Higher Order Functions in action!). While this still calls the original function and passes through its return value, it does something extra along the way. Decorators for logging We can easily log which specific method is passed with a little help from our decorator friends. Say, we wanted to log which user liked a blog post and what the ID of the post was all without touching any code in the original likePost function. Here is our original function: func likePost(userId int, postId int) bool { fmt.Printf("Update Complete!n") return true } Our decorator might look something similar to this: type LikeFunc func(int, int) bool func decoratedLike(f LikeFunc) LikeFunc { return func(userId int, postId int) bool { fmt.Printf("likePost Log: User %v liked post# %vn", userId, postId) return f(userId, postId) } } Note the use of the type definition here. I encourage you to use it for the sake of readability when defining functions with long signatures, such as those of decorators, as you need to type the function signature twice. Now, we can apply the decorator and allow the logging to begin: r := likeStats(likePost) r(1414, 324) r(5454, 324) r(4322, 250) This produces the following output: likePost Log: User 1414 liked post# 324 Update Complete! likePost Log: User 5454 liked post# 324 Update Complete! likePost Log: User 4322 liked post# 250 Update Complete! Our original likePost function still gets called and runs as expected, but now we get an additional log detailing the user and post IDs that were passed to the function each time it was called. Hopefully, this will help speed up debugging our likePost function if and when we encounter strange behavior! Decorators for performance! Say, we run a "Top 10" site and previously, our main sorting routine to find the top 10 cat photos of this week on the Internet was written with Golang's func Sort(data Interface) function from the sort package of the Golang standard library. Everything is fine until we are informed that Fluffy the cat is infuriated that she is coming in at number six on the list and not number five. The cats at ranks five and six on the list both had 5000 likes each, but Fluffy reached 5000 likes a day earlier than Bozo the cat, who is currently higher ranked. We like to give credit where it's due, so we apologize to Fluffy and go on to use the stable version of the func Stable(data Interface) sort, which preserves the order of elements equal in value during the sort. We can improve our code and tests so that this does not happen again (We promised Fluffy!). The tests pass, everything looks great, and we deploy gracefully... or so we think. Over the course of the day, other developers also deploy their changes, and then, after checking our performance reports, we notice a slowdown somewhere. Is it from our switch to stable the sorting? Well, let’s use decorators to measure the performance of both sort functions and check whether there is a noticeable dip in performance. Here’s our testing function: type SortFunc func(sort.Interface) func timedSortFunc(f SortFunc) SortFunc { return func(data sort.Interface) { defer func(t time.Time) { fmt.Printf("--- Time Elapsed: %v ---n", time.Since(t)) }(time.Now()) f(data) } } In case you are unfamiliar with defer, all it does is call the function it is passed right after its calling function returns. The arguments passed to defer are evaluated right away, so the value we get from time.Now() is really the start time of the function! Let’s go ahead and give this test a go: stable := timedSortFunc(sort.Stable) unStable := timedSortFunc(sort.Sort) // 10000 Elements with values ranging // between 0 and 5000 randomCatList1 := randomCatScoreSlice(10000, 5000) randomCatList2 := randomCatScoreSlice(10000, 5000) fmt.Printf("Unstable Sorting Function:n") stable(randomCatList1) fmt.Printf("Stable Sorting Function:n") unStable(randomCatList2) The following output is yielded: Unstable Sorting Function: --- Time Elapsed: 282.889µs --- Stable Sorting Function: --- Time Elapsed: 93.947µs --- Wow! Fluffy's complaint not only made our top 10 list more accurate but now they sort about three times as fast with the stable version of sort as well! (However, we still need to be careful; sort.Stable most likely uses way more memory than the standard sort.Sort function.) Final thoughts Figuring out when and where to apply the decorator pattern is really up to you and your team. There are no hard rules, and you can completely live without it. However, when it comes to things like extra logging or profiling a pesky area of your code, this technique may prove to be a valuable tool. Where is the rest of the code? In order get this example up and running, there is some setup code that was not shown here in order to keep the post from becoming too bloated. I encourage you take a look at this code here if you are interested! About the author Nick Maccharoli is an iOS/backend developer and open source enthusiast working at a start-up in Tokyo and enjoying the current development scene. You can see what he is up to at @din0sr or github.com/nirma.

0
0
19758

How-To Tutorials

article-image-postgis-extension-pgrouting-for-calculating-driving-distance-tutorial

Pravin Dhandre

19 Jul 2018

5 min read

PostGIS extension: pgRouting for calculating driving distance [Tutorial]

Pravin Dhandre

19 Jul 2018

5 min read

0
0
19754

article-image-setting-up-an-ethereum-development-environment-tutorial

Packt Editorial Staff

18 Jul 2018

8 min read

How to set up an Ethereum development environment [Tutorial]

Packt Editorial Staff

18 Jul 2018

8 min read

There are various ways to develop Ethereum blockchain. We will look at the mainstream options in this article which are: Test networks How to setup Ethereum private net This tutorial is extracted from the book Mastering Blockchain - Second Edition written by Imran Bashir. There are multiple ways to develop smart contracts on Ethereum. The usual and sensible approach is to develop and test Ethereum smart contracts either on a local private net or a simulated environment, and then it can be deployed on a public testnet. After all the relevant tests are successful on public testnet, the contracts can then be deployed to the public mainnet. There are however variations in this process, and many developers opt to only develop and test contracts on locally simulated environments. Then deploy on public mainnet or their private production blockchain networks. Developing on a simulated environment and then deploying directly to a public network can lead to faster time to production. As setting up private networks may take longer compared to setting a local development environment with a blockchain simulator. Let's start with connecting to a test network. Ethereum connection on test networks The Ethereum Go client (https://geth.ethereum.org) Geth, can be connected to the test network using the following command: $ geth --testnet A sample output is shown in the following screenshot. The screenshot shows the type of the network chosen and various other pieces of information regarding the blockchain download: The output of the geth command connecting to Ethereum test net A blockchain explorer for testnet is located at https://ropsten.etherscan.io can be used to trace transactions and blocks on the Ethereum test network. There are other test networks available too, such as Frontier, Morden, Ropsten, and Rinkeby. Geth can be issued with a command-line flag to connect to the desired network: --testnet: Ropsten network: pre-configured proof-of-work test network --rinkeby: Rinkeby network: pre-configured proof-of-authority test network --networkid value: Network identifier (integer, 1=Frontier, 2=Morden (disused), 3=Ropsten, 4=Rinkeby) (default: 1) Now let us do some experiments with building a private network and then we will see how a contract can be deployed on this network using the Mist and command-line tools. Setting up a private net Private net allows the creation of an entirely new blockchain. This is different from testnet or mainnet in the sense that it uses its on-genesis block and network ID. In order to create private net, three components are needed: Network ID The Genesis File Data directory to store blockchain data. Even though the data directory is not strictly required to be mentioned, if there is more than one blockchain already active on the system, then the data directory should be specified so that a separate directory is used for the new blockchain. On the mainnet, the Geth Ethereum client is capable of discovering boot nodes by default as they are hardcoded in the Geth client, and connects automatically. But on a private net, Geth needs to be configured by specifying appropriate flags and configuration in order for it to be discoverable by other peers or to be able to discover other peers. We will see how this is achieved shortly. In addition to the previously mentioned three components, it is desirable that you disable node discovery so that other nodes on the internet cannot discover your private network and it is secure. If other networks happen to have the same genesis file and network ID, they may connect to your private net. The chance of having the same network ID and genesis block is very low, but, nevertheless, disabling node discovery is good practice, and is recommended. In the following section, all these parameters are discussed in detail with a practical example. Network ID Network ID can be any positive number except 1 and 3, which are already in use by Ethereum mainnet and testnet (Ropsten), respectively. Network ID 786 has been chosen for the example private network discussed later in this section. The genesis file The genesis file contains the necessary fields required for a custom genesis block. This is the first block in the network and does not point to any previous block. The Ethereum protocol performs checking in order to ensure that no other node on the internet can participate in the consensus mechanism unless they have the same genesis block. Chain ID is usually used as an identification of the network. A custom genesis file that will be used later in the example is shown here: { "nonce": "0x0000000000000042", "timestamp": "0x00", "parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000", "extraData": "0x00", "gasLimit": "0x8000000", "difficulty": "0x0400", "mixhash": "0x0000000000000000000000000000000000000000000000000000000000000000", "coinbase": "0x3333333333333333333333333333333333333333", "alloc": { }, "config": { "chainId": 786, "homesteadBlock": 0, "eip155Block": 0, "eip158Block": 0 } } This file is saved as a text file with the JSON extension; for example, privategenesis.json. Optionally, Ether can be pre-allocated by specifying the beneficiary's addresses and the amount of Wei, but it is usually not necessary as being on the private network, Ether can be mined very quickly. In order to pre-allocate a section can be added to the genesis file, as shown here: "alloc": { "0xcf61d213faa9acadbf0d110e1397caf20445c58f ": { "balance": "100000" }, } Now let's see what each of these parameters mean. nonce: This is a 64-bit hash used to prove that PoW has been sufficiently completed. This works in combination with the mixhash parameter. timestamp: This is the Unix timestamp of the block. This is used to verify the sequence of the blocks and for difficulty adjustment. For example, if blocks are being generated too quickly that difficulty goes higher. parentHash: This is always zero being the genesis (first) block as there is no parent of the first block. extraData: This parameter allows a 32-bit arbitrary value to be saved with the block. gasLimit: This is the limit on the expenditure of gas per block. difficulty: This parameter is used to determine the mining target. It represents the difficulty level of the hash required to prove the PoW. mixhash: This is a 256-bit hash which works in combination with nonce to prove that sufficient amount of computational resources has been spent in order to complete the PoW requirements. coinbase: This is the 160-bit address where the mining reward is sent to as a result of successful mining. alloc: This parameter contains the list of pre-allocated wallets. The long hex digit is the account to which the balance is allocated. config: This section contains various configuration information defining chain ID, and blockchain hard fork block numbers. This parameter is not required to be used in private networks. Data directory This is the directory where the blockchain data for the private Ethereum network will be saved. For example, in the following example, it is ~/etherprivate/. In the Geth client, a nu mber of parameters are specified in order to launch, further fine-tune the configuration, and launch the private network. These flags are listed here. Flags and their meaning The following are the flags used with the Geth client: --nodiscover: This flag ensures that the node is not automatically discoverable if it happens to have the same genesis file and network ID. --maxpeers: This flag is used to specify the number of peers allowed to be connected to the private net. If it is set to 0, then no one will be able to connect, which might be desirable in a few scenarios, such as private testing. --rpc: This is used to enable the RPC interface in Geth. --rpcapi: This flag takes a list of APIs to be allowed as a parameter. For example, eth, web3 will enable the Eth and Web3 interface over RPC. --rpcport: This sets up the TCP RPC port; for example: 9999. --rpccorsdomain: This flag specifies the URL that is allowed to connect to the private Geth node and perform RPC operations. cors in --rpccorsdomain means cross-origin resource sharing. --port: This specifies the TCP port that will be used to listen to the incoming connections from other peers. --identity: This flag is a string that specifies the name of a private node. Static nodes If there is a need to connect to a specific set of peers, then these nodes can be added to a file where the chaindata and keystore files are saved. For example, in the ~/etherprivate/ directory. The filename should be static- nodes.json. This is valuable in a private network because this way the nodes can be discovered on a private network. An example of the JSON file is shown as follows: [ "enode:// 44352ede5b9e792e437c1c0431c1578ce3676a87e1f588434aff1299d30325c233c8d426fc5 7a25380481c8a36fb3be2787375e932fb4885885f6452f6efa77f@xxx.xxx.xxx.xxx:TCP_P ORT" ] Here, xxx is the public IP address and TCP_PORT can be any valid and available TCP port on the system. The long hex string is the node ID. To summarize, we explored Ethereum test networks and how-to setup private Ethereum networks. Learn about cryptography and cryptocurrencies from this book Mastering Blockchain - Second Edition, to build highly secure, decentralized applications and conduct trusted in-app transactions. Everything you need to know about Ethereum Will Ethereum eclipse Bitcoin? The trouble with Smart Contracts

0
0
19738

article-image-amazon-reinvent-2019-day-one-aws-launches-braket-its-new-quantum-service-and-releases-sagemaker-operators-for-kubernetes

Sugandha Lahoti

03 Dec 2019

6 min read

Amazon re:Invent 2019 Day One: AWS launches Braket, its new quantum service and releases SageMaker Operators for Kubernetes

Sugandha Lahoti

03 Dec 2019

6 min read

At day one of the ongoing Amazon re:Invent 2019, there was a flurry of announcements made for AWS. Most importantly, AWS announced the preview launch of Braket, its own quantum computing service following the likes of IBM, Microsoft, and Google. Amazon also released Amazon SageMaker Operators for Kubernetes to help data scientists using Kubernetes to train, tune, and deploy machine learning models in Amazon SageMaker. re:Invent is Amazon’s flagship conference hosted by Amazon Web Services for the global cloud computing community. This year re: Invent is taking place in Las Vegas, December 2-6, 2019. re:Invent 2019 Day One announcements Braket: AWS’ new quantum service in preview now Amazon Braket (named after the common notation for quantum states) is a fully managed service that helps you get started with quantum computing. Braket consists of a full development environment that helps data scientists to: design quantum algorithms from scratch or choose from a set of pre-built algorithms, test these algorithms on simulated quantum computers (including gate based and quantum annealing superconductors, and ion trap hardware) run them on your choice of different quantum hardware technologies ( including D-Wave, IonQ, and Rigetti) Once your tests are complete, you will be automatically notified and your results will be stored in Amazon S3. Amazon Braket publishes event logs and performance metrics such as completion status and execution time to Amazon CloudWatch. To make it easier to develop hybrid algorithms that combine classical and quantum tasks, Amazon Braket helps manage classical compute resources and establish low-latency connections to the quantum hardware. At re:Invent 2019, AWS also launched the Amazon Quantum Solutions Lab, a collaborative research program that connects you with quantum computing experts from Amazon and its technology and consulting partners. They can help you identify potential uses of quantum computing, build internal expertise, and collaborate on programs to design and test quantum algorithms. Braket is available in preview now. Amazon SageMaker Operators for Kubernetes Now developers and data scientists can use Kubernetes to train, tune, and deploy machine learning models in Amazon SageMaker, with the new Amazon SageMaker Operators for Kubernetes. Customers can install these Amazon SageMaker Operators on their Kubernetes cluster to create Amazon SageMaker jobs natively using the Kubernetes API and command-line Kubernetes tools such as ‘kubectl’. Operators can be used to train machine learning models, optimize hyperparameters for a given model, run batch transform jobs over existing models, and set up inference endpoints. With these operators, users can manage their jobs in Amazon SageMaker from their Kubernetes cluster in Amazon Elastic Kubernetes Service EKS. Amazon SageMaker Operators for Kubernetes are available in select AWS regions. AWS DeepComposer, a creative way to learn Machine Learning Amazon has launched AWS DeepComposer, the world’s first machine learning-enabled musical keyboard at re:Invent 2019. AWS DeepComposer is an educational tool to teach people Machine Learning. AWS DeepComposer gives developers of all skill levels a creative way to experience machine learning – music. https://youtu.be/XH2EbK9dQlg You can input a melody by connecting the AWS DeepComposer keyboard to your computer, or play the virtual keyboard in the AWS DeepComposer console. You can generate an original music composition using the pre-trained genre models in the console. You can then publish your tracks to SoundCloud. It is designed specifically to educate developers by means of tutorials, sample code, and training data. These can be used to get started with building generative AI models, all without having to write a single line of code. With AWS DeepComposer, you can train and optimize GAN models to create original music. GAN models pit two different neural networks against each other to produce new and original digital works based on sample inputs. AWS DeepComposer is available in preview now. Amazon Transcribe now extended to healthcare patients Amazon’s automatic speech recognition service Amazon Transcribe is now available for medical speech as announced in re:Invent 2019. Amazon Transcribe Medical allows physicians to easily and quickly dictate their clinical notes and see their speech converted to accurate text in real-time, without any human intervention. Clinicians can use natural speech and do not have to explicitly call out punctuation like “comma” or “full stop”. This text can then be automatically fed to downstream applications such as EHR systems, or to AWS language services such as Amazon Comprehend Medical for entity extraction. To make it work, you need to capture audio using your device’s microphone and send PCM (Pulse-code modulation) audio to a streaming API based on the popular Websocket protocol. This API will respond with a series of JSON blobs with the transcribed text, as well as word-level time stamps, punctuation, etc. Optionally, you can save this data to an Amazon Simple Storage Service (S3) bucket. Amazon Transcribe Medical is available in US East (N. Virginia) and US West (Oregon) regions. Updates to Microsoft Windows Server AWS has released a bring-your-own-license (BYOL) experience for customers as an easier way to bring, and manage, their existing licenses for Microsoft Windows Server and SQL Server to AWS. The new BYOL experience enables customers who want to use their existing Windows Server or SQL Server licenses to seamlessly create virtual machines in EC2, while AWS takes care of managing their licenses to help ensure compliance to licensing rules specified by the customer. Amazon is also providing End-of-Support Migration Program (EMP) for Windows Server. On January 14, 2020, support for Windows Server 2008 and 2008 R2 will end. Having an application that can run only on an unsupported version of Windows Server is problematic as you will no longer get free security patch updates, leaving you vulnerable to security and compliance risks. This new program combines technology with expert guidance, to migrate your legacy applications running on outdated versions of Windows Server to newer, supported versions on AWS. Other updates announced at Amazon re:Invent 2019 Amazon EventBridge Schema Registry is now in preview. The schema registry stores the structure (schema) of Amazon EventBridge events and maps them to Java, Python, and Typescript bindings so that you can use the events as typed objects. The existing AWS IoT SiteWise preview adds new features such as creating a virtual representation of your facility, monitor production performance metrics and use AWS IoT SiteWise Monitor to visualize the data in real-time. AWS IoT SiteWise Monitor is a new SaaS application that lets you monitor and interact with the data collected and organized by AWS IoT SiteWise. The upcoming AWS DeepRacer Evo car will include a stereo camera and a Light Detection and Ranging (LIDAR) sensor. The DeepRacer League in 2020 will have 8 additional races in 5 countries. The preview of EC2 Image Builder, a service that makes it easier and faster to build and maintain secure OS images for Windows Server and Amazon Linux 2, using automated build pipelines. Amazon re:Invent will continue throughout this week (the last day is the 6th of December). You can access the Livestream here. Keep checking this space for news on other updates and launches. Amazon EKS Windows Container Support is now generally available Amazon’s hardware event 2019 highlights: a high-end Echo Studio, the new Echo Show 8, and more 10 key announcements from Microsoft Ignite 2019 you should know about

0
0
19732

article-image-introducing-innative-an-aot-compiler-that-runs-webassembly-using-llvm-outside-the-sandbox-at-95-native-speed

Savia Lobo

28 May 2019

4 min read

Introducing InNative, an AOT compiler that runs WebAssembly using LLVM outside the Sandbox at 95% native speed

Savia Lobo

28 May 2019

4 min read

On May 17, a team of WebAssembly enthusiasts introduced InNative, an AOT (Ahead-Of-Time) compiler for WebAssembly using LLVM with a customizable level of sandboxing for Windows/Linux. It helps run WebAssembly Outside the Sandbox at 95% native speed. The team also announced an initial release of the inNative Runtime v0.1.0 for Windows and Linux, today. https://twitter.com/inNative_sdk/status/1133098611514830850 With the help of InNative, users can grab a precompiled SDK from GitHub, or build from source. If users turn off all the isolation, the LLVM optimizer can almost reach native speeds and nearly recreate the same optimized assembly that a fully optimized C++ compiler would give, while leveraging all the features of the host CPU. Given below are some benchmarks, adapted from these C++ benchmarks: Source: InNative This average benchmark has speed in microseconds and is compiled using GCC -O3 --march=native on WSL. “We usually see 75% native speed with sandboxing and 95% without. The C++ benchmark is actually run twice - we use the second run, after the cache has had time to warm up. Turning on fastmath for both inNative and GCC makes both go faster, but the relative speed stays the same”, the official website reads. “The only reason we haven’t already gotten to 99% native speed is because WebAssembly’s 32-bit integer indexes break LLVM’s vectorization due to pointer aliasing”, the WebAssembly researcher mentions. Once fixed-width SIMD instructions are added, native WebAssembly will close the gap entirely, as this vectorization analysis will have happened before the WebAssembly compilation step. Some features of InNative InNative has the same advantage as that of JIT compilers have, which is that it can always take full advantage of the native processor architecture. It can perform expensive brute force optimizations like a traditional AOT compiler, by caching its compilation result. By compiling on the target machine once, one can get the best of both, Just-In-Time and Ahead-Of-Time. It also allows webassembly modules to interface directly with the operating system. inNative uses its own unofficial extension to allow it to pass WebAssembly pointers into C functions as this kind of C interop is definitely not supported by the standard yet. However, there is a proposal for the same. inNative also lets the users write C libraries that expose themselves as WebAssembly modules, which would make it possible to build an interop library in C++. Once WebIDL bindings are standardized, it will be a lot easier to compile WebAssembly that binds to C APIs. This opens up a world of tightly integrated WebAssembly plugins for any language that supports calling standard C interfaces, integrated directly into the program. inNative lays the groundwork needed for us and it doesn’t need to be platform-independent, only architecture-independent. “We could break the stranglehold of i386 on the software industry and free developers to experiment with novel CPU architectures without having to worry about whether our favorite language compiles to it. A WebAssembly application built against POSIX could run on any CPU architecture that implements a POSIX compatible kernel!”, the official blog announced. A user on Hacker News commented, “The differentiator for InNative seems to be the ability to bypass the sandbox altogether as well as additional native interop with the OS. Looks promising!” Another user on Reddit, “This is really exciting! I've been wondering why we ship x86 and ARM assembly for years now, when we could more efficiently ship an LLVM-esque assembly that compiles on first run for the native arch. This could be the solution!” To know more about InNative in detail, head over to its official blog post. React Native VS Xamarin: Which is the better cross-platform mobile development framework? Tor Browser 8.5, the first stable version for Android, is now available on Google Play Store! Introducing SwiftWasm, a tool for compiling Swift to WebAssembly

0
0
19717

article-image-understanding-functional-reactive-programming-in-scala

Fatema Patrawala

15 Aug 2018

6 min read

Understanding functional reactive programming in Scala [Tutorial]

Fatema Patrawala

15 Aug 2018

6 min read

Like OOP (Object-Oriented Programming), Functional Programming is a kind of programming paradigm. It is a programming style in which we write programs in terms of pure functions and immutable data. It treats its programs as function evaluation. As we use pure functions and immutable data to write our applications, we will get lots of benefits for free. For instance, with immutable data, we do not need to worry about shared-mutable states, side effects, and thread-safety. It follows a Declarative programming style, which means programming is done in terms of expressions, not statements. For instance, in OOP or imperative programming paradigms, we use statements to write programs where FP uses everything as expressions. In this scala functional programming tutorial we will understand the principles and benefits of FP and why Functional reactive programming is a best fit for Reactive programming in Scala. This Scala tutorial is an extract taken from the book Scala Reactive Programming written by Rambabu Posa. Principles of functional programming FP has the following principles: Pure functions Immutable data No side effects Referential transparency (RT) Functions are first-class citizens Functions that include anonymous functions, higher order functions, combinators, partial functions, partially-applied functions, function currying, closures Tail recursion Functions composability A pure function is a function that always returns the same results for the same inputs irrespective of how many times and where you run this function. We will get lots of benefits with immutable data. For instance, no shared data, no side effects, thread safety for free, and so on. Like an object is a first-class citizen in OOP, in FP, a function is a first-class citizen. This means that we can use a function as any of these: An object A value A data A data type An operation In simple words, in FP, we treat both functions and data as the same. We can compose functions that are in sequential order so that we can solve even complex problems easily. Higher-Order Functions (HOF) are functions that take one or more functions as their parameters or return a function as their result or do both. For instance, map(), flatMap(), filter(), and so on are some of the important and frequently used higher-order functions. Consider the following example: map(x => x*x) Here, the map() function is an example of Higher-Order Function because it takes an anonymous function as its parameter. This anonymous function x => x *x is of type Int => Int, which takes an Int as input and returns Int as its result. An anonymous function is a function without any name. Benefits of functional programming FP provides us with many benefits: Thread-safe code Easy-to-write concurrency and parallel code We can write simple, readable, and elegant code Type safety Composability Supports Declarative programming As we use pure functions and immutability in FP, we will get thread-safety for free. One of the greatest benefits of FP is function composability. We can compose multiple functions one by one and execute them either sequentially or parentally. It gives us a great approach to solve complex problems easily. Functional Reactive programming The combination of FP and RP is known as function Reactive programming or, for short, FRP. It is a multiparadigm and combines the benefits and best features of two of the most popular programming paradigms, which are, FP and RP. FRP is a new programming paradigm or a new style of programming that uses the RP paradigm to support asynchronous non-blocking data streaming with backpressure and also uses the FP paradigm to utilize its features (such as pure functions, immutability, no side effects, RT, and more) and its HOF or combinators (such as map, flatMap, filter, reduce, fold, and zip). In simple words, FRP is a new programming paradigm to support RP using FP features and its building blocks. FRP = FP + RP, as shown here: Today, we have many FRP solutions, frameworks, tools, or technologies. Here's a list of a few FRP technologies: Scala, Play Framework, and Akka Toolkit RxJS Reactive-banana Reactive Sodium Haskell This book is dedicated toward discussing Lightbend's FRP technology stack—Lagom Framework, Scala, Play Framework, and Akka Toolkit (Akka Streams). FRP technologies are mainly useful in developing interactive programs, such as rich GUI (graphical user interfaces), animations, multiplayer games, computer music, or robot controllers. Types of Reactive Programming Even though most of the projects or companies use FP Paradigm to develop their Reactive systems or solutions, there are a couple of ways to use RP. They are known as types of RP: FRP (Functional Reactive Programming) OORP (Object-Oriented Reactive Programming) However, FP is the best programming paradigm to conflate with RP. We will get all the benefits of FP for free. Why FP is the best fit for RP When we conflate RP with FP, we will get the following benefits: Composability—we can compose multiple data streams using functional operations so that we can solve even complex problems easily Thread safety Readability Simple, concise, clear, and easy-to-understand code Easy-to-write asynchronous, concurrent, and parallel code Supports very flexible and easy-to-use operations Supports Declarative programming Easy to write, more Scalable, highly available, and robust code In FP, we concentrate on what to do to fulfill a job, whereas in other programming paradigms, such as OOP or imperative programming (IP), we concentrate on how to do. Declarative programming gives us the following benefits: No side effects Enforces to use immutability Easy to write concise and understandable code The main property of RP is real-time data streaming, and the main property of FP is composability. If we combine these two paradigms, we will get more benefits and can develop better solutions easily. In RP, everything is a stream, while everything is a function in FP. We can use these functions to perform operations on data streams. We learnt the principles and benefits of Scala functional programming. To build fault-tolerant, robust, and distributed applications in Scala, grab the book Scala Reactive Programming today. Introduction to the Functional Programming Manipulating functions in functional programming Why functional programming in Python matters: Interview with best selling author, Steven Lott

0
0
19708

How-To Tutorials

Yohei Yoshimuta

15 Jan 2015

4 min read

Part1. Learning AWS CLI

Yohei Yoshimuta

15 Jan 2015

4 min read

As an application developer, you must be familiar with the CLI. Using the CLI (instead of UI) has the benefit that the operations can be documented and then they become reproducible and shareable. Fortunately, AWS provides both API and the unified CLI tool named aws-cli. You must use and understand AWS CLI, especially when you want to control anything, AWS UI doesn't provide yet; for example, Scheduled Scaling - Auto Scaling can be available only via AWS CLI. Before explaining the full process, I will assume that you are using AWS VPC & S3. Also, ensure to have all of your network resources like security group inside VPC, and you know an access key and a secret key of your own AWS account or IAM account. Let's see how we can control EC2 instances and S3 : Install aws-cli package The first thing you need to do is to install aws-cli package on your machine. # Install pip if your machine doesn't have pip yet $ sudo easy_install pip # Install awscli with pip $ sudo pip install awscli # Configure AWS credential and config $ aws configure AWS Access Key ID: foo AWS Secret Access Key: bar Default region name [us-west-2]: us-west-2 Default output format [None]: json Note: You have to configure AWS Access Key ID and Secret Access Key to which an IAM account is attached by necessary but minimum policies. For now, I recommend you create an IAM account attached AmazonEC2FullAccess-AMI-201412181939 and AmazonS3FullAccess-AMI-201502041017. # AmazonEC2FullAccess-AMI-201412181939 { "Version": "2012-10-17", "Statement": [ { "Action": "ec2:*", "Effect": "Allow", "Resource": "*" }, { "Effect": "Allow", "Action": "elasticloadbalancing:*", "Resource": "*" }, { "Effect": "Allow", "Action": "cloudwatch:*", "Resource": "*" }, { "Effect": "Allow", "Action": "autoscaling:*", "Resource": "*" } ] } # AmazonS3FullAccess-AMI-201502041017 { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "s3:*", "Resource": "*" } ] } Run an EC2 instance Okay, you are ready to control AWS resources via CLI. The most important thing to run an EC2 instance is preparing option parameters. You can confirm these details in run-instances — AWS CLI documentation. This command generates a JSON file which has skeleton option parameters: $ aws ec2 run-instances --generate-cli-skeleton > /tmp/run-instances_base.json # We overwrite this skeleton file to be shorter and easier to understand $ vi /tmp/run-instances_base.json $ cat /tmp/run-instances_base.json { "ImageId": "ami-936d9d93", "KeyName": "YOUR Key pair name", "InstanceType": "t2.micro", "Placement": { "AvailabilityZone": "us-west-2" }, "NetworkInterfaces": [ { "DeviceIndex": 0, "SubnetId": "subnet-***", "Groups": [ "sg-***" ], "DeleteOnTermination": true, "AssociatePublicIpAddress": true } ] } # Run an instance $ aws ec2 run-instances --cli-input-json file:///tmp/run-instances_base.json List running EC2 instances Now confirm your running EC2 instances. The detail of using the command is here : describe-instances — AWS CLI documentation. I recommend you use the jq tool because the output is formatted as JSON and you might be overwhelmed by its volume. You can install jq via brew or make the tool. # Install jq if your machine doesn't have it yet and you want to use it on MacOSX $ brew install jq # List EC2 instances $ aws ec2 describe-instances | jq -r '.Reservations[].Instances[] | [.LaunchTime, .State.Name, .InstanceId, .InstanceType, .PrivateIpAddress, (.Tags[] | select(.Key=="Name").Value)] | join("t")' 2015-09-22T10:16:41.000Z running i-f19f6e54 t2.micro 10.0.1.61 Terminate an EC2 instance Well, it's time to terminate an EC2 instance to save money. The detail of using the command is here: terminate-instances — AWS CLI documentation # DryRun the command $ aws ec2 terminate-instances --instance-ids i-f19f6e54 --dry-run # Terminate an EC2 instance $ aws ec2 terminate-instances --instance-ids i-f19f6e54 List S3 directory contents You want to find and grep AWS ELB access logs, especially if you are an operations engineer and have some problems. To start, find the specific file. The detail of using the command is here: ls — AWS CLI documentation. # List ELB access logs created at 2015/09/18 $ aws s3 ls s3://example-elb-log/example-app-elb/AWSLogs/717669809617/elasticloadbalancing/us-west-2/2015/09/18/ Download a S3 content Then you can download a concerned file and grep with a specific keyword. The detail of using the command is here: cp — AWS CLI documentation. # Find access logs whose SSL cipher are ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 $ aws s3 cp s3://example-elb-log/example-app-elb/AWSLogs/717669809617/elasticloadbalancing/us-west-2/2015/09/18/717669809617_elasticloadbalancing_us-west-2_example-app-elb_20150918T0230Z_54.92.79.213_5wo8k1of.log - | head -n 1000 | grep 'ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2' Conclusion AWS CLI is a very useful tool. It supports extensive and important services; for example, I recently upgraded an SSL certificate of ELB from a SHA-1 signed to a SHA-2 before the iOS9 release due to iOS9 ATS. During these operations, I received a peer review for the planned aws-cli commands asynchronously. It's one of AWS CLI's benefits. About the author Yohei Yoshimuta is a software engineer with a proven record of delivering high quality software in both game and advertising industries. He has extensive experience building products from scratch in both small and large teams. His primary focuses are Perl, Go, and AWS technologies. You can reach him at @yoheimuta on GitHub and Twitter.

0
0
19705

How-To Tutorials

article-image-implementing-proximal-policy-optimization-ppo-algorithm-in-unity-tutorial

Natasha Mathur

12 Oct 2018

10 min read

Implementing Proximal Policy Optimization (PPO) algorithm in Unity [Tutorial]

Natasha Mathur

12 Oct 2018

10 min read

ML-agents uses a reinforcement learning technique called PPO or Proximal Policy Optimization. This is the preferred training method that Unity has developed which uses a neural network. This PPO algorithm is implemented in TensorFlow and runs in a separate Python process (communicating with the running Unity application over a socket). In this tutorial, we look at how to implement PPO, a reinforcement learning algorithm used for training the ML agents in Unity. This tutorial also explores training statistics with TensorBoard. This tutorial is an excerpt taken from the book 'Learn Unity ML-Agents – Fundamentals of Unity Machine Learning' by Micheal Lanham. Before implementing PPO, let's have a look at how to set a special unity environment needed for controlling the Unity training environment. Go through the following steps to learn how to configure the 3D ball environment for external. How to set up a 3D environment in Unity for external training Open the Unity editor and load the ML-Agents demo unityenvironment project. If you still have it open from the last chapter, then that will work as well. Open the 3DBall.scene in the ML-Agents/Examples/3DBall folder. Locate the Brain3DBrain object in the Hierarchy window and select it. In the Inspector window set the Brain Type to External. From the menu, select Edit | Project Settings | Player. From the Inspector window, set the properties as shown in the following screenshot: Setting the Player resolution properties From the menu, select File | Build Settings. Click on the Add Open Scene button and make sure that only the 3DBall scene is active, as shown in the following dialog: Setting the Build Settings for the Unity environment Set the Target Platform to your chosen desktop OS (Windows, in this example) and click the Build button at the bottom of the dialog. You will be prompted to choose a folder to build into. Select the python folder in the base of the ml-agents folder. If you are prompted to enter a name for the file, enter 3DBall. On newer versions of Unity, from 2018 onward, the name of the folder will be set by the name of the Unity environment build folder, which will be python.Be sure that you know where Unity is placing the build, and be sure that the file is in the python folder. At the time of writing, on Windows, Unity will name the executable python.exe and not 3DBall.exe. This is important to remember when we get to set up the Python notebook. With the environment built, we can move on to running the Basics notebook against the app. Let's see how to run the Jupyter notebook to control the environment. Running the environment Open up the Basics Jupyter notebook again; remember that we wanted to leave it open after testing the Python install. Go through the following steps to run the environment. Ensure that you update the first code block with your environment name, like so: env_name = "python" # Name of the Unity environment binary to launch train_mode = True # Whether to run the environment in training or inference mode We have the environment name set to 'python' here because that is the name of the executable that gets built into the python folder. You can include the file extension, but you don't have to. If you are not sure what the filename is, check the folder; it really will save you some frustration. Go inside the first code block and then click the Run button on the toolbar. Clicking Run will run the block of code you currently have your cursor in. This is a really powerful feature of a notebook; being able to move back and forth between code blocks and execute what you need is very useful when building complex algorithms. Click inside the second code block and click Run. The second code block is responsible for loading code dependencies. Note the following line in the second code block: from unityagents import UnityEnvironment This line is where we import the unityagents UnityEnvironment class. This class is our controller for running the environment. Run the third code block. Note how a Unity window will launch, showing the environment. You should also notice an output showing a successful startup and the brain stats. If you encounter an error at this point, go back and ensure you have the env_name variable set with the correct filename. Run the fourth code block. You should again see some more output, but unfortunately, with this control method, you don't see interactive activity. We will try to resolve this issue in a later chapter. Run the fifth code block. This will run through some random actions in order to generate some random output. Finally, run the sixth code block. This will close the Unity environment. Feel free to review the Basics notebook and play with the code. Take advantage of the ability to modify the code or make minor changes and quickly rerun code blocks. Now that we know how to set up a 3D environment in Unity, we can move on to implementation of PPO. How to implement PPO in Unity The implementation of PPO provided by Unity for training has been set up in a single script that we can put together quite quickly. Open up Unity to the unityenvironment sample projects and go through the following steps: Locate the GridWorld scene in the Assets/ML-Agents/Examples/GridWorld folder. Double-click it to open it. Locate the GridWorldBrain and set it to External. Set up the project up using the steps mentioned in the previous section. From the menu, select File | Build Settings.... Uncheck any earlier scenes and be sure to click Add Open Scenes to add the GridWorld scene to the build. Click Build to build the project, and again make sure that you put the output in the python folder. Again, if you are lost, refer to the ML-Agents external brains section. Open a Python shell or Anaconda prompt window. Be sure to navigate to the root source folder, ml-agents. Activate the ml-agents environment with the following: activate ml-agents From the ml-agents folder, run the following command: python python/learn.py python/python.exe --run-id=grid1 --train You may have to use Python 3 instead, depending on your Python engine. This will execute the learn.py script against the python/python.exe environment; be sure to put your executable name if you are not on Windows. Then we set a useful run-id we can use to identify runs later. Finally, we set the --train switch in order for the agent/brain to also be trained. As the script runs, you should see the Unity environment get launched, and the shell window or prompt will start to show training statistics, as shown in the following screenshot of the console window: Training output generated from learn.py Let the training run for as long as it needs. Depending on your machine and the number of iterations, you could be looking at a few hours of training—yes, you read that right. As the environment is trained, you will see the agent moving around and getting reset over and over again. In the next section, we will take a closer look at what the statistics are telling us. Understanding training statistics with TensorBoard Inherently, ML has its roots in statistics, statistical analysis, and probability theory. While we won't strictly use statistical methods to train our models like some ML algorithms do, we will use statistics to evaluate training performance. Hopefully, you have some memory of high school statistics, but if not, a quick refresher will certainly be helpful. The Unity PPO and other RL algorithms use a tool called TensorBoard, which allows us to evaluate statistics as an agent/environment is running. Go through the following steps as we run another Grid environment while watching the training with TensorBoard: Open the trainer_config.yaml file in Visual Studio Code or another text editor. This file contains the various training parameters we use to train our models. Locate the configuration for the GridWorldBrain, as shown in the following code: GridWorldBrain: batch_size: 32 normalize: false num_layers: 3 hidden_units: 256 beta: 5.0e-3 gamma: 0.9 buffer_size: 256 max_steps: 5.0e5 summary_freq: 2000 time_horizon: 5 Change the num_layers parameter from 1 to 3, as shown in the highlighted code. This parameter sets the number of layers the neural network will have. Adding more layers allows our model to better generalize, which is a good thing. However, this will decrease our training performance, or the time it takes our agent to learn. Sometimes, this isn't a bad thing if you have the CPU/GPU to throw at training, but not all of us do, so evaluating training performance will be essential. Open a command prompt or shell in the ml-agents folder and run the following command: python python/learn.py python/python.exe --run-id=grid2 --train Note how we updated the --run-id parameter to grid2 from grid1. This will allow us to add another run of data and compare it to the last run in real time. This will run a new training session. If you have problems starting a session, make sure you are only running one environment at a time. Open a new command prompt or shell window to the same ml-agents folder. Keep your other training window running. Run the following command: tensorboard --logdir=summaries This will start the TensorBoard web server, which will serve up a web UI to view our training results. Copy the hosting endpoint—typically http://localhost:6006, or perhaps the machine name—and paste it into a web browser. After a while, you should see the TensorBoard UI, as shown in the following screenshot: TensorBoard UI showing the results of training on GridWorld You will need to wait a while to see progress from the second training session. When you do, though, as shown in the preceding image, you will notice that the new model (grid2) is lagging behind in training. Note how the blue line on each of the plots takes several thousand iterations to catch up. This is a result of the more general multilayer network. This isn't a big deal in this example, but on more complex problems, that lag could make a huge difference. While some of the plots show the potential for improvement—such as the entropy plot—overall, we don't see a significant improvement. Using a single-layer network for this example is probably sufficient. We learned about PPO and its implementation in Unity. To learn more PPO concepts in Unity, be sure to check out the book Learn Unity ML-Agents – Fundamentals of Unity Machine Learning. Implementing Unity game engine and assets for 2D game development [Tutorial] Creating interactive Unity character animations and avatars [Tutorial] Unity 2D & 3D game kits simplify Unity game development for beginners

0
0
19690

How-To Tutorials

article-image-google-engineers-works-towards-large-scale-federated-learning-dub-it-federated-computing

Prasad Ramesh

22 Feb 2019

4 min read

Google engineers work towards large scale federated learning

Prasad Ramesh

22 Feb 2019

4 min read

In a paper published on February 4, Google engineers drafted out plans to forward federated learning at a scale. It showcases the high-level plans, challenges, solutions, and applications. Federated learning was first introduced in 2017 by Google. The idea is to use data from a number of computing devices like smartphones instead of a centralized data source. Federated learning can help with privacy Federated learning can be beneficial as it addresses the privacy concern. Android phones are used for the system where the data is only used but never uploaded to any server. A deep neural network is trained by using TensorFlow on the data stored in the Android phone. The Federated averaging algorithm by Brendan McMahan uses a similar approach as synchronous training. The weights of the neural network are combined in the cloud using Federated Averaging. This creates a global model which is then pushed back to the phones as results/desirable actions. To enhance privacy approaches like differential privacy and Secure aggregation are taken. The paper addresses challenges like time zone differences, connectivity issues, interrupted execution etc,. Their work is mature enough to deploy the system in production for tens of millions of devices. They are working towards supporting billions of devices now. The training protocol The system involves devices and the Federated Learning server communicating availability and the server selecting devices to run a task. A subset of the available devices are selected for a task. The Federated Learning server instructs the devices what computing task to run with a plan. A plan would consist a TensorFlow graph and instructions to execute it. There are three phases for the training to take place: Selection of the devices that meet eligibility criteria Configuring the server with simple or Secure Aggregation Reporting from the devices where reaching a certain number would get the training round started Source: Towards Federated Learning at Scale: System Design The devices are supposed to maintain a repository of the collected data and the applications are responsible to provide data to the Federated Learning runtime as an example store. The Federated Learning server is designed to operate on orders of many magnitudes. Each round can mean updates from devices in the range of KBs to tens of MBs coming going the server. Data collection To avoid harming the phone’s battery life and performance, various analytics are collected in the cloud. The logs don’t contain any personally identifiable information. Secure aggregation Secure aggregation uses encryption to make individual device updates uninspectable. They plant to use it for protection against threats in data centers. Secure aggregation would ensure data encryption even when it is in-memory. Challenges of federated learning Compared to a centralized dataset, federated learning poses a number of challenges. The training data is not inspectable, tooling is required to work with proxy data. Models cannot be run interactively and must be compiled to be deployed in the Federated Learning server. Model resource consumption and runtime compatibility also come into the picture when working with many devices in real-time. Applications of Federated Learning It is best for cases where the data on devices is more relevant than data on servers. Ranking items for better navigation, suggestions for on-device keyboard, and next word prediction. This has already been implemented on Google pixel and Gboard. Future work is to eliminate bias caused be restrictions in device selection, algorithms to support better parallelism (more devices in one round), avoiding retraining already trained tasks on devices, and compression to save bandwidth. Federated computation, not federated learning The authors do no mention machine learning explicitly anywhere in the paper. They believe that the applications of such a model are not limited to machine learning. Federated Computation is the term they want to use for this concept. Federated computation and edge computing Federated learning and edge computing are very similar, there are but subtle differences in the purpose of these two. Federated learning is used to solve problems with specific tasks assigned to endpoint smartphones. Edge computing is for predefined tasks to be processed at end nodes, for example, IoT cameras. Federated learning decentralizes the data used while edge computing decentralizes the task computation to various devices. For more details on the architecture and its working, you can check out the research paper. Technical and hidden debts in machine learning – Google engineers’ give their perspective Researchers introduce a machine learning model where the learning cannot be proved What if AIs could collaborate using human-like values? DeepMind researchers propose a Hanabi platform.

0
0
19678

article-image-what-we-learnt-from-the-github-octoverse-2018-report

Amey Varangaonkar

24 Oct 2018

8 min read

What we learnt from the GitHub Octoverse 2018 Report

Amey Varangaonkar

24 Oct 2018

8 min read

Highlighting key accomplishments over the last one year, Microsoft’s recent major acquisition GitHub released their yearly Octoverse report. The last 365 days have seen GitHub grow from strengths to strengths as the world’s leading source code management platform. The Octoverse report highlights how developers work and learn on GitHub. It also gives us some interesting, insights into the way the developers and even organizations are collaborating across geographies and time-zones, on a variety of interesting projects. The Octoverse report is based on the data collected from October 1 2017 to September 30, 2018, exactly 365 days from the publication of the last Octoverse report. In this article, we look at some of the key takeaways from the Octoverse 2018 report. Asia is home to GitHub’s fastest growing community GitHub developers who are currently based in Asia can feel proud of themselves. Octoverse 2018 states that more open source projects have been created in Asia than anywhere else in the world. While developers all over the world are joining and using GitHub, most new signups over the last year have come from countries such as China, India, and Japan. At the same time, GitHub usage is also growing quite rapidly in Asian countries such as Hong Kong, Singapore, Bangladesh, and Malaysia. This is quite interesting, considering the growth of AI has become part of the national policies in countries such as China, Hong Kong, and Japan. We can expect these trends to continue, and developing countries such as India and Bangladesh to contribute even more going forward. An ever-growing developer community squashes doubts on GitHub’s credibility When Microsoft announced their plans to buy GitHub in a deal worth $7.5 billion, many eyebrows were raised. Given Microsoft’s earlier stance against Open Source projects, some developers were skeptical of this move. They feared that Microsoft would exploit GitHub’s popularity and inject some kind of a subscription model into GitHub in order to recover the huge investment. Many even migrated their projects from GitHub on to rival platforms such as BitBucket and GitLab in protest. However, the numbers presented in the Octoverse report seem to suggest otherwise. According to the report, the number of new registrations last year alone was more than the number of registrations in the first 6 years of GitHub, which is quite impressive. The number of active contributors on GitHub has increased by more than 1.5 times over the last year, suggesting GitHub is still the undisputed leader when it comes to code management and collaboration. With more than 1.1 billion contributions across private and public projects over one year, I think we all know where major developers’ loyalty lies. Not just developers, organizations love GitHub too The Octoverse report states that 2.1 million organizations are using GitHub in some capacity, across public and private repositories. This number is a staggering 40% increase from 2017 - indicating the huge reliance on GitHub for effective code management and collaboration between the developers. Not just that, over 150,000 developers and organizations are using the apps and tools available on the GitHub marketplace for quick, efficient and seamless code development and management. GitHub had also launched a new feature called Security Alerts way back in November 2017. This feature alerted developers of any vulnerabilities in their project dependencies, and also suggested fixes for them from the community. Many organizations have found this feature to be an invaluable offering by GitHub, as it allowed for the development of secure, bug-free applications. Their faith in GitHub will be reinforced even more now that the report has revealed that over the last year, more than 5 million vulnerabilities were detected and communicated across to the developers. The report also suggests that members of an organization make substantial contributions to the projects and are twice as much active when they install and use the company app on GitHub. This suggests that GitHub offers them the best environment and the luxury to develop apps just as they want. All these insights only point towards one simple fact - Organizations and businesses trust GitHub. Microsoft are walking the talk with active open source contribution Microsoft joined the Linux Foundation after its initial (and vehement) opposition to the Open Source movement. With a change in leadership and the long-term vision came the realization that open source is essential for them - and the world - to progress. Eventually, they declared their support for the cause by going platinum with the Open Source initiative. That is now clearly being reflected in their achievements of the past year. Probably the most refreshing takeaway from the Octoverse report was to see Microsoft leading the pack when it comes to active open source contribution. The report states that Microsoft’s VSCode was the top open source project with 19,000 contributors. Also, it declared that the open source documentation of Azure was the fastest growing project on GitHub. Top open source projects on GitHub (Image courtesy: GitHub State of Octoverse 2018 Report) If this was not enough evidence to suggest Microsoft has amped up their claims of supporting the Open Source movement wholeheartedly, there’s more. Over 7000 Microsoft employees have contributed to various open source projects over the past one year, making it the top-most organization with the most Open Source contribution. Open source contribution by organization (Image source: GitHub State of Octoverse 2018 Report) When we said that Microsoft’s acquisition of GitHub was a good move, we were right! React Native and Machine Learning are red hot right now React Native has been touted to be the future of mobile development by many. This claim is corroborated by some strong activity on its GitHub repository over the last year. With over 10k contributors, React Native is one of the most active open source projects right now. With JavaScript continuing to rule the roost for the 5th straight year when it comes to being the top programming language, it comes as no surprise that the cross-platform framework for building native apps is now getting a lot of traction. Top languages over time (Image source: GitHub State of Octoverse 2018 Report) With the rise in popularity of Artificial Intelligence and specifically Machine Learning, the report also highlighted the continued rise of Tensorflow and PyTorch. While Tensorflow is the third most popular open source project right now with over 9000 contributors, Pytorch is one of the fastest growing projects on GitHub. The report also showed that Google and Facebook’s experimental frameworks for machine learning, called Dopamine and Detectron respectively are getting deserved attention thanks to how they are simplifying machine learning. Given the scale at which AI is being applied in the industry right now, these tools are expected to make developers’ lives easier going forward. Hence, it is not surprising to see their interest centered around these tools. GitHub’s Student Developer Pack to promote learning is a success According to the Octoverse report, over 1 million developers have honed their skills by learning best coding practices on GitHub. With over 600,000 active developer students learning how to write effective code through their Student Developer Pack, GitHub continue to give free access to the best development tools so that the students learn by doing and get valuable hands-on experience. In the academia, yet another fact that points to GitHub’s usefulness when it comes to learning is how teachers use the platform to implement real-world workflows for teaching. Over 20,000 teachers in over 18000 schools and universities have used GitHub to create over 200,000 assignments till date. Safe to say that this number is only going to grow in the near future. You can read more about how GitHub is promoting learning in their GitHub Education Classroom Report. GitHub’s competition has some serious catching up to do Since Google’s parent company Alphabet lost out to Microsoft in the race to buy GitHub, they have diverted their attention to GitHub’s competitor GitLab. Alphabet have even gone on to suggest that GitLab can surpass GitHub. According to the Octoverse report, Google are only behind Microsoft when it comes to the most open source contributions by any organization. With Gitlab joining forces with Google by moving their operations to Google Cloud Platform from Azure cloud, we might see Google’s contribution to GitHub reduce significantly over the next few years. Who knows, the next Octoverse report might not feature Google at all! That said, the size of the GitHub community, along with the volume of activity that happens on the platform on a per day basis - are both staggering and no other platforms come even close. This fact was supported by the enormity of some of the numbers that the report presented, such as: There are over 31 million developers on the platform till date. More than 96 million repositories are currently being hosted on GitHub There have been 65 million pull requests created in the last one year alone, contributing to almost 33% of the total number of pull requests created till date These numbers dwarf the other platforms such as GitLab, BitBucket and others, in comparison. Not only is GitHub the world’s most popular code collaboration and version control platform, it is currently the #1 choice of tool for most of the developers in the world. It will take some catching up for the likes of GitLab and others, to come even close to GitHub. In 5 years, machines will do half of our job tasks of today; 1 in 2 employees need reskilling/upskilling now – World Economic Forum survey. Survey reveals how artificial intelligence is impacting developers across the tech landscape What the IEEE 2018 programming languages survey reveals to us

0
0
19620

article-image-mastering-semi-structured-data-in-snowflake

Serge Gershkovich

27 Jun 2024

7 min read

Mastering Semi-Structured Data in Snowflake

Serge Gershkovich

27 Jun 2024

7 min read

This article is an excerpt from the book, Data Modeling with Snowflake, by Serge Gershkovich. Discover how Snowflake's unique objects and features can be used to leverage universal modeling techniques through real-world examples and SQL recipes.Introduction In the era of big data, the ability to efficiently manage and analyze semi-structured data is crucial for businesses. Snowflake, a leading cloud-based data platform, offers robust features to handle semi-structured data formats like JSON, Avro, and Parquet. This article explores the benefits of using the VARIANT data type in Snowflake and provides a hands-on guide to managing semi-structured data.The Benefits of Semi-Structured Data in Snowflake Semi-structured data formats are popular due to their flexibility when working with dynamically varying information. Unlike relational schemas, where a precise entity structure must be predefined, semi-structured data can adapt to include or omit attributes as needed, as long as they are properly nested within corresponding parent objects. For example, consider the contact list on your phone. It contains a list of people and their contact details but does not capture those details uniformly. Some contacts may have multiple phone numbers, while others have only one. Some entries might include an email address and street address, while others have just a number and a vague description. To handle this type of data, Snowflake uses the VARIANT data type, which allows semi-structured data to be stored as a column in a relational table. Snowflake optimizes how VARIANT data is stored internally, ensuring better compression and faster access. Semi-structured data can sit alongside relational data in the same table, and users can access it using basic extensions to standard SQL, achieving similar performance. Another compelling reason to use the VARIANT data type is its adaptability to change. If columns are added or removed from semi-structured data, there is no need to modify ELT (extract, load, and transform) pipelines. The VARIANT data type does not require schema changes, and read operations will not fail for an attribute that no longer exists.Getting Hands-On with Semi-Structured Data Let's delve into a practical example of working with semi-structured data in Snowflake. This example uses JSON data representing information about pirates, such as details about the crew, weapons, and their ship. All this information is stored in a single VARIANT data type column. In relational data, a row represents a single entity; in semi-structured data, a row can represent an entire file containing multiple entities. Creating a Table for Semi-Structured Data Here is a sample SQL script to create a table with semi-structured data:CREATE TABLE pirates_data ( id NUMBER AUTOINCREMENT PRIMARY KEY, load_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP, data VARIANT ); ``` In this example, the `AUTOINCREMENT` keyword generates a unique ID for each record inserted, and the `VARIANT` column stores the semi-structured JSON data.Loading Semi-Structured Data To load semi-structured data into Snowflake, you can use the `COPY INTO` command. Here’s an example of how to load JSON data from an external stage into the `pirates_data` table:COPY INTO pirates_data FROM @my_stage/pirates_data.json FILE_FORMAT = (TYPE = 'JSON'); ```Querying Semi-Structured Data Once the data is loaded, you can query it using standard SQL. For instance, to extract specific attributes from the JSON data, you can use the dot notation: SELECT data:id::NUMBER AS pirate_id, data:crew AS crew, data:weapons AS weapons FROM pirates_data; ```This query extracts the `id`, `crew`, and `weapons` fields from the JSON data stored in the `data` column.Converting Semi-Structured Data into Relational Data Although semi-structured data offers flexibility, converting it into a relational format can provide better performance for certain queries. Snowflake allows you to transform VARIANT data into relational columns using the `FLATTEN` function. Here's an example of how to flatten a JSON array into a relational table:SELECT value:id::NUMBER AS pirate_id, value:name::STRING AS name, value:rank::STRING AS rank FROM pirates_data, LATERAL FLATTEN(input => data:crew); ``` This query converts the `crew` array from the JSON data into individual rows in a relational format, making it easier to query and analyze.Schema-on-Read vs. Schema-on-Write One of the main advantages of using the VARIANT data type in Snowflake is the flexibility of schema-on-read. This approach allows you to ingest data without a predefined schema, and then define the schema at the time of reading the data. This contrasts with the traditional schema-on-write approach, where the schema must be defined before data ingestion.Benefits of Schema-on-ReadFlexibility: You can ingest data without worrying about its structure, which is particularly useful for unstructured or semi-structured data sources.Adaptability: Schema changes do not require re-ingestion of data, as the schema is applied at read time.Speed: Data can be loaded more quickly, as there is no need to enforce a schema during the ingestion process.Example: Using Schema-on-Read with VARIANT Data Here’s an example demonstrating schema-on-read with semi-structured data in Snowflake: SELECT data:id::NUMBER AS pirate_id, data:ship.name::STRING AS ship_name, data:ship.type::STRING AS ship_type FROM pirates_data; ```In this query, the schema is defined at read time, allowing you to extract specific attributes from the nested JSON data.Handling Nested and Repeated Data Snowflake’s support for semi-structured data also extends to handling nested and repeated data structures. The FLATTEN function is particularly useful for working with such data, enabling you to transform nested arrays into a more manageable relational format.Example: Flattening Nested Data Consider a JSON structure where each pirate has a nested array of previous voyages. To flatten this nested data, you can use the following query: SELECT data:id::NUMBER AS pirate_id, value:date::DATE AS voyage_date, value:destination::STRING AS voyage_destination FROM pirates_data, LATERAL FLATTEN(input => data:previous_voyages); ```This query extracts the nested `previous_voyages` array and converts it into individual rows in a relational format.Performance Considerations When working with semi-structured data in Snowflake, it’s important to consider performance implications. While the VARIANT data type offers flexibility, it can also introduce overhead if not managed properly.Tips for Optimizing PerformanceUse Caching: Take advantage of Snowflake’s caching mechanisms to reduce query times for frequently accessed data.Optimize Queries: Write efficient SQL queries, avoiding unnecessary complexity and ensuring that only the required data is processed.Monitor Usage: Regularly monitor your Snowflake usage and performance metrics to identify and address potential bottlenecks.ConclusionHandling semi-structured data in Snowflake using the VARIANT data type provides immense flexibility and performance benefits. Whether you are dealing with dynamically changing schemas or integrating semi-structured data with relational data, Snowflake’s capabilities can significantly enhance your data management and analytics workflows. By leveraging the techniques outlined in this article, you can efficiently model and transform semi-structured data, unlocking new insights and value for your organization.For more detailed guidance and advanced techniques, refer to the book "Data Modeling with Snowflake," which provides comprehensive insights into modern data modeling practices and Snowflake’s powerful features.Author BioSerge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.

0
0
19613

Packt

06 Jul 2017

11 min read

Spark Streaming

Packt

06 Jul 2017

11 min read

In this article by Romeo Kienzler, the author of the book Mastering Apache Spark 2.x - Second Edition, we will see Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault tolerant, having the ability to rerun failed tasks by check-pointing the data stream that is being processed. The following areas will be covered in this article after an initial section, which will provide a practical overview of how Apache Spark processes stream-based data: Error recovery and check-pointing TCP-based stream processing File streams Kafka stream source For each topic, we will provide a worked example in Scala, and will show how the stream-based architecture can be set up and tested. (For more resources related to this topic, see here.) Overview The following diagram shows potential data sources for Apache Streaming, such as Kafka, Flume, and HDFS: These feed into the Spark Streaming module, and are processed as Discrete Streams. The diagram also shows that other Spark module functionality, such as machine learning, can be used to process the stream-based data. The fully processed data can then be an output for HDFS, databases, or dashboards. This diagram is based on the one at the Spark streaming website, but we wanted to extend it for expressing the Spark module functionality: When discussing Spark Discrete Streams, the previous figure, again taken from the Spark website at http://spark.apache.org/, is the diagram we like to use. The green boxes in the previous figure show the continuous data stream sent to Spark, being broken down into a Discrete Streams (DStream). The size of each element in the stream is then based on a batch time, which might be two seconds. It is also possible to create a window, expressed as the previous red box, over the DStream. For instance, when carrying out trend analysis in real time, it might be necessary to determine the top ten Twitter-based hashtags over a ten minute window. So, given that Spark can be used for stream processing, how is a stream created? The following Scala-based code shows how a Twitter stream can be created. This example is simplified because Twitter authorization has not been included, but you get the idea. The Spark Stream Context (SSC) is created using the Spark Context sc. A batch time is specified when it is created; in this case, 5 seconds. A Twitter-based DStream, called stream, is then created from the Streamingcontext using a window of 60 seconds: val ssc = new StreamingContext(sc, Seconds(5) ) val stream = TwitterUtils.createStream(ssc,None).window( Seconds(60) ) The stream processing can be started with the stream context start method (shown next), and the awaitTermination method indicates that it should process until stopped. So, if this code is embedded in a library-based application, it will run until the session is terminated, perhaps with a Crtl + C: ssc.start() ssc.awaitTermination() This explains what Spark Streaming is, and what it does, but it does not explain error handling, or what to do if your stream-based application fails. The next section will examine Spark Streaming error management and recovery. Errors and recovery Generally, the question that needs to be asked for your application is; is it critical that you receive and process all the data? If not, then on failure you might just be able to restart the application and discard the missing or lost data. If this is not the case, then you will need to use check pointing, which will be described in the next section. It is also worth noting that your application's error management should be robust and self-sufficient. What we mean by this is that; if an exception is non-critical, then manage the exception, perhaps log it, and continue processing. For instance, when a task reaches the maximum number of failures (specified by spark.task.maxFailures), it will terminate processing. Checkpointing It is possible to set up an HDFS-based checkpoint directory to store Apache Spark-based streaming information. In this Scala example, data will be stored in HDFS, under /data/spark/checkpoint. The following HDFS file system ls command shows that before starting, the directory does not exist: [hadoop@hc2nn stream]$ hdfs dfs -ls /data/spark/checkpoint ls: `/data/spark/checkpoint': No such file or directory The Twitter-based Scala code sample given next, starts by defining a package name for the application, and by importing Spark Streaming Context, and Twitter-based functionality. It then defines an application object named stream1: package nz.co.semtechsolutions import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter._ import org.apache.spark.streaming.StreamingContext._ object stream1 { Next, a method is defined called createContext, which will be used to create both the spark, and streaming contexts. It will also checkpoint the stream to the HDFS-based directory using the streaming context checkpoint method, which takes a directory path as a parameter. The directory path being the value (cpDir) that was passed into the createContext method: def createContext( cpDir : String ) : StreamingContext = { val appName = "Stream example 1" val conf = new SparkConf() conf.setAppName(appName) val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(5) ) ssc.checkpoint( cpDir ) ssc } Now, the main method is defined, as is the HDFS directory, as well as Twitter access authority and parameters. The Spark Streaming context ssc is either retrieved or created using the HDFS checkpoint directory via the StreamingContext method—getOrCreate. If the directory doesn't exist, then the previous method called createContext is called, which will create the context and checkpoint. Obviously, we have truncated our own Twitter auth.keys in this example for security reasons: def main(args: Array[String]) { val hdfsDir = "/data/spark/checkpoint" val consumerKey = "QQpxx" val consumerSecret = "0HFzxx" val accessToken = "323xx" val accessTokenSecret = "IlQxx" System.setProperty("twitter4j.oauth.consumerKey", consumerKey) System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret) System.setProperty("twitter4j.oauth.accessToken", accessToken) System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret) val ssc = StreamingContext.getOrCreate(hdfsDir, () => { createContext( hdfsDir ) }) val stream = TwitterUtils.createStream(ssc,None).window( Seconds(60) ) // do some processing ssc.start() ssc.awaitTermination() } // end main Having run this code, which has no actual processing, the HDFS checkpoint directory can be checked again. This time it is apparent that the checkpoint directory has been created, and the data has been stored: [hadoop@hc2nn stream]$ hdfs dfs -ls /data/spark/checkpoint Found 1 items drwxr-xr-x - hadoop supergroup 0 2015-07-02 13:41 /data/spark/checkpoint/0fc3d94e-6f53-40fb-910d-1eef044b12e9 This example, taken from the Apache Spark website, shows how checkpoint storage can be set up and used. But how often is checkpointing carried out? The metadata is stored during each stream batch. The actual data is stored with a period, which is the maximum of the batch interval, or ten seconds. This might not be ideal for you, so you can reset the value using the method: DStream.checkpoint( newRequiredInterval ) Where newRequiredInterval is the new checkpoint interval value that you require, generally you should aim for a value which is five to ten times your batch interval. Checkpointing saves both the stream batch and metadata (data about the data). If the application fails, then when it restarts, the checkpointed data is used when processing is started. The batch data that was being processed at the time of failure is reprocessed, along with the batched data since the failure. Remember to monitor the HDFS disk space being used for check pointing. In the next section, we will begin to examine the streaming sources, and will provide some examples of each type. Streaming sources We will not be able to cover all the stream types with practical examples in this section, but where this article is too small to include code, we will at least provide a description. In this article, we will cover the TCP and file streams, and the Flume, Kafka, and Twitter streams. We will start with a practical TCP-based example. This article examines stream processing architecture. For instance, what happens in cases where the stream data delivery rate exceeds the potential data processing rate? Systems like Kafka provide the possibility of solving this issue by providing the ability to use multiple data topics and consumers. TCP stream There is a possibility of using the Spark Streaming Context method called socketTextStream to stream data via TCP/IP, by specifying a hostname and a port number. The Scala-based code example in this section will receive data on port 10777 that was supplied using the Netcat Linux command. The code sample starts by defining the package name, and importing Spark, the context, and the streaming classes. The object class named stream2 is defined, as it is the main method with arguments: package nz.co.semtechsolutions import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object stream2 { def main(args: Array[String]) { The number of arguments passed to the class is checked to ensure that it is the hostname and the port number. A Spark configuration object is created with an application name defined. The Spark and streaming contexts are then created. Then, a streaming batch time of 10 seconds is set: if ( args.length < 2 ) { System.err.println("Usage: stream2 <host> <port>") System.exit(1) } val hostname = args(0).trim val portnum = args(1).toInt val appName = "Stream example 2" val conf = new SparkConf() conf.setAppName(appName) val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(10) ) A DStream called rawDstream is created by calling the socketTextStream method of the streaming context using the host and port name parameters. val rawDstream = ssc.socketTextStream( hostname, portnum ) A top-ten word count is created from the raw stream data by splitting words by spacing. Then a (key,value) pair is created as (word,1), which is reduced by the key value, this being the word. So now, there is a list of words and their associated counts. Now, the key and value are swapped, so the list becomes (count and word). Then, a sort is done on the key, which is now the count. Finally, the top 10 items in the RDD, within the DStream, are taken and printed out: val wordCount = rawDstream .flatMap(line => line.split(" ")) .map(word => (word,1)) .reduceByKey(_+_) .map(item => item.swap) .transform(rdd => rdd.sortByKey(false)) .foreachRDD( rdd => { rdd.take(10).foreach(x=>println("List : " + x)) }) The code closes with the Spark Streaming start, and awaitTermination methods being called to start the stream processing and await process termination: ssc.start() ssc.awaitTermination() } // end main } // end stream2 The data for this application is provided, as we stated previously, by the Linux Netcat (nc) command. The Linux Cat command dumps the contents of a log file, which is piped to nc. The lk options force Netcat to listen for connections, and keep on listening if the connection is lost. This example shows that the port being used is 10777: [root@hc2nn log]# pwd /var/log [root@hc2nn log]# cat ./anaconda.storage.log | nc -lk 10777 The output from this TCP-based stream processing is shown here. The actual output is not as important as the method demonstrated. However, the data shows, as expected, a list of 10 log file words in descending count order. Note that the top word is empty because the stream was not filtered for empty words: List : (17104,) List : (2333,=) List : (1656,:) List : (1603,;) List : (1557,DEBUG) List : (564,True) List : (495,False) List : (411,None) List : (356,at) List : (335,object) This is interesting if you want to stream data using Apache Spark Streaming, based upon TCP/IP from a host and port. But what about more exotic methods? What if you wish to stream data from a messaging system, or via memory-based channels? What if you want to use some of the big data tools available today like Flume and Kafka? The next sections will examine these options, but first I will demonstrate how streams can be based upon files. Summary We could have provided streaming examples for systems like Kinesis, as well as queuing systems, but there was not room in this article. This article has provided practical examples of data recovery via checkpointing in Spark Streaming. It has also touched on the performance limitations of checkpointing and shown that that the checkpointing interval should be set at five to ten times the Spark stream batch interval. Resources for Article: Further resources on this subject: Understanding Spark RDD [article] Spark for Beginners [article] Setting up Spark [article]

0
0
19571

article-image-cluster-basics-and-installation-centos-7

Packt

01 Feb 2016

8 min read

Cluster Basics and Installation On CentOS 7

Packt

01 Feb 2016

8 min read

In this article by Gabriel A. Canepa, author of the book CentOS High Performance, we will review the basic principles of clustering and show you, step by step, how to set up two CentOS 7 servers as nodes to later use them as members of a cluster. (For more resources related to this topic, see here.) As part of this process, we will install CentOS 7 from scratch in a brand new server as our first cluster member, along with the necessary packages, and finally, configure key-based authentication for SSH access from one node to the other. Clustering fundamentals In computing, a cluster consists of a group of computers (which are referred to as nodes or members) that work together so that the set is seen as a single system from the outside. One typical cluster setup involves assigning a different task to each node, thus achieving a higher performance than if several tasks were performed by a single member on its own. Another classic use of clustering is helping to ensure high availability by providing failover capabilities to the set, where one node may automatically replace a failed member to minimize the downtime of one or several critical services. In either case, the concept of clustering implies not only taking advantage of the computing functionality of each member alone, but also maximizing it by complementing it with the others. As we just mentioned, HA (High-availability) clusters aim to eliminate system downtime by failing services from one node to another in case one of them experiences an issue that renders it inoperative. As opposed to switchover, which requires human intervention, a failover procedure is performed automatically by the cluster without any downtime. In other words, this operation is transparent to end users and clients from outside the cluster. On the other hand, HP (High-performance) clusters use their nodes to perform operations in parallel in order to enhance the performance of one or more applications. High-performance clusters are typically seen in scenarios involving applications that use large collections of data. Why CentOS? Just as the saying goes, Every journey begins with a small step, we will begin our own journey toward clustering by setting up the separate nodes that will make up our system. Our choice of operating system is Linux and CentOS, version 7, as the distribution, that being the latest available release of CentOS as of today. The binary compatibility with Red Hat Enterprise Linux © (which is one of the most well-used distributions in enterprise and scientific environments) along with its well-proven stability are the reasons behind this decision. CentOS 7 along with its previous versions of the distribution are available for download, free of charge, from the project's website at http://www.centos.org/. In addition, specific details about the release can always be consulted in the CentOS wiki, http://wiki.centos.org/Manuals/ReleaseNotes/CentOS7. Among the distinguishing features of CentOS 7, I would like to name the following: It includes systemd as the central system management and configuration utility It uses XFS as the default filesystem It only supports the x86_64 architecture Downloading CentOS To download CentOS, go to http://www.centos.org/download/ and click on one of the three options outlined in the following figure: Download options for CentOS 7 These options are detailed as follows: DVD ISO (~4 GB) is an .iso file that can be burned into regular DVD optical media and includes the common tools. Download this file if you have immediate access to a reliable Internet connection that you can use to download other packages and utilities. Everything ISO (~7 GB) is an .iso file with the complete set of packages that are made available in the base repository of CentOS 7. Download this file if you do not have access to a reliable Internet connection or if your plan contemplates the possibility of installing or populating a local or network mirror. The alternative downloads link will take you to a public directory within an official nearby CentOS mirror, where the previous options are available as well as others, including different choices of desktop versions (GNOME or KDE) and the minimal .iso file (~570 MB), which contains the bare bone packages of the distribution. As the minimal install is sufficient for our purpose at hand, we can install other needed packages using yum later, that is, the recommended .iso file to download. CentOS-7.X-YYMM-x86_64-Minimal.iso Here, X indicates the current update number of CentOS 7 and YYMM represent the year and month, both in two-digit notation, when the source code this version is based on was released. CentOS-7.0-1406-x86_64-Minimal.iso This tells us the source code this release is based on dates from the month of June, 2014. Independently of our preferred download method, we will need this .iso file in order to begin with the installation. In addition, feel free to burn it to optical media or a USB drive. Setting up CentOS 7 nodes If you do not have dedicated hardware that you can use to set up the nodes of your cluster, you can still create one using virtual machines over some virtualization software, such as Oracle Virtualbox © or VMware ©, for example. The following setup is going to be performed on a Virtualbox VM with 1 GB of RAM and 30 GB of disk space. We will use the default partitioning schema over LVM as suggested by the installation process. Installing CentOS 7 The splash screen shown in the following screenshot is the first step in the installation process. Highlight Install CentOS 7 using the up and down arrows and press Enter: Splash screen before starting the installation of CentOS 7 Select English (or your preferred installation language) and click on Continue, as shown in the following screenshot: Selecting the language for the installation of CentOS 7 In the following screenshot, you can choose a keyboard layout, set the current date and time, choose a partitioning method, connect the main network interface, and assign a unique hostname for the node. We will name the current node node01 and leave the rest of the settings as default (we will configure the extra network card later). Then, click on Begin installation: Configure keyboard layout, date and time, network and hostname, and partitioning schema While the installation continues in the background, we will be prompted to set the password for the root account and create an administrative user for the node. Once these steps have been confirmed, the corresponding warnings no longer appear, as shown in the following screenshot: Setting the password for root and creating an administrative user account When the process is completed, click on Finish configuration and the installation will finish configuring the system and devices. When the system is ready to boot on its own, you will be prompted to do so. Remove the installation media and click on Reboot. Now, we can proceed with setting up our network interfaces. Setting up the network infrastructure Our rather basic network infrastructure consists of 2 CentOS 7 boxes, with the node01 [192.168.0.2] and node02 [192.168.0.3] host names, respectively, and a gateway router called simply gateway [192.168.0.1]. In CentOS, network cards are configured using scripts in the /etc/sysconfig/network-scripts directory. This is the minimum content that is needed in /etc/sysconfig/network-scripts/ifcfg-enp0s3 for our purposes: HWADDR="08:00:27:C8:C2:BE" TYPE="Ethernet" BOOTPROTO="static" NAME="enp0s3" ONBOOT="yes" IPADDR="192.168.0.2" NETMASK="255.255.255.0" GATEWAY="192.168.0.1" PEERDNS="yes" DNS1="8.8.8.8" DNS2="8.8.4.4" Note that the UUID and HWADDR values will be different in your case. In addition, be aware that cluster machines need to be assigned a static IP address—never leave that up to DHCP! In the preceding configuration file, we used Google's DNS, but if you wish, feel free to use another DNS. When you're done making changes, save the file and restart the network service in order to apply them: systemctl restart network.service # Restart the network service You can verify that the previous changes have taken effect (shown in the Restarting the network service and verifying settings figure) with the following two commands: systemctl status network.service # Display the status of the network service And the changes have also taken effect due to this command: ip addr | grep 'inet addr' # Display the IP addresse Restarting the network service and verifying settings You can disregard all error messages related to the loopback interface, as shown in preceding screenshot. However, you will need to examine carefully any error messages related to the enp0s3 interface, if any, and get them resolved in order to proceed further. The second interface will be called enp0sX, where X is typically 8. You can verify with the following command (shown in the following figure): ip link show Displaying NIC information As for the configuration file of enp0s8, you can safely create it, copying the contents of ifcfg-enp0s3. Do not forget, however, to change the hardware (MAC) address as returned by the information on the NIC and leave the IP address field blank for now. ip link show enp0s8 cp /etc/sysconfig/network-scripts/ifcfg-enp0s3 /etc/sysconfig/network-scripts/ifcfg-enp0s8 Then, restart the network service. Note that you will also need to set up at least a basic DNS resolution method. Considering that we will set up a cluster with 2 nodes only, we will use /etc/hosts for this purpose. Edit /etc/hosts with the following content: 192.168.0.2 node01 192.168.0.3 node02 192.168.0.1 gateway Summary In this article, we reviewed how to install the operating system and listed the necessary software components to implement the basic cluster functionality. Resources for Article: Further resources on this subject: CentOS 7's new firewalld service[article] Mastering CentOS 7 Linux Server[article] Resource Manager on CentOS 6[article]

0
0
19543

‘Facial Recognition technology is faulty, racist, biased, abusive to civil rights; act now to restrict misuse’ say experts to House Oversight and Reform Committee

How to develop a stock price predictive model using Reinforcement Learning and TensorFlow

Golang Decorators: Logging & Time Profiling

PostGIS extension: pgRouting for calculating driving distance [Tutorial]

How to set up an Ethereum development environment [Tutorial]

Amazon re:Invent 2019 Day One: AWS launches Braket, its new quantum service and releases SageMaker Operators for Kubernetes

Introducing InNative, an AOT compiler that runs WebAssembly using LLVM outside the Sandbox at 95% native speed

Understanding functional reactive programming in Scala [Tutorial]

Part1. Learning AWS CLI

Implementing Proximal Policy Optimization (PPO) algorithm in Unity [Tutorial]

Trending Topics

Google engineers work towards large scale federated learning

What we learnt from the GitHub Octoverse 2018 Report

Mastering Semi-Structured Data in Snowflake

Spark Streaming

Cluster Basics and Installation On CentOS 7

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access