How-To Tutorials

article-image-tensorflow-2-0-released-tighter-keras-integration-eager-execution-enabled-by-default

03 Oct 2019

5 min read

TensorFlow 2.0 released with tighter Keras integration, eager execution enabled by default, and more!

03 Oct 2019

After releasing the beta version of TensorFlow 2.0 in June, Google announced its final release on Monday. This release comes with tighter integration with Keras, eager execution enabled by default, promises three times faster training performance, a cleaned-up API, and more. Key updates in TensorFlow 2.0 Tighter Keras integration for better developer productivity One of the important updates in TensorFlow 2.0 is its tighter integration with Keras, a popular high-level API used for easy and fast prototyping, building, and training deep learning models. This will enable developers to easily leverage its various model-building APIs including Sequential, Functional, and Subclassing. Explaining the motivation behind this change, the TensorFlow team wrote, “By establishing Keras as the high-level API for TensorFlow, we are making it easier for developers new to machine learning to get started with TensorFlow. A single high-level API reduces confusion and enables us to focus on providing advanced capabilities for researchers.” Eager execution enabled by default In TensorFlow 1.x, developers were required to define an abstract data structure named Graph and to run this graph they needed an encapsulation called Session. TensorFlow 2.0 has eager execution enabled by default to “eagerly” run code, similar to normal Python code. Eager execution enables fast iteration and intuitive debugging without building a graph. It also makes creating and experimenting with models using TensorFlow much easier. It can be especially useful when using the tf.keras model subclassing API. Also Read: Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out Distribution Strategy API The Distribution Strategy API in TensorFlow 2.0 allows machine learning researchers to distribute training across a wide variety of compute configurations. This will allow them to “attain great out-of-the-box performance” with minimal code changes. This release also allows distributed training with Keras’ model.fit and custom training loops. Performance improvements on GPUs TensorFlow 2.0 includes multi-GPU support and experimental support for multi worker and Cloud TPUs. This release also has a number of performance improvements on GPUs. It promises three times faster training performance when using mixed precision on NVIDIA’s Volta and Turing GPUs. It includes tight integration with NVIDIA TensorRT, a platform for high-performance deep learning inference. The standardized SavedModel file format The SavedModel API allows you to save your trained ML model into a language-neutral format. With TensorFlow 2.0, all TensorFlow ecosystem projects including TensorFlow Lite, TensorFlow JS, TensorFlow Serving, and TensorFlow Hub, support SavedModels. Standardizing the SavedModel file format will enable developers to run their models on a variety of runtimes including the cloud, web, browser, Node.js, mobile, and embedded systems. “This allows you to run your models with TensorFlow, deploy them with TensorFlow Serving, use them on mobile and embedded systems with TensorFlow Lite, and train and run in the browser or Node.js with TensorFlow.js,” the team writes. API simplification TensorFlow 2.0 includes a number of API updates. Many API symbols are removed or renamed for better consistency and clarity. Also, the tf.app, tf.flags, and tf.logging API are removed in favor of abseil-py. Because of the huge number of API changes, developers in a discussion on Hacker News expressed that transitioning from TensorFlow 1.X to TensorFlow 2.0 is quite complicated. Some also mentioned switching to PyTorch instead. A user commented, “As someone who uses TensorFlow a lot, I predict an enormous clusterfuck of a transition. Tensorflow has turned into a multiheaded monster, supporting many things and approaches but none of them very well...In my opinion, there are some architectural problems with TF, which have not been addressed in this update...If you need to transition from TF1 to TF2, consider doing the TF1 to PyTorch transition instead.” While some others were happy with the recommended Keras API and eager execution. “I don't know if I'm the only one, but I actually love the changes they've made since v1. Eager execution and tf.function are fantastic, and the built-in Keras is even better than the standalone version. A big improvement compared to TF from last year,” a user commented on Reddit. Another user added, “The most important change in terms of usability, IMO, is the use of tf.keras as the recommended interface to TensorFlow. There hasn't been a case yet where I've needed to dip outside of Keras into raw TensorFlow, but the option is there and is easy to do. That said, TF 2.0 changes a lot. Many repos might break, so expect to see lots of tensorflow==1.14 in requirement.txt files from now on.” These were some of the updates in TensorFlow 2.0. Check out the official announcement and release notes to know more in detail. Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages TensorFlow 2.0 to be released soon with eager execution, removal of redundant APIs, tf function and more Introducing TensorFlow Graphics packed with TensorBoard 3D, object transformations, and much more Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial] Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial]

0
0
32074

article-image-how-chaos-engineering-can-help-predict-and-prevent-cyber-attacks-preemptively

Sugandha Lahoti

02 Oct 2019

7 min read

How Chaos Engineering can help predict and prevent cyber-attacks preemptively

Sugandha Lahoti

02 Oct 2019

7 min read

It's no surprise that cybersecurity has become a major priority for global businesses of all sizes, often employing a dedicated IT team to focus on thwarting attacks. Huge budgets are spent on acquiring and integrating security solutions, but one of the most effective tools might be hiding in plain sight. Encouraging your own engineers and developers to purposely break systems through a process known as chaos engineering can drive a huge return on investment by identifying unknown weaknesses in your digital architecture. As modern networks become more complex with a vastly larger threat surface, stopping break-ins before they happen takes on even greater importance. Evolution of Cyberattacks Hackers typically have a background in software engineering and actually run their criminal enterprises in cycles that are similar to what happens in the development world. That means these criminals focus on iterations and agile changes to keep themselves one step ahead of security tools. Most hackers are focused on financial gains and look to steal data from enterprise or government systems for the purpose of selling it on the Dark Web. However, there is a demographic of cybercriminals focused on simply causing the most damage to certain organizations that they have targeted. In both scenarios, the hacker will first need to find a way to enter the perimeter of the network, either physically or digitally. Many cybercrimes now start with what's known as social engineering, where a hacker convinces an internal employee to divulge information that allows unauthorized access to take place. Principles of Chaos Engineering So how can you predict cyberattacks and stop hackers before they infiltrate your systems? That's where chaos engineering can help. The term first gained popularity a decade ago when Netflix created a tool called Chaos Monkey that would randomly take a node of their network offline to force teams to react accordingly. This proved effective because Netflix learned how to better keep their streaming service online and reduce dependencies between cloud servers. The Chaos Monkey tool can be seen as an example of a much broader practice: Chaos Engineering. The central insight of this form of testing is that, regardless of how all-encompassing your test suite is, as soon as your code is running on enough machines, errors are going to occur. In large, complex systems, it is essentially impossible to predict where these points of failure are going to occur. Rather than viewing this as a problem, chaos engineering sees it as an opportunity. Since failure is unavoidable, why not deliberately introduce it, and then attempt to solve randomly generated errors, in order to ensure your systems and processes can deal with the failure? For example, Netflix runs on AWS, but in response to frequent regional failures have changed their systems to become region agnostic. They have tested that this works using chaos engineering: they regularly take down important services in separate regions via a “chaos monkey” which randomly selects servers to take down, and challenges engineers to work around these failures. Though Netflix popularized the concept, chaos engineering, and testing is now used by many other companies, including Microsoft. The rise of cloud computing has meant that a high proportion of the systems running today have a similar level of complexity to Netflix’s server architecture. The cloud computing model has added a high level of complexity to online systems and the practice of chaos engineering will help to manage that over time, especially when it comes to cybersecurity. It's impossible to predict how and when a hacker will execute an attack, but chaos tests can help you be more proactive in patching your internal vulnerabilities. At first, your developers may be reluctant to jump into chaos activities since they will think their normal process works. However, there’s a good chance that, over time, they will grow to love chaos testing. You might have to do some convincing in the beginning, though, because this new approach likely goes against everything they’ve been taught about securing a network. Designing a Chaos Test Chaos tests need to be run in a controlled manner in order to be effective, and in many ways, the process lines up with the scientific method: in order to run a successful chaos test, you first need to identify a well-formed and refutable hypothesis. A test can then be designed that will prove (or more likely falsify) your hypothesis. For example, you might want to know how your network will react if one DNS server drops offline during a distributed denial of service (DDoS) attack by a hacker. This tactic of masking one’s IP address and overloading a website’s server can be executed through a VPN. Your hypothesis, in this case, is that you will be able to re-route traffic around the affected server. From there, you can begin designing and scheduling the experiment. If you are new to chaos engineering, you will want to restrict the test to a testing or staging environment to minimize the impact on live systems. Then make sure you have one person responsible for documenting the outcomes as the experiment begins. If you identify an issue during the first run of the chaos test, then you can pause that effort and focus on coming up with a solution plan. If not, expand the radius of the experiment until it produces worthwhile results. Running through this type of fire drill can be positive for various groups within a software organization, as it will provide practice for real incidents. Source: Medium How Chaos Engineering Fits Into Quality Assurance A lot of companies hear about chaos engineering and jump into it full-speed. But then over time they lose their enthusiasm for the practice or get distracted with other projects. So how should chaos tests be run and how do you ensure that they remain consistent and valuable? First, define the role of chaos engineering within your larger quality assurance efforts. In fact, your QA team may want to be the leaders of all chaos tests that are run. One important clarification is to distinguish between penetration tests and chaos tests. They both have the same goal of proactively finding system issues, but a penetration test is a specific event with a finite focus while chaos testing must be more open-ended. When teaching your teams about chaos engineering, it's vital to frame it as a practice and not a one-time activity. Return on investment will be hard to find if you do not systematically follow-up after chaos tests are run. The aim should be for continuous improvement to ensure your systems are prepared for any type of cyberattack. Final Thoughts These days, security vendors offer a wide range of tools designed to help companies protect their people, data, and infrastructure. Solutions like firewalls and virus scanners are certainly deployed to great success as part of most cybersecurity strategies, but they should never be treated as foolproof. Hackers are notorious at finding ways to get past these types of tools and exploit companies who are not prepared at a deeper level. The most mature organizations go one step further and find ways to proactively locate their own weaknesses before an outsider can expose them. Chaos engineering is a great way to accomplish this, as it encourages developers to look for gaps and bugs that they might not stumble upon normally. No matter how much planning goes into a system's architecture, unforeseen issues still come up, and hackers are a very dangerous variable in that equation. Author Bio Sam Bocetta is a freelance journalist specializing in U.S. diplomacy and national security, with emphasis on technology trends in cyberwarfare, cyberdefense, and cryptography.

0
0
32305

article-image-an-unpatched-vulnerability-in-nsas-ghidra-allows-a-remote-attacker-to-compromise-exposed-systems

Savia Lobo

01 Oct 2019

3 min read

An unpatched vulnerability in NSA’s Ghidra allows a remote attacker to compromise exposed systems

Savia Lobo

01 Oct 2019

3 min read

On September 28, the National Security Agency revealed a vulnerability in Ghidra, a free, open-source software reverse-engineering tool. The NSA released the Ghidra toolkit at the RSA security conference in San Francisco on March 6, this year. The vulnerability, tracked as CVE-2019-16941, allows a remote attacker to compromise exposed systems, according to a NIST National Vulnerability Database description. This vulnerability is reported as medium severity and currently does not have a fix available. The NSA tweeted on its official account, “A flaw currently exists within Ghidra versions through 9.0.4. The conditions needed to exploit this flaw are rare and a patch is currently being worked. This flaw is not a serious issue as long as you don’t accept XML files from an untrusted source.” According to the bug description, the flaw manifests itself “when [Ghidra] experimental mode is enabled.” This “allows arbitrary code execution if the Read XML Files feature of Bit Patterns Explorer is used with a modified XML document,” the description further reads. “Researchers add since the feature is experimental, to begin with, it’s already an area to expect bugs and vulnerabilities. They also contend, that despite descriptions of how the bug can be exploited, it can’t be triggered remotely,” Threatpost reports. Ghidra, a disassembler written in Java, breaks down executable files into assembly code that can then be analyzed. By deconstructing malicious code and malware, cybersecurity professionals can gain a better understanding of potential vulnerabilities in their networks and systems. The NSA has used it internally for years, and recently decided to open-source it. Other instances when bugs have been found in Ghidra include, in March, a proof-of-concept was released showing how an XML external entity (XXE) vulnerability (rated serious) can be exploited to attack Ghidra project users (version 9.0 and below). In July, researchers found an additional path-retrieval bug (CVE-2019-13623) that was also rated high severity. The bug, similar to CVE-2019-1694, also impacts the ghidra.app.plugin.core.archive and allows an attacker to achieve arbitrary code execution on vulnerable systems, Threatpost reports. Researchers said they are unaware that this most recent bug (CVE-2019-16941) has been exploited in the wild. To know more about this news in detail, read the bug description. A Cargo vulnerability in Rust 1.25 and prior makes it ignore the package key and download a wrong dependency 10 times ethical hackers spotted a software vulnerability and averted a crisis A zero-day pre-auth vulnerability is currently being exploited in vBulletin, reports an anonymous researcher

0
0
25815

article-image-transformers-2-0-nlp-library-with-deep-interoperability-between-tensorflow-2-0-and-pytorch

Fatema Patrawala

30 Sep 2019

3 min read

Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages

Fatema Patrawala

30 Sep 2019

3 min read

Last week, Hugging Face, a startup specializing in natural language processing, released a landmark update to their popular Transformers library, offering unprecedented compatibility between two major deep learning frameworks, PyTorch and TensorFlow 2.0. Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. Transformers 2.0 embraces the ‘best of both worlds’, combining PyTorch’s ease of use with TensorFlow’s production-grade ecosystem. The new library makes it easier for scientists and practitioners to select different frameworks for the training, evaluation and production phases of developing the same language model. “This is a lot deeper than what people usually think when they talk about compatibility,” said Thomas Wolf, who leads Hugging Face’s data science team. “It’s not only about being able to use the library separately in PyTorch and TensorFlow. We’re talking about being able to seamlessly move from one framework to the other dynamically during the life of the model.” https://twitter.com/Thom_Wolf/status/1177193003678601216 “It’s the number one feature that companies asked for since the launch of the library last year,” said Clement Delangue, CEO of Hugging Face. Notable features in Transformers 2.0 8 architectures with over 30 pretrained models, in more than 100 languages Load a model and pre-process a dataset in less than 10 lines of code Train a state-of-the-art language model in a single line with the tf.keras fit function Share pretrained models, reducing compute costs and carbon footprint Deep interoperability between TensorFlow 2.0 and PyTorch models Move a single model between TF2.0/PyTorch frameworks at will Seamlessly pick the right framework for training, evaluation, production As powerful and concise as Keras About Hugging Face Transformers With half a million installs since January 2019, Transformers is the most popular open-source NLP library. More than 1,000 companies including Bing, Apple or Stitchfix are using it in production for text classification, question-answering, intent detection, text generation or conversational. Hugging Face, the creators of Transformers, have raised US$5M so far from investors in companies like Betaworks, Salesforce, Amazon and Apple. On Hacker News, users are appreciating the company and how Transformers has become the most important library in NLP. Other interesting news in data Baidu open sources ERNIE 2.0, a continual pre-training NLP model that outperforms BERT and XLNet on 16 NLP tasks Dr Joshua Eckroth on performing Sentiment Analysis on social media platforms using CoreNLP Facebook open-sources PyText, a PyTorch based NLP modeling framework

0
0
29544

article-image-researchers-release-a-study-into-bug-bounty-programs-and-responsible-disclosure-for-ethical-hacking-in-iot

Savia Lobo

30 Sep 2019

6 min read

Researchers release a study into Bug Bounty Programs and Responsible Disclosure for ethical hacking in IoT

Savia Lobo

30 Sep 2019

6 min read

On September 26, a few researchers from the Delft University of Technology (TU Delft) in the Netherlands, released a research paper which highlighted the importance of crowdsource ethical hacking approaches for enhancing IoT vulnerability management. They have focussed on Bug Bounty Programs (BBP) and Responsible Disclosure (RD), which stimulate hackers to report vulnerabilities in exchange for monetary rewards. Supported by literature survey and expert interviews, these researchers carried out an investigation on how BBP and RD can facilitate the practice of identifying, classifying, prioritizing, remediating, and mitigating IoT vulnerabilities in an effective and cost-efficient manner. The researchers have also made recommendations on how BBP and RD can be integrated with the existing security practices to further boost the IoT security. The researchers first identified the causes for lack of security practices in IoT from socio-technical and commercial perspectives. By identifying the hidden pitfalls and blind spots, stakeholders can avoid repeating the same mistakes. They have also derived a set of recommendations as best-practices that can benefit IoT vendors, developers, and regulators. The researcher in their paper added, “We note that this study does not intend to cover all the potential vulnerabilities in IoT nor to provide an absolute solution to cure the IoT security puzzle. Instead, our focus is to provide practical and tangible recommendations and potentially new options for stakeholders to tackle IoT-oriented vulnerabilities in consumer goods, which will help enhance the overall IoT security practices.” Challenges in IoT Security The researchers have highlighted six major reasons from the system and device perspective: IoT systems do not have well-defined perimeters as they can continuously change. IoT systems are highly heterogeneous with respect to communication medium and protocols. IoT systems often include devices that are not designed to be connected to the Internet or for being secure. IoT devices can often autonomously control other IoT devices without human supervision. IoT devices could be physically unprotected and/or controlled by different parties. The large number of devices increases the security complexity The other practical challenges for IoT include the fact that enterprises targeting end-users do not have security as a priority and are generally driven by time-to-market instead of by security requirements. Several IoT products are the results of an increasing number of startup companies that have entered this market recently. This vast majority of startups accounts for less than 10 employees, and their obvious priority is to develop functional rather than secure products. In this scenario, investing in security can be perceived as a costly and time-consuming obstacle. In addition, consumers demand for security is low, and they tend to prefer cheaper rather than secure products. As a result, companies lack explicit incentives to invest in security. Crowdsourced security methods: An alternative in Ethical Hacking "Government agencies and business organizations today are in constant need of ethical hackers to combat the growing threat to IT security. A lot of government agencies, professionals and corporations now understand that if you want to protect a system, you cannot do it by just locking your doors," the researchers observe. The benefits of Ethical Hacking include: Preventing data from being stolen and misused by malicious attackers. Discovering vulnerabilities from an attacker’s point of view so to fix weak points. Implementing a secure network that prevents security breaches. Protecting networks with real-world assessments. Gaining trust from customers and investors by ensuring security. Defending national security by protecting data from terrorism. Bug bounty benefits and Responsible Disclosure The alternative for Pen Testing in Ethical Hacking is Crowdsourced security methods. These methods involve the participation of large numbers of ethical hackers, reporting vulnerabilities to companies in exchange for rewards that can consist of money or, just recognition. Similar practices have been utilized at large scale in the software industry. For example, the Vulnerability Rewards Programs (VRP) which have been applied to Chrome and Firefox, yielding several lessons on software security development. As per results, the Chrome VRP has cost approximately $580,000 over 3 years and has resulted in 501 bounties paid for the identification of security vulnerabilities. Crowdsource methods involve thousands of hackers working on a security target. In specific, instead of a point-in-time test, crowdsourcing enables continuous testing. In addition, as compared with Pen Testing, crowdsourcing hackers are only paid when a valid vulnerability is reported. Let us in detail understand Bug Bounty Programs (BBP) and Responsible Disclosure (RD). How do Bug Bounty Programs (BBP) work? Also known as Bug Bounties, the BBP represents reward-driven crowdsourced security testing where ethical hackers who successfully discover and report vulnerabilities to companies are rewarded. The BBPs can further be classified into public and private programs. Public programs allow entire communities of ethical hackers to participate in the program. They typically consist of large scale bug bounty programs and can be both time-limited and open-ended. Private programs, on the other hand, are generally limited to a selected sub-group of hackers, scoped to specific targets, and limited in time. These programs usually take place through commercial bug bounty platforms, where hackers are selected based on reputation, skills, and experience. The main platform vendors that included BBP are HackerOne, BugCrowd, Cobalt Labs, and Synack. Those platforms have facilitated establishing and maintaining BBPs for organizations. What is Responsible Disclosure (RD)? Also known as coordinated vulnerability disclosure, RD consists of rules and guidelines from companies that allow individuals to report vulnerabilities to organizations. The RD policies will define the models for a controlled and responsible disclosure of information upon vulnerabilities discovered by users. Here, most of the software vulnerabilities are discovered by both benign users and ethical hackers. In many situations, individuals might feel responsible for reporting the vulnerability to the organization, but companies may lack a channel for them to report the found vulnerabilities. Hence three different outcomes might occur including failed disclosure, full disclosure, and organization capture. Among these three, the target of RD is the organization capture where companies create a safe channel and provide rules for submitting vulnerabilities to their security team, and further allocate resources to follow up the process. One limitation of this qualitative study is that only experts that were conveniently available participated in the interview. The experts that participated in the research were predominantly from the Netherlands (13 experts), and more in general all from Europe, the results should be generalized with regard to other countries and the whole security industry. For IoT vulnerability management, the researchers recommend launching BBP only after companies have performed initial security testing and fixed the problems. The objective of BBP and RD policies should always be to provide additional support in finding undetected vulnerabilities, and never to be the only security practice. To know more about this study in detail, read the original research paper. How Blockchain can level up IoT Security How has ethical hacking benefited the software industry MITRE’s 2019 CWE Top 25 most dangerous software errors list released

0
0
32527

article-image-chaos-engineering-company-gremlin-launches-scenarios-making-it-easier-to-tackle-downtime-issues

Richard Gall

26 Sep 2019

2 min read

Chaos engineering company Gremlin launches Scenarios, making it easier to tackle downtime issues

Richard Gall

26 Sep 2019

2 min read

At the second ChaosConf in San Francisco, Gremlin CEO Kolton Andrus revealed the company's latest step in its war against downtime: 'Scenarios.' Scenarios makes it easy for engineering teams to simulate a common issues that lead to downtime. It's a natural and necessary progression for Gremlin that is seeing even the most forward thinking teams struggling to figure out how to implement chaos engineering in a way that's meaningful to their specific use case. "Since we released Gremlin Free back in February thousands of customers have signed up to get started with chaos engineering," said Andrus. "But many organisations are still struggling to decide which experiments to run in order to avoid downtime and outages." Scenarios, then, is a useful way into chaos engineering for teams that are reticent about taking their first steps. As Andrus notes, it makes it possible to inject failure "with a couple of clicks." What failure scenarios does Scenarios let engineering teams simulate? Scenarios lets Gremlin users simulate common issues that can cause outages. These include: Traffic spikes (think Black Friday site failures) Network failures Region evacuation This provides a great starting point for anyone that wants to stress test their software. Indeed, it's inevitable that these issues will arise at some point so taking advance steps to understand what the consequences could be will minimise their impact - and their likelihood. Why chaos engineering? Over the last couple of years plenty of people have been attempting to answer why chaos engineering? But in truth the reasons are clear: software - indeed, the internet as we know it - is becoming increasingly complex, a mesh of interdependent services and platforms. At the same time, the software being developed today is more critical than ever. For eCommerce sites downtime means money, but for those in IoT and embedded systems world (like self-driving cars, for example), it's sometimes a matter of life and death. This makes Gremlin's Scenarios an incredibly exciting an important prospect - it should end the speculation and debate about whether we should be doing chaos engineering, and instead help the world to simply start doing it. At ChaosConf Andrus said that Gremlin's mission is to build a more reliable internet. We should all hope they can deliver.

0
0
23526

article-image-reactos-0-4-12-releases-with-kernel-improvements-intel-e1000-nic-driver-support-and-more

Bhagyashree R

25 Sep 2019

2 min read

ReactOS 0.4.12 releases with kernel improvements, Intel e1000 NIC driver support, and more

Bhagyashree R

25 Sep 2019

2 min read

Earlier this week, the ReactOS team announced the release of ReactOS 0.4.12. This release comes with a bunch of kernel improvements, Intel e1000 NIC driver support, font improvements, and more. Key updates in ReactOS 0.4.12 Kernel updates The filesystem infrastructure of ReactOS 0.4.12 has received quite a few improvements to enable Microsoft filesystem drivers. The team has also worked on the common cache module that has deep ties to the memory manager. The team has also improved device power management, fixed support for PXE booting, and overhauled the write-protection functionality. Window snapping ReactOS 0.4.12 comes with support for window snapping. So, users will now be able to align windows to sides or maximize and minimize them by dragging in specific directions. This release also comes with the keyboard shortcuts that accompany this feature. Intel e1000 NIC driver ReactOS 0.4.12 has a new driver to support the Network Interface Card (NIC) out of the box. Now, end-users do not need to manually find and install a driver. This new driver will also be compatible with e1000 NICs. Improvements related to font In ReactOS 0.4.12, font rendering is made more robust and correct. This release fixes a series of problems that badly affected text rendering for buttons in a range of applications, from iTunes to various .NET applications. User-mode DLLs In this release, the team has made a range of improvements to user-mode components. The common controls (comctl) library is used by most of the Windows applications to draw various generic user interface elements. The team has fixed an issue related to it “reading extremely dryly.” Other updates include drivers for MIDI instruments and animated rotation bar in the startup/shutdown dialog. Check out the official announcement to know more about ReactOS 0.4.12 in detail. ReactOS 0.4.11 is now out with kernel improvements, manifests support, and more! Btrfs now boots ReactOS, a free and open-source alternative for Windows NT ReactOS version 0.4.9 released with Self-hosting and FastFAT crash fixes Understanding network port numbers, TCP, UDP, and ICMP on an operating system Google’s secret Operating System ‘Fuchsia’ will run Android Applications: 9to5Google Report

0
0
42233

article-image-red-hat-announces-centos-stream-a-developer-forward-distribution-jointly-with-the-centos-project

Savia Lobo

25 Sep 2019

3 min read

Red Hat announces CentOS Stream, a “developer-forward distribution” jointly with the CentOS Project

Savia Lobo

25 Sep 2019

3 min read

On September 24, just after the much-awaited CentOS 8 was released, the Red Hat community in agreement with the CentOS Project announced a new model into the CentOS Linux community called, CentOS Stream. CentOS Stream is an upstream development platform for ecosystem developers. It is a single, continuous stream of content with updates several times daily, encompassing the latest and greatest from the RHEL codebase. Also, it is like having a view into what the next version of RHEL will look like, available to a much broader community than just a beta or "preview" release. Chris Wright, Red Hat's CTO, says CentOS Stream is it's "a developer-forward distribution that aims to help community members, Red Hat partners, and others take full advantage of open source innovation within a more stable and predictable Linux ecosystem. It is a parallel distribution to existing CentOS." In the previous CentOS releases developers would not know beforehand about the releases in RHEL. As the CentOS Stream project sits between the Fedora Project and RHEL in the RHEL Development process, it will provide a "rolling preview" of future RHEL kernels and features. This enables developers to stay one or two steps ahead of what’s coming in RHEL. “CentOS Stream is parallel to existing CentOS builds; this means that nothing changes for current users of CentOS Linux and services, even those that begin to explore the newly-released CentOS 8. We encourage interested users that want to be more tightly involved in driving the future of enterprise Linux, however, to transition to CentOS Stream as the new "pace-setting" distribution,” The Red Hat blog states. CentOS Stream is part of Red Hat’s broader focus to engage with communities and developers in a way that better aligns with the modern IT world. A user on Hacker News commented, “I like it, at least in theory. I develop some industrial software that runs on RHEL so being able to run somewhat similar distribution on my machine would be convenient. I tried running CentOS but it was too frustrating and limiting to deal with all the outdated packages on a dev machine. I suppose it will also be good for devs who just like the RHEL environment but don't need a super stable, outdated packages.” Another user commented, “I wonder what future Fedora will have if this new CentOS Stream will be stable enough for developer daily driver. 6 month release cycle of Fedora always felt awkwardly in-between, not having the stability of lts nor the continuity of rolling. I guess lot depends on details on how the packages flow to CentOS Stream, do they come from released Fedora versions or rawhide etc.” To know more about CentOS Stream in detail, read Red Hat’s official blog post. After RHEL 8 release, users awaiting the release of CentOS 8 After Red Hat, Homebrew removes MongoDB from core formulas due to its Server Side license adoption Red Hat announces the general availability of Red Hat OpenShift Service Mesh Introducing ESPRESSO, an open-source, PyTorch based, end-to-end neural automatic speech recognition (ASR) toolkit for distributed training across GPUs .NET Core 3.0 is now available with C# 8, F# 4.7, ASP.NET Core 3.0 and general availability of EF Core 3.0 and EF 6.3

0
0
20958

article-image-can-a-modified-mit-hippocratic-license-to-restrict-misuse-of-open-source-software-prompt-a-wave-of-ethical-innovation-in-tech

Savia Lobo

24 Sep 2019

5 min read

Can a modified MIT ‘Hippocratic License’ to restrict misuse of open source software prompt a wave of ethical innovation in tech?

Savia Lobo

24 Sep 2019

5 min read

Open source licenses allow software to be freely distributed, modified, and used. These licenses give developers an additional advantage of allowing others to use their software as per their own rules and conditions. Recently, software developer and open-source advocate Coraline Ada Ehmke has caused a stir in the software engineering community with ‘The Hippocratic License.’ Ehmke was also the original author of Contributor Covenant, a “code of conduct" for open source projects that encourages participants to use inclusive language and to refrain from personal attacks and harassment. In a tweet posted in September last year, following the code of conduct, she mentioned, “40,000 open source projects, including Linux, Rails, Golang, and everything OSS produced by Google, Microsoft, and Apple have adopted my code of conduct.” [box type="shadow" align="" class="" width=""]The term ‘Hippocratic’ is derived from the Hippocratic Oath, the most widely known of Greek medical texts. The Hippocratic Oath in literal terms requires a new physician to swear upon a number of healing gods that he will uphold a number of professional ethical standards.[/box] Ehmke explained the license in more detail in a post published on Sunday. In it, she highlights how the idea that writing software with the goals of clarity, conciseness, readability, performance, and elegance are limiting, and potentially dangerous.“All of these technologies are inherently political,” she writes. “There is no neutral political position in technology. You can’t build systems that can be weaponized against marginalized people and take no responsibility for them.”The concept of the Hippocratic license is relatively simple. In a tweet, Ehmke said that it “specifically prohibits the use of open-source software to harm others.” Open source software and the associated harm Out of the many privileges that open source software allows such as free redistribution of the software as well as the source code, the OSI also defines there is no discrimination against who uses it or where it will be put to use. A few days ago, a software engineer, Seth Vargo pulled his open-source software, Chef-Sugar, offline after finding out that Chef (a popular open source DevOps company using the software) had recently signed a contract selling $95,000-worth of licenses to the US Immigrations and Customs Enforcement (ICE), which has faced widespread condemnation for separating children from their parents at the U.S. border and other abuses. Vargo took down the Chef Sugar library from both GitHub and RubyGems, the main Ruby package repository, as a sign of protest. In May, this year, Mijente, an advocacy organization released documents stating that Palantir was responsible for the 2017 ICE operation that targeted and arrested family members of children crossing the border alone. Also, in May 2018, Amazon employees, in a letter to Jeff Bezos, protested against the sale of its facial recognition tech to Palantir where they “refuse to contribute to tools that violate human rights”, citing the mistreatment of refugees and immigrants by ICE. Also, in July, the WYNC revealed that Palantir’s mobile app FALCON was being used by ICE to carry out raids on immigrant communities as well as enable workplace raids in New York City in 2017. Founder of OSI responds to Ehmke’s Hippocratic License Bruce Perens, one of the founders of the Open Source movement in software, responded to Ehmke in a post titled “Sorry, Ms. Ehmke, The “Hippocratic License” Can’t Work” . “The software may not be used by individuals, corporations, governments, or other groups for systems or activities that actively and knowingly endanger harm, or otherwise threaten the physical, mental, economic, or general well-being of underprivileged individuals or groups,” he highlights in his post. “The terms are simply far more than could be enforced in a copyright license,” he further adds. “Nobody could enforce Ms. Ehmke’s license without harming someone, or at least threatening to do so. And it would be easy to make a case for that person being underprivileged,” he continued. He concluded saying that, though the terms mentioned in Ehmke’s license were unagreeable, he will “happily support Ms. Ehmke in pursuit of legal reforms meant to achieve the protection of underprivileged people.” Many have welcomed Ehmke's idea of an open source license with an ethical clause. However, the license is not OSI approved yet and chances are slim after Perens’ response. There are many users who do not agree with the license. Reaching a consensus will be hard. https://twitter.com/seannalexander/status/1175853429325008896 https://twitter.com/AdamFrisby/status/1175867432411336704 https://twitter.com/rishmishra/status/1175862512509685760 Even though developers host their source code on open source repositories, a license may bring certain level of restrictions on who is allowed to use the code. However, as Perens mentions, many of the terms in Ehmke’s license hard to implement. Irrespective of the outcome of this license’s approval process, Coraline Ehmke has widely opened up the topic of the need for long overdue FOSS licensing reforms in the open source community. It would be interesting to see if such a license would boost ethical reformation by giving more authority to the developers in imbibing their values and preventing the misuse of their software. Read the Hippocratic license to know more in detail. Other interesting news Tech ImageNet Roulette: New viral app trained using ImageNet exposes racial biases in artificial intelligent system Machine learning ethics: what you need to know and what you can do Facebook suspends tens of thousands of apps amid an ongoing investigation into how apps use personal data

0
0
25773

article-image-introducing-weld-a-runtime-written-in-rust-and-llvm-for-cross-library-optimizations

Bhagyashree R

24 Sep 2019

5 min read

Introducing Weld, a runtime written in Rust and LLVM for cross-library optimizations

Bhagyashree R

24 Sep 2019

5 min read

Weld is an open-source Rust project for improving the performance of data-intensive applications. It is an interface and runtime that can be integrated into existing frameworks including Spark, TensorFlow, Pandas, and NumPy without changing their user-facing APIs. The motivation behind Weld Data analytics applications today often require developers to combine various functions from different libraries and frameworks to accomplish a particular task. For instance, a typical Python ecosystem application selects some data using Spark SQL, transforms it using NumPy and Pandas, and trains a model with TensorFlow. This improves developers’ productivity as they are taking advantage of functions from high-quality libraries. However, these functions are usually optimized in isolation, which is not enough to achieve the best application performance. Weld aims to solve this problem by providing an interface and runtime that can optimize across data-intensive libraries and frameworks while preserving their user-facing APIs. In an interview with Federico Carrone, a Tech Lead at LambdaClass, Weld’s main contributor, Shoumik Palkar shared, “The motivation behind Weld is to provide bare-metal performance for applications that rely on existing high-level APIs such as NumPy and Pandas. The main problem it solves is enabling cross-function and cross-library optimizations that other libraries today don’t provide.” How Weld works Weld serves as a common runtime that allows libraries from different domains like SQL and machine learning to represent their computations in a common functional intermediate representation (IR). This IR is then optimized by a compiler optimizer and JIT’d to efficient machine code for diverse parallel hardware. It performs a wide range of optimizations on the IR including loop fusion, loop tiling, and vectorization. “Weld’s IR is natively parallel, so programs expressed in it can always be trivially parallelized,” said Palkar. When Weld was first introduced it was mainly used for cross-library optimizations. However, over time people have started to use it for other applications as well. It can be used to build JITs or new physical execution engines for databases or analytics frameworks, individual libraries, target new kinds of parallel hardware using the IR, and more. To evaluate Weld’s performance the team integrated it with popular data analytics frameworks including Spark, NumPy, and TensorFlow. This prototype showed up to 30x improvements over the native framework implementations. While cross library optimizations between Pandas and NumPy also improved performance by up to two orders of magnitude. Source: Weld Why Rust and LLVM were chosen for its implementation The first iteration of Weld was implemented in Scala because of its algebraic data types, powerful pattern matching, and large ecosystem. However, it did have some shortcomings. Palkar shared in the interview, “We moved away from Scala because it was too difficult to embed a JVM-based language into other runtimes and languages.” It had a managed runtime, clunky build system, and its JIT compilations were quite slow for larger programs. Because of these shortcomings the team wanted to redesign the JIT compiler, core API, and runtime from the ground up. They were in the search for a language that was fast, safe, didn’t have a managed runtime, provided a rich standard library, functional paradigms, good package manager, and great community background. So, they zeroed-in on Rust that happens to meet all these requirements. Rust provides a very minimal, no setup required runtime. It can be easily embedded into other languages such as Java and Python. To make development easier, it has high-quality packages, known as crates, and functional paradigms such as pattern matching. Lastly, it is backed by a great Rust Community. Read also: “Rust is the future of systems programming, C is the new Assembly”: Intel principal engineer, Josh Triplett Explaining the reason why they chose LLVM, Palkar said in the interview, “We chose LLVM because its an open-source compiler framework that has wide use and support; we generate LLVM directly instead of C/C++ so we don’t need to rely on the existence of a C compiler, and because it improves compilation times (we don’t need to parse C/C++ code).” In a discussion on Hacker News many users listed other Weld-like projects that developers may find useful. A user commented, “Also worth checking out OmniSci (formerly MapD), which features an LLVM query compiler to gain large speedups executing SQL on both CPU and GPU.” Users also talked about Numba, an open-source JIT compiler that translates Python functions to optimized machine code at runtime with the help of the LLVM compiler library. “Very bizarre there is no discussion of numba here, which has been around and used widely for many years, achieves faster speedups than this, and also emits an LLVM IR that is likely a much better starting point for developing a “universal” scientific computing IR than doing yet another thing that further complicates it with fairly needless involvement of Rust,” a user added. To know more about Weld, check out the full interview on Medium. Also, watch this RustConf 2019 talk by Shoumik Palkar: https://www.youtube.com/watch?v=AZsgdCEQjFo&t Other news in Programming Darklang available in private beta GNU community announces ‘Parallel GCC’ for parallelism in real-world compilers TextMate 2.0, the text editor for macOS releases

0
0
29549

article-image-machine-learning-ethics-what-you-need-to-know-and-what-you-can-do

Richard Gall

23 Sep 2019

10 min read

Machine learning ethics: what you need to know and what you can do

Richard Gall

23 Sep 2019

10 min read

0
0
36118

article-image-how-quarkus-brings-java-into-the-modern-world-of-enterprise-tech

Guest Contributor

22 Sep 2019

6 min read

How Quarkus brings Java into the modern world of enterprise tech

Guest Contributor

22 Sep 2019

6 min read

What is old is new again, even - and maybe especially - in the world of technology. To name a few milestones that are being celebrated this year: Java is roughly 25 years old, it is the 10th anniversary of Minecraft, and Nintendo is back in vogue. None of these three examples are showing signs of slowing down anytime soon. I would argue that they are continuing to be leaders in innovation because of the simple fact that there are still people behind them that are creatively breathing new life into what otherwise could have been “been there, done that” technologies. With Java, in particular, it is so widely used, that from an enterprise efficiency perspective, it simply does not make sense NOT to have Java be a key language in the development of emerging tech apps. In fact, more and more apps are being developed with a Java-first approach. But, how can this be done, especially when apps are being built using new architectures like serverless and microservices? One technology that shows promise is Quarkus, a newly introduced Kubernetes-native platform that addresses many of the barriers hindering Java’s ability to shine in the modern world of emerging tech. Why does Java still matter Even though its continued relevance has been questioned for years, I believe Java still matters and is not likely to go away anytime soon. This is because of two reasons. First, there is a whole list of programming languages that are based on Java and the Java Virtual Machine (JVM), such as Kotlin, Groovy, Clojure and JRuby. Also, Java continues to be one of the most popular programming languages for Android apps, as well as for the development of edge devices and the internet of things. In fact, according to SlashData’s State of the Developer Nation Q4 2018 report, there are 7.6 million active Java developers worldwide. Other factors that I think are contributing to Java’s continued popularity include network portability, the fact that it is object-oriented, converts data to bytecode so that it can be read and run on any platform which has a JVM installed, and, maybe most importantly, has a syntax similar to C++, making it a relatively easy language for developers to learn. Additionally, SlashData’s research suggested that newer and niche languages do not seem to be adding many new developers, if any, per year, begging the question of whether or not it is easy for newer languages to scale beyond their niche and become the next big thing. It also makes it clear that while there is value for newer programming languages that do not serve as wide a purpose, they may not be able to or need to overtake languages like Java. In fact, the success of Java relies on the entire ecosystem surrounding it, including the editors, third party libraries, CI/CD pipelines, and systems. Each aspect of the ecosystem is something that is so easy to take for granted in established languages but are things that need to be created from scratch in new languages if they want to catch up to or overtake Java. How Quarkus brings Java into modern enterprise tech Quarkus is more than just a cool name. It is a Kubernetes Native Java framework that is tailored for GraalVM and HotSpot, and crafted by best-of-breed Java libraries and standards. The overall goal of Quarkus is to make Java one of the leading platforms in Kubernetes and serverless environments, while also enabling developers to work within what they know and in a reactive and imperative programming model. Put simply, Quarkus works to bring Java into the modern microservices and serverless modes of developing. This is important because Java continues to be a top programming language for back-end enterprise developers. Many organizations have tied both time and money into Java, which has been a dominant force in the development landscape for a number of years. As enterprises increasingly shift toward cloud computing, it is important for Java to carry over into these new programming methods. Why a “Java First” approach Java has been a top programming language for enterprises for over a decade. We should not lose sight of that fact, and that there are many developers with excellent Java skills, as well as existing applications that run on Java. Furthermore, because Java has been around so long it has not only matured as a language but also as an ecosystem. There are editors, logging systems, debuggers, build systems, unit testing environments, QA testing environments, and more--all tuned for Java, if not also implemented in Java. Therefore, when starting a new Java application it can be easier to find third-party components or entire systems that can help the developer gain productivity advancements over other languages that have not yet grown to have the breadth and depth of the Java ecosystem. Using a full-stack framework such as Quarkus, and taking advantage of libraries that use Java, such as Eclipse MicroProfile and Eclipse Vert.x, makes this easier, and also encourages the use of different combinations of tools and dependencies. With Quarkus in particular, it also includes an extension framework that third party authors can use to build native executables and expand the functionality of Java in the enterprise. Quarkus not only brings Java into the modern world of containers, but it also does so quickly with short start-up times. Java is not looking like it will go away anytime soon. Between the number of developers who still use Java as their first language and the number of apps that run almost entirely from it, Java’s take in the game is as solid as ever. Through new tools like Quarkus, it can continue to evolve in the modern app dev world. Author Bio Mark Little works at RedHat where he leads the JBoss Technical Direction and research & development. Prior to this, he was SOA Technical Development Manager and Director of Standards. He also has experience with two successful startup companies. Other interesting news in Tech Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data and Society report. Open AI researchers advance multi-agent competition by training AI agents in a hide and seek environment. France and Germany reaffirm blocking Facebook’s Libra cryptocurrency

0
0
21953

article-image-bad-metadata-can-get-you-in-legal-hot-water

Guest Contributor

21 Sep 2019

6 min read

Bad Metadata can get you in legal hot water

Guest Contributor

21 Sep 2019

6 min read

Metadata isn't just something that concerns business intelligence and IT teams; but lawyers are extremely interested in it as well. Metadata, it turns out, can win or lose lawsuits, send politicians to jail, and even decide medical malpractice cases. It's not uncommon for attorneys who conduct discovery of electronic records in organizations to find that the claims of plaintiffs or defendants are contradicted by metadata, like time and date, type of data, etc. If a discovery process is initiated against them, an organization had better be sure that its metadata is in order. All it would take for an organization to lose a case would be for an attorney to discover a discrepancy in different databases – a different time stamp on some communication, a different job title for a principal in the case. Such discrepancies could lead to accusations of data tampering, fraud, or worse – and would most definitely put the organization in a very tough position versus a judge or jury. Metadata errors are difficult to spot The problem with that, of course, is that catching metadata errors is extremely difficult. In large organizations, data is stored in repositories that are spread throughout the organization, maybe even the world – in different departments. Each department is responsible for maintaining its own database, and the metadata in it; and on different cloud storage repositories, which may have their own system of classifying data. An enterprising attorney could have a field day with the different categories and tags data is stored under, making claims that the organization is trying to “hide something.” The organization's only defense: We're poor administrators. That may not be enough to impress the court. Types of Metadata Metadata is “data about data,” and comes in three flavors: System Metadata, which is data that is automatically generated from the computer and includes specific labeled criteria, like the date and time of creation and date a document was modified, etc. Substantive Metadata reflects changes to a document, like tracked changes. Embedded metadata is data entered into a document or file but not normally visible, such as formulas in cells in an Excel spreadsheet. All of these have increasingly become targets for attorneys in recent years. Metadata has been used in thousands of cases – medical, financial, patent and trademark law, product liability, civil rights, and many more. Metadata is both discoverable and admissible as evidence. According to one New York court, “General information about the creation of a document, including who authored a document and when it was created, is pedigree information often important for purposes of determining admissibility at trial.” According to legal experts, “from a legal standpoint metadata is evidence… that describes the characteristics, origins, usage, and validity of other electronic evidence.” The biggest metadata-linked payout until now - $10.8 million – occurred in 2017, when a jury awarded a plaintiff $8 million (eventually this was increased to nearly $11 million) after claiming he was fired from a biotechnology company after telling authorities about potential bribery in China. The key piece of evidence was the metadata timestamp on a performance review that was written after the plaintiff was fired; with that evidence, the court increased the defendant's payout for violating laws against firing whistleblowers. In that case, records claiming that the employee was fired for cause were belied by the metadata in the performance review. That, of course, was a case in which there was clear wrongdoing by an organization. But the same metadata errors could have cropped up in any number of scenarios, even if no laws were broken. The precedent in this case, and others like it, might be enough to convince the court to penalize an organization based on claims of a plaintiff. How can organizations defend themselves from this legal bind of metadata The answer would seem obvious; get control of your metadata and make sure it corresponds to the data it represents. With that kind of control over data, organizations would discover for themselves if something was amiss that could cost them in a settlement later. But execution of that obvious answer is a different story. With reams of data to pore through, it would take an organization's business intelligence team months, or even years, to manually sift through the databases. And because to err is human, there would be no guarantee they hadn't missed something. Clearly Business Intelligence and Data Analysis teams need some help in doing this. One solution would be to hire more staff, expanding teams at least temporarily to make sense of the data and metadata that could prove problematic. There are services that will lend their staff to an organization to do just that, and for companies that prefer the “human touch,” adding that temporary staff may be the best solution. Another idea is to automate the process, with advanced tools that will do a full examination of data, both across systems and within repositories themselves. Such automated tools would examine the data in the various repositories and find where the metadata for the same information is different – pointing BI teams in the right direction and cutting down on the time needed to determine what needs to be fixed. Using automated metadata management tools, companies can ensure that they remain secure. If a company is being sued and discovery has commenced, it will be too late for the organization to fix anything. Honest mistakes or disorganized file keeping can no longer be corrected, and the fate of the organization will be in the hands of a jury or a judge. Automated metadata management tools can help Business Intelligence and Data Analysis teams figure out which metadata entries are not consistent across the repositories, ensuring that things are fixed before discovery takes place. There are a variety of tools on the market, with various strengths and weaknesses. Companies will need to decide whether a data dictionary, a business glossary, or a more all-encompassing product best answers their needs. They’ll also need to make sure the enterprise software they currently use is supported by the metadata management solution they are after. As the market develops, AI will be a huge distinguishing factor between metadata solutions, as machine learning will reduce the cost and manpower investment of solution onboarding significantly. With the success of recent metadata-based lawsuits, you can be sure more attorneys will be using metadata in their discovery processes. Organizations that want to defend themselves need to get their data in order, and ensure that they won't end up losing lots of money because of their own errors. Author Bio Amnon Drori is the Co-Founder and CEO of Octopai and has over 20 years of leadership experience in technology companies. Before co-founding Octopai he led sales efforts at companies like Panaya (Acquired by Infosys), Zend Technologies (Acquired by Rogue Wave Software), ModusNovo and Alvarion. Other interesting news in Tech Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data and Society report. Open AI researchers advance multi-agent competition by training AI agents in a hide and seek environment. France and Germany reaffirm blocking Facebook’s Libra cryptocurrency

0
0
27245

article-image-how-to-handle-categorical-data-for-machine-learning-algorithms

Packt Editorial Staff

20 Sep 2019

9 min read

How to handle categorical data for machine learning algorithms

Packt Editorial Staff

20 Sep 2019

9 min read

The quality of data and the amount of useful information are key factors that determine how well a machine learning algorithm can learn. Therefore, it is absolutely critical that we make sure to encode categorical variables correctly, before we feed data into a machine learning algorithm. In this article, with simple yet effective examples we will explain how to deal with categorical data in computing machine learning algorithms and how we to map ordinal and nominal feature values to integer representations. The article is an excerpt from the book Python Machine Learning - Third Edition by Sebastian Raschka and Vahid Mirjalili. This book is a comprehensive guide to machine learning and deep learning with Python. It acts as both a clear step-by-step tutorial, and a reference you’ll keep coming back to as you build your machine learning systems. It is not uncommon that real-world datasets contain one or more categorical feature columns. When we are talking about categorical data, we have to further distinguish between nominal and ordinal features. Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order XL > L > M. In contrast, nominal features don't imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue. Categorical data encoding with pandas Before we explore different techniques to handle such categorical data, let's create a new DataFrame to illustrate the problem: >>> import pandas as pd >>> df = pd.DataFrame([ ... ['green', 'M', 10.1, 'class1'], ... ['red', 'L', 13.5, 'class2'], ... ['blue', 'XL', 15.3, 'class1']]) >>> df.columns = ['color', 'size', 'price', 'classlabel'] >>> df color size price classlabel 0 green M 10.1 class1 1 red L 13.5 class2 2 blue XL 15.3 class1 As we can see in the preceding output, the newly created DataFrame contains a nominal feature (color), an ordinal feature (size), and a numerical feature (price) column. The class labels (assuming that we created a dataset for a supervised learning task) are stored in the last column. Mapping ordinal features To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Unfortunately, there is no convenient function that can automatically derive the correct order of the labels of our size feature, so we have to define the mapping manually. In the following simple example, let's assume that we know the numerical difference between features, for example, XL = L + 1 = M + 2: >>> size_mapping = { ... 'XL': 3, ... 'L': 2, ... 'M': 1} >>> df['size'] = df['size'].map(size_mapping) >>> df color size price classlabel 0 green 1 10.1 class1 1 red 2 13.5 class2 2 blue 3 15.3 class1 If we want to transform the integer values back to the original string representation at a later stage, we can simply define a reverse-mapping dictionary inv_size_mapping = {v: k for k, v in size_mapping.items()} that can then be used via the pandas map method on the transformed feature column, similar to the size_mapping dictionary that we used previously. We can use it as follows: >>> inv_size_mapping = {v: k for k, v in size_mapping.items()} >>> df['size'].map(inv_size_mapping) 0 M 1 L 2 XL Name: size, dtype: object Encoding class labels Many machine learning libraries require that class labels are encoded as integer values. Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches. To encode the class labels, we can use an approach similar to the mapping of ordinal features discussed previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer number we assign to a particular string label. Thus, we can simply enumerate the class labels, starting at 0: >>> import numpy as np >>> class_mapping = {label:idx for idx,label in ... enumerate(np.unique(df['classlabel']))} >>> class_mapping {'class1': 0, 'class2': 1} Next, we can use the mapping dictionary to transform the class labels into integers: >>> df['classlabel'] = df['classlabel'].map(class_mapping) >>> df color size price classlabel 0 green 1 10.1 0 1 red 2 13.5 1 2 blue 3 15.3 0 We can reverse the key-value pairs in the mapping dictionary as follows to map the converted class labels back to the original string representation: >>> inv_class_mapping = {v: k for k, v in class_mapping.items()} >>> df['classlabel'] = df['classlabel'].map(inv_class_mapping) >>> df color size price classlabel 0 green 1 10.1 class1 1 red 2 13.5 class2 2 blue 3 15.3 class1 Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn to achieve this: >>> from sklearn.preprocessing import LabelEncoder >>> class_le = LabelEncoder() >>> y = class_le.fit_transform(df['classlabel'].values) >>> y array([0, 1, 0]) Note that the fit_transform method is just a shortcut for calling fit and transform separately, and we can use the inverse_transform method to transform the integer class labels back into their original string representation: >>> class_le.inverse_transform(y) array(['class1', 'class2', 'class1'], dtype=object) Performing a technique ‘one-hot encoding’ on nominal features In the Mapping ordinal features section, we used a simple dictionary-mapping approach to convert the ordinal size feature into integers. Since scikit-learn's estimators for classification treat class labels as categorical data that does not imply any order (nominal), we used the convenient LabelEncoder to encode the string labels into integers. It may appear that we could use a similar approach to transform the nominal color column of our dataset, as follows: >>> X = df[['color', 'size', 'price']].values >>> color_le = LabelEncoder() >>> X[:, 0] = color_le.fit_transform(X[:, 0]) >>> X array([[1, 1, 10.1], [2, 2, 13.5], [0, 3, 15.3]], dtype=object) After executing the preceding code, the first column of the NumPy array X now holds the new color values, which are encoded as follows: blue = 0 green = 1 red = 2 If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't come in any particular order, a learning algorithm will now assume that green is larger than blue, and red is larger than green. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal. A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of an example; for example, a blue example can be encoded as blue=1, green=0, red=0. To perform this transformation, we can use the OneHotEncoder that is implemented in scikit-learn's preprocessing module: >>> from sklearn.preprocessing import OneHotEncoder >>> X = df[['color', 'size', 'price']].values >>> color_ohe = OneHotEncoder() >>> color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray() array([[0., 1., 0.], [0., 0., 1.], [1., 0., 0.]]) Note that we applied the OneHotEncoder to a single column (X[:, 0].reshape(-1, 1))) only, to avoid modifying the other two columns in the array as well. If we want to selectively transform columns in a multi-feature array, we can use the ColumnTransformer that accepts a list of (name, transformer, column(s)) tuples as follows: >>> from sklearn.compose import ColumnTransformer >>> X = df[['color', 'size', 'price']].values >>> c_transf = ColumnTransformer([ ... ('onehot', OneHotEncoder(), [0]), ... ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X) .astype(float) array([[0.0, 1.0, 0.0, 1, 10.1], [0.0, 0.0, 1.0, 2, 13.5], [1.0, 0.0, 0.0, 3, 15.3]]) In the preceding code example, we specified that we only want to modify the first column and leave the other two columns untouched via the 'passthrough' argument. An even more convenient way to create those dummy features via one-hot encoding is to use the get_dummies method implemented in pandas. Applied to a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged: >>> pd.get_dummies(df[['price', 'color', 'size']]) price size color_blue color_green color_red 0 10.1 1 0 1 0 1 13.5 2 0 0 1 2 15.3 3 1 0 0 When we are using one-hot encoding datasets, we have to keep in mind that it introduces multicollinearity, which can be an issue for certain methods (for instance, methods that require matrix inversion). If features are highly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To reduce the correlation among variables, we can simply remove one feature column from the one-hot encoded array. Note that we do not lose any important information by removing a feature column, though; for example, if we remove the column color_blue, the feature information is still preserved since if we observe color_green=0 and color_red=0, it implies that the observation must be blue. If we use the get_dummies function, we can drop the first column by passing a True argument to the drop_first parameter, as shown in the following code example: >>> pd.get_dummies(df[['price', 'color', 'size']], ... drop_first=True) price size color_green color_red 0 10.1 1 1 0 1 13.5 2 0 1 2 15.3 3 0 0 In order to drop a redundant column via the OneHotEncoder , we need to set drop='first' and set categories='auto' as follows: >>> color_ohe = OneHotEncoder(categories='auto', drop='first') >>> c_transf = ColumnTransformer([ ... ('onehot', color_ohe, [0]), ... ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X).astype(float) array([[ 1. , 0. , 1. , 10.1], [ 0. , 1. , 2. , 13.5], [ 0. , 0. , 3. , 15.3]]) In this article, we have gone through some of the methods to deal with categorical data in datasets. We distinguished between nominal and ordinal features, and with examples we explained how they can be handled. To harness the power of the latest Python open source libraries in machine learning check out this book Python Machine Learning - Third Edition, written by Sebastian Raschka and Vahid Mirjalili. Other interesting read in data! The best business intelligence tools 2019: when to use them and how much they cost Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data & Society report

0
0
48131

article-image-github-acquires-semmle-to-secure-open-source-supply-chain-attains-cve-numbering-authority-status

Savia Lobo

19 Sep 2019

5 min read

GitHub acquires Semmle to secure open-source supply chain; attains CVE Numbering Authority status

Savia Lobo

19 Sep 2019

5 min read

Yesterday, GitHub announced that it has acquired Semmle, a code analysis platform provider and also that it is now a Common Vulnerabilities and Exposures (CVE) Numbering Authority. https://twitter.com/github/status/1174371016497405953 The Semmle acquisition is a part of the plan to securing the open-source supply chain, Nat Friedman explains in his blog post. Semmle provides a code analysis engine, named QL, which allows developers to write queries that identify code patterns in large codebases and search for vulnerabilities and their variants. Security researchers use Semmle to quickly find vulnerabilities in code with simple declarative queries. “Semmle is trusted by security teams at Uber, NASA, Microsoft, Google, and has helped find thousands of vulnerabilities in some of the largest codebases in the world, as well as over 100 CVEs in open source projects to date,” Friedman writes. Also Read: GitHub now supports two-factor authentication with security keys using the WebAuthn API Semmle originally spun out of research at Oxford in 2006 announced a $21 million Series B investment led by Accel Partners, last year. “In total, the company raised $31 million before this acquisition,” Techcrunch reports. Shanku Niyogi, Senior Vice President of Product at GitHub, in his blog post writes, “An important measure of the success of Semmle’s approach is the number of vulnerabilities that have been identified and disclosed through their technology. Today, over 100 CVEs in open source projects have been found using Semmle, including high-profile projects like Apache Struts, Apple’s XNU, the Linux Kernel, Memcached, U-Boot, and VLC. No other code analysis tool has a similar success rate.” GitHub also announced that it has been approved as a CVE Numbering Authority for open source projects. Now, GitHub will be able to issue CVEs for security advisories opened on GitHub, allowing for even broader awareness across the industry. With Semmle integration, every CVE-ID can be associated with a Semmle QL query, which can then be shared and tracked by the broader developer community. The CVE approval will make it easier for project maintainers to report security flaws directly from their repositories. Also, GitHub can assign CVE identifiers directly and post them to the CVE List and the National Vulnerability Database (NVD). Earlier this year, GitHub acquired Dependabot, to provide automatic security fixes natively within GitHub. With automatic security fixes, developers no longer need to manually patch their dependencies. When a vulnerability is found in a dependency, GitHub will automatically issue a pull request on downstream repositories with the information needed to accept the patch. In August, GitHub was in the limelight for being a part of the Capital One data breach that affected 106 million users in the US and Canada. The law firm Tycko & Zavareei LLP filed a lawsuit in California’s federal district court on behalf of their plaintiffs Seth Zielicke and Aimee Aballo. Also Read: GitHub acquires Spectrum, a community-centric conversational platform Both plaintiffs claimed Capital One and GitHub were unable to protect user’s personal data. The complaint highlighted that Paige A. Thompson, the alleged hacker stole the data in March, posted about the theft on GitHub in April. According to the lawsuit, “As a result of GitHub’s failure to monitor, remove, or otherwise recognize and act upon obviously-hacked data that was displayed, disclosed, and used on or by GitHub and its website, the Personal Information sat on GitHub.com for nearly three months.” The Semmle acquisition may be GitHub’s move to improve security for users in the future. It would be interesting to know how GitHub will mold security for users with additional CVE approval. A user on Reddit writes, “I took part in a tutorial session Semmle held at a university CS society event, where we were shown how to use their system to write semantic analysis passes to look for things like use-after-free and null pointer dereferences. It was only an hour and a bit long, but I found the query language powerful & intuitive and the platform pretty effective. At the time, you could set up your codebase to run Semmle passes on pre-commit hooks or CI deployments etc. and get back some pretty smart reporting if you had introduced a bug.” The user further writes, “The session focused on Java, but a few other languages were supported as first-class, iirc. It felt kinda like writing an SQL query, but over AST rather than tuples in a table, and using modal logic to choose the selections. It took a little while to first get over the 'wut' phase (like 'how do I even express this'), but I imagine that a skilled team, once familiar with the system, could get a lot of value out of Semmle's QL/semantic analysis, especially for large/enterprise-scale codebases.” https://twitter.com/kurtseifried/status/1174395660960796672 https://twitter.com/timneutkens/status/1174598659310313472 To know more about this announcement in detail, read GitHub’s official blog post. Other news in Data Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine GQL (Graph Query Language) joins SQL as a Global Standards Project and is now the international standard declarative query language for graphs

0
0
45768

TensorFlow 2.0 released with tighter Keras integration, eager execution enabled by default, and more!

How Chaos Engineering can help predict and prevent cyber-attacks preemptively

An unpatched vulnerability in NSA’s Ghidra allows a remote attacker to compromise exposed systems

Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages

Researchers release a study into Bug Bounty Programs and Responsible Disclosure for ethical hacking in IoT

Chaos engineering company Gremlin launches Scenarios, making it easier to tackle downtime issues

ReactOS 0.4.12 releases with kernel improvements, Intel e1000 NIC driver support, and more

Red Hat announces CentOS Stream, a “developer-forward distribution” jointly with the CentOS Project

Can a modified MIT ‘Hippocratic License’ to restrict misuse of open source software prompt a wave of ethical innovation in tech?

Introducing Weld, a runtime written in Rust and LLVM for cross-library optimizations

Trending Topics

Machine learning ethics: what you need to know and what you can do

How Quarkus brings Java into the modern world of enterprise tech

Bad Metadata can get you in legal hot water

How to handle categorical data for machine learning algorithms

GitHub acquires Semmle to secure open-source supply chain; attains CVE Numbering Authority status

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access