Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1215 Articles
article-image-what-can-you-expect-at-neurips-2019
Sugandha Lahoti
06 Sep 2019
5 min read
Save for later

What can you expect at NeurIPS 2019?

Sugandha Lahoti
06 Sep 2019
5 min read
Popular machine learning conference NeurIPS 2019 (Conference on Neural Information Processing Systems) will be held on Sunday, December 8 through Saturday, December 14 at the Vancouver Convention Center. The conference invites papers tutorials, and submissions on cross-disciplinary research where machine learning methods are being used in other fields, as well as methods and ideas from other fields being applied to ML.  NeurIPS 2019 accepted papers Yesterday, the conference published the list of their accepted papers. A total of 1429 papers have been selected. Submissions opened from May 1 on a variety of topics such as Algorithms, Applications, Data implementations, Neuroscience, and Cognitive Science, Optimization, Probabilistic Methods, Reinforcement Learning and Planning, and Theory. (The full list of Subject Areas are available here.) This year at NeurIPS 2019, authors of accepted submissions were mandatorily required to prepare either a 3-minute video or a PDF of slides summarizing the paper or prepare a PDF of the poster used at the conference. This was done to make NeurIPS content accessible to those unable to attend the conference. NeurIPS 2019 also introduced a mandatory abstract submission deadline, a week before final submissions are due. Only a submission with a full abstract was allowed to have the full paper uploaded. The authors were also asked to answer questions from the Reproducibility Checklist during the submission process. NuerIPS 2019 tutorial program NeurIPS also invites experts to present tutorials that feature topics that are of interest to a sizable portion of the NeurIPS community and are different from the ones already presented at other ML conferences like ICML or ICLR. They looked for tutorial speakers that cover topics beyond their own research in a comprehensive manner that encompasses multiple perspectives.  The tutorial chairs for NeurIPS 2019 are Danielle Belgrave and Alice Oh. They initially compiled a list based on the last few years’ publications, workshops, and tutorials presented at NeurIPS and at related venues. They asked colleagues for recommendations and conducted independent research. In reviewing the potential candidates, the chair read papers to understand their expertise and watch their videos to appreciate their style of delivery. The list of candidates was emailed to the General Chair, Diversity & Inclusion Chairs, and the rest of the Organizing Committee for their comments on this shortlist. Following a few adjustments based on their input, the potential speakers were selected. A total of 9 tutorials have been selected for NeurIPS 2019: Deep Learning with Bayesian Principles - Emtiyaz Khan Efficient Processing of Deep Neural Network: from Algorithms to Hardware Architectures - Vivienne Sze Human Behavior Modeling with Machine Learning: Opportunities and Challenges - Nuria Oliver, Albert Ali Salah Interpretable Comparison of Distributions and Models - Wittawat Jitkrittum, Dougal Sutherland, Arthur Gretton Language Generation: Neural Modeling and Imitation Learning -  Kyunghyun Cho, Hal Daume III Machine Learning for Computational Biology and Health - Anna Goldenberg, Barbara Engelhardt Reinforcement Learning: Past, Present, and Future Perspectives - Katja Hofmann Representation Learning and Fairness - Moustapha Cisse, Sanmi Koyejo Synthetic Control - Alberto Abadie, Vishal Misra, Devavrat Shah NeurIPS 2019 Workshops NeurIPS Workshops are primarily used for discussion of work in progress and future directions. This time the number of Workshop Chairs doubled, from two to four; selected chairs are Jenn Wortman Vaughan, Marzyeh Ghassemi, Shakir Mohamed, and Bob Williamson. However, the number of workshop submissions went down from 140 in 2018 to 111 in 2019. Of these 111 submissions, 51 workshops were selected. The full list of selected Workshops is available here.  The NeurIPS 2019 chair committee introduced new guidelines, expectations, and selection criteria for the Workshops. This time workshops had an important focus on the nature of the problem, intellectual excitement of the topic, diversity, and inclusion, quality of proposed invited speakers, organizational experience and ability of the team and more.  The Workshop Program Committee consisted of 37 reviewers with each workshop proposal assigned to two reviewers. The reviewer committee included more senior researchers who have been involved with the NeurIPS community. Reviewers were asked to provide a summary and overall rating for each workshop, a detailed list of pros and cons, and specific ratings for each of the new criteria. After all reviews were submitted, each proposal was assigned to two of the four chair committee members. The chair members looked through assigned proposals and their reviews to form an educated assessment of the pros and cons of each. Finally, the entire chair held a meeting to discuss every submitted proposal to make decisions.  You can check more details about the conference on the NeurIPS website. As always keep checking this space for more content about the conference. In the meanwhile, you can read our previous year coverage: NeurIPS Invited Talk: Reproducible, Reusable, and Robust Reinforcement Learning NeurIPS 2018: How machine learning experts can work with policymakers to make good tech decisions [Invited Talk] NeurIPS 2018: Rethinking transparency and accountability in machine learning NeurIPS 2018: Developments in machine learning through the lens of Counterfactual Inference [Tutorial] Accountability and algorithmic bias: Why diversity and inclusion matters [NeurIPS Invited Talk]
Read more
  • 0
  • 0
  • 26494

article-image-google-is-circumventing-gdpr-reveals-braves-investigation-for-the-authorized-buyers-ad-business-case
Bhagyashree R
06 Sep 2019
6 min read
Save for later

Google is circumventing GDPR, reveals Brave's investigation for the Authorized Buyers ad business case

Bhagyashree R
06 Sep 2019
6 min read
Last year, Dr. Johnny Ryan, the Chief Policy & Industry Relations Officer at Brave, filed a complaint against Google’s DoubleClick/Authorized Buyers ad business with the Irish Data Protection Commission (DPC). New evidence produced by Brave reveals that Google is circumventing GDPR and also undermining its own data protection measures. Brave calls Google’s Push Pages a GDPR workaround Brave’s new evidence rebuts some of Google’s claims regarding its DoubleClick/Authorized Buyers system, the world’s largest real-time advertising auction house. Google says that it prohibits companies that use its real-time bidding (RTB) ad system “from joining data they receive from the Cookie Matching Service.” In September last year, Google announced that it has removed encrypted cookie IDs and list names from bid requests with buyers in its Authorized Buyers marketplace. Brave’s research, however, found otherwise, “Brave’s new evidence reveals that Google allowed not only one additional party, but many, to match with Google identifiers. The evidence further reveals that Google allowed multiple parties to match their identifiers for the data subject with each other.” When you visit a website that has Google ads embedded on its web pages, Google will run a real-time bidding ad auction to determine which advertiser will get to display its ads. For this, it uses Push Pages, which is the mechanism in question here. Brave hired Zach Edwards, the co-founder of digital analytics startup Victory Medium, and MetaX, a company that audits data supply chains, to investigate and analyze a log of Dr. Ryan’s web browsing. The research revealed that Google's Push Pages can essentially be used as a workaround for user IDs. Google shares a ‘google_push’ identifier with the participating companies to identify a user. Brave says that the problem here is that the identifier that was shared was common to multiple companies. This means that these companies could have cross-referenced what they learned about the user from Google with each other. Used by more than 8.4 million websites, Google's DoubleClick/Authorized Buyers broadcasts personal data of users to 2000+ companies. This data includes the category of what a user is reading, which can reveal their political views, sexual orientation, religious beliefs, as well as their locations. There are also unique ID codes that are specific to a user that can let companies uniquely identify a user. All this information can give these companies a way to keep tabs on what users are “reading, watching, and listening to online.” Brave calls Google’s RTB data protection policies “weak” as they ask these companies to self-regulate. Google does not have much control over what these companies do with the data once broadcast. “Its policy requires only that the thousands of companies that Google shares peoples’ sensitive data with monitor their own compliance, and judge for themselves what they should do,” Brave wrote. A Google spokesperson, as a response to this news, told Forbes, “We do not serve personalised ads or send bid requests to bidders without user consent. The Irish DPC — as Google's lead DPA — and the UK ICO are already looking into real-time bidding in order to assess its compliance with GDPR. We welcome that work and are co-operating in full." Users recommend starting an “information campaign” instead of a penalty that will hardly affect the big tech This news triggered a discussion on Hacker News where users talked about the implications of RTB and what strict actions the EU can take to protect user privacy. A user explained, "So, let's say you're an online retailer, and you have Google IDs for your customers. You probably have some useful and sensitive customer information, like names, emails, addresses, and purchase histories. In order to better target your ads, you could participate in one of these exchanges, so that you can use the information you receive to suggest products that are as relevant as possible to each customer. To participate, you send all this sensitive information, along with a Google ID, and receive similar information from other retailers, online services, video games, banks, credit card providers, insurers, mortgage brokers, service providers, and more! And now you know what sort of vehicles your customers drive, how much they make, whether they're married, how many kids they have, which websites they browse, etc. So useful! And not only do you get all these juicy private details, but you've also shared your customers sensitive purchase history with anyone else who is connected to the exchange." Others said that a penalty is not going to deter Google. "The whole penalty system is quite silly. The fines destroy small companies who are the ones struggling to comply, and do little more than offer extremely gentle pokes on the wrist for megacorps that have relatively unlimited resources available for complete compliance, if they actually wanted to comply." Users suggested that the EU should instead start an information campaign. "EU should ignore the fines this time and start an "information campaign" regarding behavior of Google and others. I bet that hurts Google 10 times more." Some also said that not just Google but the RTB participants should also be held responsible. "Because what Google is doing is not dissimilar to how any other RTB participant is acting, saying this is a Google workaround seems disingenuous." With this case, Brave has launched a full-fledged campaign that aims to “reform the multi-billion dollar RTB industry spans sixteen EU countries.” To achieve this goal it has collaborated with several privacy NGOs and academics including the Open Rights Group, Dr. Michael Veale of the Turing Institute, among others. In other news, a Bloomberg report reveals that Google and other internet companies have recently asked for an amendment to the California Consumer Privacy Act, which will be enacted in 2020. The law currently limits how digital advertising companies collect and make money from user data. The amendments proposed include approval for collecting user data for targeted advertising, using the collected data from websites for their own analysis, and many others. Read the Bloomberg report to know more in detail. Other news in Data Facebook content moderators work in filthy, stressful conditions and experience emotional trauma daily, reports The Verge GDPR complaint in EU claim billions of personal data leaked via online advertising bids European Union fined Google 1.49 billion euros for antitrust violations in online advertising  
Read more
  • 0
  • 0
  • 44217

article-image-key-skills-every-database-programmer-should-have
Sugandha Lahoti
05 Sep 2019
7 min read
Save for later

Key skills every database programmer should have

Sugandha Lahoti
05 Sep 2019
7 min read
According to Robert Half Technology’s 2019 IT salary report, ‘Database programmer’ is one of the 13 most in-demand tech jobs for 2019. For an entry-level programmer, the average salary is $98,250 which goes up to $167,750 for a seasoned expert. A typical database programmer is responsible for designing, developing, testing, deploying, and maintaining databases. In this article, we will list down the top critical tech skills essential to database programmers. #1 Ability to perform Data Modelling The first step is to learn to model the data. In Data modeling, you create a conceptual model of how data items relate to each other. In order to efficiently plan a database design, you should know the organization you are designing the database from. This is because Data models describe real-world entities such as ‘customer’, ‘service’, ‘products’, and the relation between these entities. Data models provide an abstraction for the relations in the database. They aid programmers in modeling business requirements and in translating business requirements into relations. They are also used for exchanging information between the developers and business owners. During the design phase, the database developer should pay great attention to the underlying design principles, run a benchmark stack to ensure performance, and validate user requirements. They should also avoid pitfalls such as data redundancy, null saturation, and tight coupling. #2 Know a database programming language, preferably SQL Database programmers need to design, write and modify programs to improve their databases. SQL is one of the top languages that are used to manipulate the data in a database and to query the database. It's also used to define and change the structure of the data—in other words, to implement the data model. Therefore it is essential that you learn SQL. In general, SQL has three parts: Data Definition Language (DDL): used to create and manage the structure of the data Data Manipulation Language (DML): used to manage the data itself Data Control Language (DCL): controls access to the data Considering, data is constantly inserted into the database, changed, or retrieved DML is used more often in day-to-day operations than the DDL, so you should have a strong grasp on DML. If you plan to grow in a database architect role in the near future, then having a good grasp of DDL will go a long way. Another reason why you should learn SQL is that almost every modern relational database supports SQL. Although different databases might support different features and implement their own dialect of SQL, the basics of the language remain the same. If you know SQL, you can quickly adapt to MySQL, for example. At present, there are a number of categories of database models predominantly, relational, object-relational, and NoSQL databases. All of these are meant for different purposes. Relational databases often adhere to SQL. Object-relational databases (ORDs) are also similar to relational databases. NoSQL, which stands for "not only SQL," is an alternative to traditional relational databases useful for working with large sets of distributed data. They provide benefits such as availability, schema-free, and horizontal scaling, but also have limitations such as performance, data retrieval constraints, and learning time. For beginners, it is advisable to first start with experimenting on relational databases learning SQL, gradually transitioning to NoSQL DBMS. #3 Know how to Extract, Transform, Load various data types and sources A database programmer should have a good working knowledge of ETL (Extract, Transform Load) programming. ETL developers basically extract data from different databases, transform it and then load the data into the Data Warehouse system. A Data Warehouse provides a common data repository that is essential for business needs. A database programmer should know how to tune existing packages, tables, and queries for faster ETL processing. They should conduct unit tests before applying any change to the existing ETL process. Since ETL takes data from different data sources (SQL Server, CSV, and flat files), a database developer should have knowledge on how to deal with different data sources. #4 Design and test Database plans Database programmers o perform regular tests to identify ways to solve database usage concerns and malfunctions. As databases are usually found at the lowest level of the software architecture, testing is done in an extremely cautious fashion. This is because changes in the database schema affect many other software components. A database developer should make sure that when changing the database structure, they do not break existing applications and that they are using the new structures properly. You should be proficient in Unit testing your database. Unit tests are typically used to check if small units of code are functioning properly. For databases, unit testing can be difficult. So the easiest way to do all of that is by writing the tests as SQL scripts. You should also know about System Integration Testing which is done on the complete system after the hardware and software modules of that system have been integrated. SIT validates the behavior of the system and ensures that modules in the system are functioning suitably. #5 Secure your Database Data protection and security are essential for the continuity of business. Databases often store sensitive data, such as user information, email addresses, geographical addresses, and payment information. A robust security system to protect your database against any data breach is therefore necessary. While a database architect is responsible for designing and implementing secure design options, a database admin must ensure that the right security and privacy policies are in place and are being observed. However, this does not absolve database programmers from adopting secure coding practices. Database programmers need to ensure that data integrity is maintained over time and is secure from unauthorized changes or theft. They need to especially be careful about Table Permissions i.e who can read and write to what tables. You should be aware of who is allowed to perform the 4 basic operations of INSERT, UPDATE, DELETE and SELECT against which tables. Database programmers should also adopt authentication best practices depending on the infrastructure setup, the application's nature, the user's characteristics, and data sensitivity. If the database server is accessed from the outside world, it is beneficial to encrypt sessions using SSL certificates to avoid packet sniffing. Also, you should secure database servers that trust all localhost connections, as anyone who accesses the localhost can access the database server. #6 Optimize your database performance A database programmer should also be aware of how to optimize their database performance to achieve the best results. At the basic level, they should know how to rewrite SQL queries and maintain indexes. Other aspects of optimizing database performance, include hardware configuration, network settings, and database configuration. Generally speaking, tuning database performance requires knowledge about the system's nature. Once the database server is configured you should calculate the number of transactions per second (TPS) for the database server setup. Once the system is up and running, and you should set up a monitoring system or log analysis, which periodically finds slow queries, the most time-consuming queries, etc. #7 Develop your soft skills Apart from the above technical skills, a database programmer needs to be comfortable communicating with developers, testers and project managers while working on any software project. A keen eye for detail and critical thinking can often spot malfunctions and errors that may otherwise be overlooked. A database programmer should be able to quickly fix issues within the database and streamline the code. They should also possess quick-thinking to prioritize tasks and meet deadlines effectively. Often database programmers would be required to work on documentation and technical user guides so strong writing and technical skills are a must. Get started If you want to get started with becoming a Database programmer, Packt has a range of products. Here are some of the best: PostgreSQL 11 Administration Cookbook Learning PostgreSQL 11 - Third Edition PostgreSQL 11 in 7 days [ Video ] Using MySQL Databases With Python [ Video ] Basic Relational Database Design [ Video ] How to learn data science: from data mining to machine learning How to ace a data science interview 5 barriers to learning and technology training for small software development teams
Read more
  • 0
  • 0
  • 34749

article-image-how-to-learn-data-science-from-data-mining-to-machine-learning
Richard Gall
04 Sep 2019
6 min read
Save for later

How to learn data science: from data mining to machine learning

Richard Gall
04 Sep 2019
6 min read
Data science is a field that’s complex and diverse. If you’re trying to learn data science and become a data scientist it can be easy to fall down a rabbit hole of machine learning or data processing. To a certain extent, that’s good. To be an effective data scientist you need to be curious. You need to be prepared to take on a range of different tasks and challenges. But that’s not always that efficient: if you want to learn quickly and effectively, you need a clear structure - a curriculum - that you can follow. This post will show you what you need to learn and how to go about it. Statistics Statistics is arguably the cornerstone of data science. Nate Silver called data scientists “sexed up statisticians”, a comment that was perhaps unfair but still nevertheless contains a kernel of truth in it: that data scientists are always working in the domain of statistics. Once you understand this everything else you need to learn will follow easily. Machine learning, data manipulation, data visualization - these are all ultimately technological methods for performing statistical analysis really well. Best Packt books and videos content for learning statistics Statistics for Data Science R Statistics Cookbook Statistical Methods and Applied Mathematics in Data Science [Video] Before you go any deeper into data science, it’s critical that you gain a solid foundation in statistics. Data mining and wrangling This is an important element of data science that often gets overlooked with all the hype about machine learning. However, without effective data collection and cleaning, all your efforts elsewhere are going to be pointless at best. At worst they might even be misleading or problematic. Sometimes called data manipulation or data munging, it's really all about managing and cleaning data from different sources so it can be used for analytics projects. To do it well you need to have a clear sense of where you want to get to - do you need to restructure the data? Sort or remove certain parts of a data set? Once you understand this, it’s much easier to wrangle data effectively. Data mining and wrangling tools There are a number of different tools you can use for data wrangling. Python and R are the two key programming languages, and both have some useful tools for data mining and manipulation. Python in particular has a great range of tools for data mining and wrangling, such as pandas and NLTK (Natural Language Toolkit), but that isn’t to say R isn’t powerful in this domain. Other tools are available too - Weka and Apache Mahout, for example, are popular. Weka is written in Java so is a good option if you have experience with that programming language, while Mahout integrates well with the Hadoop ecosystem. Data mining and data wrangling books and videos If you need to learn data mining, wrangling and manipulation, Packt has a range of products. Here are some of the best: Data Wrangling with R Data Wrangling with Python Python Data Mining Quick Start Guide Machine Learning for Data Mining Machine learning and artificial intelligence Although Machine learning and artificial intelligence are huge trends in their own right, they are nevertheless closely aligned with data science. Indeed, you might even say that their prominence today has grown out of the excitement around data science that we first we witnessed just under a decade ago. It’s a data scientist’s job to use machine learning and artificial intelligence in a way that can drive business value. That could, for example, be to recommend products or services to customers, perhaps to gain a better understanding into existing products, or even to better manage strategic and financial risks through predictive modelling. So, while we can see machine learning in a massive range of digital products and platforms - all of which require smart development and design - for it to work successfully, it needs to be supported by a capable and creative data scientist. Machine learning and artificial intelligence books for data scientists Machine Learning Algorithms Machine Learning with R - Third Edition Machine Learning with Apache Spark Quick Start Guide Machine Learning with TensorFlow 1.x Keras Deep Learning Cookbook Data visualization A talented data scientist isn’t just a great statistician and engineer, they’re also a great communicator. This means so-called soft skills are highly valuable - the ability to communicate insights and ideas with key stakeholders is essential. But great communication isn’t just about soft skills, it’s also about data visualization. Data visualization is, at a fundamental level, about organizing and presenting data in a way that tells a story, clarifies a problem, or illustrates a solution. It’s essential that you don’t overlook this step. Indeed, spending time learning about effective data visualization can also help you to develop your soft skills. The principles behind storytelling and communication through visualization are, in truth, exactly the same when applied to other scenarios. Data visualization tools There are a huge range of data visualization tools available. As with machine learning, understanding the differences between them and working out what solution will work for you is actually an important part of the learning process. For that reason, don’t be afraid to spend a little bit of time with a range of data visualization tools. Many of the most popular data visualization tools are paid for products. Perhaps the best known of these is Tableau (which, incidentally was bought by Salesforce earlier this year). Tableau and its competitors are very user friendly, which means the barrier to entry is pretty low. They allow you to create some pretty sophisticated data visualizations fairly easily. However, sticking to these tools is not only expensive, it can also limit your abilities. We’d recommend trying a number of different data visualization tools, such as Seabor, D3.js, Matplotlib, and ggplot2. Data visualization books and videos for data scientists Applied Data Visualization with R and ggplot2 Tableau 2019.1 for Data Scientists [Video] D3.js Data Visualization Projects [Video] Tableau in 7 Steps [Video] Data Visualization with Python If you want to learn data science, just get started! As we've seen, data science requires a number of very different skills and takes in a huge breadth of tools. That means that if you're going to be a data scientist, you need to be prepared to commit to learning forver: you're never going to reach a point where you know everything. While that might sound intimidating, it's important to have confidence. With a sense of direction and purpose, and a learning structure that works for you, it's possible to develop and build your data science capabilities in a way that could unlock new opportunities and act as the basis for some really exciting projects.
Read more
  • 0
  • 0
  • 35213

article-image-how-to-ace-a-data-science-interview
Richard Gall
02 Sep 2019
12 min read
Save for later

How to ace a data science interview

Richard Gall
02 Sep 2019
12 min read
So, you want to be a data scientist. It’s a smart move: it’s a job that’s in high demand, can command a healthy salary, and can also be richly rewarding and engaging. But to get the job, you’re going to have to pass a data science interview - something that’s notoriously tough. One of the reasons for this is that data science is a field that is incredibly diverse. I mean that in two different ways: on the one hand it’s a role that demands a variety of different skills (being a good data scientist is about much more than just being good at math). But it's also diverse in the sense that data science will be done differently at every company. That means that every data science interview is going to be different. If you specialize too much in one area, you might well be severely limiting your opportunities. There are plenty of articles out there that pretend to have all the answers to your next data science interview. And while these can be useful, they also treat job interviews like they’re just exams you need to pass. They’re not - you need to have a wide range of knowledge, but you also need to present yourself as a curious and critical thinker, and someone who is very good at communicating. You won’t get a data science by knowing all the answers. But you might get it by asking the right questions and talking in the right way. So, with all that in mind, here are what you need to do to ace your data science interview. Know the basics of data science This is obvious but it’s impossible to overstate. If you don’t know the basics, there’s no way you’ll get the job - indeed, it’s probably better for your sake that you don’t get it! But what are these basics? Basic data science interview questions "What is data science?" This seems straightforward, but proving you’ve done some thinking about what the role actually involves demonstrates that you’re thoughtful and self-aware - a sign of any good employee. "What’s the difference between supervised and unsupervised learning?" Again, this is straightforward, but it will give the interviewer confidence that you understand the basics of machine learning algorithms. "What is the bias and variance tradeoff? What is overfitting and underfitting?" Being able to explain these concepts in a clear and concise manner demonstrates your clarity of thought. It also shows that you have a strong awareness of the challenges of using machine learning and statistical systems. If you’re applying for a job as a data scientist you’ll probably already know the answers to all of these. Just make sure you have a clear answer and that you can explain each in a concise manner. Know your algorithms Knowing your algorithms is a really important part of any data science interview. However, it’s important to not get hung up on the details. Trying to learn everything you know about every algorithm you know isn’t only impossible, it’s also not going to get you the job. What’s important instead is demonstrating that you understand the differences between algorithms, and when to use one over another. Data science interview questions about algorithms you might be asked "When would you use a supervised machine learning algorithm?" "Can you name some supervised machine learning algorithms and the differences between them?" (supervised machine learning algorithms include Support Vector Machines, Naive Bayes, K-nearest Neighbor Algorithm, Regression, Decision Trees) "When would you use an unsupervised machine learning algorithm?" (unsupervised machine learning algorithms include K-Means, autoencoders, Generative Adversarial Networks, and Deep Belief Nets.) Name some unsupervised machine learning algorithms and how they’re different from one another. "What are classification algorithms?" There are others, but try to focus on these as core areas. Remember, it’s also important to always talk about your experience - that’s just as useful, if not even more useful than listing off the differences between different machine learning algorithms. Some of the questions you face in a data science interview might even be about how you use algorithms: "Tell me about the time you used an algorithm. Why did you decide to use it? Were there any other options?" "Tell me about a time you used an algorithm and it didn’t work how you expected it to. What did you do?" When talking about algorithms in a data science interview it’s useful to present them as tools for solving business problems. It can be tempting to talk about them as mathematical concepts, and although it’s good to show off your understanding, showing how algorithms help solve real-world business problems will be a big plus for your interviewer. Be confident talking about data sources and infrastructure challenges One of the biggest challenges for data scientists is dealing with incomplete or poor quality data. If that’s something you’ve faced - or even if it’s something you think you might face in the future - then make sure you talk about that. Data scientists aren’t always responsible for managing a data infrastructure (that will vary from company to company), but even if that isn’t in the job description, it’s likely that you’ll have to work with a data architect to make sure data is available and accurate to be able to carry our data science projects. This means that understanding topics like data streaming, data lakes and data warehouses is very important in a data science interview. Again, remember that it’s important that you don’t get stuck on the details. You don’t need to recite everything you know, but instead talk about your experience or how you might approach problems in different ways. Data science interview questions you might get asked about using different data sources "How do you work with data from different sources?" "How have you tackled dirty or unreliable data in the past?" Data science interview questions you might get asked about infrastructure "Talk me through a data infrastructure challenge you’ve faced in the past" "What’s the difference between a data lake and data warehouse? How would you approach each one differently?" Show that you have a robust understanding of data science tools You can’t get through a data science interview without demonstrating that you have knowledge and experience of data science tools. It’s likely that the job you’re applying for will mention a number of different skill requirements in the job description, so make sure you have a good knowledge of them all. Obviously, the best case scenario is that you know all the tools mentioned in the job description inside out - but this is unlikely. If you don’t know one - or more - make sure you understand what they’re for and how they work. The hiring manager probably won’t expect candidates to know everything, but they will expect them to be ready and willing to learn. If you can talk about a time you learned a new tool that will give the interviewer a lot of confidence that you’re someone that can pick up knowledge and skills quickly. Show you can evaluate different tools and programming languages Another element here is to be able to talk about the advantages and disadvantages of different tools. Why might you use R over Python? Which Python libraries should you use to solve a specific problem? And when should you just use Excel? Sometimes the interviewer might ask for your own personal preferences. Don’t be scared about giving your opinion - as long as you’ve got a considered explanation for why you hold the opinion that you do, you’re fine! Read next: Why is Python so good for AI and Machine Learning? 5 Python Experts Explain Data science interview questions about tools that you might be asked "What tools have you - or could you - use for data processing and cleaning? What are their benefits and disadvantages?" (These include tools such as Hadoop, Pentaho, Flink, Storm, Kafka.) "What tools do you think are best for data visualization and why?" (This includes tools like Tableau, PowerBI, D3.js, Infogram, Chartblocks - there are so many different products in this space that it’s important that you are able to talk about what you value most about data visualization tools.) "Do you prefer using Python or R? Are there times when you’d use one over another?" "Talk me through machine learning libraries. How do they compare to one another?" (This includes tools like TensorFlow, Keras, and PyTorch. If you don’t have any experience with them, make sure you’re aware of the differences, and talk about which you are most curious about learning.) Always focus on business goals and results This sounds obvious, but it’s so easy to forget. This is especially true if you’re a data geek that loves to talk about statistical models and machine learning. To combat this, make sure you’re very clear on how your experience was tied to business goals. Take some time to think about why you were doing what you were doing. What were you trying to find out? What metrics were you trying to drive? Interpersonal and communication skills Another element to this is talking about your interpersonal skills and your ability to work with a range of different stakeholders. Think carefully about how you worked alongside other teams, how you went about capturing requirements and building solutions for them. Think also about how you managed - or would manage - expectations. It’s well known that business leaders can expect data to be a silver bullet when it comes to results, so how do you make sure that people are realistic. Show off your data science portfolio A good way of showing your business acumen as a data scientist is to build a portfolio of work. Portfolios are typically viewed as something for creative professionals, but they’re becoming increasingly popular in the tech industry as competition for roles gets tougher. This post explains everything you need to build a great data science portfolio. Broadly, the most important thing is that it demonstrates how you have added value to an organization. This could be: Insights you’ve shared in reports with management Building customer-facing applications that rely on data Building internal dashboards and applications Bringing a portfolio to an interview can give you a solid foundation on which you can answer questions. But remember - you might be asked questions about your work, so make sure you have an answer prepared! Data science interview questions about business performance "Talk about a time you have worked across different teams." "How do you manage stakeholder expectations?" "What do you think are the most important elements in communicating data insights to management?" If you can talk fluently about how your work impacts business performance and how you worked alongside others in non-technical positions, you will give yourself a good chance of landing the job! Show that you understand ethical and privacy issues in data science This might seem like a superfluous point but given the events of recent years - like the Cambridge Analytica scandal - ethics has become a big topic of conversation. Employers will expect prospective data scientists to have an awareness of some of these problems and how you can go about mitigating them. To some extent, this is an extension of the previous point. Showing you are aware of ethical issues, such as privacy and discrimination, proves that you are fully engaged with the needs and risks a business might face. It also underlines that you are aware of the consequences and potential impact of data science activities on customers - what your work does in the real-world. Read next: Introducing Deon, a tool for data scientists to add an ethics checklist Data science interview questions about ethics and privacy "What are some of the ethical issues around machine learning and artificial intelligence?" "How can you mitigate any of these issues? What steps would you take?" "Has GDPR impacted the way you do data science?"  "What are some other privacy implications for data scientists?" "How do you understand explainability and interpretability in machine learning?" Ethics is a topic that’s easy to overlook but it’s essential for every data scientist. To get a good grasp of the issues it’s worth investigating more technical content on things like machine learning interpretability, as well as following news and commentary around emergent issues in artificial intelligence. Conclusion: Don’t treat a data science interview like an exam Data science is a complex and multi-faceted field. That can make data science interviews feel like a serious test of your knowledge - and it can be tempting to revise like you would for an exam. But, as we’ve seen, that’s foolish. To ace a data science interview you can’t just recite information and facts. You need to talk clearly and confidently about your experience and demonstrate your drive and curiosity. That doesn’t mean you shouldn’t make sure you know the basics. But rather than getting too hung up on definitions and statistical details, it’s a better use of your time to consider how you have performed your roles in the past, and what you might do in the future. A thoughtful, curious data scientist is immensely valuable. Show your interviewer that you are one.
Read more
  • 0
  • 0
  • 53075

article-image-data-science-vs-machine-learning-understanding-the-difference-and-what-it-means-today
Richard Gall
02 Sep 2019
8 min read
Save for later

Data science vs. machine learning: understanding the difference and what it means today

Richard Gall
02 Sep 2019
8 min read
One of the things that I really love about the tech industry is how often different terms - buzzwords especially - can cause confusion. It isn’t hard to see this in the wild. Quora is replete with confused people asking about the difference between a ‘developer’ and an ‘engineer’ and how ‘infrastructure’ is different from ‘architecture'. One of the biggest points of confusion is the difference between data science and machine learning. Both terms refer to different but related domains - given their popularity it isn’t hard to see how some people might be a little perplexed. This might seem like a purely semantic problem, but in the context of people’s careers, as they make decisions about the resources they use and the courses they pay for, the distinction becomes much more important. Indeed, it can be perplexing for developers thinking about their career - with machine learning engineer starting to appear across job boards, it’s not always clear where that role begins and ‘data scientist’ begins. Tl;dr: To put it simply - and if you can’t be bothered to read further - data science is a discipline or job role that’s all about answering business questions through data. Machine learning, meanwhile, is a technique that can be used to analyze or organize data. So, data scientists might well use machine learning to find something out, but it would only be one aspect of their job. But what are the implications of this distinction between machine learning and data science? What can the relationship between the two terms tell us about how technology trends evolve? And how can it help us better understand them both? Read next: 9 data science myths debunked What’s causing confusion about the difference between machine learning and data science? The data science v machine learning confusion comes from the fact that both terms have a significant grip on the collective imagination of the tech and business world. Back in 2012 the Harvard Business Review declared data scientist to be the ‘sexiest job of the 21st century’. This was before the machine learning and artificial intelligence boom, but it’s the point we need to go back to understand how data has shaped the tech industry as we know it today. Data science v machine learning on Google Trends Take a look at this Google trends graph: Both terms broadly received a similar level of interest. ‘Machine learning’ was slightly higher throughout the noughties and a larger gap has emerged more recently. However, despite that, it’s worth looking at the period around 2014 when ‘data science’ managed to eclipse machine learning. Today, that feels remarkable given how machine learning is a term that’s extended out into popular consciousness. It suggests that the HBR article was incredibly timely, identifying the emergence of the field. But more importantly, it’s worth noting that this spike for ‘data science’ comes at the time that both terms surge in popularity. So, although machine learning eventually wins out, ‘data science’ was becoming particularly important at a time when these twin trends were starting to grow. This is interesting, and it’s contrary to what I’d expect. Typically, I’d imagine the more technical term to take precedence over a more conceptual field: a technical trend emerges, for a more abstract concept to gain traction afterwards. But here the concept - the discipline - spikes just at the point before machine learning can properly take off. This suggests that the evolution and growth of machine learning begins with the foundations of data science. This is important. It highlights that the obsession with data science - which might well have seemed somewhat self-indulgent - was, in fact, an integral step for business to properly make sense of what the ‘big data revolution’ (a phrase that sounds eighty years old) meant in practice. Insofar as ‘data science’ is a term that really just refers to a role that’s performed, it’s growth was ultimately evidence of a space being carved out inside modern businesses that gave a domain expert the freedom to explore and invent in the service of business objectives. If that was the baseline, then the continued rise of machine learning feels inevitable. From being contained in computer science departments in academia, and then spreading into business thanks to the emergence of the data scientist job role, we then started to see a whole suite of tools and use cases that were about much more than analytics and insight. Machine learning became a practical tool that had practical applications everywhere. From cybersecurity to mobile applications, from marketing to accounting, machine learning couldn’t be contained within the data science discipline. This wasn’t just a conceptual point - practically speaking, a data scientist simply couldn’t provide support to all the different ways in which business functions wanted to use machine learning. So, the confusion around the relationship between machine learning and data science stems from the fact that the two trends go hand in hand - or at least they used to. To properly understand how they’re different, let’s look at what a data scientist actually does. Read next: Data science for non-techies: How I got started (Part 1) What is data science, exactly? I know you’re not supposed to use Wikipedia as a reference, but the opening sentence in the entry for ‘data science’ is instructive: “Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.” The word that deserves your attention is multi-disciplinary as this underlines what makes data science unique and why it stands outside of the more specific taxonomy of machine learning terms. Essentially, it’s a human activity as much as a technical one - it’s about arranging, organizing, interpreting, and communicating data. To a certain extent it shares a common thread of DNA with statistics. But although Nate Silver said that ‘data scientist’ was “a sexed up term for statistician”, I think there are some important distinctions. To do data science well you need to be deeply engaged with how your work integrates with the wider business strategy and processes. The term ‘statistics’ - like ‘machine learning’ - doesn’t quite do this. Indeed, to a certain extent this has made data science a challenging field to work in. It isn’t hard to find evidence that data scientists are trying to leave their jobs, frustrated with how their roles are being used and how they integrate into existing organisational structures. How do data scientists use machine learning? As a data scientist, your job is to answer questions. These are questions like: What might happen if we change the price of a product in this way? What do our customers think of our products? How often do customers purchase products? How are customers using our products? How can we understand the existing market? How might we tackle it? Where could we improve efficiencies in our processes? That’s just a small set. The types of questions data scientists will be tackling will vary depending on the industry, their company - everything. Every data science job is unique. But whatever questions data scientists are asking, it’s likely that at some point they’ll be using machine learning. Whether it’s to analyze customer sentiment (grouping and sorting) or predicting outcomes, a data scientist will have a number of algorithms up their proverbial sleeves ready to tackle whatever the business throws at them. Machine learning beyond data science The machine learning revolution might have started in data science, but it has rapidly expanded far beyond that strict discipline. Indeed, one of the reasons that some people are confused about the relationship between the two concepts is because machine learning is today touching just about everything, like water spilling out of its neat data science container. Machine learning is for everyone Machine learning is being used in everything from mobile apps to cybersecurity. And although data scientists might sometimes play a part in these domains, we’re also seeing subject specific developers and engineers taking more responsibility for how machine learning is used. One of the reasons for this is, as I mentioned earlier, the fact that a data scientist - or even a couple of them - can’t do all the things that a business might want them to when it comes to machine learning. But another is the fact that machine learning is getting easier. You no longer need to be an expert to employ machine learning algorithms - instead, you need to have the confidence and foundational knowledge to use existing machine learning tools and products. This ‘productization’ of machine learning is arguably what’s having the biggest impact on how we understand the topic. It’s even shrinking data science, making it a more specific role. That might sound like data science is less important today than it was in 2014, but it can only be a good thing for data scientists - it means they are being asked to spread themselves so thinly. So, if you've been googling 'data science v machine learning', you now know the answer. The two terms are distinct but they both come out of the 'big data revolution' which we're still living through. Both trends and terms are likely to evolve in the future, but they're certainly not going to disappear - as the data at our disposal grow, making effective use of it is only going to become more important.
Read more
  • 0
  • 0
  • 42098
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-bitbucket-to-no-longer-support-mercurial-users-must-migrate-to-git-by-may-2020
Fatema Patrawala
21 Aug 2019
6 min read
Save for later

Bitbucket to no longer support Mercurial, users must migrate to Git by May 2020

Fatema Patrawala
21 Aug 2019
6 min read
Yesterday marked an end of an era for Mercurial users, as Bitbucket announced to no longer support Mercurial repositories after May 2020. Bitbucket, owned by Atlassian, is a web-based version control repository hosting service, for source code and development projects. It has used Mercurial since the beginning in 2008 and then Git since October 2011. Now almost after ten years of sharing its journey with Mercurial, the Bitbucket team has decided to remove the Mercurial support from the Bitbucket Cloud and its API. The official announcement reads, “Mercurial features and repositories will be officially removed from Bitbucket and its API on June 1, 2020.” The Bitbucket team also communicated the timeline for the sunsetting of the Mercurial functionality. After February 1, 2020 users will no longer be able to create new Mercurial repositories. And post June 1, 2020 users will not be able to use Mercurial features in Bitbucket or via its API and all Mercurial repositories will be removed. Additionally all current Mercurial functionality in Bitbucket will be available through May 31, 2020. The team said the decision was not an easy one for them and Mercurial held a special place in their heart. But according to a Stack Overflow Developer Survey, almost 90% of developers use Git, while Mercurial is the least popular version control system with only about 3% developer adoption. Apart from this Mercurial usage on Bitbucket saw a steady decline, and the percentage of new Bitbucket users choosing Mercurial fell to less than 1%. Hence they decided on removing the Mercurial repos. How can users migrate and export their Mercurial repos Bitbucket team recommends users to migrate their existing Mercurial repos to Git. They have also extended support for migration, and kept the available options open for discussion in their dedicated Community thread. Users can discuss about conversion tools, migration, tips, and also offer troubleshooting help. If users prefer to continue using the Mercurial system, there are a number of free and paid Mercurial hosting services for them. The Bitbucket team has also created a Git tutorial that covers everything from the basics of creating pull requests to rebasing and Git hooks. Community shows anger and sadness over decision to discontinue Mercurial support There is an outrage among the Mercurial users as they are extremely unhappy and sad with this decision by Bitbucket. They have expressed anger not only on one platform but on multiple forums and community discussions. Users feel that Bitbucket’s decision to stop offering Mercurial support is bad, but the decision to also delete the repos is evil. On Hacker News, users speculated that this decision was influenced by potential to market rather than based on technically superior architecture and ease of use. They feel GitHub has successfully marketed Git and that's how both have become synonymous to the developer community. One of them comments, “It's very sad to see bitbucket dropping mercurial support. Now only Facebook and volunteers are keeping mercurial alive. Sometimes technically better architecture and user interface lose to a non user friendly hard solutions due to inertia of mass adoption. So a lesson in Software development is similar to betamax and VHS, so marketing is still a winner over technically superior architecture and ease of use. GitHub successfully marketed git, so git and GitHub are synonymous for most developers. Now majority of open source projects are reliant on a single proprietary solution Github by Microsoft, for managing code and project. Can understand the difficulty of bitbucket, when Python language itself moved out of mercurial due to the same inertia. Hopefully gitlab can come out with mercurial support to migrate projects using it from bitbucket.” Another user comments that Mercurial support was the only reason for him to use Bitbucket when GitHub is miles ahead of Bitbucket. Now when it stops supporting Mercurial too, Bitbucket will end soon. The comment reads, “Mercurial support was the one reason for me to still use Bitbucket: there is no other Bitbucket feature I can think of that Github doesn't already have, while Github's community is miles ahead since everyone and their dog is already there. More importantly, Bitbucket leaves the migration to you (if I read the article correctly). Once I download my repo and convert it to git, why would I stay with the company that just made me go through an annoying (and often painful) process, when I can migrate to Github with the exact same command? And why isn't there a "migrate this repo to git" button right there? I want to believe that Bitbucket has smart people and that this choice is a good one. But I'm with you there - to me, this definitely looks like Bitbucket will die.” On Reddit, programming folks see this as a big change from Bitbucket as they are the major mercurial hosting provider. And they feel Bitbucket announced this at a pretty short notice and they require more time for migration. Apart from the developer community forums, on Atlassian community blog as well users have expressed displeasure. A team of scientists commented, “Let's get this straight : Bitbucket (offering hosting support for Mercurial projects) was acquired by Atlassian in September 2010. Nine years later Atlassian decides to drop Mercurial support and delete all Mercurial repositories. Atlassian, I hate you :-) The image you have for me is that of a harmful predator. We are a team of scientists working in a university. We don't have computer scientists, we managed to use a version control simple as Mercurial, and it was a hard work to make all scientists in our team to use a version control system (even as simple as Mercurial). We don't have the time nor the energy to switch to another version control system. But we will, forced and obliged. I really don't want to check out Github or something else to migrate our projects there, but we will, forced and obliged.” Atlassian Bitbucket, GitHub, and GitLab take collective steps against the Git ransomware attack Attackers wiped many GitHub, GitLab, and Bitbucket repos with ‘compromised’ valid credentials leaving behind a ransom note BitBucket goes down for over an hour
Read more
  • 0
  • 0
  • 34327

article-image-google-open-sources-an-on-device-real-time-hand-gesture-recognition-algorithm-built-with-mediapipe
Sugandha Lahoti
21 Aug 2019
3 min read
Save for later

Google open sources an on-device, real-time hand gesture recognition algorithm built with MediaPipe

Sugandha Lahoti
21 Aug 2019
3 min read
Google researchers have unveiled a new real-time hand tracking algorithm that could be a new breakthrough for people communicating via sign language. Their algorithm uses machine learning to compute 3D keypoints of a hand from a video frame. This research is implemented in MediaPipe which is an open-source cross-platform framework for building multimodal (eg. video, audio, any time series data) applied ML pipelines. What is interesting is that the 3D hand perception can be viewed in real-time on a mobile phone. How real-time hand perception and gesture recognition works with MediaPipe? The algorithm is built using the MediaPipe framework. Within this framework, the pipeline is built as a directed graph of modular components. The pipeline employs three different models: a palm detector model, a handmark detector model and a gesture recognizer. The palm detector operates on full images and outputs an oriented bounding box. They employ a single-shot detector model called BlazePalm, They achieve an average precision of 95.7% in palm detection. Next, the hand landmark takes the cropped image defined by the palm detector and returns 3D hand keypoints. For detecting key points on the palm images, researchers manually annotated around 30K real-world images with 21 coordinates. They also generated a synthetic dataset to improve the robustness of the hand landmark detection model. The gesture recognizer then classifies the previously computed keypoint configuration into a discrete set of gestures. The algorithm determines the state of each finger, e.g. bent or straight, by the accumulated angles of joints. The existing pipeline supports counting gestures from multiple cultures, e.g. American, European, and Chinese, and various hand signs including “Thumb up”, closed fist, “OK”, “Rock”, and “Spiderman”. They also trained their models to work in a wide variety of lighting situations and with a diverse range of skin tones. Gesture recognition - Source: Google blog With MediaPipe, the researchers built their pipeline as a directed graph of modular components, called Calculators. Individual calculators like cropping, rendering , and neural network computations can be performed exclusively on the GPU. They employed TFLite GPU inference on most modern phones. The researchers are open sourcing the hand tracking and gesture recognition pipeline in the MediaPipe framework along with the source code. The researchers Valentin Bazarevsky and Fan Zhang write in a blog post, “Whereas current state-of-the-art approaches rely primarily on powerful desktop environments for inference, our method, achieves real-time performance on a mobile phone, and even scales to multiple hands. We hope that providing this hand perception functionality to the wider research and development community will result in an emergence of creative use cases, stimulating new applications and new research avenues.” People commended the fact that this algorithm can run on mobile devices and is useful for people who communicate via sign language. https://twitter.com/SOdaibo/status/1163577788764495872 https://twitter.com/anshelsag/status/1163597036442148866 https://twitter.com/JonCorey1/status/1163997895835693056 Microsoft Azure VP demonstrates Holoportation, a reconstructed transmittable 3D technology Terrifyingly realistic Deepfake video of Bill Hader transforming into Tom Cruise is going viral on YouTube. Google News Initiative partners with Google AI to help ‘deep fake’ audio detection research
Read more
  • 0
  • 0
  • 46578

article-image-twitter-and-facebook-removed-accounts-of-chinese-state-run-media-agencies-aimed-at-undermining-hong-kong-protests
Sugandha Lahoti
20 Aug 2019
5 min read
Save for later

Twitter and Facebook removed accounts of Chinese state-run media agencies aimed at undermining Hong Kong protests

Sugandha Lahoti
20 Aug 2019
5 min read
Update August 23, 2019: After Twitter, and Facebook Google has shutdown 210 YouTube channels that were tied to misinformation about Hong Kong protesters. The article has been updated accordingly. Chinese state-run media agencies have been buying advertisements and promoted tweets on Twitter and Facebook to portray Hong Kong protestors and their pro-democracy demonstrations as violent. These ads, reported by Pinboard’s Twitter account were circulated by State-run news agency Xinhua calling these protesters as those "escalating violence" and calls for "order to be restored." In reality, Hong Kong protests have been called a completely peaceful march. Pinboard warned and criticized Twitter about these tweets and asked for its takedown. Though Twitter and Facebook are banned in China, the Chinese state-run media runs several English-language accounts to present its views to the outside world. https://twitter.com/pinboard/status/1162711159000055808 https://twitter.com/Pinboard/status/1163072157166886913 Twitter bans 936 accounts managed by the Chinese state Following this revelation, in a blog post yesterday, Twitter said that they are discovering a “significant state-backed information operation focused on the situation in Hong Kong, specifically the protest movement”.  They identified 936 accounts that were undermining “the legitimacy and political positions of the protest movement on the ground.” They found a larger, spammy network of approximately 200,000 accounts which represented the most active portions of this campaign. These were suspended for a range of violations of their platform manipulation policies.  These accounts were able to access Twitter through VPNs and over a "specific set of unblocked IP addresses" from within China. “Covert, manipulative behaviors have no place on our service — they violate the fundamental principles on which our company is built,” said Twitter. Twitter bans ads from Chinese state-run media Twitter also banned advertising from Chinese state-run news media entities across the world and declared that affected accounts will be free to continue to use Twitter to engage in public conversation, but not in their advertising products. This policy will apply to news media entities that are either financially or editorially controlled by the state, said Twitter. They will be notified directly affected entities who will be given 30 days to offboard from advertising products. No new campaigns will be allowed. However, Pinboard argues that 30 days is too long; Twitter should not wait and suspend Xinhua's ad account immediately. https://twitter.com/Pinboard/status/1163676410998689793 It also calls on Twitter to disclose: How much money it took from Xinhua How many ads it ran for them since the start of the Hong Kong protests in June and How those ads were targeted Facebook blocks Chinese accounts engaged in inauthentic behavior Following a tip shared by Twitter, Facebook also removed seven Pages, three Groups and five Facebook accounts involved in coordinated inauthentic behavior as part of a small network that originated in China and focused on Hong Kong. However, unlike Twitter, Facebook did not announce any policy changes in response to the discovery. YouTube was also notably absent in the fight against Chinese misinformation propagandas. https://twitter.com/Pinboard/status/1163694701716766720 However, on 22nd August, Youtube axed 210 Youtube channels found to be spreading misinformation about the Hong Kong protests. “Earlier this week, as part of our ongoing efforts to combat coordinated influence operations, we disabled 210 channels on YouTube when we discovered channels in this network behaved in a coordinated manner while uploading videos related to the ongoing protests in Hong Kong,” Shane Huntley, director of software engineering for Google Security’s Threat Analysis Group said in a blog post. “We found use of VPNs and other methods to disguise the origin of these accounts and other activity commonly associated with coordinated influence operations.” Kyle Bass, Chief Investment Officer Hayman Capital Management, called on all social media outlets to ban all Chinese state-run propaganda sources. He tweeted, “Twitter, Facebook, and YouTube should BAN all State-backed propaganda sources in China. It’s clear that these 200,000 accounts were set up by the “state” of China. Why allow Xinhua, global times, china daily, or any others to continue to act? #BANthemALL” Public acknowledges Facebook and Twitter’s role in exposing Chinese state media Experts and journalists were appreciative of the role social media played in exposing those guilty and liked how they are responding to state interventions. Bethany Allen-Ebrahimian, President of the International China Journalist Association called it huge news. “This is the first time that US social media companies are openly accusing the Chinese government of running Russian-style disinformation campaigns aimed at sowing discord”, she tweeted. She added, “We’ve been seeing hints that China has begun to learn from Russia’s MO, such as in Taiwan and Cambodia. But for Twitter and Facebook to come out and explicitly accuse the Chinese govt of a disinformation campaign is another whole level entirely.” Adam Schiff, Representative (D-CA 28th District) tweeted, “Twitter and Facebook announced they found and removed a large network of Chinese government-backed accounts spreading disinformation about the protests in Hong Kong. This is just one example of how authoritarian regimes use social media to manipulate people, at home and abroad.” He added, “Social media platforms and the U.S. government must continue to identify and combat state-backed information operations online, whether they’re aimed at disrupting our elections or undermining peaceful protesters who seek freedom and democracy.” Social media platforms took an appreciable step against Chinese state-run media actors attempting to manipulate their platforms to discredit grassroots organizing in Hong Kong. It would be interesting to see if they would continue to protect individual freedoms and provide a safe and transparent platform if state actors from countries where they have a huge audiences like India or US, adopted similar tactics to suppress or manipulate the public or target movements. Facebook bans six toxic extremist accounts and a conspiracy theory organization Cloudflare terminates services to 8chan following yet another set of mass shootings in the US YouTube’s ban on “instructional hacking and phishing” videos receives backlash from the infosec community
Read more
  • 0
  • 0
  • 14895

article-image-terrifyingly-realistic-deepfake-video-of-bill-hader-transforming-into-tom-cruise-is-going-viral-on-youtube
Sugandha Lahoti
14 Aug 2019
4 min read
Save for later

Terrifyingly realistic Deepfake video of Bill Hader transforming into Tom Cruise is going viral on YouTube

Sugandha Lahoti
14 Aug 2019
4 min read
Deepfakes are becoming scaringly and indistinguishably real. A YouTube clip of Bill Hader in conversation with David Letterman on his late-night show in 2008 is going viral where Hader’s face subtly shifts to Cruise’s as Hader does his impression. This viral Deepfake clip has been viewed over 3 million times and is uploaded by Ctrl Shift Face (a Slovakian citizen who goes by the name of Tom), who has created other entertaining videos using Deepfake technology. For the unaware, Deepfake uses Artificial intelligence and deep neural networks to alter audio or video to pass it off as true or original content. https://www.youtube.com/watch?v=VWrhRBb-1Ig Deepfakes are problematic as they make it hard to differentiate between fake and real videos or images. This gives people the liberty to use deepfakes for promoting harassment and illegal activities. The most common use of deepfakes is found in revenge porn, political abuse, and fake celebrities videos as this one. The top comments on the video clip express dangers of realistic AI manipulation. “The fade between faces is absolutely unnoticeable and it's flipping creepy. Nice job!” “I’m always amazed with new technology, but this is scary.” “Ok, so video evidence in a court of law just lost all credibility” https://twitter.com/TheMuleFactor/status/1160925752004624387 Deepfakes can also be used as a weapon of misinformation since they can be used to maliciously hoax governments, populations and cause internal conflict. Gavin Sheridan, CEO of Vizlegal also tweeted the clip, “Imagine when this is all properly weaponized on top of already fractured and extreme online ecosystems and people stop believing their eyes and ears.” He also talked about future impact. “True videos will be called fake videos, fake videos will be called true videos. People steered towards calling news outlets "fake", will stop believing their own eyes. People who want to believe their own version of reality will have all the videos they need to support it,” he tweeted. He also tweeted whether we would require A-list movie actors at all in the future, and could choose which actor will portray what role. His tweet reads, “Will we need A-list actors in the future when we could just superimpose their faces onto the faces of other actors? Would we know the difference?  And could we not choose at the start of a movie which actors we want to play which roles?” The past year has seen accelerated growth in the use of deepfakes. In June, a fake video of Mark Zuckerberg was posted on Instagram, under the username, bill_posters_uk. In the video, Zuckerberg appears to give a threatening speech about the power of Facebook. Facebook had received strong criticism for promoting fake videos on its platform when in May, the company had refused to remove a doctored video of senior politician Nancy Pelosi. Samsung researchers also released a deepfake that could animate faces with just your voice and a picture using temporal GANs. Post this, the House Intelligence Committee held a hearing to examine the public risks posed by “deepfake” videos. Tom, the creator of the viral video told The Guardian that he doesn't see deepfake videos as the end of the world and hopes his deepfakes will raise public awareness of the technology's potential for misuse. “It’s an arms race; someone is creating deepfakes, someone else is working on other technologies that can detect deepfakes. I don’t really see it as the end of the world like most people do. People need to learn to be more critical. The general public are aware that photos could be Photoshopped, but they have no idea that this could be done with video.” Ctrl Shift Face is also on Patreon offering access to bonus materials, behind the scenes footage, deleted scenes, early access to videos for those who provide him monetary support. Now there is a Deepfake that can animate your face with just your voice and a picture. Mark Zuckerberg just became the target of the world’s first high profile white hat deepfake op. Worried about Deepfakes? Check out the new algorithm that manipulate talking-head videos by altering the transcripts.
Read more
  • 0
  • 0
  • 30340
article-image-how-data-privacy-awareness-is-changing-how-companies-do-business
Guest Contributor
09 Aug 2019
7 min read
Save for later

How Data Privacy awareness is changing how companies do business

Guest Contributor
09 Aug 2019
7 min read
Not so long ago, data privacy was a relatively small part of business operations at some companies. They paid attention to it to a minor degree, but it was not a focal point or prime area of concern. That's all changing now as businesses now recognize that failing to take privacy seriously harms the bottom line. That revelation changes how they operate and engage with customers. One of the reasons for this change is the General Data Protection Regulation (GDPR) rule which now affects all European Union companies and those that do business with EU residents. Some analysts viewed regulators as slow to begin enforcing GDPR with fines, but some of them imposed in 2019 total more than $100 million. In 2018, Twitter and Nielsen cited the GDPR as a reason for their falling share prices. No Single Way to Demonstrate Data Privacy Awareness One essential thing for companies to keep in mind is that there is not an all-encompassing way to show customers they emphasize data security. Although security and privacy are distinct, they are closely related to and impact each other. That's because what privacy awareness means differs depending on how a business operates. For example, a business might collect data from customers and feed it back to them through an analytics platform. In this case, showing data privacy awareness might mean publishing a policy that mentions how the company will never sell a person's information to others. For an e-commerce company, emphasizing on a commitment to keep customer information secure might mean going into details about how it protects sensitive data such as credit card numbers. It might also talk about internal strategies used to keep customer information as safe as possible from cybercriminals. One universal aspect of data privacy awareness is that it makes good business sense. The public is now much more aware of data privacy issues than in past years, and that's largely due to the high-profile breaches that capture the headlines. Lost customers, gigantic fines and damaged reputations after Data breaches and misuse When companies don't invest in data privacy measures, they could be victimized by severe data breaches. If that happens,  ramifications are often substantial. A 2019 study from PCI Pal surveyed customers in the United States and the United Kingdom to determine how their perceptions and spending habits changed following data breaches. It found that 41% of United Kingdom customers and 21% of people in the U.S. stop spending money at business forever if it suffers a data breach. The more common action is for consumers to stop spending money at breached businesses for several months afterward, the poll revealed. In total, 62% of Americans and 44% of Brits said they’d take that approach. However, that's not the only potential hit to a company's profitability. As the Facebook example mentioned earlier indicates, there can also be massive fines. Two other recent examples involve the British Airways and Marriott Hotels breaches. A data regulatory body in the United Kingdom imposed the largest-ever data breach fine on British Airways after a 2018 hack, with the penalty totaling £183 million — more than $228 million. Then, that same authority gave Marriott Hotels the equivalent of a $125 million fine for its incident, alleging inadequate cybersecurity and data privacy due diligence. These enormous fines don't only happen in the United Kingdom. Besides its recent decision with Facebook, the U.S. Federal Trade Commission (FTC) reached a settlement with Equifax that required the company to pay $700 million after its now-infamous data breach. It's easy to see why losing customers after such issues could make such substantial fines even more painful for the companies that have to pay them. The FTC also investigated Facebook’s Cambridge Analytica scandal and handed the company a $5 billion fine for failing to adequately protect customer data — the largest imposed by the FTC. Problems also occur if companies misuse data. Take the example of a class-action lawsuit filed against AT&T. The telecom giant and a couple of data aggregation enterprises allegedly permitted third-party companies to access individuals' real-time locations via mobile phone data. Those companies didn't check first to see if the customers allowed such access. Such news could bring about irreparable reputational damage and make people hesitate to do business. Expecting customers to read privacy policies is not sufficient Companies rely on both back-end and customer-facing strategies to meet their data security goals and earn customer trust. Some businesses go beyond the norm by taking the time to publish sections on their websites that detail how their infrastructure supports data privacy. They discuss the implementation of things like multi-layered data access authorization framework, physical access controls for server rooms and data encryption at rest and in transit. But, one of the more prominent customer-facing declarations of a company’s commitment to keeping data secure is the privacy policy, now a fixture of modern websites. Companies cannot bypass publishing their privacy policies, of course. However, most people don't take the time to read those documents. An Axios/Survey Monkey poll spotlighted a disconnect between respondents' beliefs and actions. It found that although 87% of them felt it was either somewhat or very important to understand a company's privacy policy before signing up for something, 56% of them always or usually agree to it without reading it. More research on the subject by Varonis found that it can take nearly half an hour to read some privacy policies. That reading level got more advanced after the GDPR came into effect. Together, these studies illustrate that companies need to go beyond anticipating that customers will read what privacy policies say. Moreover, they should work hard to make them shorter and easier for people to understand. Most people want companies to take a stand for Data Privacy A study of 1,000 people conducted in the United Kingdom supported the earlier finding from Gemalto where people thought the companies holding their data were responsible for maintaining its security. It concluded that customers felt it was "highly important" for businesses to take a stand for information security and privacy, and that 53% expected firms to do so. Moreover, the results of a CIGI-Ipsos worldwide survey said that 53% of those polled were more concerned about online privacy now compared to a year ago. Additionally, 49% said their rising distrust of the internet made them provide less information online. Companies must show they care about data privacy and work that aspect into their business strategies. Otherwise, they could find that customers leave them in favor of more privacy-centric organizations. To get an idea of what can happen when companies have data privacy blunders, people only need to look at how Facebook users responded in the Cambridge Analytica aftermath. Statistics published by the Pew Research Center showed that 54% of adults changed their privacy settings in the past year, while approximately a quarter stopped using the site. After the news broke about Facebook and Cambridge Analytica, many media outlets reminded people that they could download all the data Facebook had about them. The Pew Research Center found that although only 9% of its respondents took that step, 47% of the people in that group removed the app from their phones. Data Privacy is a Top-of-Mind concern The studies and examples mentioned here strongly suggest consumers are no longer willing to accept the possible wrongful treatment of their data. They increasingly hold companies accountable and don't show forgiveness if they don't meet their privacy expectations. The most forward-thinking companies see this change and respond accordingly. Those that choose inaction instead are at risk of losing out. Individuals understand that companies value their data, but they aren't willing to part with it freely unless companies convey trustworthiness first. Author Bio Kayla Matthews writes about big data, cybersecurity, and technology. You can find her work on The Week, Information Age, KDnuggets and CloudTweaks, or over at ProductivityBytes.com. Facebook fails to block ECJ data security case from proceeding ICO to fine Marriott over $124 million for compromising 383 million users’ data. Facebook fined $2.3 million by Germany for providing incomplete information about hate speech content
Read more
  • 0
  • 0
  • 13787

article-image-facebook-research-suggests-chatbots-and-conversational-ai-will-empathize-humans
Fatema Patrawala
06 Aug 2019
6 min read
Save for later

Facebook research suggests chatbots and conversational AI are on the verge of empathizing with humans

Fatema Patrawala
06 Aug 2019
6 min read
Last week, the Facebook AI research team published a progress report on dialogue research that is fundamentally building more engageable and personalized AI systems. According to the team, “Dialogue research is a crucial component of building the next generation of intelligent agents. While there’s been progress with chatbots in single-domain dialogue, agents today are far from capable of carrying an open-domain conversation across a multitude of topics. Agents that can chat with humans in the way that people talk to each other will be easier and more enjoyable to use in our day-to-day lives — going beyond simple tasks like playing a song or booking an appointment.” In their blog post, they have described new open source data sets, algorithms, and models that improve five common weaknesses of open-domain chatbots today. The weaknesses identified are maintaining consistency, specificity, empathy, knowledgeability, and multimodal understanding. Let us look at each one in detail: Dataset called Dialogue NLI introduced for maintaining consistency Inconsistencies are a common issue for chatbots partly because most models lack explicit long-term memory and semantic understanding. Facebook team in collaboration with their colleagues at NYU, developed a new way of framing consistency of dialogue agents as natural language inference (NLI) and created a new NLI data set called Dialogue NLI, used to improve and evaluate the consistency of dialogue models. The team showcased an example in the Dialogue NLI model, where in they considered two utterances in a dialogue as the premise and hypothesis, respectively. Each pair was labeled to indicate whether the premise entails, contradicts, or is neutral with respect to the hypothesis. Training an NLI model on this data set and using it to rerank the model’s responses to entail previous dialogues — or maintain consistency with them — improved the overall consistency of the dialogue agent. Across these tests they say they saw 3x lesser contradictions in the sentences. Several conversational attributes were studied to balance specificity As per the team, generative dialogue models frequently default to generic, safe responses, like “I don’t know” to some query which needs specific responses. Hence, the Facebook team in collaboration with Stanford’s AI researcher Abigail See, studied how to fix this by controlling several conversational attributes, like the level of specificity. In one experiment, they conditioned a bot on character information and asked “What do you do for a living?” A typical chatbot responds with the generic statement “I’m a construction worker.” With control methods, the chatbots proposed more specific and engaging responses, like “I build antique homes and refurbish houses." In addition to specificity, the team mentioned, “that balancing question-asking and answering and controlling how repetitive our models are make significant differences. The better the overall conversation flow, the more engaging and personable the chatbots and dialogue agents of the future will be.” Chatbot’s ability to display empathy while responding was measured The team worked with researchers from the University of Washington to introduce the first benchmark task of human-written empathetic dialogues centered on specific emotional labels to measure a chatbot’s ability to display empathy. In addition to improving on automatic metrics, the team showed that using this data for both fine-tuning and as retrieval candidates leads to responses that are evaluated by humans as more empathetic, with an average improvement of 0.95 points (on a 1-to-5 scale) across three different retrieval and generative models. The next challenge for the team is that empathy-focused models should perform well in complex dialogue situations, where agents may require balancing empathy with staying on topic or providing information. Wikipedia dataset used to make dialogue models more knowledgeable The research team has improved dialogue models’ capability of demonstrating knowledge by collecting a data set with conversations from Wikipedia, and creating new model architectures that retrieve knowledge, read it, and condition responses on it. This generative model has yielded the most pronounced improvement and it is rated by humans as 26% more engaging than their knowledgeless counterparts. To engage with images, personality based captions were used To engage with humans, agents should not only comprehend dialogue but also understand images. In this research, the team focused on image captioning that is engaging for humans by incorporating personality. They collected a data set of human comments grounded in images, and trained models capable of discussing images with given personalities, which makes the system interesting for humans to talk to. 64% humans preferred these personality-based captions over traditional captions. To build strong models, the team considered both retrieval and generative variants, and leveraged modules from both the vision and language domains. They defined a powerful retrieval architecture, named TransResNet, that works by projecting the image, personality, and caption in the same space using image, personality, and text encoders. The team showed that their system was able to produce captions that are close to matching human performance in terms of engagement and relevance. And annotators preferred their retrieval model’s captions over captions written by people 49.5% of the time. Apart from this, Facebook team has released a new data collection and model evaluation tool, a Messenger-based Chatbot game called Beat the Bot, that allows people to interact directly with bots and other humans in real time, creating rich examples to help train models. To conclude, the Facebook AI team mentions, “Our research has shown that it is possible to train models to improve on some of the most common weaknesses of chatbots today. Over time, we’ll work toward bringing these subtasks together into one unified intelligent agent by narrowing and eventually closing the gap with human performance. In the future, intelligent chatbots will be capable of open-domain dialogue in a way that’s personable, consistent, empathetic, and engaging.” On Hacker News, this research has gained positive and negative reviews. Some of them discuss that if AI will converse like humans, it will do a lot of bad. While other users say that this is an impressive improvement in the field of conversational AI. A user comment reads, “I gotta say, when AI is able to converse like humans, a lot of bad stuff will happen. People are so used to the other conversation partner having self-interest, empathy, being reasonable. When enough bots all have a “swarm” program to move conversations in a particular direction, they will overwhelm any public conversation. Moreover, in individual conversations, you won’t be able to trust anything anyone says or negotiates. Just like playing chess or poker online now. And with deepfakes, you won’t be able to trust audio or video either. The ultimate shock will come when software can render deepfakes in realtime to carry on a conversation, as your friend but not. As a politician who “said crazy stuff” but really didn’t, but it’s in the realm of believability. I would give it about 20 years until it all goes to shit. If you thought fake news was bad, realtime deepfakes and AI conversations with “friends” will be worse.  Scroll Snapping and other cool CSS features come to Firefox 68 Google Chrome to simplify URLs by hiding special-case subdomains Lyft releases an autonomous driving dataset “Level 5” and sponsors research competition
Read more
  • 0
  • 0
  • 18468

article-image-why-are-experts-worried-about-microsofts-billion-dollar-bet-in-openais-agi-pipe-dream
Sugandha Lahoti
23 Jul 2019
6 min read
Save for later

Why are experts worried about Microsoft's billion dollar bet in OpenAI's AGI pipe dream?

Sugandha Lahoti
23 Jul 2019
6 min read
Microsoft has invested $1 billion in OpenAI with the goal of building next-generation supercomputers and a platform within Microsoft Azure which will scale to AGI (Artificial General Intelligence). This is a multiyear partnership with Microsoft becoming OpenAI’s preferred partner for commercializing new AI technologies. Open AI will become a big Azure customer, porting its services to run on Microsoft Azure. The $1 billion is a cash investment into OpenAI LP, which is Open AI’s for-profit corporate subsidiary. The investment will follow a standard capital commitment structure which means OpenAI can call for it, as they need it. But the company plans to spend it in less than five years. Per the official press release, “The companies will focus on building a computational platform in Azure for training and running advanced AI models, including hardware technologies that build on Microsoft’s supercomputing technology. These will be implemented in a safe, secure and trustworthy way and is a critical reason the companies chose to partner together.” They intend to license some of their pre-AGI technologies, with Microsoft becoming their preferred partner. “My goal in running OpenAI is to successfully create broadly beneficial A.G.I.,” Sam Altman, who co-founded Open AI with Elon Musk, said in a recent interview. “And this partnership is the most important milestone so far on that path.” Musk left the company in February 2019, to focus on Tesla and because he didn’t agree with some of what OpenAI team wanted to do. What does this partnership mean for Microsoft and Open AI OpenAI may benefit from this deal by keeping their innovations private which may help commercialization, raise more funds and get to AGI faster. For OpenAI this means the availability of resources for AGI, while potentially allowing founders and other investors with the opportunity to either double-down on OpenAI or reallocate resources to other initiatives However, this may also lead to them not disclosing progress, papers with details, and open source code as much as in the past. https://twitter.com/Pinboard/status/1153380118582054912 As for Microsoft, this deal is another attempt in quietly taking over open source. First, with the acquisition of GitHub and the subsequent launch of GitHub Sponsors, and now with becoming OpenAI’s ‘preferred partner’ for commercialization. Last year at an Investor conference, Nadella said, “AI is going to be one of the trends that is going to be the next big shift in technology. It's going to be AI at the edge, AI in the cloud, AI as part of SaaS applications, AI as part of in fact even infrastructure. And to me, to be the leader in it, it's not enough just to sort of have AI capability that we can exercise—you also need the ability to democratize it so that every business can truly benefit from it. That to me is our identity around AI.” Partnership with OpenAI seems to be a part of this plan. This deal can also possibly help Azure catch up with Google and Amazon both in hardware scalability and Artificial Intelligence offerings. A hacker news user comments, “OpenAI will adopt and make Azure their preferred platform. And Microsoft and Azure will jointly "develop new Azure AI supercomputing technologies", which I assume is advancing their FGPA-based deep learning offering. Google has a lead with TensorFlow + TPUs and this is a move to "buy their way in", which is a very Microsoft thing to do.” https://twitter.com/soumithchintala/status/1153308199610511360 It is also likely that Microsoft is investing money which will eventually be pumped back into its own company, as OpenAI buys computing power from the tech giant. Under the terms of the contract, Microsoft will eventually become the sole cloud computing provider for OpenAI, and most of that $1 billion will be spent on computing power, Altman says. OpenAI, who were previously into building ethical AI will now pivot to build cutting edge AI and move towards AGI. Sometimes even neglecting ethical ramifications, wanting to deploy tech at the earliest which is what Microsoft would be interested in monetizing. https://twitter.com/CadeMetz/status/1153291410994532352 I see two primary motivations: For OpenAI—to secure funding and to gain some control over hardware which in turn helps differentiate software. For MSFT—to elevate Azure in the minds of developers for AI training. - James Wang, Analyst at ARKInvest https://twitter.com/jwangARK/status/1153338174871154689 However, the news of this investment did not go down well with some experts in the field who saw this as a pure commercial deal and questioned whether OpenAI’s switch to for-profit research undermines its claims to be “democratizing” AI. https://twitter.com/fchollet/status/1153489165595504640 “I can't really parse its conversion into an LP—and Microsoft's huge investment—as anything but a victory for capital” - Robin Sloan, Author https://twitter.com/robinsloan/status/1153346647339876352 “What is OpenAI? I don't know anymore.” - Stephen Merity, Deep learning researcher https://twitter.com/Smerity/status/1153364705777311745 https://twitter.com/SamNazarius/status/1153290666413383682 People are also speculating whether creating AGI is really even possible. In a recent survey experts estimated that there was a 50 percent chance of creating AGI by the year 2099. Pet New York Times, most experts believe A.G.I. will not arrive for decades or even centuries. Even Altman admits OpenAI may never get there. But the race is on nonetheless. Then why is Microsoft delivering the $1 billion over five years considering that is neither enough money nor enough time to produce AGI. Although, OpenAI has certainly impressed the tech community with its AI innovations. In April, OpenAI’s new algorithm that is trained to play the complex strategy game, Dota 2, beat the world champion e-sports team OG at an event in San Francisco, winning the first two matches of the ‘best-of-three’ series. The competition included a human team of five professional Dota 2 players and AI team of five OpenAI bots. In February, they released a new AI model GPT-2, capable of generating coherent paragraphs of text without needing any task-specific training. However experts felt that the move signalled towards ‘closed AI’ and propagated the ‘fear of AI’ for its ability to write convincing fake news from just a few words. Github Sponsors: Could corporate strategy eat FOSS culture for dinner? Microsoft is seeking membership to Linux-distros mailing list for early access to security vulnerabilities OpenAI: Two new versions and the output dataset of GPT-2 out!
Read more
  • 0
  • 0
  • 18493
article-image-why-intel-is-betting-on-bfloat16-to-be-a-game-changer-for-deep-learning-training-hint-range-trumps-precision
Vincy Davis
22 Jul 2019
4 min read
Save for later

Why Intel is betting on BFLOAT16 to be a game changer for deep learning training? Hint: Range trumps Precision.

Vincy Davis
22 Jul 2019
4 min read
A group of researchers from Intel Labs and Facebook have published a paper titled, “A Study of BFLOAT16 for Deep Learning Training”. The paper presents a comprehensive study indicating the success of Brain Floating Point (BFLOAT16) half-precision format in Deep Learning training across image classification, speech recognition, language modeling, generative networks and industrial recommendation systems. BFLOAT16 has a 7-bit mantissa and an 8-bit exponent, similar to FP32, but with less precision. BFLOAT16 was originally developed by Google and implemented in its third generation Tensor Processing Unit (TPU). https://twitter.com/JeffDean/status/1134524217762951168 Many state of the art training platforms use IEEE-754 or automatic mixed precision as their preferred numeric format for deep learning training. However, these formats lack in representing error gradients during back propagation. Thus, they are not able to satisfy the required  performance gains. BFLOAT16 exhibits a dynamic range which can be used to represent error gradients during back propagation. This enables easier migration of deep learning workloads to BFLOAT16 hardware. Image Source: BFLOAT16 In the above table, all the values are represented as trimmed full precision floating point values with 8 bits of mantissa with their dynamic range comparable to FP32. By adopting to BFLOAT16 numeric format, the core compute primitives such as Fused Multiply Add (FMA) can be built using 8-bit multipliers. This leads to significant reduction in area and power while preserving the full dynamic range of FP32. How Deep neural network(DNNs) is trained with BFLOAT16? The below figure shows the mixed precision data flow used to train deep neural networks using BFLOAT16 numeric format. Image Source: BFLOAT16 The BFLOAT16 tensors are taken as input to the core compute kernels represented as General Matrix Multiply (GEMM) operations. It is then forwarded to the FP32 tensors as output.   The researchers have developed a library called Quantlib, represented as Q in the figure, to implement the emulation in multiple deep learning frameworks. One of the functions of a Quantlib is to modify the elements of an input FP32 tensor to echo the behavior of BFLOAT16. Quantlib is also used to modify a copy of the FP32 weights to BFLOAT16 for the forward pass.   The non-GEMM computations include batch-normalization and activation functions. The  FP32 always maintains the bias tensors.The FP32 copy of the weights updates the step uses to maintain model accuracy. How does BFLOAT16 perform compared to FP32? Convolution Neural Networks Convolutional neural networks (CNN) are primarily used for computer vision applications such as image classification, object detection and semantic segmentation. AlexNet and ResNet-50 are used as the two representative models for the BFLOAT16 evaluation. AlexNet demonstrates that BFLOAT16 emulation follows very near to the actual FP32 run and achieves 57.2% top-1 and 80.1% top-5 accuracy. Whereas in ResNet-50, the BFLOAT16 emulation follows the FP32 baseline almost exactly and achieves the same top-1 and top-5 accuracy. Image Source: BFLOAT16 Similarly, the researchers were able to successfully demonstrate that BFLOAT16 is able to represent tensor values across many application domains including Recurrent Neural Networks, Generative Adversarial Networks (GANs) and Industrial Scale Recommendation System. The researchers thus established that the dynamic range of BFLOAT16 is of the same range as that of FP32 and its conversion to/from FP32 is also easy. It is important to maintain the same range as FP32 since no hyper-parameter tuning is required for convergence in FP32. A hyperparameter is a parameter of choosing a set of optimal hyperparameters in machine learning for a learning algorithm. Researchers of this paper expect to see an industry-wide adoption of BFLOAT16 across emerging domains. Recent reports suggest that Intel is planning to graft Google’s BFLOAT16 onto its processors  as well as on its initial Nervana Neural Network Processor for training, the NNP-T 1000. Pradeep Dubey, who directs the Parallel Computing Lab at Intel and is also one of the researchers of this paper believes that for deep learning, the range of the processor is more important than the precision, which is the inverse of the rationale used for IEEE’s floating point formats. Users are finding it interesting that a BFLOAT16 half-precision format is suitable for deep learning applications. https://twitter.com/kevlindev/status/1152984689268781056 https://twitter.com/IAmMattGreen/status/1152769690621448192 For more details, head over to the “A Study of BFLOAT16 for Deep Learning Training” paper. Intel’s new brain inspired neuromorphic AI chip contains 8 million neurons, processes data 1K times faster Google plans to remove XSS Auditor used for detecting XSS vulnerabilities from its Chrome web browser IntelliJ IDEA 2019.2 Beta 2 released with new Services tool window and profiling tools
Read more
  • 0
  • 0
  • 24618

article-image-techwontbuildit-entropic-maintainer-calls-for-a-ban-on-palantir-employees-contributing-to-the-project-and-asks-other-open-source-communities-to-take-a-stand-on-ethical-grounds
Sugandha Lahoti
19 Jul 2019
6 min read
Save for later

#TechWontBuildIt: Entropic maintainer calls for a ban on Palantir employees contributing to the project and asks other open source communities to take a stand on ethical grounds

Sugandha Lahoti
19 Jul 2019
6 min read
The tech industry is being plagued by moral and ethical issues as top players are increasingly becoming explicit about prioritizing profits over people or planet. Recent times are rift with cases of tech companies actively selling facial recognition technology to law enforcement agencies, helping ICE separate immigrant families, taking large contracts with the Department of Defense, accelerating the extraction of fossil fuels, deployment of surveillance technology. As the US gets alarmingly dangerous for minority groups, asylum seekers and other vulnerable communities, it has awakened the tech worker community to organize for keeping their employers in check. They have been grouping together to push back against ethically questionable decisions made by their employers using the hashtag #TechWontBuildIt since 2018. Most recently, several open source communities, activists and developers have strongly demonstrated against Palantir for their involvement with ICE. Palantir, a data analytics company, founded by Peter Thiel, one of President Trump’s most vocal supporters in Silicon Valley, has been called out for its association with the Immigration and Customs Enforcement (ICE). According to emails obtained by WNYC, Palantir’s mobile app FALCON is being used by ICE to carry out raids on immigrant communities as well as enable workplace raids. According to the emails, an ICE supervisor sent an email to his officers before a planned spate of raids in New York City in 2017. The emails ordered them to use a Palantir program, called FALCON mobile, for the operation. The email was sent in preparation for a worksite enforcement briefing on January 8, 2018. Two days later, ICE raided nearly a hundred 7-Elevens across U.S. According to WNYC, ICE workplace raids led to 1,525 arrests over immigration status from October 2017 to October 2018. The email reads, “[REDACTION] we want all the team leaders to utilize the FALCON mobile app on your GOV iPhones, We will be using the FALCON mobile app to share info with the command center about the subjects encountered in the stores as well as team locations." Other emails obtained by WYNC detail a Palantir staffer notifying an ICE agent to test out their FALCON mobile application because of his or her “possible involvement in an upcoming operation.” Another message, in April 2017, shows a Palantir support representative instructing an agent on how to classify a datapoint, so that Palantir’s Investigative Case Management [ICM] platform could properly ingest records of a cell phone seizure. In December 2018, Palantir told the New York Times‘ Dealbook that Palantir technology is not used by the division of ICE responsible for carrying out the deportation and detention of undocumented immigrants. Palantir declined WNYC’s requests for comment. Citing law enforcement “sensitivities,” ICE also declined to comment on how it uses Palantir during worksite enforcement operations. In May this year, new documents released by Mijente, an advocacy organization, revealed that Palantir was responsible for 2017 operation that targeted and arrested family members of children crossing the border alone. The documents show a huge contrast to what Palantir said its software was doing. As part of the operation, ICE arrested 443 people solely for being undocumented. Mijente has then urged Palantir to drop its contract with ICE and stop providing software to agencies that aid in tracking, detaining, and deporting migrants, refugees, and asylum seekers. Open source communities, activists and developers strongly oppose Palantir Post the revelation of Palantir’s involvement with ICE, several open-source developers are strongly opposing Palantir. The Entropic project, a JS package registry, is debating the idea of banning Palantir employees from participating in the project. Kat Marchán, Entropic maintainer posted on the forum, “I find it unconscionable for tech folks to be building the technological foundations for this deeply unethical and immoral (and fascist) practice, and I would like it if we, in our limited power as a community to actually affect the situation, officially banned any Palantir employees from participating in or receiving any sort of direct support from the Entropic community.” She has further proposed explicitly banning Palantir employees from the Discourse, the Discord, as well as the GitHub communities and any other forums, Entropic may use for coordinating the project. https://twitter.com/maybekatz/status/1151355320314187776 Amazon is also facing renewed calls from employees and external immigration advocates to stop working with Palantir. According to an internal email obtained by Forbes, Amazon employees are recirculating a June 2018 letter to executives calling for Palantir to be kicked off Amazon Web Services. More than 500 Amazon employees have signed the letter addressed to CEO Jeff Bezos and AWS head Andy Jassy. Not just that, pro-immigration organizations such as Mijente and Jews for Racial and Economic Justice, interrupted the keynote speech at Amazon’s annual AWS Summit, last Thursday. https://twitter.com/altochulo/status/1149326296092164097 More than a dozen groups of activists also protested on July 12 against Palantir Technologies in Palo Alto for the company’s provision of software facilitating ICE raids, detentions, and deportations. City residents also joined the protests expanding the total to hundreds. Back in August 2018, the Lerna team had taken a strong stand against ICE by modifying their MIT license to ban companies who have collaborated with ICE from using Lerna. The updated license banned companies that are known collaborators with ICE such as Microsoft, Palantir, and Amazon, among the others from using Lerna. To quote Meredith Whittaker, Google walkout organizer who recently left the company, from her farewell letter, “Tech workers have emerged as a force capable of making real change, pushing for public accountability, oversight, and meaningful equity. And this right when the world needs it most” She further adds, “The stakes are extremely high. The use of AI for social control and oppression is already emerging, even in the face of developers’ best of intentions. We have a short window in which to act, to build in real guardrails for these systems before AI is built into our infrastructure and it’s too late.” Extraordinary times call for extraordinary measures. As the tech industry grapples with the consequences of its hypergrowth technosolutionist mindset, where do tech workers draw the line? Can tech workers afford to be apolitical or separate their values from the work they do? There are no simple answers, but one thing is for sure - the questions must be asked and faced. Open source, as part of the commons, has a key role to play and how it evolves in the next couple of years is likely to define the direction the world would take. Lerna relicenses to ban major tech giants like Amazon, Microsoft, Palantir from using its software as a protest against ICE Palantir’s software was used to separate families in a 2017 operation reveals Mijente ACLU files lawsuit against 11 federal criminal and immigration enforcement agencies for disclosure of information on government hacking.
Read more
  • 0
  • 0
  • 19249
Modal Close icon
Modal Close icon