Systems | Development | Analytics | API | Testing

Data Science

Apache Ozone Powers Data Science in CDP Private Cloud

Apache Ozone is a scalable distributed object store that can efficiently manage billions of small and large files. Ozone natively provides Amazon S3 and Hadoop Filesystem compatible endpoints in addition to its own native object store API endpoint and is designed to work seamlessly with enterprise scale data warehousing, machine learning and streaming workloads. The object store is readily available alongside HDFS in CDP (Cloudera Data Platform) Private Cloud Base 7.1.3+.

Data Science + Cybersecurity

Cybersecurity is at a critical turning point, especially in the wake of the global lockdown that caused companies worldwide to conduct more online business than ever before. No organization is immune to data breaches, as hackers are using more sophisticated techniques — such as artificial intelligence — to perform these cyberattacks.

Recruiting and Building the Data Science Team at Etsy

In this episode of Data+AI Battlescars (formerly CDO Battlescars), Sandeep Uttamchandani talks to Chu-Cheng, CDO at Etsy. This episode focuses on Chu-Cheng’s battlescars related to recruiting and building a data science team. Chu-Cheng leads the global data organization at Etsy. He’s responsible for data science, AI innovation, machine learning and data infrastructure. Prior to Etsy, Chu-Cheng has led various data roles, including at Amazon, Intuit, Rakuten and eBay.

Data Science vs. Big Data Marketing

Data science and big data are essential in today’s world of marketing. You’ve probably already seen multiple instances of both being used for advertising and sales purposes, but you may not realize just how useful they are. If you own a business, you need to know how to use data for your own marketing programs.

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 3: Productionization of ML models

In this last installment, we’ll discuss a demo application that uses PySpark.ML to make a classification model based off of training data stored in both Cloudera’s Operational Database (powered by Apache HBase) and Apache HDFS. Afterwards, this model is then scored and served through a simple Web Application. For more context, this demo is based on concepts discussed in this blog post How to deploy ML models to production.

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 2: Querying/ Loading Data

In this installment, we’ll discuss how to do Get/Scan Operations and utilize PySpark SQL. Afterward, we’ll talk about Bulk Operations and then some troubleshooting errors you may come across while trying this yourself. Read the first blog here. Get/Scan Operations In this example, let’s load the table ‘tblEmployee’ that we made in the “Put Operations” in Part 1. I used the same exact catalog in order to load the table. Executing table.show() will give you:

5 Best Practices for Integrating Data Science Into Your Marketing Analytics

Personalization enables marketers to send hypertargeted content and offers that are more likely to drive purchases and cultivate brand loyalty. Research by Accenture from 2018 shows that 91% of consumers are more likely to shop with companies that provide relevant offers and recommendations. Though personalization helps marketers optimize ad spend and drive improvements in customer lifetime value, basket size, and retention, it’s still untenable at scale in many organizations.

Snowflake and Saturn Cloud Partner to Bring 100x Faster Data Science to Millions of Python Users

Snowflake and Saturn Cloud are thrilled to announce our partnership to provide the fastest data science and machine learning (ML) platform. Snowflake’s Data Cloud comprises a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Saturn Cloud’s platform provides lightning-fast data science. Combined, our solutions enable customers to maximize their ML and data science initiatives.

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Introduction Python is used extensively among Data Engineers and Data Scientists to solve all sorts of problems from ETL/ELT pipelines to building machine learning models. Apache HBase is an effective data storage system for many workflows but accessing this data specifically through Python can be a struggle. For data professionals that want to make use of data stored in HBase the recent upstream project “hbase-connectors” can be used with PySpark for basic operations.