Term Archives - Page 16 of 16 - Data Ideology

Uplift or Persuasion Modeling

A combination of treatment comparisons (e.g. send a sales solicitation to one group, send nothing to another group) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments. Here are the steps, in conceptual terms, for a typical uplift model: 1. Conduct A-B test, where B is control […]

Data mining

This term means different things in different contexts. To a lay person, it might mean the automated searching of large databases. To an analyst. it may refer to the collection of statistical and machine learning methods used with those databases (predictive modeling, clustering, recommendation systems, …)

Apache Kafka

Kafka, named after that famous czech writer, is used for building real-time data pipelines and streaming apps. Why is it so popular? Because it enables storing, managing, and processing of streams of data in a fault-tolerant way and supposedly ‘wicked fast’. Given that social network environment deals with streams of data, Kafka is currently very […]

Text mining

The application of data mining methods to text.

Apache Mahout

Mahout provides a library of pre-made algorithms for machine learning and data mining and also an environment to create more algorithms. In other words, an environment in heaven for machine learning geeks. Machine learning and Data mining are covered in my previous article mentioned above.

Apache Oozie

In any programming environment, you need some workflow system to schedule and run jobs in a predefined manner and with defined dependencies. Oozie provides that for Big Data jobs written in languages like pig, MapReduce, and Hive.

Data science, data analytics, analytics

“Data science” is often used to define a (new) profession whose practitioners are capable in many or all the above areas; one often sees the term “data scientist” in job postings. While “statistician” typically implies familiarity with research methods and the collection of data for studies, “data scientist” implies the ability to work with large […]

Apache Drill, Apache Impala, Apache Spark SQL

All these provide quick and interactive SQL like interactions with Apache Hadoop data. These are useful if you already know SQL and work with data stored in big data format (i.e. HBase or HDFS). Sorry for being little geeky here.

Statistics

Covers nearly all of the above methods, and also carries the mantle of a well-established profession dating back to the mid 1800’s. Although statisticians work on “big data” problems, the field of statistics has traditionally been focused on focused research studies (e.g. drug trials).