Critical Machine Learning

(Proposal for ITP independent study project)

Description

Critical Machine Learning (CML) is an online resource and workshop series that aims to help students and researchers of varying interest and technical background to understand the basic concepts and assumptions of computational methods grouped as machine learning. Workshop participants will get hands-on experience of applying some of the methods in a practical setting, which they can later apply to their own research. Furthermore, through the workshop and the online resource, they will be encouraged to critically examine the social and academic implications of machine learning.

Machine learning, as a subset of computer science, refers to a “set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty.” (Murphy) The key concepts here are patterns, prediction and decision-making. Those are not especially novel concepts by themselves; even before computers existed, statistical methods were developed and used to, for example, build mathematical models of the world based on collected data. However, even with the help of computers, this process can be difficult or even impossible when one intends to deal with a large amount of complex data. At the least, lots of trial-and-error will be involved in order to find a model that fits and can be used to make sense of the data; one would formulate a hypothetical model based on observation and intuition, see if the hypothesis holds well enough in different cases, make adjustments to the model, and so on. What is notable about various methods of machine learning is that they attempt to automate this trial-and-error process. Instead of the human programmer specifying a specific model that expresses the data, machine learning algorithms are designed to find a model that best fits the data they are given: this process is called training, and in a sense it can be described as a simulation of intuition—or prejudice, for that matter.

One result of automating the guesswork is the possibility to work with very large and complex data in a more fast and precise way. As recent years have seen an astronomical increase in the usage of digital data across many fields of human activity, i.e. big data, the importance of machine learning also increased; the technology industry invested in developing more efficient methods such as deep neural networks. Within a relatively short period of time, machine learning came to be used in numerous applications that affect daily life: search engines, social media feeds, OCR, face recognition, machine translation, spam filtering, stock trade, recommendation systems and self-driving cars, to name just a few. Another result is that by relying on ‘training’ the algorithm on data rather than explicitly programming what it should do, it becomes more difficult to explain exactly what is happening inside. In a way this is a tradeoff between the ability to analyze more data and the understanding of mechanics: machine learning is like a black box, which enables and obfuscates at the same time.

The increasing usage and availability of machine learning methods offer new tools with which to investigate the world, and raises important issues at the same time. Data collection, data representation, modeling and prediction all have large social implications, from governmental and corporate surveillance to labor transformation, policymaking and human subjectivity. What data exists and what does not, and by whom is it created? What types of prejudices are we training machine learning algorithms by choosing a dataset? It is not difficult to find projects that rely on Amazon Mechanical Turks’ labor to create datasets that are used to train algorithms. When these MTurks might be among the first labor force to be massively replaced by the algorithms they helped to train, what does it mean to use such methods in one’s research, for example? What elements of academia will also be replaced? Is it possible to avoid the use of computational power “primarily to augment dominant institutions of corporate and state power?” (Dawson) For what do we want to put the technology to use? To quote Dewey-Hagborg’s talk: “to work in algorithms is a political act, because to work with them has such tremendous impact on people’s lives.” Understanding not only what the tools do, but also what value systems and implicit decisions they embody, is important in order to provide a critical insight and potential room for engagement.

There are large amounts of resources and training programs for machine learning in general, many of those leaning towards specific problem-solving and applications; Stanford’s CS 229 is arguably one of the most famous of such resources. For other fields like the digital humanities and social sciences, tutorials also tend to focus on the application to problems such as text analysis—for example, the GC Digital Fellows’ Text Analysis with MALLET workshop or David Bamman’s Machine Learning for the Computational Humanities tutorial. (http://www.cs.cmu.edu/~dbamman/mlch.html)

These are important approaches that help non-CS researchers harness the power of new technologies. However, the increasing usage of machine learning and its impact to the society call for critical discussion as well. These discussions, such as the problematization of biases in the algorithmic society, are found much less often in typical resources. CML’s goal is to equip students with both a technical familiarity and an understanding of concepts and assumptions of machine learning, and provide a space to foster interdisciplinary discourse around it.

Work plan

CML will consist of a WordPress website hosted on the CUNY Academic Commons (or the New Media Lab server), and a workshop series. The website will be minimal and contain introductory materials along with a readings list and relevant links and tutorials one can follow along. It will be published complete or incomplete in late September, along with the workshop announcement. The workshop series will be held within the Graduate Center in between late October and November; since this workshop is aimed more at a general understanding of machine learning and discussions around it and less at highly technical skills, participants from all disciplines and technical backgrounds are welcome. They are, however, strongly encouraged to engage in discussions. The discussions and feedback of the workshops will be incorporated into the website material, which will be finalized in December. All material and code produced as part of CML will be open-sourced.

The workshop will consist of two 3-hour sessions, both which participants are required to attend. The first session will introduce participants to the fundamental concepts and assumptions of machine learning, and propose a hands-on exercise of a simple machine learning problem-solving. The second session will start off with another machine learning exercise, designed to address the shortcomings of machine learning methods and problematize certain approaches to data in general; this exercise will be followed by a discussion about the pitfalls of machine learning and critical issues around it.

Workshop exercises will use the scikit-learn Python package. We will use the cloud-based DH Box (dhbox.org) platform that provides open source tools such as a command line shell and iPython in a virtual Linux environment accessible via web browser. Participants will need to bring their own laptop, or borrow one from the library. Using DH Box had advantages over installing tools directly on students’ devices because it eliminates most of the configuration process, which can be time-consuming and distracting away from the content. While DH Box can be relatively slower than many laptops, the ease of preparation is important and the technical part of the workshops will not be highly intensive in terms of computational power. One thing to consider is that stable wi-fi connection must be secured, although it should usually not be a big problem in the Graduate Center building. Also, having students bring their own devices allows the workshop to be held in a setting other than computer labs; this will be crucial in order to conduct a discussion.

Timeline (2016)

June – July

Subject research
Draft: introductory document, readings list
Application: NYCDH

August

Build website
Finalize introductory document and readings list
Draft: tutorials (2 total)
(optional) Run test workshop

September

Finalize tutorials
Coordinate workshop (time, place, announcement, registration)

October

Run workshop session 1: introduction & exercise 1

November

Run workshop session 2: exercise 2 & discussion
Incorporate feedback into website

December

Project write-up
Submit to JITP, etc

References

Dawson, Ashley. “DIY Academy? Cognitive Capitalism, Humanist Scholarship, and the Digital Transformation.” The Social Media Reader. Ed. Michael Mandiberg. New York: NYU Press, 2012. 257-74. Print.
Dewey-Hagborg, Heather. Alt-AI. School for Poetic Computation, New York. 21 May 2016. Presentation. Re-cited from https://twitter.com/carolinesinders/status/734167428343726080.
Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
Tanz, Jason. “Soon We Won’t Program Computers. We’ll Train Them Like Dogs.” Wired.com. Web. 17 May 2016.

Should I split workshop sessions the other way – with technical exercises on one day and concepts+discussion on the other?
visualization methods could be helpful in explaining the concepts? graphs, t-sne, …

Feedback from 05/23/2016 presentation

What is the end product?
Think of ways to incorporate other people’s input. Don’t reinvent the wheel for all of the discussion
Machine learning can be a big, hard-to-tackle subject. Framing the outreaching, narrowing down the audience, defining constraints, employing focus groups, etc will be helpful.
- e.g. Adaptive learning
- google translate + K-12
Where is the point of intended intervention?
- involving computer scientists?
- critiquing the industrial trend?
- providing enough skills to critiques?
- bringing together the separated activities, i.e. faciliating dialogue btwn compsci and other fields?
Check out:
- Queens college’s big data analytics program
- Beyond Citation

Responses

Maura A. Smale (she/her) says:

May 27, 2016 at 10:31 am

This is a terrific project, Achim — as you note, with the increase in machine learning and algorithms used in so many services and products, it’s increasingly important to be able understand what they are and what they do. I think these workshops will be of interest to students (and faculty!) in all disciplines, and I can certainly see future ITP cohorts benefiting from them as well.

Maura A. Smale (she/her) says:

May 27, 2016 at 10:39 am

To your question about splitting up the workshops, I would suggest having some presentation, some activity, and some discussion for each of the two workshops. 3 hours is a long time — though I think you need that much time, for sure — and I think that breaking up the workshop into discrete sections will help it flow smoothly.

I’m also curious about the end product — you’ll have the website of resources and the workshop lesson plans, would you consider offering the workshops again in a subsequent semester, or training another interested GC student to do so?

Again, this a great project proposal, Achim. I’m looking forward to hearing how the workshops go next semester.

Achim Koh