【Python 载入数据】Memory Datasets from the UCI Machine Learning Repository

Memory Datasets from the UCI Machine Learning Repository

Where can you get good datasets to practice machine learning?

Datasets that are real-world so that they are interesting and relevant, although small enough for you to review in Excel and work through on your desktop.

In this post you will discover a database of high-quality, real-world, and well understood machine learning datasets that you can use to practice applied machine learning.

This database is called the UCI machine learning repository and you can use it to structure a self-study program and build a solid foundation in machine learning.

Practice Practice Practice

Photo byPhil Roeder, some rights reserved.

Why Do We Need Practice Datasets?

If you are interested in practicing applied machine learning, you need datasets on which to practice.

This problem can stop you dead.

Which dataset should you use?

Should you collect your own or use one off the shelf?

Which one and why?

I teach a top-down approach to machine learning where I encourage you to learn a process for working a problem end-to-end, map that process onto a tool and practice the process on data in a targeted way. For more information see my post “Machine Learning for Programmers: Leap from developer to machine learning practitioner“.

So How Do You Practice In A Targeted Way?

I teach that the best way to get started is to practice on datasets that have specific traits.

I recommend you select traits that you will encounter and need to address when you start working on problems of your own such as:

Different types of supervised learning such as classification and regression.

Different sized datasets from tens, hundreds, thousands and millions of instances.

Different numbers of attributes from less than ten, tens, hundreds and thousands of attributes

Different attribute types from real, integer, categorical, ordinal and mixtures

Different domains that force you to quickly understand and characterize a new problem in which you have no previous experience.

You can create a program of traits to study and learn about and the algorithm you need to address them, by designing a program of test problem datasets to work through.

Such a program has a number of practical requirements, for example:

Real-World: The datasets should be drawn from the real world (rather than being contrived). This will keep them interesting and introduce the challenges that come with real data.

Small: The datasets need to be small so that you can inspect and understand them and that you can run many models quickly to accelerate your learning cycle.

Well-Understood: There should be a clear idea of what the data contains, why it was collected, what the problem is that needs to be solved so that you can frame your investigation.

Baseline: It is also important to have an idea of what algorithms are known to perform well and the scores they achieved so that you have a useful point of comparison. This is important when you are getting started and learning because you need quick feedback as to how well you are performing (close to state-of-the-art or something is broken).

Plentiful: You need many datasets to choose from, both to satisfy the traits you would like to investigate and (if possible) your natural curiosity and interests.

For beginners, you can get everything you need and more in terms of datasets to practice on from the UCI Machine Learning Repository.

What is the UCI Machine Learning Repository?

TheUCI Machine Learning Repositoryis a database of machine learning problems that you can access for free.

It is hosted and maintained by theCenter for Machine Learning and Intelligent Systemsat the University of California, Irvine. It was originally created byDavid Ahaas a graduate student at UC Irvine.

For more than 25 years it has been the go-to place for machine learning researchers and machine learning practitioners that need a dataset.

UCI Machine Learning Repository

Each dataset get its own webpage that lists all the details known about it including any relevant publications that investigate it. The datasets themselves can be downloaded as ASCII files, often the useful CSV format.

For example, here is the webpage for theAbalone Data Setthat requires the prediction of the age of abalone from their physical measurements.

Benefits of the Repository

Some beneficial features of the library include:

Almost all datasets are drawn from the domain (as opposed to being synthetic), meaning that they have real-world qualities.

Datasets cover a wide range of subject matter from biology to particle physics.

The details of datasets are summarized by aspects like attribute types, number of instances, number of attributes and year published that can be sorted and searched.

Datasets are well studied which means that they are well known in terms of interesting properties and expected “good” results. This can provide a useful baseline for comparison.

Most datasets are small (hundreds to thousands of instances) meaning that you can readily load them in a text editor or MS Excel and review them, you can also easily model them quickly on your workstation.

Browse the 300+ datasets usingthis handy tablethat supports sorting and searching.

Criticisms of the Repository

Some criticisms of the repository include:

The datasets are cleaned, meaning that the researchers that prepared them have often already performed some pre-processing in terms of the the selection of attributes and instances.

The datasets are small, this is not helpful if you are interested in investigating larger scale problems and techniques.

There are so many to choose from that you can be frozen by indecision and over-analysis. It can be hard to just pick a dataset and get started when you are unsure if it is a “good dataset” for what you’re investigating.

Datasets are limited to tabular data, primarily for classification (although clustering and regression datasets are listed). This is limiting for those interested in natural language, computer vision, recommender and other data.

Take a look at therepository homepageas it shows featured datasets, the newest datasets as well as which datasets are currently the most popular.

A Self-Study Program

So, how can you make the best use of the UCI machine learning repository?

I would advise you to think about the traits in problem datasets that you would like to learn about.

These may be traits that you would like to model (like regression), or algorithms that model these traits that you would like to get more skillful at using (like random forest for multi-class classification).

An example program might look like the following:

Binary Classification:Pima Indians Diabetes Data Set

Multi-Class Classification:Iris Data Set

Regression:Wine Quality Data Set

Categorical Attributes:Breast Cancer Data Set

Integer Attributes:Computer Hardware Data Set

Classification Cost Function:German Credit Data

Missing Data:Horse Colic Data Set

This is just a list of traits, can can pick and choose your own traits to investigate.

I have listed one dataset for each trait, but you could pick 2-3 different datasets and complete a few small projects to improve your understanding and put in more practice.

For each problem, I would advise that you work it systematically from end-to-end, for example, go through the following steps in the applied machine learning process:

Define the problem

Prepare data

Evaluate algorithms

Improve results

Write-up results

Select a systematic and repeatable process that you can use to deliver results consistently.

For more on the process of working through a machine learning problem systematically, see my post titled “Process for working through Machine Learning Problems“.

The write-up is a key part.

It allows you to build up a portfolio of projects that you refer back to as a reference on future projects and get a jump-start, as well as use as a public resume or your growing skills and capabilities in applied machine learning.

For more on building a portfolio of projects, see my post “Build a Machine Learning Portfolio: Complete Small Focused Projects and Demonstrate Your Skills“.

But, What If…

I don’t know a machine learning tool.

Pick a tool or platform (like Weka, R or scikit-learn) and use this process to learn a tool. Cover off both practicing machine learning and getting good at your tool at the same time.

I don’t know how to program (or code very well).

Use Weka. It has a graphical user interface and no programming is required. I would recommend this to beginners regardless of whether they can program or not because the process of working machine learning problems maps so well onto the platform.

I don’t have the time.

With a strong systematic process and a good tool that covers the whole process, I think that you could work through a problem in one-or-two hours. This means you could complete one project in an evening or over two evenings.

You choose the level of detail to investigate and it is a good idea to keep it light and simple when just starting out.

I don’t have a background in the domain I’m modeling.

The dataset pages provide some background on the dataset. Often you can dive deeper by looking at publications or the information files accompanying the main dataset.

I have little to no experience working through machine learning problems.

Now is your time to start. Pick asystematic process, pick a simple dataset and a tool likeWekaand work through your first problem. Place that first stone in your machine learning foundation.

I have no experience at data analysis.

No experience in data analysis is required. The datasets are simple, easy to understand and well explained. You simply need to read up on them using the data sets home page and by looking at the data files themselves.

Action Step

Select a dataset and get started.

If you are serious about your self-study, consider designing a modest list of traits and corresponding datasets to investigate.

You will learn a lot and build a valuable foundation for diving into more complex and interesting problems.

Did you find this post useful? Leave a comment and let me know.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,816评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,729评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,300评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,780评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,890评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,084评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,151评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,912评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,355评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,666评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,809评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,504评论 4 334
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,150评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,882评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,121评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,628评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,724评论 2 351

推荐阅读更多精彩内容