成为技术高手:想更了解自己的偶像么?教你用技术手段挖掘他/她的关系

发布时间:2024-12-17 06:56

简介: 是否在不同的电影中,总是能看到那些熟悉却叫不上名字的演员么,想知道他们之间相互的关系么?本文将带你一步一步地挖掘出他们的关系。想更了解自己的偶像么,那就试试吧。

是否在不同的电影中,总是能看到那些熟悉却叫不上名字的演员么,想知道他们之间相互的关系么?本文将带你一步一步地挖掘出他们的关系。想更了解自己的偶像么,那就试试吧。

以下为译文

我最近换了个工作,在入职之前,我看了很多电影来打发闲暇时光。然而,演员之间的联系引起了我的注意。我经常回想,为什么我总能看到他们在一起工作呢?然后,我就使用 IMDB 电影数据库来分析演员之间的关联。

# Load up the useful libraries for building and visualizing networks: library(reshape2) library(network) library(sna) library(ggplot2) library(GGally) library(readr) system("ls ../input") # Read in the data, and strip out the unnecessary details: data <- read_csv("../input/movie_metadata.csv") network: Classes for Relational Data Version 1.13.0 created on 2015-08-31. copyright (c) 2005, Carter T. Butts, University of California-Irvine Mark S. Handcock, University of California -- Los Angeles David R. Hunter, Penn State University Martina Morris, University of Washington Skye Bender-deMoll, University of Washington For citation information, type citation("network"). Type help("network-package") to get started. Loading required package: statnet.common sna: Tools for Social Network Analysis Version 2.4 created on 2016-07-23. copyright (c) 2005, Carter T. Butts, University of California-Irvine For citation information, type citation("sna"). Type help(package="sna") to get started. Parsed with column specification: cols( .default = col_integer(), color = col_character(), director_name = col_character(), actor_2_name = col_character(), genres = col_character(), actor_1_name = col_character(), movie_title = col_character(), actor_3_name = col_character(), plot_keywords = col_character(), movie_imdb_link = col_character(), language = col_character(), country = col_character(), content_rating = col_character(), imdb_score = col_double(), aspect_ratio = col_double() ) See spec(...) for full column specifications.

输出结果

Warning message: “4 parsing failures. row col expected actual 2324 budget an integer 2400000000 2989 budget an integer 12215500000 3006 budget an integer 2500000000 3860 budget an integer 4200000000 ”

需要做的第一件事就是构建可生成网络图的对象。我们有很多库可以完成这件事情,而我选择了sna和network两个库。另外,我推荐包含GGally的GGplot2库可以将网络完美地可视化。我决定对这些进行一层简单的封装来获得我们需要的对象。

getIMDBGraph<-function(data, firstyear =1, lastyear = 3000, genre = FALSE, minscore = 0, maxscore = 10){ if(is.character(genre)){ data<-data[grep(genre, data$genres),]} data<-subset(data, data$title_year >= firstyear & data$title_year <= lastyear & data$imdb_score <= maxscore & data$imdb_score >= minscore) df<- data.frame(data$movie_title, data$actor_1_name, data$actor_2_name, data$actor_3_name) df<- melt(df, id.vars = 'data.movie_title') names(df)<-c('title', 'actornum', 'actor') df<-df[,c(1,3)] edges<-merge(x = df, y = df, by = 'title') edges<-subset(edges,edges$actor.x != edges$actor.y) edgelist<-as.matrix(edges[,c(2:3)]) graph<-network(edgelist) return(graph)}

现在就可以传递之前导入的按时间排序的数据了。在这次案例中,我使用的数据是2000年到2009年的。

graph <- getIMDBGraph(data, firstyear = 2000, lastyear = 2009) Warning message: “attributes are not identical across measure variables; they will be dropped”

试用GGplot2 的扩展可以很容易地画出交互性好的网络图。下面是一些图中的参数。

p <- ggnet2(graph, size = 'degree', # feature by which the nodes are scaled size.min = 40, # lower bound of the nodes to be plotted label = T, #plot labels? # mode = 'kamadakawai', # plotting algo for node placement (defaults to FR) label.size = 2.5, node.size = 7, node.color = 'grey70', node.alpha = 0.7, edge.alpha = 0.2, edge.size = 0.3, legend.size = FALSE, # I don't want a legend legend.position = 'None') # Now add a title and subtitle p <- p + ggtitle('Network of actor connections', subtitle = '2000\'s') # And format the title p <- p + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) p

输出结果

Loading required package: scales Attaching package: ‘scales’ The following object is masked from ‘package:readr’: col_factor size.min removed 3072 nodes out of 3144

这个图的密度由我们过滤节点的方法决定。在这个案例中,我们一般用节点的度来过滤,也就是说一个节点和其它节点关联的数量。

raph <- getIMDBGraph(data, firstyear = 1990, lastyear = 1999) p <- ggnet2(graph, size = 'degree', size.min = 18, label = T, # mode = 'kamadakawai', label.size = 2.5, node.size = 7, node.color = 'grey70', node.alpha = 0.7, edge.alpha = 0.2, edge.size = 0.3, legend.size = FALSE, legend.position = 'None') # Now add a title and subtitle p <- p + ggtitle('Network of actor connections', subtitle = '90\'s') # And format the title p <- p + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) p

输出结果

Warning message: “attributes are not identical across measure variables; they will be dropped”size.min removed 1256 nodes out of 1336

接下来让我们尝试90年代的动作电影:

genre <- 'Action' firstyear <- 1990 lastyear <- 1999 graph <- getIMDBGraph(data, firstyear = firstyear, lastyear = lastyear, genre = genre) p <- ggnet2(graph, size = 'degree', size.min = 6, label = T, # mode = 'kamadakawai', label.size = 2.5, node.size = 7, node.color = 'grey70', node.alpha = 0.7, edge.alpha = 0.2, edge.size = 0.3, legend.size = FALSE, legend.position = 'None') # Now add a title and subtitle p <- p + ggtitle('Network of actor connections', subtitle = paste(genre, 'movies between', firstyear, '&', lastyear)) # And format the title p <- p + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) p

输出结果

Warning message: “attributes are not identical across measure variables; they will be dropped”size.min removed 347 nodes out of 446

那么对于80年代以后且评分7.0分以上的电影又是怎样的情况呢?为了方便浏览,我将图着了色。

genre <- 'Action' firstyear <- 1980 minscore <- 7 graph <- getIMDBGraph(data, firstyear = firstyear, genre = genre, minscore = minscore) p <- ggnet2(graph, size = 'degree', size.min = 6, label = T, label.color = 'white', label.size = 3, node.color = 'grey70', node.alpha = 0.5, edge.alpha = 0.3, edge.size = 0.5, legend.size = FALSE, legend.position = 'None') # Now add a title and subtitle p <- p + ggtitle('Network of actor connections', subtitle = 'from the best action films:') # And format the title p <- p + theme(plot.title = element_text(hjust = 0.5, colour = 'white', size = 25), plot.subtitle = element_text(hjust = 0.5, colour = 'white', size = 17), plot.background = element_rect(fill = 'black')) # And plot it p

输出结果

Warning message: “attributes are not identical across measure variables; they will be dropped”size.min removed 392 nodes out of 518

数十款阿里云产品限时折扣中,赶紧点击领劵开始云上实践吧!

本文由北邮@爱可可-爱生活 老师推荐,阿里云云栖社区组织翻译。

文章原标题《Network Mapping Hollywood actor overlap | Kaggle》,作者:wouldntyaliktono,译者:爱小乖

文章为简译,更为详细的内容,请查看原文

网址:成为技术高手:想更了解自己的偶像么?教你用技术手段挖掘他/她的关系 http://c.mxgxt.com/news/view/243823

相关内容

成为技术高手:想更了解自己的偶像么?教你用技术手段挖掘他/她的关系
虚拟偶像 :“技术中立”不能成为避风港
数据挖掘是对业务和用户的理解
挖掘竞争对手人才也是有技巧
挖掘竞争对手人才的策略
百度天量数据挖掘明星关系,极客都是好娱记
技术演讲的技巧和经验
《媒体信息技术在英语课堂和课外教学中的运用》
【成为偶像】我的二次元女友是虚拟歌手
【故乡的云·校友访谈录之十四】李彦宏:技术与梦想的力量

随便看看