模拟人生3sim文件放哪2simspark怎么安装

徵求OPAL Card & Spark SIM
[雪梨廣告區]徵求澳寶卡OPAL Card 2張...
【票卡】 徵求OPAL Card & Spark SIM
主题分类清单
看作者其他主题
看作者其他文章
显示可打印版本
澳洲廣告專區
伯斯廣告區
布里斯本廣告
墨爾本廣告區
雪梨廣告區
阿得萊德廣告
達爾文廣告
性别: 男生
感谢: 119次/73帖
徵求OPAL Card & Spark SIM -
徵求澳寶卡OPAL Card 2張
被 hit383 编辑。
澳洲廣告專區
雪梨廣告區
澳洲廣告專區
angela850163
澳洲廣告專區
主题分类清单
看作者其他主题
看作者其他文章
显示可打印版本
澳洲廣告專區
伯斯廣告區
布里斯本廣告
墨爾本廣告區
雪梨廣告區
阿得萊德廣告
達爾文廣告baybay带你了解新西兰の手机卡电话卡指南Spark,vofadone,2degrees,skinny
telecom已改名为Spark
圣诞大活动5G上网卡只要30刀
奥克兰&&出海关左前方
基督城&&出海关往左走
实在不明白请留言更贴么么哒
1楼&vodafone
2楼Spark(原名telecom)
3楼&2degrees
4楼&skinny
最头痛的手机选择现在给大家一点小建议,游的同学们主要是为了可以手机上网方便,流量多是王道。先传些照片,讲究看着,如果不明白我有时间了会一点点再详细给大家解释,估计会有等不及的新西兰&&同学们帮我解释的哈
不要求大家给我点赞,只希望觉得有用的同学或者感觉有帮助的同学帮我顶下帖.截图也不是一件小工程哈
4G网路出动Vodafonehttp://www.vodafone.co.nz&&
vodafone是手机的领头羔羊,一般其他通讯公司不能用的手机sim卡,他家都可以使用,除非是特别特殊的手机型号,从机场一出口就会看到vodafone和telecom的手机专柜,如果担心英文不够好的同学看见华人的面孔不用质疑,直接用中文沟通万无一失,让华人的工作人员帮你试手机卡,还有记得拨打777开通哦。记得和任何工作人员说需要$19的套餐,如果不提醒他们一般直接会帮你开通freebee&talk或者freebee&data。来旅游的同学们主要就是上网打电话什么的无限短消息之类的不是太用得到。办卡之后会自动生成4G,速度那是刷刷的~呼呼~
$19的套餐包括
100mins通话
无限当地短信
500mb上网流量
$49的套餐2G的流量,变回原来价格
120分钟的免费电话+15个国家(包括中国&&)
200个短信+15个国家(包括)
$10的花费免费分钟用完以后使用
Spark&&&http://www.spark.co.nz/
原本的telecom有点类似于国内的电信,是家庭固话类的带头羔羊,以前手机通话客户都是当地KIWI,现在因为竞争激烈也开始走平价路线,再加上12月18号推出的电话亭每日免费1G活动,越来越多的同学都加入他家的行列。不过不是每个手机都会合适们一般,一般smart&phone类的iphone或者都可以直接试用。其他电话还是一样,让机场telecom专柜的工作人员帮忙试用,合适以后再买,如果手机不能使用还是需要跑去隔壁的vodafone再次购买。
$19的套餐包括
100mins&通话
无限澳洲当地短信
500mb上网流量
圣诞节推出新优惠只要购买19套餐的一个月以内顾客拨打telecom的手机或者座机全部免费。也就是说自家打自家不花钱,明白
http://www.2degreesmobile.co.nz/
2degrees就是因为他首先推出的的19元电话套餐,才使得vodafone和telecom被逼降价,直到现在还维持者19套餐的优惠。机场没有连锁店,可以自skycity底楼的iside&服务中心拿专供旅客试用的免费sim卡。然后再充值试用。机场isid有一段时间断卡,现在我也不确定还可不可以&拿。现在更推出carryover&&套餐,只要选择carryover&mins&套餐就可以拨打任何或者澳洲的电话一共200分钟。本人现在就转了这个,也不用担心流量会用完。只不过比较不适合不在市区居住游们的同学,信号非常不好。除去,,,,丹尼丁几市,其他小城市信号也开始编号
更新现在对于2degree大大改观信号开始各种强大,信号不好的地方所有手机都收不到米佛峡湾
skinny&http://www.skinny.co.nz
Skinny是最年轻的移动电话运营商,同时它也是目前市场上资费最便宜的移动电话运营商。与其他的运营商不同的是,Skinny主要针对的是低端用户,并且仅提供预付费套餐;另外,Skinny自身并不搭建移动网络,而是使用电信spark的移动网络发射基站和传输网络,所以它的价格能做到市场的最低。
移动服务商Skinny的特点Skinny作为最年轻的移动电话通信商,有以下特点:这家公司有自己的系统、客服、账单、电话中心、销售团队,但是其移动网络则全部来自于sparkSkinny只做移动通信服务,并且只有预付费Prepay套餐销售由于使用Telecom(目前改名叫Spark)的移动网络,所以Skinny的电话是CDMA2000制式的Skinny的电话运行在1900MHz的频率上绝大部分便利店、超市、电脑商店、书店等等都能够方便买到Skinny的电话SIM卡,5元钱一张,同时也能够很方便的进行充值Skinny的电话号码之间通信是免费的Skinny提供按周(7天)计算的套餐,最低只需要4元即可得到通话时间、短信和数据流量
和以上三家的电话公司小异,唯一增加的一个是每个月40刀的nz&AUS无限制电话比较适合在本地的客人
特别的爱给特别的你~
京公网安备号
京ICP证110318号
新出网证(京)字242号 全国统一客服电话:Spark修炼之道(高级篇)——Spark源码阅览:第二节 SparkContext的创建 - 云计算当前位置:& &&&Spark修炼之道(高级篇)——Spark源码阅览:第二节Spark修炼之道(高级篇)——Spark源码阅览:第二节 SparkContext的创建&&网友分享于:&&浏览:0次Spark修炼之道(高级篇)——Spark源码阅读:第二节 SparkContext的创建博文推荐:http://blog.csdn.net/anzhsoft/article/details/,由大神张安站写的Spark架构原理,使用Spark版本为1.2,本文以Spark 1.5.0为蓝本,介绍Spark应用程序的执行流程。
本文及后面的源码分析都以下列代码为样板
import org.apache.spark.{SparkConf, SparkContext}
object SparkWordCount{
def main(args: Array[String]) {
if (args.length == 0) {
System.err.println("Usage: SparkWordCount &inputfile& &outputfile&")
System.exit(1)
val conf = new SparkConf().setAppName("SparkWordCount")
val sc = new SparkContext(conf)
val file=sc.textFile("file:///hadoopLearning/spark-1.5.1-bin-hadoop2.4/README.md")
val counts=file.flatMap(line=&line.split(" "))
.map(word=&(word,1))
.reduceByKey(_+_)
counts.saveAsTextFile("file:///hadoopLearning/spark-1.5.1-bin-hadoop2.4/countReslut.txt")
代码中的SparkContext在Spark应用程序的执行过程中起着主导作用,它负责与程序个Spark集群进行交互,包括申请集群资源、创建RDD、accumulators 及广播变量等。SparkContext与集群资源管理器、Worker结节点交互图如下图所示。
官网对图下面几点说明:
(1)不同的Spark应用程序对应该不同的Executor,这些Executor在整个应用程序执行期间都存在并且Executor中可以采用多线程的方式执行Task。这样做的好处是,各个Spark应用程序的执行是相互隔离的。除Spark应用程序向外部存储系统写数据进行数据交互这种方式外,各Spark应用程序间无法进行数据共享。
(2)Spark对于其使用的集群资源管理器没有感知能力,只要它能对Executor进行申请并通信即可。这意味着不管使用哪种资源管理器,其执行流程都是不变的。这样Spark可以不同的资源管理器进行交互。
(3)Spark应用程序在整个执行过程中要与Executors进行来回通信。
(4)Driver端负责Spark应用程序任务的调度,因此最好Driver应该靠近Worker节点。
Spark目前支持的集群管理器包括:
Standalone
Apache Mesos
Hadoop YARN
在提交Spark应用程序时,Spark支持下列几种Master URL
有了前面的知识铺垫后,现在我们来说明一下Spark的创建过程,SparkContext创建部分核心源码如下:
// We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
// retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
_heartbeatReceiver = env.rpcEnv.setupEndpoint(
HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
// Create and start the scheduler
//根据master及SparkContext对象创建TaskScheduler,返回SchedulerBackend及TaskScheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master)
_schedulerBackend = sched
_taskScheduler = ts
//根据SparkContext对象创建DAGScheduler
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler‘s
// constructor
_taskScheduler.start()
_applicationId = _taskScheduler.applicationId()
_applicationAttemptId = taskScheduler.applicationAttemptId()
_conf.set("spark.app.id", _applicationId)
_env.blockManager.initialize(_applicationId)
跳到createTaskScheduler方法,可以看到如下源码:
* Create a task scheduler based on a given master URL.
* Return a 2-tuple of the scheduler backend and the task scheduler.
private def createTaskScheduler(
sc: SparkContext,
master: String): (SchedulerBackend, TaskScheduler) = {
val LOCAL_N_REGEX = """local\[([0-9]+|\*)\]""".r
val LOCAL_N_FAILURES_REGEX = """local\[([0-9]+|\*)\s*,\s*([0-9]+)\]""".r
val LOCAL_CLUSTER_REGEX = """local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*]""".r
val SPARK_REGEX = """spark://(.*)""".r
val MESOS_REGEX = """(mesos|zk)://.*""".r
val SIMR_REGEX = """simr://(.*)""".r
val MAX_LOCAL_TASK_FAILURES = 1
master match {
case "local" =&
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(sc.getConf, scheduler, 1)
scheduler.initialize(backend)
(backend, scheduler)
case LOCAL_N_REGEX(threads) =&
def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
val threadCount = if (threads == "*") localCpuCount else threads.toInt
if (threadCount &= 0) {
throw new SparkException(s"Asked to run locally with $threadCount threads")
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
scheduler.initialize(backend)
(backend, scheduler)
case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =&
def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
val threadCount = if (threads == "*") localCpuCount else threads.toInt
val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
scheduler.initialize(backend)
(backend, scheduler)
case SPARK_REGEX(sparkUrl) =&
val scheduler = new TaskSchedulerImpl(sc)
val masterUrls = sparkUrl.split(",").map("spark://" + _)
val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
(backend, scheduler)
case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =&
val memoryPerSlaveInt = memoryPerSlave.toInt
if (sc.executorMemory & memoryPerSlaveInt) {
throw new SparkException(
"Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
memoryPerSlaveInt, sc.executorMemory))
val scheduler = new TaskSchedulerImpl(sc)
val localCluster = new LocalSparkCluster(
numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf)
val masterUrls = localCluster.start()
val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
backend.shutdownCallback = (backend: SparkDeploySchedulerBackend) =& {
localCluster.stop()
(backend, scheduler)
case "yarn-standalone" | "yarn-cluster" =&
if (master == "yarn-standalone") {
logWarning(
"\"yarn-standalone\" is deprecated as of Spark 1.0. Use \"yarn-cluster\" instead.")
val scheduler = try {
val clazz = Utils.classForName("org.apache.spark.scheduler.cluster.YarnClusterScheduler")
val cons = clazz.getConstructor(classOf[SparkContext])
cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
case e: Exception =& {
throw new SparkException("YARN mode not available ?", e)
val backend = try {
val clazz =
Utils.classForName("org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend")
val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
case e: Exception =& {
throw new SparkException("YARN mode not available ?", e)
scheduler.initialize(backend)
(backend, scheduler)
case "yarn-client" =&
org.apache.spark.scheduler.cluster.YarnScheduler
val scheduler = try {
val clazz = Utils.classForName("org.apache.spark.scheduler.cluster.YarnScheduler")
val cons = clazz.getConstructor(classOf[SparkContext])
cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
case e: Exception =& {
throw new SparkException("YARN mode not available ?", e)
val backend = try {
val clazz =
Utils.classForName("org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend")
val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
case e: Exception =& {
throw new SparkException("YARN mode not available ?", e)
scheduler.initialize(backend)
(backend, scheduler)
case mesosUrl @ MESOS_REGEX(_) =&
MesosNativeLibrary.load()
val scheduler = new TaskSchedulerImpl(sc)
val coarseGrained = sc.conf.getBoolean("spark.mesos.coarse", false)
val url = mesosUrl.stripPrefix("mesos://")
val backend = if (coarseGrained) {
new CoarseMesosSchedulerBackend(scheduler, sc, url, sc.env.securityManager)
new MesosSchedulerBackend(scheduler, sc, url)
scheduler.initialize(backend)
(backend, scheduler)
case SIMR_REGEX(simrUrl) =&
val scheduler = new TaskSchedulerImpl(sc)
val backend = new SimrSchedulerBackend(scheduler, sc, simrUrl)
scheduler.initialize(backend)
(backend, scheduler)
throw new SparkException("Could not parse Master URL: ‘" + master + "‘")
资源调度SchedulerBackend类及相关子类如下图
任务调度器,TaskScheduler类及其子数如下图:
在后续章节中,我们将对具体的内容进行进一步的分析
版权声明:本文为博主原创文章,未经博主允许不得转载。
12345678910
12345678910
12345678910 上一篇:下一篇:文章评论相关解决方案 1234567891011 Copyright & &&版权所有自然语言处理(NLP,Natural Language Processing)
TF-IDF(词频 term frequency&逆向文件频率 inverse document frequency)
短语加权:根据词频,为单词赋予权值
特征哈希:使用哈希方程对特征赋予向量下标
0 运行环境
tar xfvz 20news-bydate.tar.gz
export SPARK_HOME=/Users/erichan/Garden/spark-1.5.1-bin-hadoop2.6
cd $SPARK_HOME
bin/spark-shell --name my_mlib --packages org.jblas:jblas:1.2.4-SNAPSHOT --driver-memory 4G --executor-memory 4G --driver-cores 2
1 提取特征
val PATH = "/Users/erichan/sourcecode/book/Spark机器学习/20news-bydate"
val path = PATH+"/20news-bydate-train/*"
val rdd = sc.wholeTextFiles(path)
println(rdd.count)
查看新闻组主题
val newsgroups = rdd.map { case (file, text) =& file.split("/").takeRight(2).head }
val countByGroup = newsgroups.map(n =& (n, 1)).reduceByKey(_ + _).collect.sortBy(-_._2).mkString("\n")
println(countByGroup)
(rec.sport.hockey,600)(soc.religion.christian,599)(rec.motorcycles,598)(rec.sport.baseball,597)(sci.crypt,595)(rec.autos,594)(sci.med,594)(comp.windows.x,593)(sci.space,593)(sci.electronics,591)(comp.os.ms-windows.misc,591)(comp.sys.ibm.pc.hardware,590)(misc.forsale,585)(comp.graphics,584)(comp.sys.mac.hardware,578)(talk.politics.mideast,564)(talk.politics.guns,546)(alt.atheism,480)(talk.politics.misc,465)(talk.religion.misc,377)
val text = rdd.map { case (file, text) =& text }
val whiteSpaceSplit = text.flatMap(t =& t.split(" ").map(_.toLowerCase))
println(whiteSpaceSplit.distinct.count)
println(whiteSpaceSplit.sample(true, 0.3, 42).take(100).mkString(","))
402978from:,mathew,mathew,faq:,faq:,atheist,resourcessummary:,music,--,fiction,,mantis,consultants,,uk.supersedes:,290
archive-name:,1.0
,,,,,,,,,,,,,,,,,,,organizations
,organizations
,,,,,,,,,,,,,,,,stickers,and,and,the,from,from,in,to:,to:,ffrf,,256-8900
evolution,designs
evolution,a,stick,cars,,writteninside.,fish,us.
write,evolution,,,,,,,bay,can,get,get,,to,theprice,is,of,the,the,so,on.,and,foote.,,atheist,pp.,0--4,,,atrocities,,foote:,aap.,,the
2.2 改进分词
val nonWordSplit = text.flatMap(t =& t.split("""\W+""").map(_.toLowerCase))
println(nonWordSplit.distinct.count)
println(nonWordSplit.distinct.sample(true, 0.3, 42).take(100).mkString(","))
val regex = """[^0-9]*""".r
val filterNumbers = nonWordSplit.filter(token =& regex.pattern.matcher(token).matches)
println(filterNumbers.distinct.count)
println(filterNumbers.distinct.sample(true, 0.3, 42).take(100).mkString(","))
2.3 移除停用词
val tokenCounts = filterNumbers.map(t =& (t, 1)).reduceByKey(_ + _)
val oreringDesc = Ordering.by[(String, Int), Int](_._2)
2.4 移除低频词
val oreringAsc = Ordering.by[(String, Int), Int](-_._2)
2.5 提取词干
标准NLP方法
3 训练模型
3.1 HashingTF 特征哈希
import org.apache.spark.mllib.linalg.{ SparseVector =& SV }
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.feature.IDF
val v = tf.first.asInstanceOf[SV]
println(v.size)
println(v.values.size)
println(v.values.take(10).toSeq)
println(v.indices.take(10).toSeq)
262144706WrappedArray(1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0)WrappedArray(313, 713, 871, , , , 3166)
fit & transform
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
val v2 = tfidf.first.asInstanceOf[SV]
println(v2.values.size)
println(v2.values.take(10).toSeq)
println(v2.indices.take(10).toSeq)
706WrappedArray(2., 6.856, 4.142, 8.111, 5.528, 2., 3.9)WrappedArray(313, 713, 871, , , , 3166)
3.2 分析权重
val minMaxVals = tfidf.map { v =&
val sv = v.asInstanceOf[SV]
(sv.values.min, sv.values.max)
val globalMinMax = minMaxVals.reduce { case ((min1, max1), (min2, max2)) =&
(math.min(min1, min2), math.max(max1, max2))
println(globalMinMax)
globalMinMax: (Double, Double) = (0.0,09753)
val common = sc.parallelize(Seq(Seq("you", "do", "we")))
val tfCommon = hashingTF.transform(common)
val tfidfCommon = idf.transform(tfCommon)
val commonVector = tfidfCommon.first.asInstanceOf[SV]
println(commonVector.values.toSeq)
WrappedArray(0.5, 0.9175)
不常出现的单词
val uncommon = sc.parallelize(Seq(Seq("telescope", "legislation", "investment")))
val tfUncommon = hashingTF.transform(uncommon)
val tfidfUncommon = idf.transform(tfUncommon)
val uncommonVector = tfidfUncommon.first.asInstanceOf[SV]
println(uncommonVector.values.toSeq)
WrappedArray(5., 5.579)
4 使用模型
4.1 余弦相似度
import breeze.linalg._
val hockeyText = rdd.filter { case (file, text) =& file.contains("hockey") }
val hockeyTF = hockeyText.mapValues(doc =& hashingTF.transform(tokenize(doc)))
val hockeyTfIdf = idf.transform(hockeyTF.map(_._2))
val hockey1 = hockeyTfIdf.sample(true, 0.1, 42).first.asInstanceOf[SV]
val breeze1 = new SparseVector(hockey1.indices, hockey1.values, hockey1.size)
val hockey2 = hockeyTfIdf.sample(true, 0.1, 43).first.asInstanceOf[SV]
val breeze2 = new SparseVector(hockey2.indices, hockey2.values, hockey2.size)
val cosineSim = breeze1.dot(breeze2) / (norm(breeze1) * norm(breeze2))
println(cosineSim)
cosineSim: Double = 0.164626
val graphicsText = rdd.filter { case (file, text) =& file.contains("comp.graphics") }
val graphicsTF = graphicsText.mapValues(doc =& hashingTF.transform(tokenize(doc)))
val graphicsTfIdf = idf.transform(graphicsTF.map(_._2))
val graphics = graphicsTfIdf.sample(true, 0.1, 42).first.asInstanceOf[SV]
val breezeGraphics = new SparseVector(graphics.indices, graphics.values, graphics.size)
val cosineSim2 = breeze1.dot(breezeGraphics) / (norm(breeze1) * norm(breezeGraphics))
println(cosineSim2)
cosineSim2: Double = 0.792852
val baseballText = rdd.filter { case (file, text) =& file.contains("baseball") }
val baseballTF = baseballText.mapValues(doc =& hashingTF.transform(tokenize(doc)))
val baseballTfIdf = idf.transform(baseballTF.map(_._2))
val baseball = baseballTfIdf.sample(true, 0.1, 42).first.asInstanceOf[SV]
val breezeBaseball = new SparseVector(baseball.indices, baseball.values, baseball.size)
val cosineSim3 = breeze1.dot(breezeBaseball) / (norm(breeze1) * norm(breezeBaseball))
println(cosineSim3)
4.2 学习单词与主题的映射关系
多分类映射
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.evaluation.MulticlassMetrics
val newsgroupsMap = newsgroups.distinct.collect().zipWithIndex.toMap
val zipped = newsgroups.zip(tfidf)
val train = zipped.map { case (topic, vector) =& LabeledPoint(newsgroupsMap(topic), vector) }
train.cache
朴素贝叶斯训练
val model = NaiveBayes.train(train, lambda = 0.1)
加载测试数据集
val testPath = PATH+"/20news-bydate-test/*"
val testRDD = sc.wholeTextFiles(testPath)
val testLabels = testRDD.map { case (file, text) =&
val topic = file.split("/").takeRight(2).head
newsgroupsMap(topic)
val testTf = testRDD.map { case (file, text) =& hashingTF.transform(tokenize(text)) }
val testTfIdf = idf.transform(testTf)
val zippedTest = testLabels.zip(testTfIdf)
val test = zippedTest.map { case (topic, vector) =& LabeledPoint(topic, vector) }
计算准确度和多分类加权F-指标
val predictionAndLabel = test.map(p =& (model.predict(p.features), p.label))
val accuracy = 1.0 * predictionAndLabel.filter(x =& x._1 == x._2).count() / test.count()
println(accuracy)
val metrics = new MulticlassMetrics(predictionAndLabel)
println(metrics.weightedFMeasure)
val rawTokens = rdd.map { case (file, text) =& text.split(" ") }
val rawTF = rawTokens.map(doc =& hashingTF.transform(doc))
val rawTrain = newsgroups.zip(rawTF).map { case (topic, vector) =& LabeledPoint(newsgroupsMap(topic), vector) }
val rawModel = NaiveBayes.train(rawTrain, lambda = 0.1)
val rawTestTF = testRDD.map { case (file, text) =& hashingTF.transform(text.split(" ")) }
val rawZippedTest = testLabels.zip(rawTestTF)
val rawTest = rawZippedTest.map { case (topic, vector) =& LabeledPoint(topic, vector) }
val rawPredictionAndLabel = rawTest.map(p =& (rawModel.predict(p.features), p.label))
val rawAccuracy = 1.0 * rawPredictionAndLabel.filter(x =& x._1 == x._2).count() / rawTest.count()
println(rawAccuracy)
val rawMetrics = new MulticlassMetrics(rawPredictionAndLabel)
println(rawMetrics.weightedFMeasure)
6 Word2Vec模型
Word2Vec模型(分布向量表示):把每个单词表示成一个向量,MLlib中使用skip-gram模型
import org.apache.spark.mllib.feature.Word2Vec
val word2vec = new Word2Vec()
word2vec.setSeed(42)
最相似的20个单词
word2vecModel.findSynonyms("hockey", 20).foreach(println)
(sport,1.7133)(ecac,1.254)(hispanic,1.5194)(glens,1.2825)(woofers,1.8116)(tournament,1.1586)(champs,1.3941)(boxscores,1.543)(aargh,1.267)(ahl,1.253)(playoff,1.0572)(ncaa,1.8046)(pool,1.0224)(champion,1.9134)(filinuk,1.0915)(olympic,1.0243)(motorcycles,1.9679)(yankees,1.3371)(calder,1.493)(homeruns,1.3932)
word2vecModel.findSynonyms("legislation", 20).foreach(println)
(accommodates,0.8688)(briefed,0.2989)(amended,0.3344)(telephony,0.3956)(pitted,0.2533)(aclu,0.2372)(licensee,0.7975)(agency,0.648)(policies,0.5566)(senate,0.0903)(businesses,0.0467)(permit,0.1389)(cpsr,0.4367)(cooperation,0.6543)(surveillance,0.8756)(congress,0.2855)(restricted,0.7126)(procure,0.6356)(inquiry,0.4405)(industry,0.4752)
legislation 立法
aclu 美国公民自由协会
senate 参议院
surveillance 监视
inquiry 调查
阅读(...) 评论()

我要回帖

更多关于 模拟人生sim文件 的文章

 

随机推荐