hadoop configurations

发表于 2014-09-12 | 分类于 hadoop |

	Yarn的配置keys在org.apache.hadoop.yarn.conf.YarnConfiguration
	MR的配置keys在org.apache.hadoop.mapreduce.MRJobConfig

其他:YARN详解_参数配置

mapreduce.job.user.classpath.first
优先加载用户的class
mapreduce.job.jvm.numtasks
JVM重用
mapreduce.map.speculative
map 预测执行
mapreduce.reduce.speculative
reduce 预测执行

Calculating the Capacity of a Node

发表于 2014-09-12 | 分类于 hadoop |

转自：Calculating the Capacity of a Node
Because YARN has now removed the hard partitioned mapper and reducer slots of Hadoop Version 1, new capacity calculations are required. There are eight important parameters for calculating a node’s capacity that are specified in mapred-site.xml and yarn-site.xml:

In mapred-site.xml:

mapreduce.map.memory.mb
mapreduce.reduce.memory.mb`

These are the hard limits enforced by Hadoop on each mapper or reducer task.

阅读全文 »

mvn jetty debug in eclipse

发表于 2014-09-12 | 分类于 maven |

### create eclipse Program
eclise->run->External tools->External Configurations->Program->右击->new
- Location:${M2_HOME}/bin/mvn
- Work Directory:选择需要debug的工程
- Arguments:jetty:run
- Environment:
  MAVEN_OPTS=-Xdebug -Xnoagent -Djava.compiler=NONE -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=y
### create remote java application
eclipse->run->debug configurations->remote java application
- project 选择需要debug的工程
- Host:localhost
- Prot:8000
- select allow termination of remote VM
### run
1. run program
2. debug run remote java application

hadoop文件切分和合并

发表于 2014-08-18 | 分类于 hadoop |

文件切分
如果想一个大的文件能同时被多个Mapper处理，hadoop一般把一个文件切分成多个splits。当然如果文件被压缩，文件的压缩格式需支持splitable;
源代码见:org.apache.hadoop.mapreduce.lib.input.FileInputFormat
- 涉及的参数
  - blocksize hadoop文件系统的block的大小
  - splitMaxSize mapreduce.input.fileinputformat.split.maxsize
  - splitMinSize mapreduce.input.fileinputformat.split.minsize
  splitSize = max{splitMinSize, min{splitMaxSize, blocksize}}
  有意思的是:splitMaxSize设置成大于blocksize没有任何意义。只有splitMinSize>blocksize时，splitSize才会大于blocksize。还是hadoop不希望splitSize>blocksize?
文件合并
mapreduce过程，每个split对应一个map jvm 进程（当然也可以通过设置mapred.job.reuse.jvm.num.tasks来使同job的task重用jvm)。过多的小文件给HDFS的性能带来影响，所以有时需要合并小文件成大文件。

eclipse download mirror

发表于 2014-05-25 | 分类于 eclipse |

## 使用mirror更新eclipse插件

#### 修改hosts
127.0.0.1 download.eclipse.org
#### 运行代理服务

nodejs代理

var http = require('http');
http.createServer(function (request, response) {
	console.log(request.method + '\t' + request.url);
	response.writeHead(302, {
		'Location': 'http://mirror.bit.edu.cn/eclipse'+request.url
	});
	response.end();
}).listen(80);

Serializer

发表于 2014-05-25 | 分类于 storm |

### 参考
https://github.com/nathanmarz/storm/wiki/Serialization
### 原由
storm的部分逻辑会与hadoop的MapReduce共用，MapReduce使用的ProtoBuff，所以希望在storm中能和MapReduce中一样使用ProtoBuff生成的Proto消息类
但是，在storm中使用定义的类，在序列化时默认使用java自带的序列化方法，效力低下。所以尝试使用ProtoBuff的序列化方法注册到storm中。
### Serializer

GeneratedMessage的Serializer

public class ProtoBuffSerializer<T extends GeneratedMessage> extends
		Serializer<T> {

	@Override
	public void write(Kryo kryo, Output output, T object) {
		try {
			object.writeTo(output);
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

	@SuppressWarnings("unchecked")
	@Override
	public T read(Kryo kryo, Input input, Class<T> type) {
		try {
			return (T) type.getMethod("parseFrom", InputStream.class).invoke(
					null, input);
		} catch (IllegalAccessException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IllegalArgumentException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (InvocationTargetException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (NoSuchMethodException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (SecurityException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		return null;
	}
}

### 使用registerSerialization
试着使用registerSerialization注册GeneratedMessage，但发现不起作用，需要每个子类都注册一次，比较麻烦

注册GeneratedMessage的Serializer not work

Config conf = new Config();
conf.registerSerialization(GeneratedMessage.class, ProtoBuffSerializer.class);

注册Person的Serializer is work

Config conf = new Config();
conf.registerSerialization(Person.class, ProtoBuffSerializer.class);
....

阅读全文 »

消息的可靠处理

发表于 2014-05-25 | 分类于 storm |

### 参考
http://blog.linezing.com/?p=1898
### ack
- IBasicBolt会自动ack
- IRichBolt需要手动ack,当使用啦IRichBolt而忘记啦ack时，Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS 时间后，Spout的fail会被调用
### 锚定
- IBasicBolt会自动锚定
- IRichBolt需要手动锚定
### 关闭可靠性
- 将参数Config.TOPOLOGY_ACKERS设置为0，通过此方法，当Spout发送一个消息的时候，它的ack方法将立刻被调用；
- Spout发送一个消息时，不指定此消息的messageID。当需要关闭特定消息可靠性的时候，可以使用此方法；
- 如果你不在意某个消息派生出来的子孙消息的可靠性，则此消息派生出来的子消息在发送时不要做锚定，即在emit方法中不指定输入消息。因为这些子孙消息没有被锚定在任何tuple tree中，因此他们的失败不会引起任何spout重新发送消息。
### 相关
开启可靠性后TOPOLOGY_MAX_SPOUT_PENDING才会有效

protobuf

发表于 2014-05-23 | 分类于 protobuf |

### 参考
https://developers.google.com/protocol-buffers/docs/overview
http://www.ibm.com/developerworks/cn/linux/l-cn-gpb/

GoldenDict 添加有道及生词本

发表于 2014-05-07 | 分类于 tools |

##转载:
妙用GoldenDict使用有道词典和有道单词本
###补充，生词本登录
我用的GoldenDict1.5.0,网页无法输入，可以用Ctrl+C Ctrl+V 解决

阅读全文 »

eclipse fonts and theme

发表于 2014-05-07 | 分类于 eclipse |

theme

eclipse install 'Eclipse Color Theme' plugin
个人比较喜欢样式:Pastel,Havenjark,Obsidian

fonts
- window
  1. 控制面板->字体, 找到CourierNew文件,右键菜单选择"显示".
  2. 在菜单windows-->prefereces里面，找到General-->Appearance-->Colors and Fonts-->Basic-->Text Font 点击Edit,选择"CourierNew"即可.
- ubuntu
  1. 安装win字体,sudo apt-get install ttf-mscorefonts-installer
  2. 在菜单windows-->prefereces里面，找到General-->Appearance-->Colors and Fonts-->Basic-->Text Font 点击Edit,选择"CourierNew"即可.
    个人感觉:Ubuntu下的FreeMono和CourierNew差不多

Xuehui He

面朝大海，春暖花开

RSS

github

mapreduce.job.user.classpath.first

mapreduce.job.jvm.numtasks

mapreduce.map.speculative

mapreduce.reduce.speculative

In mapred-site.xml:

文件切分

涉及的参数

文件合并

theme

fonts

window

ubuntu